# Some useful DataFrame functions

1. Random Data Generation
2. Summary and Descriptive Statistics
3. Sample covariance and correlation
4. Contingency Table
5. Frequent Items


## 1 - Random Data Generation
Random data generation is useful for testing of existing algorithms and implementing randomized algorithms, such as random projection.

Spark provides methods under sql.functions for generating columns that contains i.i.d. values drawn from a distribution, e.g., uniform (rand),  and standard normal (randn).

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import *
from pyspark.sql.types import *
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

In [2]:
from pyspark.sql.functions import rand, randn

In [3]:
# Create a DataFrame with one int column and 10 rows.
df = sqlContext.range(0, 10)

In [4]:
df.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+



In [5]:
# Generate two other columns using uniform distribution and normal distribution.
df.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal")).show()

+---+-------------------+--------------------+
| id|            uniform|              normal|
+---+-------------------+--------------------+
|  0|0.41371264720975787|  0.5888539012978773|
|  1| 0.1982919638208397| 0.06157382353970104|
|  2|0.12030715258495939|  1.0854146699817222|
|  3|0.44292918521277047| -0.4798519469521663|
|  4| 0.8898784253886249| -0.8820294772950535|
|  5| 0.2731073068483362|-0.15116027592854422|
|  6|   0.87079354700073|-0.27674189870783683|
|  7|0.27149331793166864|-0.18575112254167045|
|  8| 0.6037143578435027|   0.734722467897308|
|  9| 0.1435668838975337|-0.30123700668427145|
+---+-------------------+--------------------+



## 2 - Summary and Descriptive Statistics
- The first operation to perform after importing data is to get some sense of what it looks like.
- For numerical columns, knowing the descriptive summary statistics can help a lot in understanding the distribution of your data.
- The function describe returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column.

In [6]:
# A slightly different way to generate the two random columns
df = sqlContext.range(0, 10).withColumn('uniform', rand(seed=10)).withColumn('normal', randn(seed=27))
df.describe().show()

+-------+------------------+-------------------+--------------------+
|summary|                id|            uniform|              normal|
+-------+------------------+-------------------+--------------------+
|  count|                10|                 10|                  10|
|   mean|               4.5|0.42277947877387234|0.019379313460706583|
| stddev|3.0276503540974917|0.28230337640258524|  0.6053138209606246|
|    min|                 0|0.12030715258495939| -0.8820294772950535|
|    max|                 9| 0.8898784253886249|  1.0854146699817222|
+-------+------------------+-------------------+--------------------+



If you have a DataFrame with a large number of columns, you can also run describe on a subset of the columns:

In [7]:
df.describe('uniform', 'normal').show()

+-------+-------------------+--------------------+
|summary|            uniform|              normal|
+-------+-------------------+--------------------+
|  count|                 10|                  10|
|   mean|0.42277947877387234|0.019379313460706583|
| stddev|0.28230337640258524|  0.6053138209606246|
|    min|0.12030715258495939| -0.8820294772950535|
|    max| 0.8898784253886249|  1.0854146699817222|
+-------+-------------------+--------------------+



Of course, while describe works well for quick exploratory data analysis, you can also control the list of descriptive statistics and the columns they apply to using the normal select on a DataFrame:

In [8]:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()

+-------------------+-------------------+------------------+
|       avg(uniform)|       min(uniform)|      max(uniform)|
+-------------------+-------------------+------------------+
|0.42277947877387234|0.12030715258495939|0.8898784253886249|
+-------------------+-------------------+------------------+



## 3 - Sample covariance and correlation
- Covariance is a measure of how two variables change with respect to each other.
- A positive number would mean that there is a tendency that as one variable increases, the other increases as well.
- A negative number would mean that as one variable increases, the other variable has a tendency to decrease.
- The sample covariance of two columns of a DataFrame can be calculated as follows:

In [9]:
df = sqlContext.range(0, 10).withColumn('rand1', rand(seed=10)).withColumn('rand2', rand(seed=27))

In [10]:
df.stat.cov('rand1', 'rand2')

-0.011320674637528958

In [11]:
df.stat.cov('id', 'id')

9.166666666666666

As you can see from the above, the covariance of the two randomly generated columns is close to zero, while the covariance of the id column with itself is very high.

The covariance value of 9.17 might be hard to interpret. Correlation is a normalized measure of covariance that is easier to understand, as it provides quantitative measurements of the statistical dependence between two random variables.

In [12]:
df.stat.corr('rand1', 'rand2')

-0.15104231216965627

In [13]:
df.stat.corr('id', 'id')

1.0

In the above example, id correlates perfectly with itself, while the two randomly generated columns have low correlation value.

## 4 - Contingency Table
- Cross Tabulation provides a table of the frequency distribution for a set of variables.
- Cross-tabulation is a powerful tool in statistics that is used to observe the statistical significance (or independence) of variables.
- In Spark, users will be able to cross-tabulate two columns of a DataFrame in order to obtain the counts of the different pairs that are observed in those columns.

Here is an example on how to use crosstab to obtain the contingency table.

In [14]:
# Create a DataFrame with two columns (name, item)
names = ["Alice", "Bob", "Mike"]
items = ["milk", "bread", "butter", "apples", "oranges"]
df = sqlContext.createDataFrame([(names[i % 3], items[i % 5]) for i in range(100)], ["name", "item"])

In [15]:
# Take a look at the first 10 rows.
df.show(10)

+-----+-------+
| name|   item|
+-----+-------+
|Alice|   milk|
|  Bob|  bread|
| Mike| butter|
|Alice| apples|
|  Bob|oranges|
| Mike|   milk|
|Alice|  bread|
|  Bob| butter|
| Mike| apples|
|Alice|oranges|
+-----+-------+
only showing top 10 rows



In [16]:
df.stat.crosstab("name", "item").show()

+---------+------+-------+------+----+-----+
|name_item|apples|oranges|butter|milk|bread|
+---------+------+-------+------+----+-----+
|      Bob|     6|      7|     7|   6|    7|
|     Mike|     7|      6|     7|   7|    6|
|    Alice|     7|      7|     6|   7|    7|
+---------+------+-------+------+----+-----+



One important thing to keep in mind is that the cardinality of columns we run crosstab on cannot be too big. That is to say, the number of distinct “name” and “item” cannot be too large. Just imagine if “item” contains 1 billion distinct entries: how would you fit that table on your screen?!

## 5 - Frequent Items
Figuring out which items are frequent in each column can be very useful to understand a dataset.

In Spark, users will be able to find the frequent items for a set of columns using DataFrames. Spark has an one-pass algorithm proposed by Karp et al. This is a fast, approximate algorithm that always return all the frequent items that appear in a user-specified minimum proportion of rows.

Note that the result might contain false positives, i.e. items that are not frequent.

In [17]:
df = sqlContext.createDataFrame([(1, 2, 3) if i % 2 == 0 else (i, 2 * i, i % 4) for i in range(100)], ["a", "b", "c"])

In [18]:
df.show(10)

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  1|  2|  1|
|  1|  2|  3|
|  3|  6|  3|
|  1|  2|  3|
|  5| 10|  1|
|  1|  2|  3|
|  7| 14|  3|
|  1|  2|  3|
|  9| 18|  1|
+---+---+---+
only showing top 10 rows



Given the above DataFrame, the following code finds the frequent items that show up 40% of the time for each column:

In [19]:
freq = df.stat.freqItems(["a", "b", "c"], 0.4)
freq.collect()[0]

Row(a_freqItems=[11, 1], b_freqItems=[2, 22], c_freqItems=[1, 3])

As you can see, “11” and “1” are the frequent values for column “a”. You can also find frequent items for column combinations, by creating a composite column using the struct function:

In [20]:
from pyspark.sql.functions import struct
freq = df.withColumn('ab', struct('a', 'b')).stat.freqItems(['ab'], 0.4)
freq.collect()[0]

Row(ab_freqItems=[Row(a=11, b=22), Row(a=1, b=2)])

From the above example, the combination of “a=11 and b=22”, and “a=1 and b=2” appear frequently in this dataset. Note that “a=11 and b=22” is a false positive.