# Analyzing Tabular Data

Unlike previous notebooks where we dealt with unstructure textual data, here we will go into structed data. We will focus on `SparkReader` object for delimited data rather than unstructured text.

In [1]:
import os
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

We can easily create a data frame from the data in our program with the spark
`.createDataFrame` function, as below. Our first parameter is the data itself. You can provide a list of items (here, a list of lists), a pandas data frame, or a resilient distributed dataset. The second parameter is the schema of the data frame. Here we are passing a list of column names which is sufficient for PySpark as it will infer the types (string, long, and double, respectively) of our columns.

In [2]:
my_grocery_list = [
    ["Banana", 2, 1.74],
    ["Apple", 4, 2.04],
    ["Carrot", 1, 1.09],
    ["Cake", 1, 10.99],
]

df_grocery_list = spark.createDataFrame(
    my_grocery_list, ["Item","Quantity","Price"]
)

df_grocery_list.printSchema()

root
 |-- Item: string (nullable = true)
 |-- Quantity: long (nullable = true)
 |-- Price: double (nullable = true)



## Exploratory Data Analysis
PySpark doesn’t provide any charting capabilities and doesn’t play with other charting libraries (like `Matplotlib`, `seaborn`, `Altair`, or `plot.ly`), and this makes a lot of sense: PySpark distributes your data over many computers. It doesn’t make much sense to distribute a chart creation. The usual solution will be to transform your data using PySpark, use the `toPandas()` method to transform your PySpark data frame into a pandas data frame, and then use your favorite charting library. 

When using toPandas(), remember that you lose the advantages of working with multiple machines, as the data will accumulate on the driver. Reserve this operation for an aggregated or manageable data set. 

A general rule of thumb is if $\#\ of\ rows\ $ x $\ \#\ of\ columns\ =\ 100,000$ for 16GB driver, its better to reduce the data size further.

**Dataset** we are using is CRTC (Canadian Radio-Television and Telecommunications Commission). Every broadcaster is mandated to provide a complete log of the programs and commercials showcased to the Canadian public.

we will try to answer this question *"What are the channels with the greatest and least proportion of commercials ?"*

In [20]:
spark = SparkSession.builder.appName(
    "Analyzing CRTC"
).getOrCreate()

DIRECTORY = 'data/broadcast_logs/'

logs = spark.read.csv(
    os.path.join(DIRECTORY,"BroadcastLogs_2018_Q3_M8_sample.csv"),
    sep="|",
    header=True,
    inferSchema=True,
    timestampFormat="yyyy-MM-dd"
)

logs.printSchema()

root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string 

Star schemas are common in the relational database world because of normalization, a process used to avoid duplicating pieces of data and improve data integrity. Data normalization is illustrated in figure below, where our center table logs contain IDs that map to the auxiliary tables called link tables. In the case of the CD_Category link table, it contains many fields (e.g., Category_CD and
English_description) that are made available to logs when you link the tables with
the Category_ID key.

![star schema](images/analyse_tabular_star_schema.png)

In Spark’s universe, we often prefer working with a single table instead of linking a multitude of tables to get the data. We call these denormalized tables, or, colloquially, fat tables. 

#### The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing

### Selecting

In [4]:
# Selecting a few columns of our data to show

logs.select("BroadcastLogID","LogServiceID","LogDate").show(5,False)

+--------------+------------+-------------------+
|BroadcastLogID|LogServiceID|LogDate            |
+--------------+------------+-------------------+
|1196192316    |3157        |2018-08-01 00:00:00|
|1196192317    |3157        |2018-08-01 00:00:00|
|1196192318    |3157        |2018-08-01 00:00:00|
|1196192319    |3157        |2018-08-01 00:00:00|
|1196192320    |3157        |2018-08-01 00:00:00|
+--------------+------------+-------------------+
only showing top 5 rows



let’s expand our selection code to see every column in groups of three.

In [5]:
import numpy as np

column_split = np.array_split(
    np.array(logs.columns), len(logs.columns) // 3
)

print(column_split)

for x in column_split:
    logs.select(*x).show(5,False)

[array(['BroadcastLogID', 'LogServiceID', 'LogDate'], dtype='<U22'), array(['SequenceNO', 'AudienceTargetAgeID', 'AudienceTargetEthnicID'],
      dtype='<U22'), array(['CategoryID', 'ClosedCaptionID', 'CountryOfOriginID'], dtype='<U22'), array(['DubDramaCreditID', 'EthnicProgramID', 'ProductionSourceID'],
      dtype='<U22'), array(['ProgramClassID', 'FilmClassificationID', 'ExhibitionID'],
      dtype='<U22'), array(['Duration', 'EndTime', 'LogEntryDate'], dtype='<U22'), array(['ProductionNO', 'ProgramTitle', 'StartTime'], dtype='<U22'), array(['Subtitle', 'NetworkAffiliationID', 'SpecialAttentionID'],
      dtype='<U22'), array(['BroadcastOriginPointID', 'CompositionID', 'Producer1'],
      dtype='<U22'), array(['Producer2', 'Language1', 'Language2'], dtype='<U22')]
+--------------+------------+-------------------+
|BroadcastLogID|LogServiceID|LogDate            |
+--------------+------------+-------------------+
|1196192316    |3157        |2018-08-01 00:00:00|
|1196192317    |3157 

### Dropping

Let’s get rid of two columns in our current data frame in the spirit of tidying up.
Hopefully, it will bring us joy:
- BroadCastLogID is the primary key of the table and will not serve us in answering our questions.
- SequenceNo is a sequence number and won’t be useful either

In [6]:
logs = logs.drop("BroadcastLogID","SequenceNO")

# checking if removed it
print("BroadcastLogID" in logs.columns) 
print("SequenceNo" in logs.columns) 

False
False


In [7]:
# drop and select are two sides of the same coin.

logs = logs.select(
    *[s for s in logs.columns if s not in ["BroadcastLogID","SequenceNO"]]
)

### Create new Columns with `withColumn()`

let's start with an example

In [8]:
logs.select(F.col("Duration")).show(5)

+----------------+
|        Duration|
+----------------+
|02:00:00.0000000|
|00:00:30.0000000|
|00:00:15.0000000|
|00:00:15.0000000|
|00:00:15.0000000|
+----------------+
only showing top 5 rows



In [9]:
print(logs.select(F.col("Duration")).dtypes)

[('Duration', 'string')]


Let's extract the hours, minutes, and seconds from the Duration columns. The `substr()` method takes two parameters. The first gives the position of where the sub-string starts, the first character being `1`, not `0` like in Python. The second gives the length of the sub-string we want to extract in a number of characters. The function application returns a string Column that we convert to an Integer via the `cast()` method.

In [10]:
logs.select(
    F.col("Duration"),
    F.col("Duration").substr(1, 2).cast("int").alias("dur_hours"),
    F.col("Duration").substr(4, 2).cast("int").alias("dur_minutes"),
    F.col("Duration").substr(7, 2).cast("int").alias("dur_seconds"),
).distinct().show(
    5
)

+----------------+---------+-----------+-----------+
|        Duration|dur_hours|dur_minutes|dur_seconds|
+----------------+---------+-----------+-----------+
|00:04:52.0000000|        0|          4|         52|
|00:10:06.0000000|        0|         10|          6|
|00:09:52.0000000|        0|          9|         52|
|00:04:26.0000000|        0|          4|         26|
|00:14:59.0000000|        0|         14|         59|
+----------------+---------+-----------+-----------+
only showing top 5 rows



let's merge all these values into a single field, the duration of the program in seconds

In [11]:
logs.select(
    F.col("Duration"),
    (
        F.col("Duration").substr(1, 2).cast("int") * 60 * 60
        + F.col("Duration").substr(4, 2).cast("int") * 60
        + F.col("Duration").substr(7, 2).cast("int")
    ).alias("Duration_seconds"),
).distinct().show(5)

# we can use this approach this to assign to 
# logs variable again to add this new column. 
# this is a faster approach. 

+----------------+----------------+
|        Duration|Duration_seconds|
+----------------+----------------+
|01:59:30.0000000|            7170|
|00:31:00.0000000|            1860|
|00:28:08.0000000|            1688|
|00:32:00.0000000|            1920|
|00:30:00.0000000|            1800|
+----------------+----------------+
only showing top 5 rows



In [12]:
# Let's add this column to our dataframe using withColumn()

logs = logs.withColumn(
    "Duration_seconds",
    (
        F.col("Duration").substr(1, 2).cast("int") * 60 * 60
        + F.col("Duration").substr(4, 2).cast("int") * 60
        + F.col("Duration").substr(7, 2).cast("int")
    )
)

logs.printSchema()

root
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- SpecialAttenti

>**WARNING**: If you create a column withColumn() and give it a name that already exists in your data frame, PySpark will happily overwrite the column


### Renaming and reordering columns

Though renaming an be done with `select()` and `alias()`, we will see `withColumnRenamed()`.

In [13]:
logs = logs.withColumnRenamed("Duration_seconds", "duration_seconds")

logs.printSchema()

root
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- SpecialAttenti

We can rename all the columns of out data frame in one fell swoop. This relies on a method, `toDF()`, that returns a new data frame with the new columns. Just like `drop()`, `toDF()` takes a `*cols`, and just like `select()` and `drop()`, we need to unpack our column names if they’re in a list

In [14]:
logs.toDF(*[x.lower() for x in logs.columns]).printSchema()

root
 |-- logserviceid: integer (nullable = true)
 |-- logdate: timestamp (nullable = true)
 |-- audiencetargetageid: integer (nullable = true)
 |-- audiencetargetethnicid: integer (nullable = true)
 |-- categoryid: integer (nullable = true)
 |-- closedcaptionid: integer (nullable = true)
 |-- countryoforiginid: integer (nullable = true)
 |-- dubdramacreditid: integer (nullable = true)
 |-- ethnicprogramid: integer (nullable = true)
 |-- productionsourceid: integer (nullable = true)
 |-- programclassid: integer (nullable = true)
 |-- filmclassificationid: integer (nullable = true)
 |-- exhibitionid: integer (nullable = true)
 |-- duration: string (nullable = true)
 |-- endtime: string (nullable = true)
 |-- logentrydate: timestamp (nullable = true)
 |-- productionno: string (nullable = true)
 |-- programtitle: string (nullable = true)
 |-- starttime: string (nullable = true)
 |-- subtitle: string (nullable = true)
 |-- networkaffiliationid: integer (nullable = true)
 |-- specialattenti

In [15]:
# Our final step is reordering columns. 
# Since reordering columns is equivalent to 
# selecting columns in a different order, 
# select() is the perfect method for the job.

logs.select(sorted(logs.columns)).printSchema()

root
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- BroadcastOriginPointID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CompositionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- Language1: integer (nullable = true)
 |-- Language2: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- Producer1: string (nullable = true)
 |-- Producer2: string (nullable = true)
 |-- ProductionNO: 

### Diagnosing a data frame with `describe()` and `summary()`

When applied to a data frame with no parameters, `describe()` will show summary statistics (count, mean, standard deviation, min, and max) on all numerical and string columns. To avoid screen overflow, I display the column descriptions one by one by iterating over the list of columns and showing the output of `describe()` in the next listing. Note that `describe()` will (lazily) compute the data frame but won’t display it, just like any transformation, so we have to `show()` the result.


In [17]:
for i in logs.columns:
    logs.describe(i).show()

+-------+------------------+
|summary|      LogServiceID|
+-------+------------------+
|  count|            238945|
|   mean| 3450.890284375065|
| stddev|199.50673962554765|
|    min|              3157|
|    max|              3925|
+-------+------------------+

+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    max|
+-------+

+-------+-------------------+
|summary|AudienceTargetAgeID|
+-------+-------------------+
|  count|              16112|
|   mean| 3.4929245283018866|
| stddev| 1.0415963394745125|
|    min|                  1|
|    max|                  4|
+-------+-------------------+

+-------+----------------------+
|summary|AudienceTargetEthnicID|
+-------+----------------------+
|  count|                  1710|
|   mean|    120.56432748538012|
| stddev|      71.9869405943613|
|    min|                     4|
|    max|                   337|
+-------+----------------------+

+-------+------------------+
|summary|        CategoryID|
+-------+-----------

`describe()` is a fantastic method, but what if you want more? `summary()` to the
rescue! Where `describe()` will take *cols as a parameter (one or more columns, the same way as `select()` or `drop()`), `summary()` will take *statistics as a parameter. This means that you’ll need to select the columns you want to see before passing the `summary()` method

In [18]:
for i in logs.columns:
    logs.select(i).summary().show()

+-------+------------------+
|summary|      LogServiceID|
+-------+------------------+
|  count|            238945|
|   mean| 3450.890284375065|
| stddev|199.50673962554765|
|    min|              3157|
|    25%|              3287|
|    50%|              3379|
|    75%|              3627|
|    max|              3925|
+-------+------------------+

+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    25%|
|    50%|
|    75%|
|    max|
+-------+

+-------+-------------------+
|summary|AudienceTargetAgeID|
+-------+-------------------+
|  count|              16112|
|   mean| 3.4929245283018866|
| stddev| 1.0415963394745125|
|    min|                  1|
|    25%|                  4|
|    50%|                  4|
|    75%|                  4|
|    max|                  4|
+-------+-------------------+

+-------+----------------------+
|summary|AudienceTargetEthnicID|
+-------+----------------------+
|  count|                  1710|
|   mean|    120.56432748538012|
| st

In [19]:
for i in logs.columns:
    logs.select(i).summary("min","10%","90%").show()

+-------+------------+
|summary|LogServiceID|
+-------+------------+
|    min|        3157|
|    10%|        3236|
|    90%|        3709|
+-------+------------+

+-------+
|summary|
+-------+
|    min|
|    10%|
|    90%|
+-------+

+-------+-------------------+
|summary|AudienceTargetAgeID|
+-------+-------------------+
|    min|                  1|
|    10%|                  1|
|    90%|                  4|
+-------+-------------------+

+-------+----------------------+
|summary|AudienceTargetEthnicID|
+-------+----------------------+
|    min|                     4|
|    10%|                    74|
|    90%|                   258|
+-------+----------------------+

+-------+----------+
|summary|CategoryID|
+-------+----------+
|    min|         1|
|    10%|         3|
|    90%|        29|
+-------+----------+

+-------+---------------+
|summary|ClosedCaptionID|
+-------+---------------+
|    min|              1|
|    10%|              1|
|    90%|              1|
+-------+-----------

***
<p style="text-align:left;">
    <a href="./3_Scaling.ipynb">Previous Chapter</a>
    <span style="float:right;">
        <a href="./5_Joining_Grouping.ipynb">Next Chapter</a>
    </span>
</p>
