# Analyzing Tabular Data

Unlike previous notebooks where we dealt with unstructure textual data, here we will go into structed data. We will focus on `SparkReader` object for delimited data rather than unstructured text.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

We can easily create a data frame from the data in our program with the spark
`.createDataFrame` function, as below. Our first parameter is the data itself. You can provide a list of items (here, a list of lists), a pandas data frame, or a resilient distributed dataset. The second parameter is the schema of the data frame. Here we are passing a list of column names which is sufficient for PySpark as it will infer the types (string, long, and double, respectively) of our columns.

In [3]:
my_grocery_list = [
    ["Banana", 2, 1.74],
    ["Apple", 4, 2.04],
    ["Carrot", 1, 1.09],
    ["Cake", 1, 10.99],
]

df_grocery_list = spark.createDataFrame(
    my_grocery_list, ["Item","Quantity","Price"]
)

df_grocery_list.printSchema()

root
 |-- Item: string (nullable = true)
 |-- Quantity: long (nullable = true)
 |-- Price: double (nullable = true)



## Exploratory Data Analysis
PySpark doesn’t provide any charting capabilities and doesn’t play with other charting libraries (like `Matplotlib`, `seaborn`, `Altair`, or `plot.ly`), and this makes a lot of sense: PySpark distributes your data over many computers. It doesn’t make much sense to distribute a chart creation. The usual solution will be to transform your data using PySpark, use the `toPandas()` method to transform your PySpark data frame into a pandas data frame, and then use your favorite charting library. 

When using toPandas(), remember that you lose the advantages of working with multiple machines, as the data will accumulate on the driver. Reserve this operation for an aggregated or manageable data set. 

A general rule of thumb is if $\#\ of\ rows\ $ x $\ \#\ of\ columns\ =\ 100,000$ for 16GB driver, its better to reduce the data size further.

**Dataset** we are using is CRTC (Canadian Radio-Television and Telecommunications Commission). Every broadcaster is mandated to provide a complete log of the programs and commercials showcased to the Canadian public.

we will try to answer this question *"What are the channels with the greatest and least proportion of commercials ?"*