# Advanced Data Analysis - week 3, lecture 1, examples

In the advanced data analysis course, we assume basic knowledge of Python, as could be acquired by attending the *Introduction to Programming* bridging course.

This notebook includes the examples and exercises presented in **Week 3**, lecture 1. There is an additional notebook with the examples and exercises suggested for autonomous study during the week.

In **week 3**, we will focus on introducing Spark.

**This notebook should be run on Google Colab**.


## Install Spark

In [1]:
# RUN THIS CELL ONLY IF RUNNING IN COLAB

!apt-get install openjdk-11-jdk-headless
!pip install pyspark


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
openjdk-11-jdk-headless is already the newest version (11.0.20.1+1-0ubuntu1~22.04).
0 upgraded, 0 newly installed, 0 to remove and 17 not upgraded.
Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285387 sha256=1e135567cff58f8edd53dda85b1152385ed8746accd2e8f9b569cbc8426e5a26
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1
Access denied with the followin

### Downloading data files

This cell will download the dataset files used in the computation.

In [7]:
!pip install gdown
!gdown https://drive.google.com/file/d/1Suzt37ohetSKLNP0kFUv0Ji1joiXumir/view?usp=sharing
!unzip -o sbe_data_2324.zip


Downloading...
From: https://drive.google.com/file/d/1Suzt37ohetSKLNP0kFUv0Ji1joiXumir/view?usp=sharing
To: /content/view?usp=sharing
80.5kB [00:00, 4.76MB/s]
Archive:  sbe_data_2324.zip
  inflating: data/AD-covid.csv       
  inflating: data/AE-covid.csv       
  inflating: data/AF-covid.csv       
  inflating: data/AG-covid.csv       
  inflating: data/AL-covid.csv       
  inflating: data/ALL-covid.csv      
  inflating: data/AM-covid.csv       
  inflating: data/AO-covid.csv       
  inflating: data/AR-covid.csv       
  inflating: data/AT-covid.csv       
  inflating: data/AU-covid.csv       
  inflating: data/AW-covid.csv       
  inflating: data/AZ-covid.csv       
  inflating: data/BA-covid.csv       
  inflating: data/BB-covid.csv       
  inflating: data/BD-covid.csv       
  inflating: data/BE-covid.csv       
  inflating: data/BF-covid.csv       
  inflating: data/BG-covid.csv       
  inflating: data/BH-covid.csv       
  inflating: data/BI-covid.csv       
  inflating: da

## Programming (with Pandas API for Spark)

We now show how to program with the [**Pandas API**](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) for Spark.

For using Pandas, you must import *pyspark.pandas*. After that, you can use the Pandas API, but it will run on Spark.

In [10]:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

# imports pandas API for Spark
import pyspark.pandas as ps


### Data model : DataFrame

In *Pandas* API, a table is represented as a [**DataFrame**](https://pandas.pydata.org/docs/reference/frame.html), using the underlying Spark DataFrame. (follow the link for DataFrame documentation)

There are multiple ways to create an initial DataFrame. For example, you can create date from a Python dictionary, as follows:

In [11]:
population = ps.DataFrame( { "country": ["PT", "ES", "DE"] , \
                            "population": [10276617, 46937060, 83019213]})

print( population)


  country  population
0      PT    10276617
1      ES    46937060
2      DE    83019213


Pandas will maintain an additional column, the index, with a increasing integer. This column - the first column when printing the dataframe - is created automatically.

#### Loading DataFrame from CSV files

More often, will want to load the data from files. To create a DataFrame from a CSV file, you can use the ```load_csv``` function.

Note: If the following code fails, the most likely reason is that you do not have the *data* directory with the data files.

In [12]:
import os

# Let's create a PATH in a OS independent way
# File lec1-example.csv is in directory data
fileName = os.path.join( "data", "lec1-example.csv")

# Read a CSV file into a DataFrame
df = ps.read_csv(fileName)

print( df)




        Name  Age  Educational level Company
0     Andrew   55                1.0    Good
1   Bernhard   43                2.0    Good
2   Carolina   37                5.0     Bad
3     Dennis   82                3.0    Good
4        Eve   23                3.2     Bad
5       Fred   46                5.0    Good
6    Gwyneth   38                4.2     Bad
7     Hayden   50                4.0     Bad
8      Irene   29                4.5     Bad
9      James   42                4.1    Good
10     Kevin   35                4.5     Bad
11       Lea   38                2.5    Good
12    Marcus   31                4.8     Bad
13     Nigel   71                2.3    Good


#### Saving DataFrame into CSV files

You can save a DataFrame into a CSV file using ```to_csv``` function.

In [13]:
import os

# Let's create a PATH in a OS independent way
# File lec1-saved.csv will be in directory data
fileName = os.path.join( "data", "lec5-saved.csv")

# Save DataFrameRead a CSV file into a DataFrame
df.to_csv( fileName)




Please check the file created. Is it the same as the original lec1-saved.csv?

No, it has an additional column with the row number. You can also save the DataFrame without this number by using the ```index=False``` option.

In [14]:
import os

# Let's create a PATH in a OS independent way
# File lec1-saved-noindex.csv will be in directory data
fileName = os.path.join( "data", "lec5-saved-noindex.csv")

# Save DataFrameRead a CSV file into a DataFrame
df.to_csv( fileName, index=False)


### Data processing with Pandas

We now show the transformations necessary to perform the exercises proposed above.


#### Selecting rows based on conditions

It is possible to select the rows for which a column has a given value as follows:

In [15]:
# Select the persons that are good company.
good = df[df["Company"]=="Good"]

print(good)

        Name  Age  Educational level Company
0     Andrew   55                1.0    Good
1   Bernhard   43                2.0    Good
3     Dennis   82                3.0    Good
5       Fred   46                5.0    Good
9      James   42                4.1    Good
11       Lea   38                2.5    Good
13     Nigel   71                2.3    Good


In [16]:
# Select the persons that are good company and have educational level larger than 3.
goodEd3plus = df[(df["Company"]=="Good") & (df["Educational level"]>=3.0)]

print(goodEd3plus)

     Name  Age  Educational level Company
3  Dennis   82                3.0    Good
5    Fred   46                5.0    Good
9   James   42                4.1    Good


#### Selecting a subset of the columns

Often, we do not need all data that is in a table. We can get rid of the data we do not need by selecting the columns we want using the following syntax ```dataframe[[col1,col2,...]]```.

In the following example we create a new DataFrame containing only the Name and Age columns.

In [17]:
# Select the persons that are good company.
person_age = df[["Name","Age"]]

print(person_age)

        Name  Age
0     Andrew   55
1   Bernhard   43
2   Carolina   37
3     Dennis   82
4        Eve   23
5       Fred   46
6    Gwyneth   38
7     Hayden   50
8      Irene   29
9      James   42
10     Kevin   35
11       Lea   38
12    Marcus   31
13     Nigel   71


#### Applying reduce/aggregation functions

Pandas allow to compute the reduction/aggregation for the values of one or multiple columns.

You must select the columns for which you want to perform the computation, and then call the reduce/aggregation function.

The following example computes first, the minimum age (```min```function), and then the minimum of both *Age* and *Educational level* at the same time. Pandas has multiple useful aggregation functions, including, maximum (```max```), minimum (```min```), mean (```mean```), median (```median```), standard deviation (```std```), etc. - check the [**DataFrame** documentation](https://pandas.pydata.org/docs/reference/frame.html) for the list of available functions.

In [18]:
minAge = good["Age"].min()
print( "Minimum age is ")
print( minAge)

mins = good[["Age","Educational level"]].min()
print( "Minimum information for several columns now")
print( mins)


Minimum age is 
38
Minimum information for several columns now
Age                  38.0
Educational level     1.0
dtype: float64


Wait, this was not what we wanted in the first place - we want the information about the youngest person that is a good company.

Function ```nsmallest(num elems, columns)``` allow to compute that.

In [19]:
youngestGood = good.nsmallest(1,["Age"])
print( good.nsmallest(1,["Age"]))


   Name  Age  Educational level Company
11  Lea   38                2.5    Good


#### Applying reduce/aggregation functions per group

```groupby([cols])``` allows to group elements of a DataFrame before applying an aggregation function to each of the groups.

The following example computes the lowest age for each value of Company.


In [20]:
youngest = df[["Age","Company"]].groupby(["Company"]).min()
print( youngest)

youngestAny = youngest.idxmin()
print( "The youngest person is " + youngestAny["Age"])


         Age
Company     
Good      38
Bad       23
The youngest person is Bad


#### Access Spark Dataframes from Pandas Spark Dataframe

Spark Dataframe has a different API from Spark Pandas API. It is possible to expose a Pandas Dataframe as a Spark native Dataframe using *to_spark* function.

For printing data in a Spark Dataframe, function *show* is used.


In [21]:
dfSDF = df.to_spark()

dfSDF.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Educational level: double (nullable = true)
 |-- Company: string (nullable = true)





In [22]:
fileName = os.path.join( "data", "lec1-example.csv")

df = ps.read_csv(fileName)

youngest = df[["Age","Company"]].groupby(["Company"]).min()

print( youngest)



         Age
Company     
Good      38
Bad       23


In [23]:
youngestSDF = youngest.to_spark(index_col='index')

youngestSDF.show()

+-----+---+
|index|Age|
+-----+---+
| Good| 38|
|  Bad| 23|
+-----+---+



An interesting function is *explain* that shows the computations that are performed in the computation.

In [24]:
youngestSDF.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[__index_level_0__#587], functions=[min(Age#550)])
   +- Exchange hashpartitioning(__index_level_0__#587, 200), ENSURE_REQUIREMENTS, [plan_id=605]
      +- HashAggregate(keys=[__index_level_0__#587], functions=[partial_min(Age#550)])
         +- Project [Company#552 AS __index_level_0__#587, Age#550]
            +- Filter atleastnnonnulls(1, Company#552)
               +- FileScan csv [Age#550,Company#552] Batched: false, DataFilters: [atleastnnonnulls(1, Company#552)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/content/data/lec1-example.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Age:int,Company:string>




In the following text, that has the operations to be executed (from the bottom to the top), it is possible to observe that Spark will execute the **min** agregation by first computing the **partial_min** for each group in each partition, and only after that exchange data among partitions (Exchange line) for computing the minimum for each group. Note that this approach is much more efficient than propagating all information among partitions, as much less data is propagated.

Now let's see another example.

In [31]:
df[["Age","Company"]].groupby(["Company"]).mean()

Unnamed: 0_level_0,Age
Company,Unnamed: 1_level_1
Good,53.857143
Bad,34.714286


In [32]:
means = df[["Age","Company"]].groupby(["Company"]).mean().reset_index()
meanGood = means[means["Company"]=="Good"]
meanGood.to_spark().explain()


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[__index_level_0__#892], functions=[avg(Age#550)])
   +- Exchange hashpartitioning(__index_level_0__#892, 200), ENSURE_REQUIREMENTS, [plan_id=1015]
      +- HashAggregate(keys=[__index_level_0__#892], functions=[partial_avg(Age#550)])
         +- Project [Company#552 AS __index_level_0__#892, Age#550]
            +- Filter (atleastnnonnulls(1, Company#552) AND CASE WHEN CASE WHEN isnull((Company#552 = Good)) THEN false ELSE isnull((Company#552 = Good)) END THEN false ELSE CASE WHEN isnull((Company#552 = Good)) THEN false ELSE (Company#552 = Good) END END)
               +- FileScan csv [Age#550,Company#552] Batched: false, DataFilters: [atleastnnonnulls(1, Company#552), CASE WHEN CASE WHEN isnull((Company#552 = Good)) THEN false EL..., Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/content/data/lec1-example.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Age:int,Company:string>






The explanation should be read from bottom to top, starting with FileScan that represent the read of data from the file. In this example, you can see that, even if in the program the filtering is the last operation ```meanGood = means[means["Company"]=="Good"]```, Spark will start by filtering lines for which the company is not good when reading from the file.

Why?

Because there is no point in computing the number of persons that are not a good company if in the end, everyhting is needed is for those that are a Good company. This kind of optimization can have a great impact on program execution when data is large.

#### Creating Pandas Dataframes from Pandas Spark Dataframe

It is possible to create a *plain* Pandas Dataframe from a Pandas Spark Dataframe using the *to_pandas* function.

While a Pandas Spark Dataframe is distributed and partitioned across multiple machines/cores, the Pandas Dataframe is in a single machine.


In [33]:
dfPDF = df.to_pandas()

print(dfPDF)

        Name  Age  Educational level Company
0     Andrew   55                1.0    Good
1   Bernhard   43                2.0    Good
2   Carolina   37                5.0     Bad
3     Dennis   82                3.0    Good
4        Eve   23                3.2     Bad
5       Fred   46                5.0    Good
6    Gwyneth   38                4.2     Bad
7     Hayden   50                4.0     Bad
8      Irene   29                4.5     Bad
9      James   42                4.1    Good
10     Kevin   35                4.5     Bad
11       Lea   38                2.5    Good
12    Marcus   31                4.8     Bad
13     Nigel   71                2.3    Good


