# Making your Way Into PySpark
* **Purpose**: This notebook shows a few basic commands for you to start getting familiar with PySpark, coming from Pandas.

We will go over the main functions to:
* Load Data
* Summarizing
* Slicing
* Filtering
* Grouping
* Replacing
* Arranging


## Installing PySpark in Colab

In [None]:
!pip install pyspark py4j

## Imports

In [23]:
# import spark and functions
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, mean, count, when

# Imports from Python
import pandas as pd
import numpy as np

In [None]:
# Path of the dataset to be used
pth = "/content/sample_data/california_housing_train.csv"

## Loading a dataset

### Pandas

In [24]:
# Load data with Pandas
dfp = pd.read_csv(pth)

In [26]:
# Visualizing the data
dfp.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


Now, to make the same with Spark, first we need to create a Spark Session.

### Creating a Spark Session

In [13]:
# Create a spark session
spark = SparkSession.builder.appName("tests").getOrCreate()

### Loading Dataset with Spark

In [14]:
# Load data to session
df = spark.read.csv(pth, header=True, inferSchema=True)

In [27]:
# Visualizing the Data
df.limit(5).show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|  -114.57|   33.57|              20.0|     1454.0|         326.0|     624.0|     262.0|        1.925|           65500.0|
+---------+--------+----

## Summarizing Data

### Pandas

In [28]:
# Summarizing Data in Pandas
dfp.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


### Spark

In [29]:
# Summarizing Data with Spark
df.describe().show()

+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|summary|          longitude|          latitude|housing_median_age|      total_rooms|   total_bedrooms|        population|       households|     median_income|median_house_value|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|  count|              17000|             17000|             17000|            17000|            17000|             17000|            17000|             17000|             17000|
|   mean|-119.56210823529375|  35.6252247058827| 28.58935294117647|2643.664411764706|539.4108235294118|1429.5739411764705|501.2219411764706| 3.883578100000021|207300.91235294117|
| stddev| 2.0051664084260357|2.1373397946570867|12.586936981660406|2179.947071452777|421.4994515798648| 1

We notice that there aren't the percentiles in the Spark summarization. But we can calculate those if needed.

In [42]:
# Percentiles Spark
(df
 .agg(*[F.percentile(col, [.25, .5, .75]) for col in df.columns])
 .show()
 )

+------------------------------------------------+-----------------------------------------------+---------------------------------------------------------+--------------------------------------------------+-----------------------------------------------------+-------------------------------------------------+-------------------------------------------------+----------------------------------------------------+---------------------------------------------------------+
|percentile(longitude, array(0.25, 0.5, 0.75), 1)|percentile(latitude, array(0.25, 0.5, 0.75), 1)|percentile(housing_median_age, array(0.25, 0.5, 0.75), 1)|percentile(total_rooms, array(0.25, 0.5, 0.75), 1)|percentile(total_bedrooms, array(0.25, 0.5, 0.75), 1)|percentile(population, array(0.25, 0.5, 0.75), 1)|percentile(households, array(0.25, 0.5, 0.75), 1)|percentile(median_income, array(0.25, 0.5, 0.75), 1)|percentile(median_house_value, array(0.25, 0.5, 0.75), 1)|
+------------------------------------------------+----

## Slicing

### Pandas

In [54]:
# Slicing (Selecting) Data in Pandas
dfp.loc[10:20, ['households', 'housing_median_age', 'median_house_value']]

Unnamed: 0,households,housing_median_age,median_house_value
10,824.0,16.0,86500.0
11,437.0,21.0,62000.0
12,211.0,48.0,48600.0
13,479.0,31.0,70400.0
14,300.0,15.0,45000.0
15,401.0,17.0,69100.0
16,256.0,28.0,94900.0
17,27.0,21.0,25000.0
18,320.0,17.0,44000.0
19,15.0,17.0,27500.0


### Spark

In [55]:
# Slicing in Spark
"In spark you can't directly slice a data frame by rows"
(df
 .select('households', 'housing_median_age', 'median_house_value')
 .limit(10)
 .show()
)

+----------+------------------+------------------+
|households|housing_median_age|median_house_value|
+----------+------------------+------------------+
|     472.0|              15.0|           66900.0|
|     463.0|              19.0|           80100.0|
|     117.0|              17.0|           85700.0|
|     226.0|              14.0|           73400.0|
|     262.0|              20.0|           65500.0|
|     239.0|              29.0|           74000.0|
|     633.0|              25.0|           82400.0|
|     158.0|              41.0|           48500.0|
|    1056.0|              34.0|           58400.0|
|     271.0|              46.0|           48100.0|
+----------+------------------+------------------+



## Filtering

### Pandas

In [59]:
# Filtering data Pandas
dfp.query('housing_median_age < 20').head(10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
10,-114.6,33.62,16.0,3741.0,801.0,2434.0,824.0,2.6797,86500.0
14,-114.63,32.76,15.0,1448.0,378.0,949.0,300.0,0.8585,45000.0
15,-114.65,34.89,17.0,2556.0,587.0,1005.0,401.0,1.6991,69100.0
18,-114.66,32.74,17.0,1388.0,386.0,775.0,320.0,1.2049,44000.0
19,-114.67,33.92,17.0,97.0,24.0,29.0,15.0,1.2656,27500.0
23,-114.98,33.82,15.0,644.0,129.0,137.0,52.0,3.2097,71300.0


### Spark

In [57]:
(df
 .filter(df.housing_median_age < 20)
 .show()
 )

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|   -114.6|   33.62|              16.0|     3741.0|         801.0|    2434.0|     824.0|       2.6797|           86500.0|
|  -114.63|   32.76|    

## Grouping

### Pandas

In [66]:
# Grouping in Pandas
(dfp
 .groupby('housing_median_age')
 ['median_house_value']
 .mean()
 .reset_index()
 .sort_values('housing_median_age')
 .head(10)
)

Unnamed: 0,housing_median_age,median_house_value
0,1.0,190250.0
1,2.0,229438.836735
2,3.0,239450.043478
3,4.0,230054.10559
4,5.0,211035.708543
5,6.0,206768.24031
6,7.0,188445.059603
7,8.0,190805.073034
8,9.0,190306.994186
9,10.0,178416.393805


In [73]:
# Get different aggregation values for different variables
(dfp
 .groupby('housing_median_age')
 .agg({'median_house_value': 'mean',
       'population':'max',
       'median_income':'median'})
 .reset_index()
 .sort_values('housing_median_age')
 .head(10)
 )

Unnamed: 0,housing_median_age,median_house_value,population,median_income
0,1.0,190250.0,872.0,4.7568
1,2.0,229438.836735,8652.0,4.6336
2,3.0,239450.043478,9623.0,5.40415
3,4.0,230054.10559,16122.0,4.9432
4,5.0,211035.708543,11956.0,4.3598
5,6.0,206768.24031,8222.0,4.1458
6,7.0,188445.059603,15037.0,3.9464
7,8.0,190805.073034,15507.0,3.9363
8,9.0,190306.994186,12873.0,4.05935
9,10.0,178416.393805,9851.0,3.85075


Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')

### Spark

In [18]:
# Grouping in Spark
 (df
 .groupBy('housing_median_age')
 .agg(mean('median_house_value').alias('median_house_value'))
 .sort('housing_median_age')
 .show()
)

+------------------+------------------+
|housing_median_age|median_house_value|
+------------------+------------------+
|               1.0|          190250.0|
|               2.0|229438.83673469388|
|               3.0|239450.04347826086|
|               4.0| 230054.1055900621|
|               5.0|211035.70854271358|
|               6.0|206768.24031007753|
|               7.0|  188445.059602649|
|               8.0|190805.07303370786|
|               9.0| 190306.9941860465|
|              10.0|178416.39380530972|
|              11.0|       182480.3125|
|              12.0|     181590.640625|
|              13.0|188065.47389558234|
|              14.0|191181.28818443805|
|              15.0|181031.02403846153|
|              16.0|200354.50708661418|
|              17.0|191772.24479166666|
|              18.0|192074.71548117156|
|              19.0|196017.75242718446|
|              20.0|192681.75195822454|
+------------------+------------------+
only showing top 20 rows



In [76]:
# Grouping in Spark
(df
 .groupBy('housing_median_age')
 .agg(mean('median_house_value').alias('median_house_value'),
      F.max('population').alias('population'),
      F.median('median_income').alias('median_income'))
 .sort('housing_median_age')
 .show()
)

+------------------+------------------+----------+------------------+
|housing_median_age|median_house_value|population|     median_income|
+------------------+------------------+----------+------------------+
|               1.0|          190250.0|     872.0|            4.7568|
|               2.0|229438.83673469388|    8652.0|            4.6336|
|               3.0|239450.04347826086|    9623.0|           5.40415|
|               4.0| 230054.1055900621|   16122.0|            4.9432|
|               5.0|211035.70854271358|   11956.0|            4.3598|
|               6.0|206768.24031007753|    8222.0|            4.1458|
|               7.0|  188445.059602649|   15037.0|            3.9464|
|               8.0|190805.07303370786|   15507.0|            3.9363|
|               9.0| 190306.9941860465|   12873.0|           4.05935|
|              10.0|178416.39380530972|    9851.0|3.8507499999999997|
|              11.0|       182480.3125|   28566.0|           3.59375|
|              12.0|

## Replacing

### Pandas

In [84]:
# Replacing values in Pandas
(dfp #dataset
 .assign(housing_median_age=
         dfp['housing_median_age'].where(dfp.housing_median_age > 15, other="potential buy") ) #assign replaced values to variable
 )

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,potential buy,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,potential buy,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


### Spark

In [89]:
# Replace values in Spark
(df
 .withColumn('housing_median_age',
                when(col('housing_median_age') <= 15, 'potential buy')
                .otherwise(col('housing_median_age'))
                )
 .show()
)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|     potential buy|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|     potential buy|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|  -114.57|   33.57|              20.0|     1454.0|         326.0|     624.0|     262.0|        1.925|           65500.0|
|  -114.58|   33.63|    

## Arranging

### Pandas

In [91]:
# Arrange values in Pandas
(dfp
 .sort_values('median_house_value')
 .head(10)
)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
568,-117.02,36.4,19.0,619.0,239.0,490.0,164.0,2.1,14999.0
16643,-122.74,39.71,16.0,255.0,73.0,85.0,38.0,1.6607,14999.0
16801,-123.17,40.31,36.0,98.0,28.0,18.0,8.0,0.536,14999.0
3226,-117.86,34.24,52.0,803.0,267.0,628.0,225.0,4.1932,14999.0
7182,-118.33,34.15,39.0,493.0,168.0,259.0,138.0,2.3667,17500.0
15499,-122.32,37.93,33.0,296.0,73.0,216.0,63.0,2.675,22500.0
11653,-121.29,37.95,52.0,107.0,79.0,167.0,53.0,0.7917,22500.0
264,-116.57,35.43,8.0,9975.0,1743.0,6835.0,1439.0,2.7138,22500.0
17,-114.65,32.79,21.0,44.0,33.0,64.0,27.0,0.8571,25000.0
9636,-119.45,35.13,34.0,1440.0,309.0,808.0,294.0,2.3013,26600.0


### Spark

In [92]:
# Arrange values in Spark
(df
 .orderBy('median_house_value')
 .show()
)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -117.02|    36.4|              19.0|      619.0|         239.0|     490.0|     164.0|          2.1|           14999.0|
|  -117.86|   34.24|              52.0|      803.0|         267.0|     628.0|     225.0|       4.1932|           14999.0|
|  -122.74|   39.71|              16.0|      255.0|          73.0|      85.0|      38.0|       1.6607|           14999.0|
|  -123.17|   40.31|              36.0|       98.0|          28.0|      18.0|       8.0|        0.536|           14999.0|
|  -118.33|   34.15|              39.0|      493.0|         168.0|     259.0|     138.0|       2.3667|           17500.0|
|  -116.57|   35.43|    