# Spark API Exercises

__1) Create a spark data frame that contains your favorite programming languages.__

* The name of the column should be language
* View the schema of the dataframe
* Output the shape of the dataframe
* Show the first 5 records in the dataframe

In [1]:
import pandas as pd
import numpy as np
import pyspark

In [2]:
#Create the df
languages = pd.DataFrame({
    'language': ['Python', 'C#', 'C', 'C++', 'SQL', 'Java', 'JavaScript']
})

In [5]:
#Start spark
spark = pyspark.sql.SparkSession.builder.getOrCreate()

#Create spark df
df = spark.createDataFrame(languages)

In [45]:
#Show the schema
df.printSchema()

root
 |-- language: string (nullable = true)



In [15]:
#Show the shape. # rows, # columns
df.count(), len(df.columns)

(7, 1)

In [16]:
df.show(5)

+--------+
|language|
+--------+
|  Python|
|      C#|
|       C|
|     C++|
|     SQL|
+--------+
only showing top 5 rows



__2) Load the mpg dataset as a spark dataframe.__

In [17]:
from pydataset import data

In [31]:
#Get the df
mpg = data('mpg')

In [32]:
#Create spark df
mpg = spark.createDataFrame(mpg)

In [23]:
mpg.show(2)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 2 rows



__a) Create 1 column of output that contains a message like the one below or each vehicle:__
    
* The 1999 audi a4 has a 4 cylinder engine.

In [25]:
from pyspark.sql.functions import concat, lit

In [35]:
mpg.select(concat(lit('The '), 'year', lit(' '), 'manufacturer', lit(' '), 'model', lit(' has a '), 'cyl', lit(' cylinder engine.')).alias('string')).show(truncate = False)

+--------------------------------------------------------------+
|string                                                        |
+--------------------------------------------------------------+
|The 1999 audi a4 has a 4 cylinder engine.                     |
|The 1999 audi a4 has a 4 cylinder engine.                     |
|The 2008 audi a4 has a 4 cylinder engine.                     |
|The 2008 audi a4 has a 4 cylinder engine.                     |
|The 1999 audi a4 has a 6 cylinder engine.                     |
|The 1999 audi a4 has a 6 cylinder engine.                     |
|The 2008 audi a4 has a 6 cylinder engine.                     |
|The 1999 audi a4 quattro has a 4 cylinder engine.             |
|The 1999 audi a4 quattro has a 4 cylinder engine.             |
|The 2008 audi a4 quattro has a 4 cylinder engine.             |
|The 2008 audi a4 quattro has a 4 cylinder engine.             |
|The 1999 audi a4 quattro has a 6 cylinder engine.             |
|The 1999 audi a4 quattro

__b) Transform the trans column so that it only contains either manual or auto.__

In [36]:
from pyspark.sql.functions import regexp_replace

In [38]:
mpg.select('trans', regexp_replace('trans', r'\(.*\)$', '')).show()

+----------+-----------------------------------+
|     trans|regexp_replace(trans, \(.*\)$, , 1)|
+----------+-----------------------------------+
|  auto(l5)|                               auto|
|manual(m5)|                             manual|
|manual(m6)|                             manual|
|  auto(av)|                               auto|
|  auto(l5)|                               auto|
|manual(m5)|                             manual|
|  auto(av)|                               auto|
|manual(m5)|                             manual|
|  auto(l5)|                               auto|
|manual(m6)|                             manual|
|  auto(s6)|                               auto|
|  auto(l5)|                               auto|
|manual(m5)|                             manual|
|  auto(s6)|                               auto|
|manual(m6)|                             manual|
|  auto(l5)|                               auto|
|  auto(s6)|                               auto|
|  auto(s6)|        

__3) Load the tips dataset as a spark dataframe.__

In [59]:
#Load the data
tips = data('tips')

In [60]:
#Create the spark df
tips = spark.createDataFrame(tips)

__a) What percentage of observations are smokers?__

In [43]:
from pyspark.sql.functions import min, max, mean, asc, desc

In [51]:
tips.show(3)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
+----------+----+------+------+---+------+----+
only showing top 3 rows



In [54]:
(tips.filter(tips.smoker == 'Yes').count() / tips.count())

0.38114754098360654

__b) Create a column that contains the tip percentage__

In [61]:
tips = tips.select('*', (tips.tip / tips.total_bill).alias('tip_percentage'))
tips.show()

+----------+----+------+------+---+------+----+-------------------+
|total_bill| tip|   sex|smoker|day|  time|size|     tip_percentage|
+----------+----+------+------+---+------+----+-------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|0.18623962040332148|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|0.22805017103762829|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|0.11607142857142858|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|0.13031914893617022|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2| 0.2185385656292287|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2| 0.1665043816942551|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|0

__c) Calculate the average tip percentage for each combination of sex and smoker__

In [62]:
tips.groupBy('sex', 'smoker').agg(mean('tip_percentage')).show()

+------+------+-------------------+
|   sex|smoker|avg(tip_percentage)|
+------+------+-------------------+
|  Male|    No| 0.1606687151291298|
|Female|    No| 0.1569209707691836|
|  Male|   Yes|0.15277117520248512|
|Female|   Yes|0.18215035269941032|
+------+------+-------------------+



__4) Use the seattle weather dataset referenced in the lesson to answer the questions below.__

In [63]:
from vega_datasets import data

In [73]:
#Load the data
weather = data.seattle_weather()

In [74]:
weather = spark.createDataFrame(weather)

In [66]:
weather.printSchema()

root
 |-- date: timestamp (nullable = true)
 |-- precipitation: double (nullable = true)
 |-- temp_max: double (nullable = true)
 |-- temp_min: double (nullable = true)
 |-- wind: double (nullable = true)
 |-- weather: string (nullable = true)



In [67]:
weather.show(5)

+-------------------+-------------+--------+--------+----+-------+
|               date|precipitation|temp_max|temp_min|wind|weather|
+-------------------+-------------+--------+--------+----+-------+
|2012-01-01 00:00:00|          0.0|    12.8|     5.0| 4.7|drizzle|
|2012-01-02 00:00:00|         10.9|    10.6|     2.8| 4.5|   rain|
|2012-01-03 00:00:00|          0.8|    11.7|     7.2| 2.3|   rain|
|2012-01-04 00:00:00|         20.3|    12.2|     5.6| 4.7|   rain|
|2012-01-05 00:00:00|          1.3|     8.9|     2.8| 6.1|   rain|
+-------------------+-------------+--------+--------+----+-------+
only showing top 5 rows



__a) Convert the temperatures to fahrenheit.__

In [75]:
# F = C * 1.8 + 32
weather = weather.select('date', 'precipitation', (weather.temp_max * 1.8 + 32).cast('int').alias('temp_max'), (weather.temp_min * 1.8 + 32).cast('int').alias('temp_min'), 'wind', 'weather')

In [76]:
weather.show(5)

+-------------------+-------------+--------+--------+----+-------+
|               date|precipitation|temp_max|temp_min|wind|weather|
+-------------------+-------------+--------+--------+----+-------+
|2012-01-01 00:00:00|          0.0|      55|      41| 4.7|drizzle|
|2012-01-02 00:00:00|         10.9|      51|      37| 4.5|   rain|
|2012-01-03 00:00:00|          0.8|      53|      44| 2.3|   rain|
|2012-01-04 00:00:00|         20.3|      53|      42| 4.7|   rain|
|2012-01-05 00:00:00|          1.3|      48|      37| 6.1|   rain|
+-------------------+-------------+--------+--------+----+-------+
only showing top 5 rows



__b) Which month has the most rain, on average?__

In [81]:
from pyspark.sql.functions import month, year, quarter

In [95]:
weather.groupBy(month('date')).agg(mean('precipitation')).sort(desc('avg(precipitation)')).show(1)


+-----------+------------------+
|month(date)|avg(precipitation)|
+-----------+------------------+
|         11| 5.354166666666667|
+-----------+------------------+
only showing top 1 row



__c) Which year was the windiest?__

In [108]:
from pyspark.sql.functions import sum

In [112]:
weather.groupBy(year('date')).agg(sum("wind")).sort(desc('sum(wind)')).show(1)

+----------+---------+
|year(date)|sum(wind)|
+----------+---------+
|      2012|   1244.7|
+----------+---------+
only showing top 1 row



__d) What is the most frequent type of weather in January?__

In [115]:
weather.filter(month('date') == 1).groupBy('weather').count().sort(desc('count')).show(1)

+-------+-----+
|weather|count|
+-------+-----+
|    fog|   38|
+-------+-----+
only showing top 1 row



__e) What is the average high and low temperature on sunny days in July in 2013 and 2014?__

In [120]:
weather.filter( ((year('date') == 2013) | (year('date') == 2014)) & (month('date') == 7) & (weather.weather == 'sun') ).select(mean('temp_max'), mean('temp_min')).show()

+-----------------+-----------------+
|    avg(temp_max)|    avg(temp_min)|
+-----------------+-----------------+
|79.84615384615384|57.09615384615385|
+-----------------+-----------------+



__f) What percentage of days were rainy in q3 of 2015?__

In [121]:
q3_2015 = weather.filter((quarter('date') == 3) & (year('date') == 2015))

In [122]:
q3_2015.filter(q3_2015.weather == 'rain').count() / q3_2015.count()

0.021739130434782608

__g) For each year, find what percentage of days it rained (had non-zero precipitation)__

In [165]:
rainy_days = weather.filter(weather.precipitation > 0).groupBy(year('date').alias('year')).count().withColumnRenamed('count','rainy_days')
rainy_days.show()

+----+----------+
|year|rainy_days|
+----+----------+
|2012|       177|
|2013|       152|
|2014|       150|
|2015|       144|
+----+----------+



In [166]:
total_days = weather.groupBy(year('date').alias('year1')).count().withColumnRenamed('count', 'total_days')
total_days.show()

+-----+----------+
|year1|total_days|
+-----+----------+
| 2012|       366|
| 2013|       365|
| 2014|       365|
| 2015|       365|
+-----+----------+



In [168]:
rainy_days = rainy_days.join(total_days, on = rainy_days.year == total_days.year1).drop('year1')

In [169]:
rainy_days.show()

+----+----------+----------+
|year|rainy_days|total_days|
+----+----------+----------+
|2012|       177|       366|
|2013|       152|       365|
|2014|       150|       365|
|2015|       144|       365|
+----+----------+----------+



In [172]:
rainy_days.withColumn('percent_rainy', rainy_days.rainy_days / rainy_days.total_days).show()

+----+----------+----------+-------------------+
|year|rainy_days|total_days|      percent_rainy|
+----+----------+----------+-------------------+
|2012|       177|       366|0.48360655737704916|
|2013|       152|       365|0.41643835616438357|
|2014|       150|       365|  0.410958904109589|
|2015|       144|       365|0.39452054794520547|
+----+----------+----------+-------------------+

