<h2 id="exercises">Exercises - Spark API</h2>
<p>Create a directory named <code>spark</code> within your <code>ds-methodologies</code> repository. This is where you will do the exercises for this module.</p>
<p>Create a jupyter notebook or python script named <code>spark101</code> for this exercise.</p>

In [7]:
import pandas as pd
import numpy as np
from pydataset import data

import pyspark
import pyspark.sql.functions as F

In [8]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

## 1
Create a spark data frame that contains your favorite programming languages.

[(Up to top)](#Exercises---Spark-API)
[(Down to 2)](#2)

### 1.a
The name of the column should be <code>language</code>

[(Back to 1)](#1)

In [21]:
langlist = [['Python', "https://www.python.org/", 9], 
            ['SQL', '', 8], 
            ['R', 'https://www.r-project.org/about.html', 2], 
            ['C++', 'http://www.cplusplus.com/', 2], 
            ['Java', "https://www.java.com/en/", 1], 
            ['JavaScript', "https://www.javascript.com/", 2], 
            ['Bash', 'https://www.gnu.org/software/bash/', 1], 
            ['MATLAB', 'https://www.mathworks.com/products/matlab.html', 4], 
            ['C#', 'https://docs.microsoft.com/en-us/dotnet/csharp/', 2], 
            ['.NET', 'https://dotnet.microsoft.com/', 5], 
            ['Visual Basic', 'https://docs.microsoft.com/en-us/dotnet/visual-basic/', 7], 
            ['PHP', 'https://www.php.net/', 3], 
            ['STATA', 'https://www.stata.com/', 4], 
            ['Scala', 'https://www.scala-lang.org/', 4], 
            ['Go', 'https://golang.org/', 6], 
            ['Ruby', 'https://www.ruby-lang.org/en/', 4], 
            ['Julia', 'https://julialang.org/', 5]
           ]
langpd = pd.DataFrame(langlist, columns=['language','url','kevscore'])
langdf = spark.createDataFrame(langpd)

### 1.b
View the schema of the dataframe

[(Back to 1)](#1)

In [177]:
langdf.printSchema()

root
 |-- language: string (nullable = true)
 |-- url: string (nullable = true)
 |-- kevscore: long (nullable = true)



### 1.c
Output the shape of the dataframe

[(Back to 1)](#1)

In [178]:
print(f'({langdf.count()}, {len(langdf.columns)})')

(17, 3)


### 1.d

Show the first 5 records in the dataframe

[(Back to 1)](#1)

In [181]:
langdf.sort(F.desc('kevscore'), 'language').show(10, truncate=False)

+------------+-----------------------------------------------------+--------+
|language    |url                                                  |kevscore|
+------------+-----------------------------------------------------+--------+
|Python      |https://www.python.org/                              |9       |
|SQL         |                                                     |8       |
|Visual Basic|https://docs.microsoft.com/en-us/dotnet/visual-basic/|7       |
|Go          |https://golang.org/                                  |6       |
|.NET        |https://dotnet.microsoft.com/                        |5       |
|Julia       |https://julialang.org/                               |5       |
|MATLAB      |https://www.mathworks.com/products/matlab.html       |4       |
|Ruby        |https://www.ruby-lang.org/en/                        |4       |
|STATA       |https://www.stata.com/                               |4       |
|Scala       |https://www.scala-lang.org/                       

## 2

Load the <code>mpg</code> dataset as a spark dataframe.</p>

[(Up to top)](#Exercises---Spark-API)
[(Up to 1)](#1)
[(Down to 3)](#3)

In [187]:
from pydataset import data

mpg = spark.createDataFrame(data("mpg"))
mpg.show(1)

+------------+-----+-----+----+---+--------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|   trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+--------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|auto(l5)|  f| 18| 29|  p|compact|
+------------+-----+-----+----+---+--------+---+---+---+---+-------+
only showing top 1 row



### 2.a
Create 1 column of output that contains a message like the one below:</p>
<pre><code>The 1999 audi a4 has a 4 cylinder engine.</code></pre>
For each vehicle.</p>

[(Back to 2)](#2)

In [111]:
(mpg
 .groupBy('manufacturer', 'model', 'year', 'cyl')
 .agg(F.count('*').alias('cars'))
 .select(
     F.concat(
        F.lit('The '),
        (F.when(F.col('cars') == 1, F.lit(''))
            .otherwise(F.col('cars'))), 
        (F.when(F.col('cars') == 1, F.lit(''))
            .otherwise(F.lit(' '))), 
        'year', F.lit(' '), 'manufacturer', 
        F.lit(' '), 'model', 
        (F.when(F.col('cars') == 1, F.lit(' has a '))
            .otherwise(F.lit('s have '))), 
        'cyl', F.lit(' cylinder engine'),
        (F.when(F.col('cars') == 1, F.lit('.'))
            .otherwise(F.lit('s.'))), 

    ).alias('descriptive_sentence')).show(15, truncate=False))

+--------------------------------------------------------------+
|descriptive_sentence                                          |
+--------------------------------------------------------------+
|The 2 2008 chevrolet k1500 tahoe 4wds have 8 cylinder engines.|
|The 2 2008 volkswagen new beetles have 5 cylinder engines.    |
|The 2 1999 audi a4 quattros have 6 cylinder engines.          |
|The 2 2008 toyota corollas have 4 cylinder engines.           |
|The 2008 audi a6 quattro has a 8 cylinder engine.             |
|The 2008 volkswagen passat has a 6 cylinder engine.           |
|The 2008 nissan maxima has a 6 cylinder engine.               |
|The 3 1999 pontiac grand prixs have 6 cylinder engines.       |
|The 2 1999 chevrolet corvettes have 8 cylinder engines.       |
|The 2008 ford explorer 4wd has a 6 cylinder engine.           |
|The 2008 audi a6 quattro has a 6 cylinder engine.             |
|The 2008 toyota camry solara has a 6 cylinder engine.         |
|The 2 2008 nissan altima

### 2.b
Transform the <code>trans</code> column so that it only contains either <code>manual</code> or <code>auto</code>.

[(Back to 2)](#2)

In [50]:
(mpg.select(
    '*', (
        F.when(mpg.trans.like('auto%'), 'auto')
        .when(mpg.trans.like('manual%'), 'manual')
        .otherwise('unknown')
        .alias('transtype')))
    .drop('trans')).show(10)

+------------+----------+-----+----+---+---+---+---+---+-------+---------+
|manufacturer|     model|displ|year|cyl|drv|cty|hwy| fl|  class|transtype|
+------------+----------+-----+----+---+---+---+---+---+-------+---------+
|        audi|        a4|  1.8|1999|  4|  f| 18| 29|  p|compact|     auto|
|        audi|        a4|  1.8|1999|  4|  f| 21| 29|  p|compact|   manual|
|        audi|        a4|  2.0|2008|  4|  f| 20| 31|  p|compact|   manual|
|        audi|        a4|  2.0|2008|  4|  f| 21| 30|  p|compact|     auto|
|        audi|        a4|  2.8|1999|  6|  f| 16| 26|  p|compact|     auto|
|        audi|        a4|  2.8|1999|  6|  f| 18| 26|  p|compact|   manual|
|        audi|        a4|  3.1|2008|  6|  f| 18| 27|  p|compact|     auto|
|        audi|a4 quattro|  1.8|1999|  4|  4| 18| 26|  p|compact|   manual|
|        audi|a4 quattro|  1.8|1999|  4|  4| 16| 25|  p|compact|     auto|
|        audi|a4 quattro|  2.0|2008|  4|  4| 20| 28|  p|compact|   manual|
+------------+----------+

## 3
Load the <code>tips</code> dataset as a spark dataframe.

[(Up to top)](#Exercises---Spark-API)
[(Up to 2)](#2)
[(Down to 4)](#4)

In [188]:
from pydataset import data

tips = spark.createDataFrame(data("tips"))
tips.show(1)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
+----------+----+------+------+---+------+----+
only showing top 1 row



### 3.a
What percentage of observations are smokers?

[(Back to 3)](#3)

In [190]:
tips.select(F.when(tips.smoker == 'Yes', 1).otherwise(0).alias('smokes')).agg(F.avg('smokes')).show()

+-------------------+
|        avg(smokes)|
+-------------------+
|0.38114754098360654|
+-------------------+



### 3.b
Create a column that contains the tip percentage

[(Back to 3)](#3)

In [68]:
tipswpctg = tips.select('*', F.round((tips.tip / tips.total_bill),5).alias('tip_pctg'))
tipswpctg.show(15)

+----------+----+------+------+---+------+----+--------+
|total_bill| tip|   sex|smoker|day|  time|size|tip_pctg|
+----------+----+------+------+---+------+----+--------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2| 0.05945|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3| 0.16054|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3| 0.16659|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.13978|
|     24.59|3.61|Female|    No|Sun|Dinner|   4| 0.14681|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4| 0.18624|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2| 0.22805|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4| 0.11607|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2| 0.13032|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2| 0.21854|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|  0.1665|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|  0.1418|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2| 0.10182|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4| 0.16278|
|     14.83|3.02|Female|    No|

### 3.c
Calculate the average tip percentage for each combination of sex and smoker.

[(Back to 3)](#3)

In [79]:
(tipswpctg
 .groupBy('sex','smoker')
 .agg(
     F.round(F.avg('tip_pctg'),5).alias('avg_tip_pctg'), 
     F.round((F.sum('tip')/F.sum('total_bill')), 5).alias('gross_tip_pctg')
 )).show()

+------+------+------------+--------------+
|   sex|smoker|avg_tip_pctg|gross_tip_pctg|
+------+------+------------+--------------+
|  Male|    No|     0.16067|       0.15731|
|  Male|   Yes|     0.15277|       0.13692|
|Female|    No|     0.15692|       0.15319|
|Female|   Yes|     0.18215|       0.16306|
+------+------+------------+--------------+



## 4
Use the seattle weather dataset referenced in the lesson to answer the questions below.</p>

[(Up to top)](#Exercises---Spark-API)
[(Up to 3)](#3)

In [191]:
from vega_datasets import data

weather = data.seattle_weather().assign(date=lambda df: df.date.astype(str))
weather = spark.createDataFrame(weather)
weather.printSchema()
weather.show(6)

root
 |-- date: string (nullable = true)
 |-- precipitation: double (nullable = true)
 |-- temp_max: double (nullable = true)
 |-- temp_min: double (nullable = true)
 |-- wind: double (nullable = true)
 |-- weather: string (nullable = true)

+----------+-------------+--------+--------+----+-------+
|      date|precipitation|temp_max|temp_min|wind|weather|
+----------+-------------+--------+--------+----+-------+
|2012-01-01|          0.0|    12.8|     5.0| 4.7|drizzle|
|2012-01-02|         10.9|    10.6|     2.8| 4.5|   rain|
|2012-01-03|          0.8|    11.7|     7.2| 2.3|   rain|
|2012-01-04|         20.3|    12.2|     5.6| 4.7|   rain|
|2012-01-05|          1.3|     8.9|     2.8| 6.1|   rain|
|2012-01-06|          2.5|     4.4|     2.2| 2.2|   rain|
+----------+-------------+--------+--------+----+-------+
only showing top 6 rows



### 4.a
Convert the temperatures to farenheight.

[(Back to 4)](#4)
[(Go to function test)](#4.a.test)

In [164]:
import pyspark.sql.types as T

def c_to_f(temp_c):
    return (9 * temp_c * .2) + 32

spark.udf.register("degreesCtoF", c_to_f, T.DoubleType())

+---+---------+
| id|id_c_to_f|
+---+---------+
|  1|     33.8|
|  2|     35.6|
|  3|     37.4|
|  4|     39.2|
|  5|     41.0|
+---+---------+
only showing top 5 rows



In [193]:
weatherfm = weatherf.select(
    'date', 'precipitation', 'temp_max', 'temp_max_f', 'temp_min', 'temp_min_f', 'wind', 'weather',
    F.expr('LEFT(date, 7) year_month')
)
weatherfm.show(5)

+----------+-------------+--------+----------+--------+----------+----+-------+----------+
|      date|precipitation|temp_max|temp_max_f|temp_min|temp_min_f|wind|weather|year_month|
+----------+-------------+--------+----------+--------+----------+----+-------+----------+
|2012-01-01|          0.0|    12.8|      55.0|     5.0|      41.0| 4.7|drizzle|   2012-01|
|2012-01-02|         10.9|    10.6|      51.1|     2.8|      37.0| 4.5|   rain|   2012-01|
|2012-01-03|          0.8|    11.7|      53.1|     7.2|      45.0| 2.3|   rain|   2012-01|
|2012-01-04|         20.3|    12.2|      54.0|     5.6|      42.1| 4.7|   rain|   2012-01|
|2012-01-05|          1.3|     8.9|      48.0|     2.8|      37.0| 6.1|   rain|   2012-01|
+----------+-------------+--------+----------+--------+----------+----+-------+----------+
only showing top 5 rows



In [140]:
weatherf = weather.select(
    '*', 
   F.expr('round(degreesCtoF(temp_max),1) temp_max_f'), 
   F.expr('round(degreesCtoF(temp_min),1) temp_min_f')
)
weatherf.show(5)
# type(temp_max_f)

+----------+-------------+--------+--------+----+-------+----------+----------+
|      date|precipitation|temp_max|temp_min|wind|weather|temp_max_f|temp_min_f|
+----------+-------------+--------+--------+----+-------+----------+----------+
|2012-01-01|          0.0|    12.8|     5.0| 4.7|drizzle|      55.0|      41.0|
|2012-01-02|         10.9|    10.6|     2.8| 4.5|   rain|      51.1|      37.0|
|2012-01-03|          0.8|    11.7|     7.2| 2.3|   rain|      53.1|      45.0|
|2012-01-04|         20.3|    12.2|     5.6| 4.7|   rain|      54.0|      42.1|
|2012-01-05|          1.3|     8.9|     2.8| 6.1|   rain|      48.0|      37.0|
+----------+-------------+--------+--------+----+-------+----------+----------+
only showing top 5 rows



In [175]:
weatherfmgym = (
    weatherfm
    .groupby('year_month')
    .agg(
        F.count('date').alias('days'), 
        F.round(F.sum('precipitation'), 2).alias('tot_precip'), 
        F.round(F.avg('wind'), 2).alias('avg_wind')
    )
    .sort('year_month')
).show(5)

+----------+----+----------+--------+
|year_month|days|tot_precip|avg_wind|
+----------+----+----------+--------+
|   2012-01|  31|     173.3|     3.9|
|   2012-02|  29|      92.3|     3.9|
|   2012-03|  31|     183.0|    4.25|
|   2012-04|  30|      68.1|    3.37|
|   2012-05|  31|      52.2|    3.35|
+----------+----+----------+--------+
only showing top 5 rows



In [176]:
weatherfmpymw = (
    weatherfm
    .groupby('year_month')
    .pivot('weather')
    .agg(
        F.count('date').alias('days')
    )
    .sort('year_month')
    .na.fill(0)
).show(5)

+----------+-------+---+----+----+---+
|year_month|drizzle|fog|rain|snow|sun|
+----------+-------+---+----+----+---+
|   2012-01|      2|  0|  18|   7|  4|
|   2012-02|      1|  0|  17|   3|  8|
|   2012-03|      1|  0|  19|   5|  6|
|   2012-04|      2|  0|  19|   1|  8|
|   2012-05|      1|  0|  16|   0| 14|
+----------+-------+---+----+----+---+
only showing top 5 rows



### 4.b
Which month has the most rain, on average?

[(Back to 4)](#4)

### 4.c
Which year was the windiest?

[(Back to 4)](#4)

### 4.d
What is the most frequent type of weather in January?

[(Back to 4)](#4)

### 4.e
What is the average high and low tempurature on sunny days in July in 2013 and 2014?

[(Back to 4)](#4)

### 4.f
What percentage of days were rainy in q3 of 2015?

[(Back to 4)](#4)

### 4.g
For each year, find what percentage of days it rained (had non-zero precipitation).

[(Back to 4)](#4)

### 4.a.test
[(Back to 4.a)](#4.a)

In [192]:
spark.range(1, 20).registerTempTable("test")
spark.sql('''select id, degreesCtoF(id) as id_c_to_f from test''').show()

+---+------------------+
| id|         id_c_to_f|
+---+------------------+
|  1|              33.8|
|  2|              35.6|
|  3|              37.4|
|  4|              39.2|
|  5|              41.0|
|  6|              42.8|
|  7|              44.6|
|  8|              46.4|
|  9|              48.2|
| 10|              50.0|
| 11|              51.8|
| 12|              53.6|
| 13|55.400000000000006|
| 14|              57.2|
| 15|              59.0|
| 16|              60.8|
| 17|              62.6|
| 18|              64.4|
| 19|              66.2|
+---+------------------+

