## Exercises

Using the [repo setup directions](https://ds.codeup.com/fundamentals/git/), setup a new local and remote repository named `spark-exercises`. The local version of your repo should live inside of `~/codeup-data-science`. This repo should be named `spark-exercises`

Save this work in your `spark-exercises` repo. Then add, commit, and push your changes.

Create a jupyter notebook or python script named `spark101` for this exercise.



# imports 

In [1]:
import pyspark


In [2]:
#create spark session builder
spark = pyspark.sql.SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/20 14:49:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/20 14:49:39 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


# 1. Create a spark data frame that contains your favorite programming languages.

    - The name of the column should be `language`
    - View the schema of the dataframe
    - Output the shape of the dataframe
    - Show the first 5 records in the dataframe



In [11]:
from pyspark.sql import Row

fav_language = spark.createDataFrame([
    Row(language='python', projects=5),
    Row(language='swift', projects=2),
    Row(language='javascript', projects=1),
    Row(language='basic', projects=1),
    Row(language='r', projects=0),
    Row(language='c++', projects=0)
    ])
fav_language

DataFrame[language: string, projects: bigint]

In [7]:
#view schema of data frame
fav_language.printSchema()

root
 |-- language: string (nullable = true)
 |-- projects: long (nullable = true)



In [9]:
#output shape of dataframe
#count gives number of rows, , len of df.columns gives number of columns
print(fav_language.count(), 'rows', len(fav_language.columns), 'columns')

6 rows 2 columns


In [13]:
#show the first 5 records in the data frame
fav_language.show(5)

+----------+--------+
|  language|projects|
+----------+--------+
|    python|       5|
|     swift|       2|
|javascript|       1|
|     basic|       1|
|         r|       0|
+----------+--------+
only showing top 5 rows



# Load the `mpg` dataset as a spark dataframe.

    * Create 1 column of output that contains a message like the one below for each vehicle:

            The 1999 audi a4 has a 4 cylinder engine.

    * Transform the `trans` column so that it only contains either `manual` or `auto`.
---


In [18]:
from pydataset import data

#import functions to use concat, sum, avg, min, max, ocount,  mean,  concat, and lit, regexpt_extract, regexp_replace
import pyspark.sql.functions as F


mpg = spark.createDataFrame(data('mpg'))

## Create 1 column of output that contains a message for each vehicle:

In [21]:
(mpg.select(F.concat(F.lit('The '), mpg.year, mpg.manufacturer, mpg.model, F.lit(' has a '), mpg.cyl, F.lit(' cylinder engine.') ))).show(5, truncate=False)

+------------------------------------------------------------------------+
|concat(The , year, manufacturer, model,  has a , cyl,  cylinder engine.)|
+------------------------------------------------------------------------+
|The 1999audia4 has a 4 cylinder engine.                                 |
|The 1999audia4 has a 4 cylinder engine.                                 |
|The 2008audia4 has a 4 cylinder engine.                                 |
|The 2008audia4 has a 4 cylinder engine.                                 |
|The 1999audia4 has a 6 cylinder engine.                                 |
+------------------------------------------------------------------------+
only showing top 5 rows



## Transform the `trans` column so that it only contains either `manual` or `auto`

In [22]:
mpg.select('trans').show(3)

+----------+
|     trans|
+----------+
|  auto(l5)|
|manual(m5)|
|manual(m6)|
+----------+
only showing top 3 rows



In [50]:
mpg.select('trans',
    F.when(mpg.trans.contains('auto'), 'auto')
    .otherwise('manual')).show(20)

+----------+---------------------------------------------------------+
|     trans|CASE WHEN contains(trans, auto) THEN auto ELSE manual END|
+----------+---------------------------------------------------------+
|  auto(l5)|                                                     auto|
|manual(m5)|                                                   manual|
|manual(m6)|                                                   manual|
|  auto(av)|                                                     auto|
|  auto(l5)|                                                     auto|
|manual(m5)|                                                   manual|
|  auto(av)|                                                     auto|
|manual(m5)|                                                   manual|
|  auto(l5)|                                                     auto|
|manual(m6)|                                                   manual|
|  auto(s6)|                                                     auto|
|  aut

In [53]:
mpg.select('trans', F.regexp_extract('trans', r'^(\w+)', 1).alias('attempts')).show(3)

+----------+--------+
|     trans|attempts|
+----------+--------+
|  auto(l5)|    auto|
|manual(m5)|  manual|
|manual(m6)|  manual|
+----------+--------+
only showing top 3 rows



# 1. Load the `tips` dataset as a spark dataframe.

    1. What percentage of observations are smokers?
    1. Create a column that contains the tip percentage
    1. Calculate the average tip percentage for each combination of sex and smoker.



In [54]:
tips = spark.createDataFrame(data('tips'))

In [55]:
#similar to df.info
tips.printSchema()

root
 |-- total_bill: double (nullable = true)
 |-- tip: double (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoker: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: long (nullable = true)



## What Percentage of observations are smokers?

In [68]:
#tips.rollup(tips.smoker== 'Yes').count().show()
#tips.groupBy(tips.smoker).agg(F.sum(tips.smoker == 'Yes')).show() --> not good
tips.filter(tips.smoker == 'Yes').count()/tips.count()

0.38114754098360654

## Create a column that contains the tip percentage?

In [70]:
tips.printSchema()


root
 |-- total_bill: double (nullable = true)
 |-- tip: double (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoker: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: long (nullable = true)



In [82]:
col1 = F.round((tips.tip/tips.total_bill),2).alias('tip_percentage')
tips.select(tips.total_bill, tips.tip, col1).show(5)

+----------+----+--------------+
|total_bill| tip|tip_percentage|
+----------+----+--------------+
|     16.99|1.01|          0.06|
|     10.34|1.66|          0.16|
|     21.01| 3.5|          0.17|
|     23.68|3.31|          0.14|
|     24.59|3.61|          0.15|
+----------+----+--------------+
only showing top 5 rows



# Use the seattle weather dataset referenced in the lesson to answer the questions below.

    - Convert the temperatures to fahrenheit.
    - Which month has the most rain, on average?
    - Which year was the windiest?
    - What is the most frequent type of weather in January?
    - What is the average high and low temperature on sunny days in July in 2013 and 2014?
    - What percentage of days were rainy in q3 of 2015?
    - For each year, find what percentage of days it rained (had non-zero precipitation).

In [83]:
from vega_datasets import data

weather = data.seattle_weather().assign(date=lambda df: df.date.astype(str))
weather = spark.createDataFrame(weather)

In [85]:
weather.show(5)

+----------+-------------+--------+--------+----+-------+
|      date|precipitation|temp_max|temp_min|wind|weather|
+----------+-------------+--------+--------+----+-------+
|2012-01-01|          0.0|    12.8|     5.0| 4.7|drizzle|
|2012-01-02|         10.9|    10.6|     2.8| 4.5|   rain|
|2012-01-03|          0.8|    11.7|     7.2| 2.3|   rain|
|2012-01-04|         20.3|    12.2|     5.6| 4.7|   rain|
|2012-01-05|          1.3|     8.9|     2.8| 6.1|   rain|
+----------+-------------+--------+--------+----+-------+
only showing top 5 rows

