# MSBX 5420 Assignment 2
## Task 1 - Warm Up with Spark

### Replicate the MapReduce Calculations in Assignment 1 with Spark RDD, DataFrame and SQL
As a warm-up, your first task will be replicating the same MapReduce jobs in assignment 1 with spark RDD and DataFrame/SQL.
First, let's load the NFL dataset into Spark. For convenience we will use spark session and dataframe.

In [None]:
from pyspark.sql import SparkSession

#let's use all cores with 'local[*]' to somehow speed up
spark = SparkSession.builder.master('local[*]').appName('spark_nfl_data').getOrCreate()

If you look at the NFL dataset, those missing values we have tried to skip in assignment 1 are "NA". By default, spark will treat empty or null as missing values, so here we need to let spark treat "NA" as the missing values.

In [None]:
df_nfl = spark.read.options(header=True, nullValue='NA', inferSchema=True).csv('./NFL_Play_by_Play_2009-2018.csv')
df_nfl.show(5)

In [None]:
#let's check and clean the data with dataframe
print(df_nfl.count())
print(df_nfl.distinct().count())

In [None]:
#drop duplicate rows
df_nfl = df_nfl.dropDuplicates()

In [None]:
#let's look at the data
df_nfl.printSchema()

In [None]:
#take a look at the dta via pandas
import pandas as pd
#disable the row/column limits to not truncate the displayed data
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
df_nfl.limit(10).toPandas().head(10)

In [None]:
#let's also see the number of partitions
df_nfl.rdd.getNumPartitions()

In [None]:
#convert the dataset to a RDD
rdd_nfl = df_nfl.rdd
rdd_nfl.take(5)

Now we have the RDD converted from Dataframe so we can do RDD operations with `rdd_nfl`. Now let's replicate the two calculations (1) number of plays in each game (2) average yarns gained in each game.

In [None]:
#mapreduce with spark RDD for sum of plays
#[Your Code]


In [None]:
#mapreduce with spark RDD for average yards gained
#hint: there are "NA"s (None) in the column "yards_gained"; after into dataframe and transformed as RDD, "NA" is None now, and others are integers
#[Your Code]


Next, let's do it with spark dataframe and SQL. The dataframe is `df_nfl`.

In [None]:
#use dataframe operations/api
#[Your Code]


In [None]:
#use spark sql - don't forget to create temp view for df_nfl before querying the dataframe
#[Your Code]


## Task 2 - Data Analytics with Spark DataFrame and SQL
### Answer four data analytics questions on NFL dataset to solve the problems
With the NFL Dataframe `df_nfl`, use either dataframe operations/API or spark SQL to answer the following questions.
First of all, let's build a data viewer to look at the data so we can understand the values better.

In [None]:
#build a data viewer to check data for a game; the list basically contains all the columns to use in this task
game_info_all = ['play_id', 'game_id', 'home_team', 'away_team', 'game_date', 
                 'posteam', 'posteam_type', 'defteam',
                 'total_home_score', 'total_away_score',
                 'touchdown', 'pass_touchdown', 'rush_touchdown', 'return_touchdown']
#df_nfl.select(game_info).limit(200).toPandas().head(200)
df_nfl.select(game_info_all).where('game_id = 2018111110').toPandas().head(200)

Now you are going to answer the following questions using spark dataframe or spark SQL. You can choose either one to solve the problem and output results.
1. Which game(s) has the highest number of plays from 2009 to 2018? And which game has the highest final score difference?

In [None]:
from pyspark.sql import functions as fn
from pyspark.sql import Window

#you need to show the game info with the highest plays, so let's obtain game level information
game_info = ['game_id', 'home_team', 'away_team', 'game_date', 'total_home_score', 'total_away_score']
#because we need the final scores for each game as game level info, we can do that by filtering the maxiumn play id to get game level info
window = Window.partitionBy('game_id')
nfl_game_info = df_nfl.withColumn("max_play_id", fn.max("play_id").over(window)).filter("max_play_id = play_id").drop("max_play_id").select(game_info)
nfl_game_info.show()

In [None]:
#get number of plays in each game
nfl_num_play = df_nfl.groupBy('game_id').agg(fn.count('play_id').alias('num_plays'))
nfl_num_play.show()

#join the two dataframes
nfl_game_info = nfl_game_info.join(nfl_num_play, 'game_id')

In [None]:
#[Your Code] to get the game with highest number of plays


In [None]:
#now it is the score difference
nfl_game_info = nfl_game_info.withColumn('score_diff', fn.abs(nfl_game_info['total_home_score'] - nfl_game_info['total_away_score']))

In [None]:
#[Your Code] to get the game with highest score difference


2. On average how many plays are needed for a successful touchdown? And how many plays are needed for home team and away team, respectively?

In [None]:
nfl_game_play = df_nfl.groupBy('game_id').agg(fn.count('play_id').alias('total_plays'), fn.sum('touchdown').alias('total_touchdowns'))

In [None]:
#[Your Code] to take average for total_plays/total_touchdowns


In [None]:
nfl_team_play = df_nfl.groupBy('game_id', 'posteam_type').agg(fn.count('play_id').alias('total_plays'), fn.sum('touchdown').alias('total_touchdowns'))

In [None]:
#[Your Code] to take average for total_plays/total_touchdowns by posteam_type


3. For touchdown, which type happened more likely on average, rush touchdown, pass touchdown or return touchdown? Are the probabilities different by home and away team?

In [None]:
#[Your Code] to calculate total touchdowns and total of each type of touchdowns by game
#then take average for each type of touchdown divided by total touchdowns


In [None]:
#[Your Code] to calculate total touchdowns and total of each type of touchdowns by game and posteam_type
#then take average for each type of touchdown divided by total touchdowns by posteam_type


4. For each calendar year, which team(s) has the highest winning rate?

In [None]:
#let's look at the available teams
df_nfl.select('home_team').distinct().show(50)

In [None]:
nfl_game_info = nfl_game_info.withColumn('win_team', fn.when(fn.col('total_home_score') > fn.col('total_away_score'), fn.col('home_team')).otherwise(fn.col('away_team')))
nfl_game_info = nfl_game_info.withColumn('game_year', fn.substring('game_date', 0, 4))
nfl_game_info.show()

In [None]:
#create three sub dataframe, by team-year
win_count = nfl_game_info.groupBy(fn.col('win_team').alias('team'), 'game_year').agg(fn.count('win_team').alias('win_count'))
home_count = nfl_game_info.groupBy(fn.col('home_team').alias('team'), 'game_year').agg(fn.count('home_team').alias('home_count'))
away_count = nfl_game_info.groupBy(fn.col('away_team').alias('team'), 'game_year').agg(fn.count('away_team').alias('away_count'))

In [None]:
#[Your Code] to join the three dataframes by 'team' for subsequent calculations


In [None]:
#generate total game counts and winning rate
team_count = team_count.withColumn('game_count', team_count['home_count'] + team_count['away_count'])
team_count = team_count.withColumn('win_rate', team_count['win_count'] / team_count['game_count'])

In [None]:
#[Your Code] to obtain the team(s) with highest winning rate in each calendar year
