# MSBX 5420 Assignment 2
## Task 1 - Warm Up with Spark

### Replicate the MapReduce Calculations in Assignment 1 with Spark RDD, DataFrame and SQL
As a warm-up, your first task will be replicating the same MapReduce jobs in assignment 1 with spark RDD and DataFrame/SQL.
First, let's load the NFL dataset into Spark. For convenience we will use spark session and dataframe.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[2]').config("spark.executor.memory", "1g").config("spark.driver.memory", "1g").appName('spark_nfl_data2').getOrCreate()

In [2]:
spark.sparkContext.getConf().getAll()

[('spark.driver.extraJavaOptions',
  '"-Dio.netty.tryReflectionSetAccessible=true"'),
 ('spark.driver.port', '34833'),
 ('spark.app.name', 'spark_nfl_data2'),
 ('spark.executor.id', 'driver'),
 ('spark.app.id', 'local-1614639102732'),
 ('spark.driver.memory', '1g'),
 ('spark.executor.extraJavaOptions',
  '"-Dio.netty.tryReflectionSetAccessible=true"'),
 ('spark.executor.memory', '1g'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.master', 'local[2]'),
 ('spark.driver.host', 'caab37041184')]

If you look at the NFL dataset, those missing values we have tried to skip in assignment 1 are "NA". By default, spark will treat empty or null as missing values, so here we need to let spark treat "NA" as the missing values.

In [3]:
df_nfl = spark.read.options(header=True, nullValue='NA', inferSchema=True).csv('./NFL_Play_by_Play_2009-2018.csv')
df_nfl.show(5)

+-------+----------+---------+---------+-------+------------+-------+-------------+------------+----------+-------------------------+----------------------+----------------------+---------+-----------+-----+---+---+----+----------+-----+------+-------+------+--------------------+---------+------------+-------+---------+-----------+--------+--------+-----------+-----------+-------------+---------+-----------------+------------+-------+-----------------+-------------+------------------+---------------------+-----------------------+-----------------------+-------+------------+-------+--------------------------+--------------------------+----------------+----------------+-------------+-------------+------------------+------------------+------------------+-----------------------+--------------------+-------------------+--------------------+-------------------+-------------------+--------------------+-------------------+----------------+-------------------------+-------------------+---------

In [4]:
#let's check and clean the data with dataframe
print(df_nfl.count())
print(df_nfl.distinct().count())

449371
446982


In [5]:
#drop duplicate rows
df_nfl = df_nfl.dropDuplicates()

In [6]:
#let's look at the data
df_nfl.printSchema()

root
 |-- play_id: integer (nullable = true)
 |-- game_id: integer (nullable = true)
 |-- home_team: string (nullable = true)
 |-- away_team: string (nullable = true)
 |-- posteam: string (nullable = true)
 |-- posteam_type: string (nullable = true)
 |-- defteam: string (nullable = true)
 |-- side_of_field: string (nullable = true)
 |-- yardline_100: integer (nullable = true)
 |-- game_date: string (nullable = true)
 |-- quarter_seconds_remaining: integer (nullable = true)
 |-- half_seconds_remaining: double (nullable = true)
 |-- game_seconds_remaining: double (nullable = true)
 |-- game_half: string (nullable = true)
 |-- quarter_end: integer (nullable = true)
 |-- drive: integer (nullable = true)
 |-- sp: integer (nullable = true)
 |-- qtr: integer (nullable = true)
 |-- down: integer (nullable = true)
 |-- goal_to_go: integer (nullable = true)
 |-- time: string (nullable = true)
 |-- yrdln: string (nullable = true)
 |-- ydstogo: integer (nullable = true)
 |-- ydsnet: integer (nulla

In [7]:
#take a look at the dta via pandas
import pandas as pd
#disable the row/column limits to not truncate the displayed data
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
df_nfl.limit(10).toPandas().head(10)

Unnamed: 0,play_id,game_id,home_team,away_team,posteam,posteam_type,defteam,side_of_field,yardline_100,game_date,quarter_seconds_remaining,half_seconds_remaining,game_seconds_remaining,game_half,quarter_end,drive,sp,qtr,down,goal_to_go,time,yrdln,ydstogo,ydsnet,desc,play_type,yards_gained,shotgun,no_huddle,qb_dropback,qb_kneel,qb_spike,qb_scramble,pass_length,pass_location,air_yards,yards_after_catch,run_location,run_gap,field_goal_result,kick_distance,extra_point_result,two_point_conv_result,home_timeouts_remaining,away_timeouts_remaining,timeout,timeout_team,td_team,posteam_timeouts_remaining,defteam_timeouts_remaining,total_home_score,total_away_score,posteam_score,defteam_score,score_differential,posteam_score_post,defteam_score_post,score_differential_post,no_score_prob,opp_fg_prob,opp_safety_prob,opp_td_prob,fg_prob,safety_prob,td_prob,extra_point_prob,two_point_conversion_prob,ep,epa,total_home_epa,total_away_epa,total_home_rush_epa,total_away_rush_epa,total_home_pass_epa,total_away_pass_epa,air_epa,yac_epa,comp_air_epa,comp_yac_epa,total_home_comp_air_epa,total_away_comp_air_epa,total_home_comp_yac_epa,total_away_comp_yac_epa,total_home_raw_air_epa,total_away_raw_air_epa,total_home_raw_yac_epa,total_away_raw_yac_epa,wp,def_wp,home_wp,away_wp,wpa,home_wp_post,away_wp_post,total_home_rush_wpa,total_away_rush_wpa,total_home_pass_wpa,total_away_pass_wpa,air_wpa,yac_wpa,comp_air_wpa,comp_yac_wpa,total_home_comp_air_wpa,total_away_comp_air_wpa,total_home_comp_yac_wpa,total_away_comp_yac_wpa,total_home_raw_air_wpa,total_away_raw_air_wpa,total_home_raw_yac_wpa,total_away_raw_yac_wpa,punt_blocked,first_down_rush,first_down_pass,first_down_penalty,third_down_converted,third_down_failed,fourth_down_converted,fourth_down_failed,incomplete_pass,interception,punt_inside_twenty,punt_in_endzone,punt_out_of_bounds,punt_downed,punt_fair_catch,kickoff_inside_twenty,kickoff_in_endzone,kickoff_out_of_bounds,kickoff_downed,kickoff_fair_catch,fumble_forced,fumble_not_forced,fumble_out_of_bounds,solo_tackle,safety,penalty,tackled_for_loss,fumble_lost,own_kickoff_recovery,own_kickoff_recovery_td,qb_hit,rush_attempt,pass_attempt,sack,touchdown,pass_touchdown,rush_touchdown,return_touchdown,extra_point_attempt,two_point_attempt,field_goal_attempt,kickoff_attempt,punt_attempt,fumble,complete_pass,assist_tackle,lateral_reception,lateral_rush,lateral_return,lateral_recovery,passer_player_id,passer_player_name,receiver_player_id,receiver_player_name,rusher_player_id,rusher_player_name,lateral_receiver_player_id,lateral_receiver_player_name,lateral_rusher_player_id,lateral_rusher_player_name,lateral_sack_player_id,lateral_sack_player_name,interception_player_id,interception_player_name,lateral_interception_player_id,lateral_interception_player_name,punt_returner_player_id,punt_returner_player_name,lateral_punt_returner_player_id,lateral_punt_returner_player_name,kickoff_returner_player_name,kickoff_returner_player_id,lateral_kickoff_returner_player_id,lateral_kickoff_returner_player_name,punter_player_id,punter_player_name,kicker_player_name,kicker_player_id,own_kickoff_recovery_player_id,own_kickoff_recovery_player_name,blocked_player_id,blocked_player_name,tackle_for_loss_1_player_id,tackle_for_loss_1_player_name,tackle_for_loss_2_player_id,tackle_for_loss_2_player_name,qb_hit_1_player_id,qb_hit_1_player_name,qb_hit_2_player_id,qb_hit_2_player_name,forced_fumble_player_1_team,forced_fumble_player_1_player_id,forced_fumble_player_1_player_name,forced_fumble_player_2_team,forced_fumble_player_2_player_id,forced_fumble_player_2_player_name,solo_tackle_1_team,solo_tackle_2_team,solo_tackle_1_player_id,solo_tackle_2_player_id,solo_tackle_1_player_name,solo_tackle_2_player_name,assist_tackle_1_player_id,assist_tackle_1_player_name,assist_tackle_1_team,assist_tackle_2_player_id,assist_tackle_2_player_name,assist_tackle_2_team,assist_tackle_3_player_id,assist_tackle_3_player_name,assist_tackle_3_team,assist_tackle_4_player_id,assist_tackle_4_player_name,assist_tackle_4_team,pass_defense_1_player_id,pass_defense_1_player_name,pass_defense_2_player_id,pass_defense_2_player_name,fumbled_1_team,fumbled_1_player_id,fumbled_1_player_name,fumbled_2_player_id,fumbled_2_player_name,fumbled_2_team,fumble_recovery_1_team,fumble_recovery_1_yards,fumble_recovery_1_player_id,fumble_recovery_1_player_name,fumble_recovery_2_team,fumble_recovery_2_yards,fumble_recovery_2_player_id,fumble_recovery_2_player_name,return_team,return_yards,penalty_team,penalty_player_id,penalty_player_name,penalty_yards,replay_or_challenge,replay_or_challenge_result,penalty_type,defensive_two_point_attempt,defensive_two_point_conv,defensive_extra_point_attempt,defensive_extra_point_conv
0,2972,2009091000,PIT,TEN,PIT,home,TEN,PIT,82,2009-09-10,25,925.0,925.0,Half2,0,19,0,3,3.0,0,00:25,PIT 18,7,8,(:25) (Shotgun) B.Roethlisberger pass short ri...,pass,5,1,0,1,0,0,0,short,right,4.0,1.0,,,,,,,3,3,0,,,3,3,7,7,7,7,0,7,7,0,0.049016,0.231857,0.009026,0.342126,0.13309,0.00373,0.231156,0.0,0.0,-1.08368,-0.529102,-1.8779,1.8779,-9.979168,9.979168,-0.664738,0.664738,-1.064585,0.535482,-1.064585,0.535482,11.1406,-11.1406,-7.884627,7.884627,20.060447,-20.060447,-16.414136,16.414136,0.436217,0.563783,0.436217,0.563783,-0.019123,0.417094,0.582906,-0.292289,0.292289,0.013755,-0.013755,-0.037935,0.018813,-0.037935,0.018813,0.336156,-0.336156,-0.226747,0.226747,-0.142437,0.142437,0.286112,-0.286112,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,00-0022924,B.Roethlisberger,00-0017162,H.Ward,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,00-0019733,N.Harper,TEN,00-0021219,C.Hope,TEN,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
1,3110,2009091000,PIT,TEN,TEN,away,PIT,PIT,34,2009-09-10,790,790.0,790.0,Half2,0,20,0,4,1.0,0,13:10,PIT 34,10,31,(13:10) C.Johnson left tackle to PIT 32 for 2 ...,run,2,0,0,0,0,0,0,,,,,left,tackle,,,,,3,3,0,,,3,3,7,7,7,7,0,7,7,0,0.026651,0.048917,0.000111,0.073142,0.346123,0.002868,0.502188,0.0,0.0,3.900459,-0.311933,-3.853643,3.853643,-8.989579,8.989579,-3.162022,3.162022,,,0.0,0.0,8.645225,-8.645225,-7.886536,7.886536,17.565073,-17.565073,-16.416045,16.416045,0.621188,0.378812,0.378812,0.621188,-0.010799,0.389611,0.610389,-0.256658,0.256658,-0.084232,0.084232,,,0.0,0.0,0.240199,-0.240199,-0.228776,0.228776,-0.238395,0.238395,0.284083,-0.284083,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,,,,,00-0026164,C.Johnson,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,00-0015080,A.Smith,PIT,00-0022697,K.Fox,PIT,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
2,4089,2009091000,PIT,TEN,PIT,home,TEN,PIT,59,2009-09-10,870,870.0,870.0,Overtime,0,25,0,5,2.0,0,14:30,PIT 41,2,21,"(14:30) (No Huddle, Shotgun) M.Moore left tack...",run,2,1,1,0,0,0,0,,,,,left,tackle,,,,,2,2,0,,,2,2,10,10,10,10,0,10,10,0,0.040792,0.103352,0.001092,0.151517,0.259334,0.003464,0.440449,0.0,0.0,2.495214,0.038122,2.083957,-2.083957,-10.333891,10.333891,1.200538,-1.200538,,,0.0,0.0,8.783677,-8.783677,-2.504563,2.504563,13.279489,-13.279489,-6.486087,6.486087,0.703247,0.296753,0.703247,0.296753,0.008758,0.712005,0.287995,-0.26069,0.26069,0.003903,-0.003903,,,0.0,0.0,0.228296,-0.228296,-0.047788,0.047788,-0.594251,0.594251,0.785364,-0.785364,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,00-0022217,M.Moore,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,TEN,,00-0024331,,S.Tulloch,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
3,1577,2009091304,CLE,MIN,CLE,home,MIN,CLE,53,2009-09-13,150,150.0,1950.0,Half1,0,9,0,2,4.0,0,02:30,CLE 47,5,5,"(2:30) D.Zastudil punts 38 yards to MIN 15, Ce...",punt,0,0,0,0,0,0,0,,,,,,,,38.0,,,3,2,0,,,3,2,6,10,6,10,-4,6,10,-4,0.490218,0.095856,0.001281,0.138467,0.142093,0.003102,0.128984,0.0,0.0,0.075976,-0.22131,-4.734675,4.734675,-3.262794,3.262794,-2.941432,2.941432,,,0.0,0.0,-0.123651,0.123651,-1.739479,1.739479,-3.437378,3.437378,1.839947,-1.839947,0.376919,0.623081,0.376919,0.623081,0.01211,0.38903,0.61097,-0.047341,0.047341,-0.093375,0.093375,,,0.0,0.0,0.003722,-0.003722,-0.060262,0.060262,-0.105275,0.105275,0.052666,-0.052666,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,00-0025951,D.Reynaud,,,,,,,00-0021235,D.Zastudil,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
4,1984,2009091304,CLE,MIN,MIN,away,CLE,MID,50,2009-09-13,855,1755.0,1755.0,Half2,0,13,0,3,2.0,0,14:15,MID 50,5,11,(14:15) A.Peterson left guard to CLE 44 for 6 ...,run,6,0,0,0,0,0,0,,,,,left,guard,,,,,3,3,0,,,3,3,12,10,10,12,-2,10,12,-2,0.000982,0.09652,0.000779,0.149546,0.316745,0.003761,0.431665,0.0,0.0,2.641476,0.689826,-0.24832,0.24832,-4.022258,4.022258,-3.66632,3.66632,,,0.0,0.0,-0.692441,0.692441,-2.447779,2.447779,-5.397055,5.397055,3.494788,-3.494788,0.510715,0.489285,0.489285,0.510715,0.020889,0.468396,0.531604,-0.069457,0.069457,-0.133975,0.133975,,,0.0,0.0,-0.016043,0.016043,-0.108685,0.108685,-0.167717,0.167717,0.078815,-0.078815,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,00-0025394,A.Peterson,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CLE,,00-0001563,,D.Bowens,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
5,2496,2009091304,CLE,MIN,MIN,away,CLE,CLE,45,2009-09-13,273,1173.0,1173.0,Half2,0,15,0,3,1.0,0,04:33,CLE 45,10,42,(4:33) C.Taylor right guard to CLE 40 for 5 ya...,run,5,0,0,0,0,0,0,,,,,right,guard,,,,,3,3,0,,,3,3,12,17,17,12,5,17,12,5,0.008135,0.071978,0.000318,0.109373,0.329474,0.003265,0.477457,0.0,0.0,3.354972,0.027541,-7.720655,7.720655,-4.338499,4.338499,-10.571493,10.571493,,,0.0,0.0,-1.093482,1.093482,-7.90268,7.90268,-5.133019,5.133019,-5.497688,5.497688,0.775073,0.224927,0.224927,0.775073,0.001469,0.223458,0.776542,-0.081955,0.081955,-0.344597,0.344597,,,0.0,0.0,-0.037921,0.037921,-0.294059,0.294059,-0.168755,0.168755,-0.198977,0.198977,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,00-0021314,C.Taylor,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CLE,,00-0024249,,D.Jackson,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
6,1356,2009091307,NO,DET,NO,home,DET,NO,65,2009-09-13,529,529.0,2329.0,Half1,0,11,0,2,1.0,0,08:49,NO 35,10,13,(8:49) M.Bell right tackle to NO 48 for 13 yar...,run,13,0,0,0,0,0,0,,,,,right,tackle,,,,,3,2,0,,,3,2,14,10,14,10,4,14,10,4,0.154751,0.103339,0.001514,0.153695,0.233368,0.002919,0.350414,0.0,0.0,1.769929,0.912886,6.093394,-6.093394,0.854689,-0.854689,7.267959,-7.267959,,,0.0,0.0,13.065823,-13.065823,-5.448348,5.448348,14.976949,-14.976949,-7.70899,7.70899,0.690826,0.309174,0.690826,0.309174,0.033655,0.724481,0.275519,0.020638,-0.020638,0.25832,-0.25832,,,0.0,0.0,0.312843,-0.312843,-0.048087,0.048087,0.348654,-0.348654,-0.090333,0.090333,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,00-0024011,M.Bell,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,DET,,00-0020329,,A.Henry,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
7,3039,2009091308,TB,DAL,DAL,away,TB,TB,2,2009-09-13,763,763.0,763.0,Half2,0,18,1,4,,0,12:43,TB 2,0,81,"(Kick formation) N.Folk extra point is GOOD, C...",extra_point,0,0,0,0,0,0,0,,,,,,,,20.0,good,,3,3,0,,,3,3,14,27,26,14,12,27,14,13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.990795,0.0,0.990795,0.009205,-13.484174,13.484174,8.963263,-8.963263,-9.845547,9.845547,,,0.0,0.0,-0.685125,0.685125,-13.685992,13.685992,2.429224,-2.429224,-13.374705,13.374705,0.894655,0.105345,0.105345,0.894655,0.007729,0.097617,0.902383,0.233593,-0.233593,-0.213758,0.213758,,,0.0,0.0,-0.022189,0.022189,-0.298771,0.298771,0.116615,-0.116615,-0.366189,0.366189,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,N.Folk,00-0025565,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
8,3175,2009091305,HOU,NYJ,NYJ,away,HOU,NYJ,78,2009-09-13,702,702.0,702.0,Half2,0,18,0,4,2.0,0,11:42,NYJ 22,8,7,(11:42) M.Sanchez scrambles right end ran ob a...,run,5,0,0,1,0,0,1,,,,,right,end,,,,,3,3,0,,,3,3,6,17,17,6,11,17,6,11,0.110901,0.169717,0.005623,0.253685,0.183553,0.003378,0.273142,0.0,0.0,0.173219,0.147724,-11.517267,11.517267,3.733709,-3.733709,-26.283959,26.283959,,,0.0,0.0,-10.065641,10.065641,-7.149768,7.149768,-23.414674,23.414674,2.231304,-2.231304,0.867009,0.132991,0.132991,0.867009,0.005407,0.127584,0.872416,0.107552,-0.107552,-0.588099,0.588099,,,0.0,0.0,0.548666,-0.548666,-0.254464,0.254464,0.137623,-0.137623,0.169211,-0.169211,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,00-0026898,M.Sanchez,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,HOU,,00-0023750,,M.Williams,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0
9,3238,2009091306,IND,JAC,JAC,away,IND,JAC,99,2009-09-13,411,411.0,411.0,Half2,0,17,0,4,2.0,0,06:51,JAC 1,10,2,(6:51) D.Garrard up the middle to JAC 3 for 2 ...,run,2,0,0,0,0,0,0,,,,,middle,,,,,,2,2,0,,,2,2,14,12,12,14,-2,12,14,-2,0.343785,0.173685,0.016357,0.256821,0.079669,0.002112,0.127571,0.0,0.0,-1.215293,-0.297364,2.276548,-2.276548,-11.569201,11.569201,18.399077,-18.399077,,,0.0,0.0,3.503658,-3.503658,18.04894,-18.04894,10.409396,-10.409396,7.893985,-7.893985,0.348137,0.651863,0.651863,0.348137,-0.017733,0.669596,0.330404,-0.328184,0.328184,0.654096,-0.654096,,,0.0,0.0,0.121944,-0.121944,0.623202,-0.623202,0.310032,-0.310032,0.340819,-0.340819,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,00-0021231,D.Garrard,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,IND,,00-0021484,,G.Brackett,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,0,,,0,0,0,0


In [8]:
#let's also see the number of partitions
rdd_nfl = df_nfl.rdd
rdd_nfl.getNumPartitions()

200

In [9]:
#convert the dataset to a RDD
rdd_nfl = rdd_nfl.repartition(4)
rdd_nfl.take(5)

[Row(play_id=786, game_id=2009091303, home_team='CIN', away_team='DEN', posteam='DEN', posteam_type='away', defteam='CIN', side_of_field='DEN', yardline_100=56, game_date='2009-09-13', quarter_seconds_remaining=104, half_seconds_remaining=1004.0, game_seconds_remaining=2804.0, game_half='Half1', quarter_end=0, drive=5, sp=0, qtr=1, down=1, goal_to_go=0, time='01:44', yrdln='DEN 44', ydstogo=10, ydsnet=4, desc='(1:44) K.Orton sacked at DEN 40 for -4 yards (J.Fanene). CIN-J.Fanene was injured during the play.', play_type='pass', yards_gained=-4, shotgun=0, no_huddle=0, qb_dropback=1, qb_kneel=0, qb_spike=0, qb_scramble=0, pass_length=None, pass_location=None, air_yards=None, yards_after_catch=None, run_location=None, run_gap=None, field_goal_result=None, kick_distance=None, extra_point_result=None, two_point_conv_result=None, home_timeouts_remaining=3, away_timeouts_remaining=3, timeout=0, timeout_team=None, td_team=None, posteam_timeouts_remaining=3, defteam_timeouts_remaining=3, total_

Now we have the RDD converted from Dataframe so we can do RDD operations with `rdd_nfl`. Now let's replicate the two calculations (1) number of plays in each game (2) average yarns gained in each game.

In [10]:
#mapreduce with spark RDD for sum of plays
#[Your Code]
nfl_play_sum = rdd_nfl.map(lambda row: (row['game_id'], 1)).reduceByKey(lambda x, y: x+y)
nfl_play_sum.take(10)

[(2009092000, 177),
 (2009092008, 158),
 (2009100500, 167),
 (2009101812, 174),
 (2009102508, 162),
 (2009111512, 187),
 (2009112204, 161),
 (2009121312, 170),
 (2009121400, 169),
 (2009122004, 171)]

In [11]:
#mapreduce with spark RDD for average yards gained
#hint: there are "NA"s (None) in the column "yards_gained"; after into dataframe and transformed as RDD, "NA" is None now, and others are integers
#[Your Code]
nfl_yard_avg = rdd_nfl.filter(lambda row: row['yards_gained'] != None).map(lambda row: (row['game_id'], row['yards_gained'])).groupByKey().mapValues(lambda x: sum(x) / len(x))
nfl_yard_avg.take(10)

[(2009092000, 4.581920903954802),
 (2009092008, 3.8417721518987342),
 (2009100500, 4.538922155688622),
 (2009101812, 3.5977011494252875),
 (2009102508, 4.382716049382716),
 (2009111512, 4.7272727272727275),
 (2009112204, 4.068322981366459),
 (2009121312, 4.047058823529412),
 (2009121400, 3.3846153846153846),
 (2009122004, 3.2280701754385963)]

Next, let's do it with spark dataframe and SQL. The dataframe is `df_nfl`.

In [12]:
#use dataframe operations/api
#[Your Code]
df_nfl.groupBy('game_id').count().show()
df_nfl.groupBy('game_id').avg('yards_gained').show()

+----------+-----+
|   game_id|count|
+----------+-----+
|2010110703|  157|
|2010112111|  161|
|2012111100|  178|
|2013101302|  186|
|2014092105|  177|
|2015110810|  202|
|2016102301|  165|
|2015121312|  180|
|2011103008|  181|
|2017102208|  163|
|2017112609|  188|
|2018092303|  174|
|2018111110|  184|
|2009092011|  164|
|2013110700|  173|
|2018120205|  171|
|2009122012|  163|
|2012111101|  147|
|2012120907|  195|
|2015092005|  166|
+----------+-----+
only showing top 20 rows

+----------+------------------+
|   game_id| avg(yards_gained)|
+----------+------------------+
|2010110703| 4.840764331210191|
|2010112111| 3.031055900621118|
|2012111100|3.4269662921348316|
|2013101302| 4.091397849462366|
|2014092105|3.0395480225988702|
|2015110810| 4.306930693069307|
|2016102301|5.5212121212121215|
|2015121312|3.9166666666666665|
|2011103008|3.5359116022099446|
|2017102208|3.6748466257668713|
|2017112609|  3.00531914893617|
|2018092303| 4.632183908045977|
|2018111110| 4.728260869565218|
|20090

In [13]:
#use spark sql - don't forget to create temp view for df_nfl before querying the dataframe
#[Your Code]
df_nfl.createOrReplaceTempView('df_nfl')
spark.sql('select game_id, count(*) from df_nfl group by game_id').show()
spark.sql('select game_id, avg(yards_gained) from df_nfl group by game_id').show()

+----------+--------+
|   game_id|count(1)|
+----------+--------+
|2010110703|     157|
|2010112111|     161|
|2012111100|     178|
|2013101302|     186|
|2014092105|     177|
|2015110810|     202|
|2016102301|     165|
|2015121312|     180|
|2011103008|     181|
|2017102208|     163|
|2017112609|     188|
|2018092303|     174|
|2018111110|     184|
|2009092011|     164|
|2013110700|     173|
|2018120205|     171|
|2009122012|     163|
|2012111101|     147|
|2012120907|     195|
|2015092005|     166|
+----------+--------+
only showing top 20 rows

+----------+------------------+
|   game_id| avg(yards_gained)|
+----------+------------------+
|2010110703| 4.840764331210191|
|2010112111| 3.031055900621118|
|2012111100|3.4269662921348316|
|2013101302| 4.091397849462366|
|2014092105|3.0395480225988702|
|2015110810| 4.306930693069307|
|2016102301|5.5212121212121215|
|2015121312|3.9166666666666665|
|2011103008|3.5359116022099446|
|2017102208|3.6748466257668713|
|2017112609|  3.00531914893617

## Task 2 - Data Analytics with Spark DataFrame and SQL
### Answer four data analytics questions on NFL dataset to solve the problems
With the NFL Dataframe `df_nfl`, use either dataframe operations/API or spark SQL to answer the following questions.
First of all, let's build a data viewer to look at the data so we can understand the values better.

In [14]:
#build a data viewer to check data for a game; the list basically contains all the columns to use in this task
game_info_all = ['play_id', 'game_id', 'home_team', 'away_team', 'game_date', 
                 'posteam', 'posteam_type', 'defteam',
                 'total_home_score', 'total_away_score',
                 'touchdown', 'pass_touchdown', 'rush_touchdown', 'return_touchdown']
#df_nfl.select(game_info).limit(200).toPandas().head(200)
df_nfl.select(game_info_all).where('game_id = 2018111110').toPandas().head(200)

Unnamed: 0,play_id,game_id,home_team,away_team,game_date,posteam,posteam_type,defteam,total_home_score,total_away_score,touchdown,pass_touchdown,rush_touchdown,return_touchdown
0,703,2018111110,LA,SEA,2018-11-11,SEA,away,LA,7,13,1.0,0.0,1.0,0.0
1,1792,2018111110,LA,SEA,2018-11-11,SEA,away,LA,17,14,0.0,0.0,0.0,0.0
2,2840,2018111110,LA,SEA,2018-11-11,LA,home,SEA,20,21,0.0,0.0,0.0,0.0
3,3647,2018111110,LA,SEA,2018-11-11,SEA,away,LA,36,24,0.0,0.0,0.0,0.0
4,4239,2018111110,LA,SEA,2018-11-11,SEA,away,LA,36,31,0.0,0.0,0.0,0.0
5,190,2018111110,LA,SEA,2018-11-11,SEA,away,LA,0,0,0.0,0.0,0.0,0.0
6,3549,2018111110,LA,SEA,2018-11-11,SEA,away,LA,36,24,0.0,0.0,0.0,0.0
7,1076,2018111110,LA,SEA,2018-11-11,SEA,away,LA,10,14,0.0,0.0,0.0,0.0
8,3516,2018111110,LA,SEA,2018-11-11,SEA,away,LA,36,24,0.0,0.0,0.0,0.0
9,2439,2018111110,LA,SEA,2018-11-11,SEA,away,LA,20,14,0.0,0.0,0.0,0.0


Now you are going to answer the following questions using spark dataframe or spark SQL. You can choose either one to solve the problem and output results.
1. Which game(s) has the highest number of plays from 2009 to 2018? And which game has the highest final score difference?

In [15]:
from pyspark.sql import functions as fn
from pyspark.sql import Window

#you need to show the game info with the highest plays, so let's obtain game level information
game_info = ['game_id', 'home_team', 'away_team', 'game_date', 'total_home_score', 'total_away_score']
#because we need the final scores for each game as game level info, we can do that by filtering the maxiumn play id to get game level info
window = Window.partitionBy('game_id')
nfl_game_info = df_nfl.withColumn("max_play_id", fn.max("play_id").over(window)).filter("max_play_id = play_id").drop("max_play_id").select(game_info)
nfl_game_info.show()

+----------+---------+---------+----------+----------------+----------------+
|   game_id|home_team|away_team| game_date|total_home_score|total_away_score|
+----------+---------+---------+----------+----------------+----------------+
|2009092011|      CHI|      PIT|2009-09-20|              17|              14|
|2010110703|      HOU|       SD|2010-11-07|              23|              29|
|2010112111|       SF|       TB|2010-11-21|               0|              21|
|2011103008|      PIT|       NE|2011-10-30|              25|              17|
|2012111100|      CAR|      DEN|2012-11-11|              14|              34|
|2013101302|      CLE|      DET|2013-10-13|              17|              31|
|2013110700|      MIN|      WAS|2013-11-07|              34|              27|
|2014092105|       NE|      OAK|2014-09-21|              16|               9|
|2015110810|      DAL|      PHI|2015-11-08|              27|              32|
|2015121312|       GB|      DAL|2015-12-13|              28|    

In [16]:
#DataFrame API Solution
#get number of plays in each game
nfl_num_play = df_nfl.groupBy('game_id').agg(fn.count('play_id').alias('num_plays'))
nfl_num_play.show()

#join the two dataframes
nfl_game_info = nfl_game_info.join(nfl_num_play, 'game_id')

#[Your Code] to get the game with highest number of plays
win = Window.partitionBy()
nfl_game_info.withColumn("max_plays", fn.max("num_plays").over(win)).filter("max_plays = num_plays").drop("max_plays").show()

+----------+---------+
|   game_id|num_plays|
+----------+---------+
|2010110703|      157|
|2010112111|      161|
|2012111100|      178|
|2013101302|      186|
|2014092105|      177|
|2015110810|      202|
|2016102301|      165|
|2015121312|      180|
|2011103008|      181|
|2017102208|      163|
|2017112609|      188|
|2018092303|      174|
|2018111110|      184|
|2009092011|      164|
|2013110700|      173|
|2018120205|      171|
|2009122012|      163|
|2012111101|      147|
|2012120907|      195|
|2015092005|      166|
+----------+---------+
only showing top 20 rows

+----------+---------+---------+----------+----------------+----------------+---------+
|   game_id|home_team|away_team| game_date|total_home_score|total_away_score|num_plays|
+----------+---------+---------+----------+----------------+----------------+---------+
|2011120406|       NO|      DET|2011-12-04|              52|              38|      272|
+----------+---------+---------+----------+----------------+----------

In [17]:
#DataFrame API Solution
#now it is the score difference
nfl_game_info = nfl_game_info.withColumn('score_diff', fn.abs(nfl_game_info['total_home_score'] - nfl_game_info['total_away_score']))

#[Your Code] to get the game with highest score difference
win = Window.partitionBy()
nfl_game_info.withColumn("max_score_diff", fn.max("score_diff").over(win)).filter("max_score_diff = score_diff").drop("max_score_diff").show()

+----------+---------+---------+----------+----------------+----------------+---------+----------+
|   game_id|home_team|away_team| game_date|total_home_score|total_away_score|num_plays|score_diff|
+----------+---------+---------+----------+----------------+----------------+---------+----------+
|2009101810|       NE|      TEN|2009-10-18|              59|               0|      175|        59|
+----------+---------+---------+----------+----------------+----------------+---------+----------+



In [None]:
#Spark SQL Solution
#max plays
nfl_num_play = spark.sql('select game_id, count(*) as num_plays from df_nfl group by game_id')

nfl_game_info.createOrReplaceTempView('nfl_game_info')
nfl_num_play.createOrReplaceTempView('nfl_num_play')

nfl_game_info = spark.sql('select a.*, b.num_plays from nfl_game_info a, nfl_num_play b where a.game_id = b.game_id')

#[Your Code] to get the game with highest number of plays
nfl_game_info.createOrReplaceTempView('nfl_game_info')
spark.sql('select * from nfl_game_info where num_plays = (select max(num_plays) from nfl_game_info)').show()

In [None]:
#Spark SQL Solution
#max score diff
nfl_game_info = spark.sql('select *, abs(total_home_score-total_away_score) as score_diff from nfl_game_info')

#[Your Code] to get the game with highest number of plays
nfl_game_info.createOrReplaceTempView('nfl_game_info')
spark.sql('select * from nfl_game_info where score_diff = (select max(score_diff) from nfl_game_info)').show()

2. On average how many plays are needed for a successful touchdown? And how many plays are needed for home team and away team, respectively?

In [18]:
#DataFrame API Solution
nfl_game_play = df_nfl.groupBy('game_id').agg(fn.count('play_id').alias('total_plays'), fn.sum('touchdown').alias('total_touchdowns'))
#[Your Code] to take average for total_plays/total_touchdowns
nfl_game_play.groupby().agg(fn.avg(fn.col('total_plays')/fn.col('total_touchdowns'))).show()

nfl_team_play = df_nfl.groupBy('game_id', 'posteam_type').agg(fn.count('play_id').alias('total_plays'), fn.sum('touchdown').alias('total_touchdowns'))
#[Your Code] to take average for total_plays/total_touchdowns by posteam_type
nfl_team_play.groupby('posteam_type').agg(fn.avg(fn.col('total_plays')/fn.col('total_touchdowns'))).show()

+-------------------------------------+
|avg((total_plays / total_touchdowns))|
+-------------------------------------+
|                    43.57610031013332|
+-------------------------------------+

+------------+-------------------------------------+
|posteam_type|avg((total_plays / total_touchdowns))|
+------------+-------------------------------------+
|        null|                                 null|
|        away|                    42.94638401085658|
|        home|                    40.42116942690338|
+------------+-------------------------------------+



In [None]:
#Spark SQL Solution
df_nfl.createOrReplaceTempView('df_nfl')

nfl_game_play = spark.sql('select game_id, count(*) as total_plays, \
                          sum(touchdown) as total_touchdowns from df_nfl group by game_id')

#[Your Code] to take average for total_plays/total_touchdowns
nfl_game_play.createOrReplaceTempView('nfl_game_play')
spark.sql('select avg(total_plays/total_touchdowns) from nfl_game_play').show()

nfl_team_play = spark.sql('select game_id, posteam_type, count(*) as total_plays, \
                          sum(touchdown) as total_touchdowns from df_nfl group by game_id, posteam_type')

#[Your Code] to take average for total_plays/total_touchdowns by posteam_type
nfl_team_play.createOrReplaceTempView('nfl_team_play')
spark.sql('select posteam_type, avg(total_plays/total_touchdowns) from nfl_team_play group by posteam_type').show()

3. For touchdown, which type happened more likely on average, rush touchdown, pass touchdown or return touchdown? Are the probabilities different by home and away team?

In [19]:
#DataFrame API Solution
#[Your Code] to calculate total touchdowns and total of each type of touchdowns by game
#then take average for each type of touchdown divided by total touchdowns
nfl_game_play = df_nfl.groupBy('game_id').agg(fn.sum('touchdown').alias('total_touchdowns'),
                                              fn.sum('pass_touchdown').alias('pass_touchdowns'),
                                              fn.sum('rush_touchdown').alias('rush_touchdowns'),
                                              fn.sum('return_touchdown').alias('return_touchdowns'))
nfl_game_play.groupBy().agg(fn.avg(fn.col('pass_touchdowns')/fn.col('total_touchdowns')),
                            fn.avg(fn.col('rush_touchdowns')/fn.col('total_touchdowns')),
                            fn.avg(fn.col('return_touchdowns')/fn.col('total_touchdowns'))).show()

#[Your Code] to calculate total touchdowns and total of each type of touchdowns by game and posteam_type
#then take average for each type of touchdown divided by total touchdowns by posteam_type
nfl_team_play = df_nfl.groupBy('game_id', 'posteam_type').agg(fn.sum('touchdown').alias('total_touchdowns'),
                                                              fn.sum('pass_touchdown').alias('pass_touchdowns'),
                                                              fn.sum('rush_touchdown').alias('rush_touchdowns'),
                                                              fn.sum('return_touchdown').alias('return_touchdowns'))
nfl_team_play.groupBy('posteam_type').agg(fn.avg(fn.col('pass_touchdowns')/fn.col('total_touchdowns')),
                                          fn.avg(fn.col('rush_touchdowns')/fn.col('total_touchdowns')),
                                          fn.avg(fn.col('return_touchdowns')/fn.col('total_touchdowns'))).show()

+-----------------------------------------+-----------------------------------------+-------------------------------------------+
|avg((pass_touchdowns / total_touchdowns))|avg((rush_touchdowns / total_touchdowns))|avg((return_touchdowns / total_touchdowns))|
+-----------------------------------------+-----------------------------------------+-------------------------------------------+
|                       0.6008999522827617|                       0.3172862012197468|                        0.05594969649884093|
+-----------------------------------------+-----------------------------------------+-------------------------------------------+

+------------+-----------------------------------------+-----------------------------------------+-------------------------------------------+
|posteam_type|avg((pass_touchdowns / total_touchdowns))|avg((rush_touchdowns / total_touchdowns))|avg((return_touchdowns / total_touchdowns))|
+------------+-----------------------------------------+-------

In [None]:
#Spark SQL Solution
df_nfl.createOrReplaceTempView('df_nfl')

#[Your Code] to calculate total touchdowns and total of each type of touchdowns by game
#then take average for each type of touchdown divided by total touchdowns
nfl_game_play = spark.sql('select game_id, sum(touchdown) as total_touchdowns, \
                          sum(pass_touchdown) as pass_touchdowns, sum(rush_touchdown) as rush_touchdowns, \
                          sum(return_touchdown) as return_touchdowns from df_nfl group by game_id')
nfl_game_play.createOrReplaceTempView('nfl_game_play')
spark.sql('select avg(pass_touchdowns/total_touchdowns), avg(rush_touchdowns/total_touchdowns), \
          avg(return_touchdowns/total_touchdowns) from nfl_game_play').show()

#[Your Code] to calculate total touchdowns and total of each type of touchdowns by game and posteam_type
#then take average for each type of touchdown divided by total touchdowns by posteam_type
nfl_team_play = spark.sql('select game_id, posteam_type, sum(touchdown) as total_touchdowns, \
                          sum(pass_touchdown) as pass_touchdowns, sum(rush_touchdown) as rush_touchdowns, \
                          sum(return_touchdown) as return_touchdowns from df_nfl group by game_id, posteam, posteam_type')
nfl_team_play.createOrReplaceTempView('nfl_team_play')
spark.sql('select posteam_type, avg(pass_touchdowns/total_touchdowns), avg(rush_touchdowns/total_touchdowns), \
          avg(return_touchdowns/total_touchdowns) from nfl_team_play group by posteam_type').show()

4. For each calendar year, which team(s) has the highest winning rate?

In [20]:
#let's look at the available teams
df_nfl.select('home_team').distinct().show(50)

+---------+
|home_team|
+---------+
|      NYJ|
|      CAR|
|       LA|
|       TB|
|      OAK|
|      DET|
|      TEN|
|      BUF|
|      BAL|
|      LAC|
|       NE|
|      JAC|
|       GB|
|      DEN|
|       SF|
|      ARI|
|       KC|
|      SEA|
|      CIN|
|      DAL|
|      CLE|
|      MIA|
|       SD|
|      STL|
|      MIN|
|      ATL|
|      PHI|
|      WAS|
|      NYG|
|      PIT|
|       NO|
|      IND|
|      HOU|
|      JAX|
|      CHI|
+---------+



In [21]:
nfl_game_info = nfl_game_info.withColumn('win_team', fn.when(fn.col('total_home_score') > fn.col('total_away_score'), fn.col('home_team')).otherwise(fn.col('away_team')))
nfl_game_info = nfl_game_info.withColumn('game_year', fn.substring('game_date', 0, 4))
nfl_game_info.show()

+----------+---------+---------+----------+----------------+----------------+---------+----------+--------+---------+
|   game_id|home_team|away_team| game_date|total_home_score|total_away_score|num_plays|score_diff|win_team|game_year|
+----------+---------+---------+----------+----------------+----------------+---------+----------+--------+---------+
|2009092011|      CHI|      PIT|2009-09-20|              17|              14|      164|         3|     CHI|     2009|
|2010110703|      HOU|       SD|2010-11-07|              23|              29|      157|         6|      SD|     2010|
|2010112111|       SF|       TB|2010-11-21|               0|              21|      161|        21|      TB|     2010|
|2011103008|      PIT|       NE|2011-10-30|              25|              17|      181|         8|     PIT|     2011|
|2012111100|      CAR|      DEN|2012-11-11|              14|              34|      178|        20|     DEN|     2012|
|2013101302|      CLE|      DET|2013-10-13|             

In [22]:
#DataFrame API Solution
#create three sub dataframe, by team-year
win_count = nfl_game_info.groupBy(fn.col('win_team').alias('team'), 'game_year').agg(fn.count('win_team').alias('win_count'))
home_count = nfl_game_info.groupBy(fn.col('home_team').alias('team'), 'game_year').agg(fn.count('home_team').alias('home_count'))
away_count = nfl_game_info.groupBy(fn.col('away_team').alias('team'), 'game_year').agg(fn.count('away_team').alias('away_count'))

#[Your Code] to join the three dataframes by 'team' for subsequent calculations
team_count = win_count.join(home_count, ['team', 'game_year']).join(away_count, ['team', 'game_year'])
team_count.show()

+----+---------+---------+----------+----------+
|team|game_year|win_count|home_count|away_count|
+----+---------+---------+----------+----------+
| ATL|     2012|       14|         9|         8|
| STL|     2012|        8|         9|         8|
| WAS|     2015|        8|         8|         7|
| WAS|     2014|        4|         8|         8|
|  NE|     2013|       11|         8|         8|
| PHI|     2012|        5|         9|         8|
|  GB|     2010|       11|         7|         9|
| DET|     2012|        4|         8|         9|
| PHI|     2010|       10|         7|         9|
| BAL|     2011|       12|         9|         7|
| DEN|     2012|       13|         9|         8|
| NYG|     2014|        7|         8|         8|
| TEN|     2016|        8|         7|         9|
| DET|     2009|        2|         7|         8|
| CLE|     2018|        6|         7|         7|
| MIN|     2009|       11|         7|         8|
| HOU|     2011|       11|         8|         8|
| OAK|     2010|    

In [23]:
#DataFrame API Solution
#generate total game counts and winning rate
team_count = team_count.withColumn('game_count', team_count['home_count'] + team_count['away_count'])
team_count = team_count.withColumn('win_rate', team_count['win_count'] / team_count['game_count'])

#[Your Code] to obtain the team(s) with highest winning rate in each calendar year
window = Window.partitionBy('game_year')
team_count.withColumn("max_win_rate", fn.max("win_rate").over(window)).filter("max_win_rate = win_rate").drop("max_win_rate").show()

+----+---------+---------+----------+----------+----------+------------------+
|team|game_year|win_count|home_count|away_count|game_count|          win_rate|
+----+---------+---------+----------+----------+----------+------------------+
| DAL|     2016|       13|         9|         7|        16|            0.8125|
|  NE|     2016|       13|         8|         8|        16|            0.8125|
| ATL|     2012|       14|         9|         8|        17|0.8235294117647058|
|  NE|     2017|       14|         8|         9|        17|0.8235294117647058|
| PIT|     2017|       14|         9|         8|        17|0.8235294117647058|
| MIN|     2017|       14|         9|         8|        17|0.8235294117647058|
| DAL|     2014|       12|         8|         8|        16|              0.75|
|  GB|     2014|       12|         8|         8|        16|              0.75|
|  NE|     2014|       12|         8|         8|        16|              0.75|
| SEA|     2014|       12|         8|         8|    

In [None]:
#Spark SQL Solution
nfl_game_info.createOrReplaceTempView('nfl_game_info')

#create three sub dataframe, by team-year
win_count = spark.sql('select win_team as team, game_year, count(*) as win_count from nfl_game_info group by win_team, game_year')
home_count = spark.sql('select home_team as team, game_year, count(*) as home_count from nfl_game_info group by home_team, game_year')
away_count = spark.sql('select away_team as team, game_year, count(*) as away_count from nfl_game_info group by away_team, game_year')

#[Your Code] to join the three dataframes by 'team' for subsequent calculations
win_count.createOrReplaceTempView('win_count')
home_count.createOrReplaceTempView('home_count')
away_count.createOrReplaceTempView('away_count')

team_count = spark.sql('select a.*, b.home_count, c.away_count from win_count a, home_count b, away_count c where a.team=b.team and a.team=c.team')
team_count.show()

In [None]:
#Spark SQL Solution
#generate total game counts and winning rate
team_count = team_count.withColumn('game_count', team_count['home_count'] + team_count['away_count'])
team_count = team_count.withColumn('win_rate', team_count['win_count'] / team_count['game_count'])

#[Your Code] to obtain the team(s) with highest winning rate in each calendar year
team_count.createOrReplaceTempView('team_count')
spark.sql('select a.* from team_count a, (select game_year, max(win_rate) as max_win_rate from team_count group by game_year) b where a.game_year=b.game_year and a.win_rate=b.max_win_rate').show()

In [None]:
nfl_game_info = nfl_game_info.withColumn('game_date_1', fn.to_date('game_date', 'mm/dd/yyyy'))
nfl_game_info = nfl_game_info.withColumn('game_date_str', fn.date_format('game_date_1', 'yyyy-MM-dd'))
nfl_game_info = nfl_game_info.withColumn('game_year', fn.substring('game_date_str', 0, 4))