Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('/content/NFLPlaybyPlay2015.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0.1,Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc,PlayAttempted,Yards.Gained,sp,Touchdown,ExPointResult,TwoPointConv,DefTwoPoint,Safety,PuntResult,PlayType,Passer,PassAttempt,PassOutcome,PassLength,PassLocation,InterceptionThrown,Interceptor,Rusher,RushAttempt,RunLocation,RunGap,Receiver,Reception,ReturnResult,Returner,BlockingPlayer,Tackler1,Tackler2,FieldGoalResult,FieldGoalDistance,Fumble,RecFumbTeam,RecFumbPlayer,Sack,Challenge.Replay,ChalReplayResult,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,Season
0,36,2015-09-10,2015091000,1,1,,15:00,15,3600.0,0.0,NE,35.0,35.0,0,0,0.0,,PIT,NE,S.Gostkowski kicks 65 yards from NE 35 to end ...,1,0,0,0,,,,0,,Kickoff,,0,,,,0,,,0,,,,0,Touchback,,,,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
1,51,2015-09-10,2015091000,1,1,1.0,15:00,15,3600.0,0.0,PIT,20.0,80.0,10,18,0.0,1.0,PIT,NE,(15:00) De.Williams right tackle to PIT 38 for...,1,18,0,0,,,,0,,Run,,0,,,,0,,D.Hightower,1,right,tackle,,0,,,,D.Hightower,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
2,72,2015-09-10,2015091000,1,1,1.0,14:21,15,3561.0,39.0,PIT,38.0,62.0,10,31,0.0,0.0,PIT,NE,(14:21) B.Roethlisberger pass short right to A...,1,9,0,0,,,,0,,Pass,B.Roethlisberger,1,Complete,Short,right,0,,,0,,,A.Brown,1,,,,D.Hightower,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
3,101,2015-09-10,2015091000,1,1,2.0,14:04,15,3544.0,17.0,PIT,47.0,53.0,1,31,0.0,1.0,PIT,NE,(14:04) De.Williams right guard to NE 49 for 4...,1,4,0,0,,,,0,,Run,,0,,,,0,,J.Collins,1,right,guard,,0,,,,J.Collins,M.Brown,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
4,122,2015-09-10,2015091000,1,1,1.0,13:26,14,3506.0,38.0,NE,49.0,49.0,10,45,0.0,1.0,PIT,NE,(13:26) B.Roethlisberger pass short right to H...,1,14,0,0,,,,0,,Pass,B.Roethlisberger,1,Complete,Short,right,0,,,0,,,H.Miller,1,,,,J.Mayo,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015


In [5]:
# This will be a classification problem looking at the playtype, predicting a run or a pass. I'll drop the rows that are special teams plays
# The idea of predicting a run or a pass is entirely different if I look at all game situations
target = 'PlayType'

In [6]:
df[target].value_counts(normalize = True)

Pass                  0.397212
Run                   0.284181
No Play               0.056537
Kickoff               0.055605
Punt                  0.052960
Timeout               0.040300
Sack                  0.025819
Extra Point           0.024410
Field Goal            0.021418
Quarter End           0.014633
QB Kneel              0.009213
End of Game           0.004574
Onside Kick           0.001452
Spike                 0.001127
Half End              0.000130
Name: PlayType, dtype: float64

In [7]:
# Not including the non run/pass plays we can see that the data while relatively close, skews towards more pass plays instead of running ones

In [12]:
df_rp = df[(df['PlayType'] == 'Pass') | (df['PlayType'] == 'Run')]

In [13]:
df_rp

Unnamed: 0.1,Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc,PlayAttempted,Yards.Gained,sp,Touchdown,ExPointResult,TwoPointConv,DefTwoPoint,Safety,PuntResult,PlayType,Passer,PassAttempt,PassOutcome,PassLength,PassLocation,InterceptionThrown,Interceptor,Rusher,RushAttempt,RunLocation,RunGap,Receiver,Reception,ReturnResult,Returner,BlockingPlayer,Tackler1,Tackler2,FieldGoalResult,FieldGoalDistance,Fumble,RecFumbTeam,RecFumbPlayer,Sack,Challenge.Replay,ChalReplayResult,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,Season
1,51,2015-09-10,2015091000,1,1,1.0,15:00,15,3600.0,0.0,PIT,20.0,80.0,10,18,0.0,1.0,PIT,NE,(15:00) De.Williams right tackle to PIT 38 for...,1,18,0,0,,,,0,,Run,,0,,,,0,,D.Hightower,1,right,tackle,,0,,,,D.Hightower,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
2,72,2015-09-10,2015091000,1,1,1.0,14:21,15,3561.0,39.0,PIT,38.0,62.0,10,31,0.0,0.0,PIT,NE,(14:21) B.Roethlisberger pass short right to A...,1,9,0,0,,,,0,,Pass,B.Roethlisberger,1,Complete,Short,right,0,,,0,,,A.Brown,1,,,,D.Hightower,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
3,101,2015-09-10,2015091000,1,1,2.0,14:04,15,3544.0,17.0,PIT,47.0,53.0,1,31,0.0,1.0,PIT,NE,(14:04) De.Williams right guard to NE 49 for 4...,1,4,0,0,,,,0,,Run,,0,,,,0,,J.Collins,1,right,guard,,0,,,,J.Collins,M.Brown,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
4,122,2015-09-10,2015091000,1,1,1.0,13:26,14,3506.0,38.0,NE,49.0,49.0,10,45,0.0,1.0,PIT,NE,(13:26) B.Roethlisberger pass short right to H...,1,14,0,0,,,,0,,Pass,B.Roethlisberger,1,Complete,Short,right,0,,,0,,,H.Miller,1,,,,J.Mayo,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
5,159,2015-09-10,2015091000,1,1,1.0,12:42,13,3462.0,44.0,NE,35.0,35.0,10,56,0.0,1.0,PIT,NE,(12:42) (Shotgun) De.Williams right guard to N...,1,11,0,0,,,,0,,Run,,0,,,,0,,J.Collins,1,right,guard,,0,,,,J.Collins,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46124,389412,2016-01-03,2016010310,22,4,4.0,00:43,1,43.0,7.0,GB,42.0,58.0,10,12,0.0,1.0,GB,MIN,"(:43) (No Huddle, Shotgun) A.Rodgers pass shor...",1,12,0,0,,,,0,,Pass,A.Rodgers,1,Complete,Short,middle,0,,,0,,,R.Rodgers,1,,,,C.Munnerlyn,A.Sendejo,,,0,,,0,0,,0,,,,0,13.0,19.0,-6.0,6.0,2015
46125,391814,2016-01-03,2016010310,22,4,1.0,00:27,1,27.0,16.0,MIN,46.0,46.0,10,19,0.0,0.0,GB,MIN,(:27) (No Huddle) A.Rodgers pass short right t...,1,7,0,0,,,,0,,Pass,A.Rodgers,1,Complete,Short,right,0,,,0,,,D.Adams,1,,,,T.Newman,,,,0,,,0,0,,0,,,,0,13.0,19.0,-6.0,6.0,2015
46126,394216,2016-01-03,2016010310,22,4,2.0,00:24,1,24.0,3.0,MIN,39.0,39.0,3,19,0.0,0.0,GB,MIN,"(:24) (No Huddle, Shotgun) A.Rodgers pass inco...",1,0,0,0,,,,0,,Pass,A.Rodgers,1,Incomplete Pass,Deep,left,0,,,0,,,J.Jones,0,,,,,,,,0,,,0,0,,0,,,,0,13.0,19.0,-6.0,6.0,2015
46127,396414,2016-01-03,2016010310,22,4,3.0,00:15,1,15.0,9.0,MIN,39.0,39.0,3,20,0.0,0.0,GB,MIN,"(:15) (No Huddle, Shotgun) A.Rodgers pass shor...",1,1,0,0,,,,0,,Pass,A.Rodgers,1,Complete,Short,left,0,,,0,,,R.Rodgers,1,,,,X.Rhodes,,,,0,,,0,0,,0,,,,0,13.0,19.0,-6.0,6.0,2015


In [14]:
df_rp[target].value_counts(normalize = True)

Pass    0.582941
Run     0.417059
Name: PlayType, dtype: float64

In [15]:
# At 58% the accuracy metric should be sufficient

In [16]:
# Because I'm only looking at running and passing plays, most of the potential 'outliers' have already been cut from my dataset

# When I split the data I'll probably do a test train random split

In [17]:
# I think some of the more important features I'll have to make myself. Results of the prior plays which is very influential on future play calling

In [18]:
# As far as columns I'll be dropping, there are a lot that give information that only applies to one of either a run or a pass. So I'd have to drop those to prevent data leakage

In [20]:
df = df_rp

In [24]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format= True)

In [25]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,yrdln,yrdline100,ydstogo,ydsnet,GoalToGo,FirstDown,posteam,DefensiveTeam,desc,PlayAttempted,Yards.Gained,sp,Touchdown,ExPointResult,TwoPointConv,DefTwoPoint,Safety,PuntResult,PlayType,Passer,PassAttempt,PassOutcome,PassLength,PassLocation,InterceptionThrown,Interceptor,Rusher,RushAttempt,RunLocation,RunGap,Receiver,Reception,ReturnResult,Returner,BlockingPlayer,Tackler1,Tackler2,FieldGoalResult,FieldGoalDistance,Fumble,RecFumbTeam,RecFumbPlayer,Sack,Challenge.Replay,ChalReplayResult,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,Season
1,51,2015-09-10,2015091000,1,1,1.0,15:00,15,3600.0,0.0,PIT,20.0,80.0,10,18,0.0,1.0,PIT,NE,(15:00) De.Williams right tackle to PIT 38 for...,1,18,0,0,,,,0,,Run,,0,,,,0,,D.Hightower,1,right,tackle,,0,,,,D.Hightower,,,,0,,,0,0,,0,,,,0,0.0,0.0,0.0,0.0,2015


In [26]:
leak_cols = ['Passer', 'PassAttempt', 'PassOutcome', 'PassLength', 'PassLocation', 'InterceptionThrown', 'Interceptor', 'Rusher', 'RushAttempt', 'RunLocation', 'RunGap', 'Receiver', 'Reception']

# However I wont drop these rows until later because I will need some of them to engineer features that describe the results of previous plays