# DS 5110 Final Project
## 2020 NFL Big Data Bowl
### Group 3: Anoop Nath, Lauren O'Donnell, OC Ofoma

### Introduction
Our final project is completing the [NFL Big Data Bowl](https://www.kaggle.com/competitions/nfl-big-data-bowl-2020/overview) prompt from 2020. This competition was hosted on Kaggle and asks the participants to compute the Continuous Ranked Probability Score (CRPS) for each play. The CRPS indicates the probability that a team gains that many yards on the play. The CRPS is computed by: 

$C = \dfrac{1}{1199} \sum^N_{m=1} \sum^{99}_{n=-99} (P(y \leq n) H(n-Y_m))^2 $

where P is the predicted distribution, $N$ is the number of plays in the test set, $Y$ is the actual yardage and $H(x)$ is the Heaviside step function $(H(x) = 1$ for $x \geq 0$ and zero otherwise).

The submission will not score if any of the predicted values has $P(y \leq k) \ge P(y \leq k + 1)$ for any $k$ (i.e. the CDF must be non-decreasing).

As part of the DS 5110 project requirements, we are to generate three models utilizing techniques from this course in `pyspark`. 

### Importing Dataset
The data were provided as part of the [2020 NFL Big Data Bowl](https://www.kaggle.com/competitions/nfl-big-data-bowl-2020/) hosted on Kaggle. For this project, the data were downloaded from Kaggle locally and uploaded to a shared location. 

We utilized the resources on Rivanna to execute our project. Our data was stored and shared on Rivanna for us to use in one singular location. Code was shared on our [GitHub Repository](https://github.com/lauren-odonnell/DS5110_NFL_BigDataBowl).

A data dictionary was provided by the 2020 NL Big Data Bowl hosted on Kaggle and used as a schema to import the data. The variables are described below (**cite**)

* `GameId` - a unique game identifier
* `PlayId` - a unique play identifier
* `Team` - home or away
* `X` - player position along the long axis of the field. See figure below.
* `Y` - player position along the short axis of the field. See figure below.
* `S` - speed in yards/second
* `A` - acceleration in yards/second^2
* `Dis` - distance traveled from prior time point, in yards
* `Orientation` - orientation of player (deg)
* `Dir` - angle of player motion (deg)
* `NflId` - a unique identifier of the player
* `DisplayName` - player's name
* `JerseyNumber` - jersey number
* `Season` - year of the season
* `YardLine` - the yard line of the line of scrimmage
* `Quarter` - game quarter (1-5, 5 == overtime)
* `GameClock` - time on the game clock
* `PossessionTeam` - team with possession
* `Down` - the down (1-4)
* `Distance` - yards needed for a first down
* `FieldPosition` - which side of the field the play is happening on
* `HomeScoreBeforePlay` - home team score before play started
* `VisitorScoreBeforePlay` - visitor team score before play started
* `NflIdRusher` - the NflId of the rushing player
* `OffenseFormation` - offense formation
* `OffensePersonnel` - offensive team positional grouping
* `DefendersInTheBox` - number of defenders lined up near the line of scrimmage, spanning the width of the offensive line
* `DefensePersonnel` - defensive team positional grouping
* `PlayDirection` - direction the play is headed
* `TimeHandoff` - UTC time of the handoff
* `TimeSnap` - UTC time of the snap
* `Yards` - the yardage gained on the play (you are predicting this)
* `PlayerHeight` - player height (ft-in)
* `PlayerWeight` - player weight (lbs)
* `PlayerBirthDate` - birth date (mm/dd/yyyy)
* `PlayerCollegeName` - where the player attended college
* `Position` - the player's position (the specific role on the field that they typically play)
* `HomeTeamAbbr` - home team abbreviation
* `VisitorTeamAbbr` - visitor team abbreviation
* `Week` - week into the season
* `Stadium` - stadium where the game is being played
* `Location` - city where the game is being played
* `StadiumType` - description of the stadium environment
* `Turf` - description of the field surface
* `GameWeather` - description of the game weather
* `Temperature` - temperature (deg F)
* `Humidity` - humidity
* `WindSpeed` - wind speed in miles/hour
* `WindDirection` - wind direction* 


<img src="NFL_Data_Dictionary.png">  

In [1]:
import os
print(os.getcwd())

from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("comm") \
        .getOrCreate()

/sfs/qumulo/qhome/qsq6zz/ds5110/project


/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/07/28 16:32:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# setting dataset schema for import

# import data types
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, TimestampType

schema = StructType([
    StructField('GameID', IntegerType(), True),
    StructField('PlayID', IntegerType(), True), 
    StructField('Team', StringType(), True), 
    StructField('X', FloatType(), True), 
    StructField('Y', FloatType(), True), 
    StructField('S', FloatType(), True), 
    StructField('A', FloatType(), True), 
    StructField('Dis', FloatType(), True), 
    StructField('Orientation', FloatType(), True), 
    StructField('Dir', FloatType(), True), 
    StructField('NflId', IntegerType(), True), 
    StructField('DisplayName', StringType(), True), 
    StructField('JerseyNumber', IntegerType(), True), 
    StructField('Season', IntegerType(), True), 
    StructField('YardLine', IntegerType(), True), 
    StructField('Quarter', IntegerType(), True), 
    StructField('GameClock', TimestampType(), True), 
    StructField('Possessionteam', StringType(), True), 
    StructField('Down', IntegerType(), True), 
    StructField('Distance', IntegerType(), True), 
    StructField('FieldPosition', StringType(), True), 
    StructField('HomeScoreBeforePlay', IntegerType(), True), 
    StructField('VisitorScoreBeforePlay', IntegerType(), True), 
    StructField('NflIdRusher', IntegerType(), True), 
    StructField('OffenseFormation', StringType(), True), 
    StructField('OffensePersonnel', StringType(), True), 
    StructField('DefendersInTheBox', IntegerType(), True), 
    StructField('DefensePersonnel', StringType(), True), 
    StructField('PlayDirection', StringType(), True), 
    StructField('TimeHandoff', TimestampType(), True), 
    StructField('TimeSnap', TimestampType(), True), 
    StructField('Yards', IntegerType(), True), 
    StructField('PlayerHeight', StringType(), True), 
    StructField('PlayerWeight', IntegerType(), True), 
    StructField('PlayerBirthDate', StringType(), True), 
    StructField('PlayerCollegeName', StringType(), True), 
    StructField('Position', StringType(), True), 
    StructField('HomeTeamAbbr', StringType(), True), 
    StructField('VisitorTeamAbbr', StringType(), True), 
    StructField('Week', IntegerType(), True), 
    StructField('Stadium', StringType(), True), 
    StructField('Location', StringType(), True), 
    StructField('StadiumType', StringType(), True), 
    StructField('Turf', StringType(), True), 
    StructField('GameWeather', StringType(), True), 
    StructField('Temperature', IntegerType(), True), 
    StructField('Humidity', IntegerType(), True), 
    StructField('WindSpeed', IntegerType(), True), 
    StructField('WindDirection', StringType(), True)])

In [3]:
path = '/sfs/qumulo/qhome/nux9aq/project' # update as necessary; current is the shared location containing the data

data = spark.read.csv(path, header = True, schema = schema)

In [4]:
data.(1)

23/07/28 16:32:42 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


                                                                                

+----------+------+----+-----+-----+----+----+---+-----------+------+------+-----------+------------+------+--------+-------+-------------------+--------------+----+--------+-------------+-------------------+----------------------+-----------+----------------+----------------+-----------------+----------------+-------------+-------------------+-------------------+-----+------------+------------+---------------+-----------------+--------+------------+---------------+----+----------------+--------------+-----------+----------+--------------+-----------+--------+---------+-------------+
|    GameID|PlayID|Team|    X|    Y|   S|   A|Dis|Orientation|   Dir| NflId|DisplayName|JerseyNumber|Season|YardLine|Quarter|          GameClock|Possessionteam|Down|Distance|FieldPosition|HomeScoreBeforePlay|VisitorScoreBeforePlay|NflIdRusher|OffenseFormation|OffensePersonnel|DefendersInTheBox|DefensePersonnel|PlayDirection|        TimeHandoff|           TimeSnap|Yards|PlayerHeight|PlayerWeight|PlayerBirth

In [5]:
print('Number of rows: ', data.count())



Number of rows:  682801


                                                                                

### Data Preprocessing
The NFL data is notorious for requiring lots of wrangling, from cleaning up sensors that were worn backwards during a game to engineering variables to fit the need of the research question at hand.

### Data Splitting/Sampling

### Exploratory Data Analysis (EDA)
**NOTES:** Need at least two graphs. 

### Model Construction
**NOTES:** Ideally created with a pipeline and will include the following: 

    a. A benchmark model, which is relatively simple. This could be a regression model with a small number of features (possibly a single feature). This provides a basis for comparison and a sanity check.
    b. Two relatively more sophisticated models (e.g., random forest, gradient boosted tree). The best model found in your experiments is called the champion model. The model construction process should follow the best practices covered in class, including:
    a. Data preprocessing. The required steps will depend on the model, and could include:
        i. dummy variable construction
        ii. feature scaling
        iii. handling missing values and outliers
        iv. handling semi-structured / unstructured data
        v. dimensionality reduction (e.g., PCA)
    b. Data splitting (train/validation/test sets, for example). The test set should be left out for evaluation purposes. It should NOT be used in training.
    c. K-fold cross validation of hyperparameters

#### Model 1: 

#### Model 2: 

#### Model 3: 

### Model Evaluation
**NOTES:** This should include computation of relevant metrics, and a comparison between models.

For all appropriate models (benchmark, champion, and other relevant models), the following should be
conducted:
    
    a. Evaluate relevant metrics. 
        For regression, this would include
            i. R-squared
        For classification, this would include:
            i. accuracy
            ii. precision, recall, F1 score
            iii. confusion matrix
            iv. area under ROC curve (AUROC)
        Depending on the application, additional evaluation could make sense such as lift charts
    b. Sensitivity analysis
        For your champion model, show relevant metrics for different hyperparameter values. This gives an idea of the model sensitivity.