<a href="https://colab.research.google.com/github/grizzler88/Springboard/blob/master/Capstone%20-%20Fantasy%20Draft%20Strategy/FantasyNFL_Capstone_Preprocessing_(Updated_v3).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone 1: Fantasy NFL (Pre-processing & Training Data Developmnet)

The next step for my Capstone project is to clean up the latest verion of my dataset to ensure it is ready for the Modelling stage of the project.

## Getting Started

### Import packages & load dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('NFL_FantasyData_2015_2019_EDA_v3.csv')
#df.head()

### Review Dataset

Dataset has unamed column 'Unnamed: 0' from import that is not of value and should be removed.

In [3]:
df = df.drop(columns='Unnamed: 0')

In [4]:
#df.head()

In [5]:
df.shape

(22410, 47)

Next, we will look see what data types are in the dataset

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22410 entries, 0 to 22409
Data columns (total 47 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TEAM             22410 non-null  object 
 1   OPP              22410 non-null  object 
 2   DATE             22410 non-null  object 
 3   SEASON           22410 non-null  int64  
 4   WEEK             22410 non-null  object 
 5   MONTH            22410 non-null  object 
 6   TIME             22410 non-null  object 
 7   POS              22410 non-null  object 
 8   PLAYER           22410 non-null  object 
 9   FAN_ACTUAL       22410 non-null  float64
 10  HOME             22410 non-null  int64  
 11  DOME             22410 non-null  int64  
 12  GRASS            22410 non-null  int64  
 13  SUNDAY           22410 non-null  int64  
 14  WEEK_SEASON_ID   22410 non-null  int64  
 15  FAN_AVG          22410 non-null  float64
 16  PASSCOMP_AVG     22410 non-null  float64
 17  PASSATT_AVG 

There are 8 objects or categorical variables that we will need to make numeric.

## Categorical variables

In [7]:
object_cols = list(df.columns[df.dtypes == np.object])

In [8]:
for x in object_cols:
  val = df[x].nunique()
  print(x,' = ', val)

TEAM  =  32
OPP  =  32
DATE  =  248
WEEK  =  17
MONTH  =  5
TIME  =  3
POS  =  4
PLAYER  =  1012


The review of the ojbect columns shows that including them all would create +1,300 new columns.

For the moment, we will not look at the date fields of DATE, WEEK, MONTH, DAY, TIME.

We will instead concentrate on the POS, TEAM, OPP, and PLAYER columns. 

### Remove 'PLAYER' column

First, I will remove the PLAYER field. Turning this to dummy values would create 1,011 additional columns which would add too many dimensions to the problem. It is also unlikely that the player's name itself will be a main indicator of performance, but rather the statistics that they produce.

In [9]:
df = df.drop(columns='PLAYER')

In [10]:
#df.shape

### Remove 'TEAM' and 'OPP' column

Transforming the TEAM and OPP columns into dummy variables would add 31 columns each. We also noted during our EDA that, althought the Team and Coach columns were providing us with information, that this information could be proxied using team peformance statisitcs (i.e. it likely isn't the name of the team that influences a player performance but rather the team performance under a certain organisation structure that influences it).

With this in mind, I have decided to remove both the 'TEAM' and 'OPP' column. However, based on the performance of our initial modelling we could look to reintroduce if required.

In [11]:
df = df.drop(columns=['TEAM', 'OPP'])

In [12]:
#df.shape

### Create dummy variables for 'POS' column

Throughout the EDA, we saw that position of a player was an important indicator of fantasy performance and therefore we will include it in our model. To do this, we will create dummy variables below.

In [13]:
dummy_POS = pd.get_dummies(df.POS, prefix='POS', drop_first=True)

In [14]:
df = pd.concat([df, dummy_POS], axis=1).drop(columns=['POS'])

In [15]:
df.head()
df.shape

(22410, 46)

## Date variables

### Date

Our data has 248 unique values in our 'DATE' column. While we want to capture some time element in our model, I don't believe it needs to be as granular as a specific date. For this reason, we will remove the date field from our data set for now but can look to include again if we believe the time is becoming an issue with our data.

In [16]:
df = df.drop(columns=['DATE'])

In [17]:
df.shape

(22410, 45)

### Week, Month & Season

As mentioned when removing the 'DATE' column, we have other variables in our dataset that we believe can capture the time and date aspect of data. This namely relates to the following columns:

* 'WEEK' - what gamewek was a game played in (there are 17 gameweeks in an NFL season)
* 'MONTH' - what month was a game played in
* 'SEASON' - what season was a game played in
* 'WEEK_SEASON_ID' - what number in order did a game get played in since start of this dataset
* 'SUNDAY' - was game played on a Sunday (1 if yes, 0 if no)
* 'TIME' - what time was game played (Noon, Afternoon, Night)


For this, there are two main decisions that need to be answered in order:

1.  Do we want to include 'WEEK_SEASON_ID'?
  * This is an ordered numeric series that may cause issues with our model if not correctly applied. For this reason, it is probably best to __remove 'WEEK_SEASON_ID'.__

2.  Do we want to keep 'WEEK' or 'MONTH' column?
  * During our EDA, we found that both showed a trend that as the season went on the lower average fantasy points became. However, including both will likely create a duplication of information so it is probably best to proceed with only one for our modelling. __As 'MONTH' required fewer variables, we will begin with this but can return to include 'WEEK' if required later.__


This means that we will proceed with the 'MONTH', 'SEASON', 'TIME', and 'SUNDAY' columns to capture the time elements of the data in our modelling. To do this, we will need to create dummy variable for all the columns (for 'SEASON' we will first have to turn into an object), except for 'SUNDAY' which is already created as a boolean/binary column.

In [18]:
df = df.drop(columns=['WEEK_SEASON_ID', 'WEEK'])

In [19]:
df.shape

(22410, 43)

In [20]:
df['SEASON'] = df['SEASON'].astype(object)

In [21]:
dummy_SEASON = pd.get_dummies(df.SEASON, prefix='SEASON', drop_first=True)
dummy_MONTH = pd.get_dummies(df.MONTH, prefix='MONTH', drop_first=True)
dummy_TIME = pd.get_dummies(df.TIME, prefix='TIME', drop_first=True)

In [22]:
df = pd.concat([df, dummy_SEASON, dummy_MONTH, dummy_TIME], axis=1).drop(columns=['SEASON', 'MONTH', 'TIME'])

In [23]:
df.head()
df.shape

(22410, 50)

## Training and Test Data

Now that all the data is in numeric format, we will have to scale to ensure it has the correct distribution to support modelling. However, prior to this, we will need to split our data into Train and Test data.

First, I will import the train_test_split fuction from sklearn.

In [24]:
from sklearn.model_selection import train_test_split

Next, I will breakout my data into independent and dependent variables. 

For our model, we have two potential dependent variables  - 'FAN_ACTUAL' or 'cluster_4'.  'FAN_ACTUAL' is a a continuous variable and 'cluster_4' is a categorical variable meaning the choice between which variable we use will dicate what type of model - regression or classification.

To start we will focus on 'FAN_ACTUAL'.

In [25]:
X = df.drop(['FAN_ACTUAL', 'cluster_4'], axis=1)
y = df['FAN_ACTUAL']

Now, I will split my data into training and test data. Due to the high number of dimensions in the dataset, I am going to set my test zize at 20% - lower than the default option of 25%. 

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

## Scaling

In [27]:
X.describe()

Unnamed: 0,HOME,DOME,GRASS,SUNDAY,FAN_AVG,PASSCOMP_AVG,PASSATT_AVG,PASSCOMP%_AVG,PASSYDS_AVG,PASSTD_AVG,INT_AVG,QBRAT_AVG,SACK_AVG,SACKYDS_AVG,PASSYDS_300_AVG,PASSYDS_400_AVG,RUSHATT_AVG,RUSHYDS_AVG,RUSHTD_AVG,FUM_AVG,FUMLST_AVG,RUSHYDS_100_AVG,RUSHYDS_200_AVG,TGTS_AVG,REC_AVG,RECYDS_AVG,RECTD_AVG,RECYDS_100_AVG,RECYDS_200_AVG,PTS_FOR_AVG,PTS_AGT_AVG,WIN/TIE_AVG,OPP_PTS_FOR_AVG,OPP_PTS_AGT_AVG,OPP_WIN/TIE_AVG,POS_RB,POS_TE,POS_WR,SEASON_2016,SEASON_2017,SEASON_2018,SEASON_2019,MONTH_January,MONTH_November,MONTH_October,MONTH_September,TIME_Night,TIME_Noon
count,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0,22410.0
mean,0.49286,0.25087,0.559036,0.847256,7.540876,2.452454,3.861323,0.079474,27.963528,0.175565,0.089573,11.367761,0.256783,1.699108,0.053429,0.003153,2.895828,12.283151,0.089194,0.116354,0.055332,0.0426,0.000491,3.633118,2.44128,27.912379,0.174554,0.084627,0.000807,22.746088,22.665462,0.504332,22.667321,22.71248,0.500892,0.296029,0.184516,0.391031,0.203614,0.204596,0.202811,0.20183,0.025524,0.23784,0.253503,0.19112,0.19643,0.549353
std,0.49996,0.433524,0.496514,0.359749,6.1616,6.878654,10.75693,0.208177,78.749941,0.535357,0.296911,29.988032,0.778899,5.273701,0.222486,0.030521,4.947567,22.129624,0.219471,0.239117,0.13889,0.184523,0.011799,2.909681,1.906358,25.561921,0.262745,0.24865,0.014224,5.930808,5.475222,0.285957,5.94076,5.459414,0.286925,0.456514,0.387913,0.487992,0.402694,0.403415,0.402102,0.401375,0.157715,0.42577,0.435026,0.393192,0.397306,0.497569
min,0.0,0.0,0.0,0.0,-2.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-7.0,0.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.25,1.0,7.5,0.0,0.0,0.0,18.75,19.0,0.25,18.5,19.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,1.0,1.0,5.841667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.25,0.0,0.0,0.0,0.0,0.0,3.25,2.25,21.75,0.0,0.0,0.0,22.5,22.5,0.5,22.25,22.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,1.0,1.0,1.0,1.0,10.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,14.729167,0.0,0.25,0.0,0.0,0.0,5.5,3.5,42.0,0.25,0.0,0.0,26.5,26.25,0.75,26.5,26.5,0.75,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,44.09,36.0,54.0,1.0,403.0,5.0,3.0,158.3,7.0,54.0,2.0,1.0,29.5,168.0,2.0,3.0,2.0,2.0,0.5,17.0,15.0,166.0,3.0,2.0,0.333333,43.75,43.75,1.0,43.75,43.75,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


As can be seen in the table above, the ranges of values differs greatly between each column. This suggest that in order to compare values across columns, we should to transform all are values to a similar scale.

To do this, we will use the StandarScaler() method from sklearn.preprocessing

In [28]:
from sklearn.preprocessing import StandardScaler

In [29]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [30]:
#pd.DataFrame(X_train_scaled, columns=list(X.columns))

In [31]:
#pd.DataFrame(X_test_scaled, columns=list(X.columns))