<a href="https://colab.research.google.com/github/grizzler88/Springboard/blob/master/Capstone%20-%20Fantasy%20Draft%20Strategy/FantasyNFL_Capstone_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone 1: Fantasy NFL (Pre-processing & Training Data Developmnet)

The next step for my Capstone project is to clean up the latest verion of my dataset to ensure it is ready for the Modelling stage of the project.

## Import packages & load dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('NFL_FantasyData_2015_2019_EDA.csv')
#df.head()

Dataset has unamed column 'Unnamed: 0' from import that is not of value and should be removed.

In [3]:
df = df.drop(columns='Unnamed: 0')

In [4]:
#df.head()

In [5]:
df.shape

(19529, 34)

Next, we will look see what data types are in the dataset

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19529 entries, 0 to 19528
Data columns (total 34 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   POS             19529 non-null  object 
 1   TEAM            19529 non-null  object 
 2   OPP             19529 non-null  object 
 3   DATE            19529 non-null  object 
 4   SEASON          19529 non-null  int64  
 5   WEEK            19529 non-null  object 
 6   MONTH           19529 non-null  object 
 7   DAY             19529 non-null  object 
 8   TIME            19529 non-null  object 
 9   PLAYER          19529 non-null  object 
 10  FAN_ACTUAL      19529 non-null  float64
 11  HOME            19529 non-null  int64  
 12  DOME            19529 non-null  int64  
 13  GRASS           19529 non-null  int64  
 14  WEEK_NUMBER     19529 non-null  int64  
 15  FAN_AVG         19529 non-null  float64
 16  RUSHATT_AVG     19529 non-null  float64
 17  RUSHYDS_AVG     19529 non-null 

There are 9 objects or categorical variables that we will need to make numeric.

## Categorical variables

In [7]:
object_cols = list(df.columns[df.dtypes == np.object])

In [8]:
for x in object_cols:
  val = df[x].nunique()
  print(x,' = ', val)

POS  =  4
TEAM  =  32
OPP  =  32
DATE  =  248
WEEK  =  17
MONTH  =  5
DAY  =  4
TIME  =  5
PLAYER  =  890


The review of the ojbect columns shows that including them all would create +1,200 new columns.

For the moment, we will not look at the date fields of DATE, WEEK, MONTH, DAY, TIME.

We will instead concentrate on the POS, TEAM, OPP, and PLAYER columns.

In [9]:
dummy_POS = pd.get_dummies(df.POS, prefix='POS', drop_first=True)
dummy_TEAM = pd.get_dummies(df.TEAM, prefix='TEAM', drop_first=True)
dummy_OPP = pd.get_dummies(df.OPP, prefix='OPP', drop_first=True)
dummy_PLAYER = pd.get_dummies(df.PLAYER, prefix='PLAYER', drop_first=True)

In [10]:
df = pd.concat([df, dummy_POS, dummy_TEAM, dummy_OPP, dummy_PLAYER], axis=1).drop(columns=['POS', 'TEAM', 'OPP', 'PLAYER'])
#df.head()

In [11]:
df.shape

(19529, 984)

## Date Variables

There are a couple of different steps to transforming the date variables mentioned previously.

First, we will remove the 'Week' column because this is a duplicate of the 'WEEK_NUMBER' field that is already numeric and ordered.

In [12]:
df = df.drop(columns=['WEEK'])

Next, we will turn the 'MONTH' and 'DAY' columns into numbers. 

While these fields are oridinal (in sense we rank months in order of when they happen), I don't believe using this would be correct for this analysis. This would lead to January being considered 11 values different to December which I don't believe accurately reflects the situation. Instead, we want to see if playing in one month or day impacts a player's pefromance. Therefore, we will look to again use dummy value for each column to support our modelling attempts.

We will also apply this approach to the 'TIME' variable as well. 

In [13]:
dummy_MONTH = pd.get_dummies(df.MONTH, prefix='MONTH', drop_first=True)
dummy_DAY = pd.get_dummies(df.DAY, prefix='DAY', drop_first=True)
dummy_TIME = pd.get_dummies(df.TIME, prefix='TIME', drop_first=True)

In [14]:
df = pd.concat([df, dummy_MONTH, dummy_DAY, dummy_TIME], axis=1).drop(columns=['MONTH', 'DAY', 'TIME'])
#df.head()

Lastly we have the 'DATE' column.

To convert this, what we will convert the date to UNIX timestamp. This will allow us to keep same number of columns and not need to add an additional 247 rows for each date, as well as preserve intervals between dates. The negative of this approach is that it removes some of the interpretability related to the variable.

In [15]:
df['DATE'] = pd.to_datetime(df.DATE)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19529 entries, 0 to 19528
Columns: 991 entries, DATE to TIME_Noon
dtypes: datetime64[ns](1), float64(20), int64(5), uint8(965)
memory usage: 21.8 MB


In [16]:
df['DATE_STAMP'] = df['DATE'].values.astype(np.int64) // 10 ** 9
df.DATE_STAMP.value_counts()
df = df.drop(columns='DATE')
#df.head()

In [17]:
df.shape

(19529, 991)

Now that all the data is in numeric format, we will have to scale to ensure it has the correct distribution to support modelling. However, prior to this, we will need to split our data into Train and Test data.

## Training and Test Data

First, I will import the train_test_split fuction from sklearn.

In [18]:
from sklearn.model_selection import train_test_split

Next, I will breakout my data into independent and dependent variables.

In [19]:
X = df.drop('FAN_ACTUAL', axis=1)
y = df['FAN_ACTUAL']

Now, I will split my data into training and test data. Due to the high number of dimensions in the dataset, I am going to set my test zize at 20% - lower than the default option of 25%. 

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

## Scaling

In [21]:
df_columns = list(df.columns)

In [22]:
'''
for x in df_columns:
  _ = plt.hist(df[x])
  _ = plt.title(x)
  plt.show()
'''

'\nfor x in df_columns:\n  _ = plt.hist(df[x])\n  _ = plt.title(x)\n  plt.show()\n'

Most columns do not have normal distribution so we will need to scale our data to ensure it can be modelled correctly.

In [23]:
from sklearn.preprocessing import StandardScaler

In [24]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [25]:
#pd.DataFrame(X_train_scaled, columns=list(X.columns))

In [26]:
#pd.DataFrame(X_test_scaled, columns=list(X.columns))