# Feature Analysis
Since there are a huge amount of features in our data, we will be using techniques to reduce the dimensionality of the input and find only the best features. This will reduce the compelxity of our model, and also have it only train on the most impactful features. This will lead to overall better results of the models.

In [34]:
import pandas as pd
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif

Get the data from csv files

In [9]:
features = pd.read_csv("data/scaled_features.csv")
labels = pd.read_csv("data/labels.csv")

# combine into one dataframe
data = pd.concat([features, labels], axis=1)

data

Unnamed: 0,home_team,away_team,posteam,posteam_type,defteam,side_of_field,yardline_100,quarter_seconds_remaining,half_seconds_remaining,game_seconds_remaining,...,home_timeouts_remaining,away_timeouts_remaining,posteam_timeouts_remaining,defteam_timeouts_remaining,total_home_score,total_away_score,posteam_score,defteam_score,score_differential,play_type
0,0.000000,0.741935,0.741935,1.0,0.000000,0.46,0.755102,1.000000,1.000000,1.000000,...,1.00,1.0,1.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.50000,0
1,0.000000,0.741935,0.741935,1.0,0.000000,0.46,0.755102,0.970000,0.985000,0.992500,...,1.00,1.0,1.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.50000,1
2,0.000000,0.741935,0.741935,1.0,0.000000,0.46,0.755102,0.916667,0.958333,0.979167,...,1.00,1.0,1.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.50000,1
3,0.000000,0.741935,0.000000,0.0,0.741935,0.00,0.520408,0.836667,0.918333,0.959167,...,1.00,1.0,1.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.50000,0
4,0.000000,0.741935,0.000000,0.0,0.741935,0.46,0.469388,0.797778,0.898889,0.949444,...,1.00,1.0,1.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.50000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31970,0.290323,1.000000,0.290323,0.0,1.000000,0.62,0.112245,0.046667,0.023333,0.011667,...,0.25,0.5,0.000000,0.333333,0.396226,0.440678,0.355932,0.440678,0.44898,1
31971,0.290323,1.000000,0.290323,0.0,1.000000,0.62,0.040816,0.024444,0.012222,0.006111,...,0.25,0.5,0.000000,0.333333,0.396226,0.440678,0.355932,0.440678,0.44898,1
31972,0.290323,1.000000,0.290323,0.0,1.000000,0.62,0.040816,0.016667,0.008333,0.004167,...,0.25,0.5,0.000000,0.333333,0.396226,0.440678,0.355932,0.440678,0.44898,1
31973,0.290323,1.000000,0.290323,0.0,1.000000,0.62,0.040816,0.013333,0.006667,0.003333,...,0.25,0.5,0.000000,0.333333,0.396226,0.440678,0.355932,0.440678,0.44898,1


## Chi-Squared Test
The first technique used will be the Chi-Squared test. Chi-Squared measures the similarity between 2 categorical variables. In our case most features are categorical rather than numerical. Even yardline and time can be thought of as categorical since they are discrete.

In [25]:
chi2_stats, p_values = chi2(features, labels)

chi2_stats_df = pd.DataFrame(chi2_stats, columns=["Chi2 Stat"])
p_values_df = pd.DataFrame(p_values, columns=["p Value"])

chi2_results = pd.concat([chi2_stats_df, p_values_df], axis=1)
chi2_results = chi2_results.set_index(features.columns)
chi2_results.sort_values(by="p Value")

Unnamed: 0,Chi2 Stat,p Value
down,461.009183,2.8991869999999998e-102
no_huddle,143.820461,3.889058e-33
goal_to_go,63.092909,1.971827e-15
defteam_score,59.277307,1.369467e-14
half_seconds_remaining,53.596484,2.461996e-13
quarter_seconds_remaining,45.670235,1.39933e-11
posteam_timeouts_remaining,22.323441,2.303772e-06
score_differential,20.052412,7.534844e-06
ydstogo,18.505738,1.693936e-05
posteam_score,16.682742,4.41811e-05


This offers us some very interesting results! The teams in the game are hardly relevant to the whether the play will be a pass or a run. Interestingly, no_huddle was the second most important feature according to the Chi-Squared Test.

## ANOVA F-Test
Another categorical univariate feature analysis technique we have is the ANOVA F-test. This test measures the amount of variance and independence of two categories.

In [28]:
f_stats, p_values = f_classif(features, labels)

f_stats_df = pd.DataFrame(f_stats, columns=["F Stat"])
p_values_df = pd.DataFrame(p_values, columns=["p Value"])

f_results = pd.concat([f_stats_df, p_values_df], axis=1)
f_results = f_results.set_index(features.columns)
f_results.sort_value(by="p Value")

  y = column_or_1d(y, warn=True)


Unnamed: 0,F Stat,p Value
down,1705.24616,0.0
score_differential,817.636685,1.365436e-177
defteam_score,411.548331,6.311517e-91
ydstogo,348.550807,2.262807e-77
posteam_timeouts_remaining,305.547246,4.229939e-68
half_seconds_remaining,279.42154,1.849846e-62
quarter_seconds_remaining,244.535294,6.449532e-55
away_timeouts_remaining,168.357511,2.1133149999999998e-38
no_huddle,156.982911,6.271278999999999e-36
home_timeouts_remaining,151.069952,1.211645e-34


Most of the features these two techniques found to be useful are similar. Both claim that down is the most important feature when determining if the play will be a pass or a run. This is most likely due to the plays being almost exclusively pass on 3rd down during our data analysis. Interestingly, while the teams tend not to matter, the amount of time outs for both teams seems significant.

## Mutual Information Test
The final univariate test we will explore is the mutual information test. This measures the entropy of features.

In [37]:
mutual_stats = mutual_info_classif(features, labels)

mutual_stats_df = pd.DataFrame(mutual_stats, columns=["Mutual Info"])
mutual_stats_df = mutual_stats_df.set_index(features.columns)

mutual_stats_df.sort_values(by="Mutual Info", ascending=False)

  y = column_or_1d(y, warn=True)


Unnamed: 0,Mutual Info
down,0.03084
ydstogo,0.028896
score_differential,0.018766
posteam_timeouts_remaining,0.015443
game_seconds_remaining,0.012641
defteam_score,0.010656
half_seconds_remaining,0.010133
away_timeouts_remaining,0.007225
posteam,0.006397
yardline_100,0.005703


This technique also has the score_differential high like the f-test, but the chi-squared test didn't. The rest of the results tend to be consistent with the other techniques. We must now select the k best features that we believe will help our model. Hopefully by reducing the complexity, the model has an easier time training and will give better results.