# 39AA Project Part 2
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/glaframb71/CS39AA-Project/blob/main/39aa-project.ipynb)


<font size=3>Welcome to Part 2 of my 39AA Project! In this notebook, we will be taking the dataset that was explored in Part 1 and trying to complete our task of predicting an athlete's football position based on their height, weight, and college. Let's get started by importing a few packages we will use in this notebook.</font>

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE

<font size='3'>Now lets import the players data from the project GitHub.</font>

In [2]:
playersDataUrl = "https://raw.githubusercontent.com/glaframb71/CS39AA-Project/main/data/players.csv"
playersData = pd.read_csv(playersDataUrl)

playersData.head()

Unnamed: 0,nflId,height,weight,birthDate,collegeName,position,displayName
0,25511,6-4,225,1977-08-03,Michigan,QB,Tom Brady
1,29550,6-4,328,1982-01-22,Arkansas,T,Jason Peters
2,29851,6-2,225,1983-12-02,California,QB,Aaron Rodgers
3,30842,6-6,267,1984-05-19,UCLA,TE,Marcedes Lewis
4,33084,6-4,217,1985-05-17,Boston College,QB,Matt Ryan


<font size='3'>Great! The only thing missing from the DataFrame is a column that converts the 'height' column, which currently represents player heights in feet as strings, into a 'height_inches' column that represents player heights in inches as ints.</font>

In [3]:
# First we will add in the height_inches column to playersData and slot it next to the height column.
playersData = playersData.copy()
playersData.loc[:, 'height_inches'] = playersData['height'].apply(lambda x: int(x.split('-')[0]) * 12 + int(x.split('-')[1]))
height_index = playersData.columns.get_loc('height')
playersData.insert(height_index + 1, "height_inches", playersData.pop("height_inches"))

playersData.head()

Unnamed: 0,nflId,height,height_inches,weight,birthDate,collegeName,position,displayName
0,25511,6-4,76,225,1977-08-03,Michigan,QB,Tom Brady
1,29550,6-4,76,328,1982-01-22,Arkansas,T,Jason Peters
2,29851,6-2,74,225,1983-12-02,California,QB,Aaron Rodgers
3,30842,6-6,78,267,1984-05-19,UCLA,TE,Marcedes Lewis
4,33084,6-4,76,217,1985-05-17,Boston College,QB,Matt Ryan


<font size='3'>Now that the 'height_inches' column has been made, lets clean up the DataFrame to hold only the columns that are relevant to the task we are looking to accomplish. As a refresher, these columns will be 'height_inches', 'weight', 'collegeName', and 'position'.</font>

In [4]:
players = playersData[['height_inches', 'weight', 'collegeName', 'position']]

players.head()

Unnamed: 0,height_inches,weight,collegeName,position
0,76,225,Michigan,QB
1,76,328,Arkansas,T
2,74,225,California,QB
3,78,267,UCLA,TE
4,76,217,Boston College,QB


Dropped DB LS

In [5]:
print('Players shape before filter:')
print(players.shape)
positions_to_drop = ['DB', 'LS']
filtered_players = players[~players['position'].isin(positions_to_drop)]
print('Players shape after filter:')
print(filtered_players.shape)

Players shape before filter:
(1683, 4)
Players shape after filter:
(1681, 4)


<font size='3'>We now have our DataFrame holding the relevant information that is needed. Let's split up our data into X and y training and validation sets using the method **train_test_split** imported from sklearn.model_selection.</font>

In [6]:
X = filtered_players.drop('position', axis=1)
y = filtered_players['position']
X.head()

Unnamed: 0,height_inches,weight,collegeName
0,76,225,Michigan
1,76,328,Arkansas
2,74,225,California
3,78,267,UCLA
4,76,217,Boston College


In [7]:
X = pd.get_dummies(X, columns=['collegeName'])

In [8]:
smote = SMOTE(sampling_strategy='minority', k_neighbors=1)
X_sm, y_sm = smote.fit_resample(X,y)

In [9]:
X_train_raw, X_val_raw, y_train, y_val = train_test_split(X_sm, y_sm, test_size=0.20, random_state=42)

<font size='3'>Sweet! We have split our dataset into training and validation sets. Now, to make our data more compatible with machine learning algorithms, let's convert the 'collegeName' column in our X taining and validation sets into a numerical format using one-hot encoding.</font>

In [10]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize the Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42)

# Fit the model using the training data
rf_clf.fit(X_train_raw, y_train)

# Use the trained model to make predictions on the validation data
y_pred = rf_clf.predict(X_val_raw)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_val, y_pred)

# Print the accuracy
print("Validation Accuracy: ", accuracy)


Validation Accuracy:  0.2328042328042328


In [11]:
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the Random Forest Classifier
rf_clf = RandomForestClassifier(random_state=42)

# Initialize the GridSearchCV object
grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train_raw, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best parameters: ", best_params)

# Get the best estimator
best_rf_clf = grid_search.best_estimator_

# Use the best estimator to make predictions on the validation data
y_pred = best_rf_clf.predict(X_val_raw)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_val, y_pred)

# Print the accuracy
print("Validation Accuracy: ", accuracy)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.1s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.1s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   1.9s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   2.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=400; total time=   3.6s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time=   1.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time=   1.7s
[CV] END m