# 39AA Project Part 2
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/glaframb71/CS39AA-Project/blob/main/39aa-project.ipynb)


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

Lets import our players data...

In [2]:
playersDataUrl = "https://raw.githubusercontent.com/glaframb71/CS39AA-Project/main/data/players.csv"
playersData = pd.read_csv(playersDataUrl)

playersData.head(10)

Unnamed: 0,nflId,height,weight,birthDate,collegeName,position,displayName
0,25511,6-4,225,1977-08-03,Michigan,QB,Tom Brady
1,29550,6-4,328,1982-01-22,Arkansas,T,Jason Peters
2,29851,6-2,225,1983-12-02,California,QB,Aaron Rodgers
3,30842,6-6,267,1984-05-19,UCLA,TE,Marcedes Lewis
4,33084,6-4,217,1985-05-17,Boston College,QB,Matt Ryan
5,33099,6-6,245,1985-01-16,Delaware,QB,Joe Flacco
6,33107,6-4,315,1985-08-30,Virginia Tech,T,Duane Brown
7,33130,5-10,175,1986-12-01,California,WR,DeSean Jackson
8,33131,6-8,300,1986-09-01,Miami,DE,Calais Campbell
9,33138,6-3,222,1985-07-02,Michigan,QB,Chad Henne


Great! The only thing missing from the DataFrame is a column that converts the 'height' column, which currently represents player heights in feet as strings, into a 'height_inches' column that represents player heights in inches as ints.

In [3]:
# First we will add in the height_inches column to playersData and slot it next to the height column.
playersData = playersData.copy()
playersData.loc[:, 'height_inches'] = playersData['height'].apply(lambda x: int(x.split('-')[0]) * 12 + int(x.split('-')[1]))
height_index = playersData.columns.get_loc('height')
playersData.insert(height_index + 1, "height_inches", playersData.pop("height_inches"))

playersData.head()

Unnamed: 0,nflId,height,height_inches,weight,birthDate,collegeName,position,displayName
0,25511,6-4,76,225,1977-08-03,Michigan,QB,Tom Brady
1,29550,6-4,76,328,1982-01-22,Arkansas,T,Jason Peters
2,29851,6-2,74,225,1983-12-02,California,QB,Aaron Rodgers
3,30842,6-6,78,267,1984-05-19,UCLA,TE,Marcedes Lewis
4,33084,6-4,76,217,1985-05-17,Boston College,QB,Matt Ryan


Now that the 'height_inches' column has been made, lets clean up the DataFrame to hold only the columns that are relevant to the task we are looking to accomplish.

In [4]:
players = playersData[['height_inches', 'weight', 'collegeName', 'position']]

players.head()

Unnamed: 0,height_inches,weight,collegeName,position
0,76,225,Michigan,QB
1,76,328,Arkansas,T
2,74,225,California,QB
3,78,267,UCLA,TE
4,76,217,Boston College,QB


Now that we have our data imported and cleaned up, let's split up our data set into a raw training set and a validation set.

In [5]:
X = players[['height_inches', 'weight', 'collegeName']].copy()
y = players['position'].copy()
X.head()
y.head()
X_train_raw, X_val_raw, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

One-hot encode the 'collegeName' column of the X train and val data sets

In [6]:
# Assuming X_train_raw and X_val_raw are your training and validation sets
# Concatenate the training and validation sets to ensure consistent one-hot encoding
combined_data = pd.concat([X_train_raw, X_val_raw])

# Perform one-hot encoding on the 'collegeName' column
combined_data_encoded = pd.get_dummies(combined_data, columns=['collegeName'])

# Split the combined data back into training and validation sets
X_train_encoded = combined_data_encoded[:len(X_train_raw)]
X_val_encoded = combined_data_encoded[len(X_train_raw):]

print("X_train_encoded Shape:")
print(X_train_encoded.shape)

print("\nX_val_encoded Shape:")
X_val_encoded.shape


X_train_encoded Shape:
(1346, 228)

X_val_encoded Shape:


(337, 228)

In [7]:
# Assuming y_train and y_val are your target variables for the training and validation sets
# Concatenate the training and validation target variables to ensure consistent one-hot encoding
combined_target = pd.concat([y_train, y_val])

# Perform one-hot encoding on the 'position' column
combined_target_encoded = pd.get_dummies(combined_target, columns=['position'])

# Split the combined target back into training and validation sets
y_train_encoded = combined_target_encoded[:len(y_train)]
y_val_encoded = combined_target_encoded[len(y_train):]

# Now y_train_encoded and y_val_encoded contain the one-hot encoded 'position' column

print("y_train_encoded Shape:")
print(y_train_encoded.shape)

print("y_val_encoded Shape:")
print(y_val_encoded.shape)

y_train_encoded Shape:
(1346, 19)
y_val_encoded Shape:
(337, 19)
