## DX799S O1 Data Science Capstone (Summer 1 2025): ACTIVITY 6.3 ##

Each week, you will apply the concepts of that week to your Integrated Capstone Project’s dataset. In preparation for Milestone One, create a Jupyter Notebook (similar to in Module B, Semester Two) that illustrates these lessons. There are no specific questions to answer in your Jupyter Notebook files in this course; your general goal is to analyze your data using the methods you have learned about in this course and in this program and draw interesting conclusions. 

For Week 6, include concepts such as decision trees and random forests. Complete your Jupyter Notebook homework by 11:59pm ET on Sunday. 

In Week 7, you will compile your findings from your Jupyter Notebook homework into your Milestone One assignment for grading. For full instructions and the rubric for Milestone One, refer to the following link. 

The following dataset, "Video Review", is a collection of information that was created based on reviewable video evidence that outlines the events that resulted in a concussion during punt players in the NFL 2016-2017 season. The target, Primary_Impact_Type, outlines if the concussion occurred from the impact of Helmet-to-Helmet, Helmet-to-Body, or Helmet-to-Ground.

In [76]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score, 
    GridSearchCV, 
    RandomizedSearchCV, 
    RepeatedKFold
)

In [77]:
repeated_cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=42)

In [78]:
df_videoreview = pd.read_csv("video_review.csv")

label_encoder = LabelEncoder()
for col in df_videoreview.select_dtypes(include=['object']).columns:
    df_videoreview[col] = label_encoder.fit_transform(df_videoreview[col].astype(str))

target_column = 'Primary_Impact_Type'
X = df_videoreview.drop(columns=[target_column])
y = df_videoreview[target_column]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

dt_classifier = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)

cv_scores = cross_val_score(dt_classifier, X_train_scaled, y_train, cv=repeated_cv, scoring='accuracy')
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)

dt_classifier.fit(X_train_scaled, y_train)
train_acc = dt_classifier.score(X_train_scaled, y_train)

print(f"Mean CV Accuracy (Video Review): {mean_accuracy:.2%}")
print(f"Std. Dev. CV Accuracy Video Review: {std_accuracy:.2%}")
print(f"Train Accuracy Video Review: {train_acc:.2%}")


Mean CV Accuracy (Video Review): 42.93%
Std. Dev. CV Accuracy Video Review: 16.69%
Train Accuracy Video Review: 72.41%




In [79]:
#Video Review Dataset with Random Forest 

rf_classifier = RandomForestClassifier(
    n_estimators=100,         
    max_depth=5,              
    min_samples_split=10,     
    min_samples_leaf=5,       
    random_state=42,
    class_weight='balanced'   
)

rf_cv_scores = cross_val_score(rf_classifier, X_train_scaled, y_train, cv=repeated_cv, scoring='accuracy')
rf_mean_accuracy = np.mean(rf_cv_scores)
rf_std_accuracy = np.std(rf_cv_scores)

rf_classifier.fit(X_train_scaled, y_train)
rf_train_acc = rf_classifier.score(X_train_scaled, y_train)

print(f"Random Forest CV Accuracy Video Review: {rf_mean_accuracy:.2%}")
print(f"Random Forest CV Std. Dev. Video Review: {rf_std_accuracy:.2%}")
print(f"Random Forest Train Accuracy Video Review: {rf_train_acc:.2%}")






Random Forest CV Accuracy Video Review: 45.47%
Random Forest CV Std. Dev. Video Review: 16.94%
Random Forest Train Accuracy Video Review: 72.41%


The next dataset, Injury Record, looks to determine the relationship between the playing surface and the injury and performance of NFL athletes. The Injury Record dataset accounts for 105 lower-limbs injuries that occurred over two seasons during the regular NFL season and provides information on the surface the game occurred on and the number of days the player missed due to injury (or how severe it was). The target in this case is surface which lists the type of surface (synethic or natural) the field was when the injury occurred.

In [80]:
df_injuryrecord = pd.read_csv("InjuryRecord.csv")

In [81]:
#Injury Record Dataset with Decision Tree 

label_encoder = LabelEncoder()

for col in df_injuryrecord.select_dtypes(include=['object']).columns:
    df_injuryrecord[col] = label_encoder.fit_transform(df_injuryrecord[col].astype(str))

target_column = 'Surface'  
X_injury = df_injuryrecord.drop(columns=[target_column])
y_injury = df_injuryrecord[target_column]

X_train, X_test, y_train, y_test = train_test_split(X_injury, y_injury, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

dt_classifier_injury = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)

cv_scores_injury = cross_val_score(dt_classifier_injury, X_train_scaled, y_train, cv=repeated_cv, scoring='accuracy')
mean_accuracy_injury = np.mean(cv_scores_injury)
std_accuracy_injury = np.std(cv_scores_injury)

dt_classifier_injury.fit(X_train_scaled, y_train)
train_acc_injury = dt_classifier_injury.score(X_train_scaled, y_train)

print(f"Mean CV Accuracy (Injury Record): {mean_accuracy_injury:.2%}")
print(f"Std. Dev. CV Accuracy Injury Record: {std_accuracy_injury:.2%}")
print(f"Train Accuracy Injury Record: {train_acc_injury:.2%}")


Mean CV Accuracy (Injury Record): 42.40%
Std. Dev. CV Accuracy Injury Record: 8.94%
Train Accuracy Injury Record: 67.86%


In [82]:
#Injury Record Dataset with Random Forest

rf_classifier_injury = RandomForestClassifier(
    n_estimators=100,         
    max_depth=5,              
    min_samples_split=10,     
    min_samples_leaf=5,       
    random_state=42,
    class_weight='balanced'   
)

rf_cv_scores_injury = cross_val_score(rf_classifier, X_train_scaled, y_train, cv=repeated_cv, scoring='accuracy')
rf_mean_accuracy_injury = np.mean(rf_cv_scores_injury)
rf_std_accuracy_injury = np.std(rf_cv_scores_injury)

rf_classifier_injury.fit(X_train_scaled, y_train)
rf_train_acc_injury = rf_classifier_injury.score(X_train_scaled, y_train)

print(f"Random Forest CV Accuracy Injury Record: {rf_mean_accuracy_injury:.2%}")
print(f"Random Forest CV Std. Dev. Injury Record: {rf_std_accuracy_injury:.2%}")
print(f"Random Forest Train Accuracy Injury Record: {rf_train_acc_injury:.2%}")


Random Forest CV Accuracy Injury Record: 40.26%
Random Forest CV Std. Dev. Injury Record: 9.18%
Random Forest Train Accuracy Injury Record: 75.00%


The last dataset, PlayList, "contains information about each player-play in the dataset, to include the player’s assigned roster position, stadium type, field type, weather, play type, position for the play, and position group". This dataset was provided with the injury_record dataset so it provides additional information regarding the environment when a player's lower body injury occurred during these two NFL seasons. The target in this case will be "PlayType" which can include kickoff, run, pass, etc.

In [83]:
#Playlist Dataset with Decision Tree

df_playlist = pd.read_csv("PlayList.csv")

label_encoder = LabelEncoder()

for col in df_playlist.select_dtypes(include=['object']).columns:
    df_playlist[col] = label_encoder.fit_transform(df_playlist[col].astype(str))


target_column = 'PlayType'  
X_playlist = df_playlist.drop(columns=[target_column])
y_playlist = df_playlist[target_column]

X_train, X_test, y_train, y_test = train_test_split(X_playlist, y_playlist, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

dt_classifier_playlist = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)

cv_scores_playlist = cross_val_score(dt_classifier_playlist, X_train_scaled, y_train, cv=repeated_cv, scoring='accuracy')
mean_accuracy_playlist = np.mean(cv_scores_playlist)
std_accuracy_playlist= np.std(cv_scores_playlist)

dt_classifier_playlist.fit(X_train_scaled, y_train)
train_acc_playlist = dt_classifier_playlist.score(X_train_scaled, y_train)

print(f"Mean CV Accuracy (Playlist): {mean_accuracy_playlist:.2%}")
print(f"Std. Dev. CV Accuracy Playlist: {std_accuracy_playlist:.2%}")
print(f"Train Accuracy Playlist: {train_acc_playlist:.2%}")

Mean CV Accuracy (Playlist): 52.20%
Std. Dev. CV Accuracy Playlist: 0.08%
Train Accuracy Playlist: 52.26%


In [84]:
#Playlist Dataset with Random Forest

rf_classifier_playlist = RandomForestClassifier(
    n_estimators=100,         
    max_depth=5,              
    min_samples_split=10,     
    min_samples_leaf=5,       
    random_state=42,
    class_weight='balanced'   
)

rf_cv_scores_playlist = cross_val_score(rf_classifier, X_train_scaled, y_train, cv=repeated_cv, scoring='accuracy')
rf_mean_accuracy_playlist = np.mean(rf_cv_scores_playlist)
rf_std_accuracy_playlist = np.std(rf_cv_scores_playlist)

rf_classifier_playlist.fit(X_train_scaled, y_train)
rf_train_acc_playlist = rf_classifier_playlist.score(X_train_scaled, y_train)

print(f"Random Forest CV Accuracy Playlist: {rf_mean_accuracy_playlist:.2%}")
print(f"Random Forest CV Std. Dev. Playlist: {rf_std_accuracy_playlist:.2%}")
print(f"Random Forest Train Accuracy Playlist: {rf_train_acc_playlist:.2%}")

Random Forest CV Accuracy Playlist: 3.60%
Random Forest CV Std. Dev. Playlist: 0.12%
Random Forest Train Accuracy Playlist: 3.82%
