## DX799S O1 Data Science Capstone (Summer 1 2025): ACTIVITY 4.2 ##

Each week, you will apply the concepts of that week to your Integrated Capstone Project’s dataset. In preparation for Milestone One, create a Jupyter Notebook (similar to in Module B, semester two) that illustrates these lessons. There are no specific questions to answer in your Jupyter Notebook files in this course; your general goal is to analyze your data using the methods you have learned about in this course and in this program and draw interesting conclusions. 

For Week 4, include concepts such as logistic regression and feature scaling. This homework should be submitted for peer review in the assignment titled 4.3 Peer Review: Week 4 Jupyter Notebook. Complete and submit your Jupyter Notebook homework by 11:59pm ET on Sunday. 

In Week 7, you will compile your findings from your Jupyter Notebook homework into your Milestone One assignment for grading. For full instructions and the rubric for Milestone One, refer to the following link.  

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, f1_score
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import cross_val_predict


The following dataset, "Video Review", is a collection of information that was created based on reviewable video evidence that outlines the events that resulted in a concussion during punt players in the NFL 2016-2017 season. The target, Primary_Impact_Type, outlines if the concussion occurred from the impact of Helmet-to-Helmet, Helmet-to-Body, or Helmet-to-Ground.

In [16]:
#Video Review Dataset with Feature Scaling

df_videoreview = pd.read_csv("video_review.csv")

label_encoder = LabelEncoder()

print("Object columns before encoding:")
print(df_videoreview.select_dtypes(include=['object']).columns)

for col in df_videoreview.select_dtypes(include=['object']).columns:
    df_videoreview[col] = label_encoder.fit_transform(df_videoreview[col].astype(str))


target_column = 'Primary_Impact_Type'  
X = df_videoreview.drop(columns=[target_column])
y = df_videoreview[target_column]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



Object columns before encoding:
Index(['Player_Activity_Derived', 'Turnover_Related', 'Primary_Impact_Type',
       'Primary_Partner_GSISID', 'Primary_Partner_Activity_Derived',
       'Friendly_Fire'],
      dtype='object')


In [17]:
#Video Review Dataset Logistic Regression


pipeline_video = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])

param_grid = {
    'model__C': [0.01, 0.1, 1, 10, 100, 200, 300]
}

grid_search = GridSearchCV(
    estimator=pipeline_video,
    param_grid=param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1
)

grid_search.fit(X, y)

best_model = grid_search.best_estimator_

y_pred = cross_val_predict(best_model, X, y, cv=5)

accuracy = accuracy_score(y, y_pred)
precision = precision_score(y, y_pred, average="weighted") 
f1 = f1_score(y, y_pred, average="weighted")

print(f"Best C value: {grid_search.best_params_['model__C']}")
print(f"Cross-validated Accuracy for Video Review Dataset: {accuracy:.4f}")
print(f"Cross-validated Precision for Video Review Dataset: {precision:.4f}")
print(f"Cross-validated F1-score for Video Review Datasett: {f1:.4f}")




Best C value: 100
Cross-validated Accuracy for Video Review Dataset: 0.5946
Cross-validated Precision for Video Review Dataset: 0.5903
Cross-validated F1-score for Video Review Datasett: 0.5871


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The next dataset, Injury Record, looks to determine the relationship between the playing surface and the injury and performance of NFL athletes. The Injury Record dataset accounts for 105 lower-limbs injuries that occurred over two seasons during the regular NFL season and provides information on the surface the game occurred on and the number of days the player missed due to injury (or how severe it was). The target in this case is surface which lists the type of surface (synethic or natural) the field was when the injury occurred.

In [18]:
df_injuryrecord = pd.read_csv("InjuryRecord.csv")

In [19]:
#Injury Record Dataset with Feature Scaling


label_encoder = LabelEncoder()

print("Object columns before encoding:")
print(df_injuryrecord.select_dtypes(include=['object']).columns)

for col in df_injuryrecord.select_dtypes(include=['object']).columns:
    df_injuryrecord[col] = label_encoder.fit_transform(df_injuryrecord[col].astype(str))


target_column = 'Surface'  
X_injury = df_injuryrecord.drop(columns=[target_column])
y_injury = df_injuryrecord[target_column]

X_train, X_test, y_train, y_test = train_test_split(X_injury, y_injury, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Object columns before encoding:
Index(['GameID', 'PlayKey', 'BodyPart', 'Surface'], dtype='object')


In [20]:
#Injury Record Dataset Logistic Regression

pipeline_injury = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])

param_grid = {
    'model__C': [0.01, 0.1, 1, 10, 100, 200, 300]
}

grid_search = GridSearchCV(
    estimator=pipeline_injury,
    param_grid=param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1
)

grid_search.fit(X_injury, y_injury)

best_model = grid_search.best_estimator_

y_pred_injury = cross_val_predict(best_model, X_injury, y_injury, cv=5)

accuracy_injury = accuracy_score(y_injury, y_pred_injury)
precision_injury = precision_score(y_injury, y_pred_injury, average="weighted") 
f1_injury = f1_score(y_injury, y_pred_injury, average="weighted")

print(f"Best C value: {grid_search.best_params_['model__C']}")
print(f"Cross-validated Accuracy for Injury Record Dataset: {accuracy_injury:.4f}")
print(f"Cross-validated Precision for Injury Record Dataset: {precision_injury:.4f}")
print(f"Cross-validated F1-score for Injury Record Dataset: {f1_injury:.4f}")



Best C value: 0.1
Cross-validated Accuracy for Injury Record Dataset: 0.5048
Cross-validated Precision for Injury Record Dataset: 0.4900
Cross-validated F1-score for Injury Record Dataset: 0.4857


The last dataset, Concussion, contains a list of concussion injuries that occurred in the National Football League from the year 2012 to 2014. The data includes features such as Position, Pre-Season Injury?, Week of Injury, Weeks Injured, Games Missed, Reported Injury Type, Average Playtime Before Injury, etc. The target in this case will be "Reported Injury Type" which will be limited to just concussions.

In [21]:
#Concussion Dataset with Feature Scaling

df_concussion = pd.read_csv("Concussion Injuries 2012-2014 (1).csv")
df_clean_concussion = df_concussion.drop(columns=['ID', 'Player', 'Game', 'Date', 'Winning Team?', 'Unknown Injury?'])
df_clean_concussion = df_clean_concussion.dropna()

label_encoder = LabelEncoder()

for col in df_clean_concussion.select_dtypes(include=['object']).columns:
    df_clean_concussion[col] = label_encoder.fit_transform(df_clean_concussion[col].astype(str))

target_column = 'Reported Injury Type'  
X_concussion = df_clean_concussion.drop(columns=[target_column])
y_concussion = df_clean_concussion[target_column]

X_train, X_test, y_train, y_test = train_test_split(X_concussion, y_concussion, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)


In [22]:
#Concussion Dataset with Logistic Regression

pipeline_concussion = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000))
])

param_grid = {
    'model__C': [0.01, 0.1, 1, 10, 100, 200, 300]
}

grid_search = GridSearchCV(
    estimator=pipeline_concussion,
    param_grid=param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1
)

grid_search.fit(X_concussion, y_concussion)

best_pipeline_concussion = grid_search.best_estimator_

y_pred_concussion = cross_val_predict(best_pipeline_concussion, X_concussion, y_concussion, cv=5)

accuracy_concussion = accuracy_score(y_concussion, y_pred_concussion)
precision_concussion = precision_score(y_concussion, y_pred_concussion, average="weighted") 
f1_concussion = f1_score(y_concussion, y_pred_concussion, average="weighted")

print(f"Best C value: {grid_search.best_params_['model__C']}")
print(f"Cross-validated Accuracy for Concussion Dataset: {accuracy_concussion:.4f}")
print(f"Cross-validated Precision for Concussion Dataset: {precision_concussion:.4f}")
print(f"Cross-validated F1-score for Concussion Dataset: {f1_concussion:.4f}")


Best C value: 100
Cross-validated Accuracy for Concussion Dataset: 0.8031
Cross-validated Precision for Concussion Dataset: 0.7681
Cross-validated F1-score for Concussion Dataset: 0.7781
