## DX799S O1 Data Science Capstone (Summer 1 2025): ACTIVITY 5.2 ##

Each week, you will apply the concepts of that week to your Integrated Capstone Project’s dataset. In preparation for Milestone One, create a Jupyter Notebook (similar to in Module B, Semester Two) that illustrates these lessons. There are no specific questions to answer in your Jupyter Notebook files in this course; your general goal is to analyze your data using the methods you have learned about in this course and in this program and draw interesting conclusions. 

For Week 5, include concepts such as support vector machines, the kernel trick, and regularization for support vector machines. Complete your Jupyter Notebook homework by 11:59pm ET on Sunday. 

In Week 7, you will compile your findings from your Jupyter Notebook homework into your Milestone One assignment for grading. For full instructions and the rubric for Milestone One, refer to the following link. 

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV


The following dataset, "Video Review", is a collection of information that was created based on reviewable video evidence that outlines the events that resulted in a concussion during punt players in the NFL 2016-2017 season. The target, Primary_Impact_Type, outlines if the concussion occurred from the impact of Helmet-to-Helmet, Helmet-to-Body, or Helmet-to-Ground.

In [2]:
#Video Review Dataset with support vector machines

df_videoreview = pd.read_csv("video_review.csv")

label_encoder = LabelEncoder()

for col in df_videoreview.select_dtypes(include=['object']).columns:
    df_videoreview[col] = label_encoder.fit_transform(df_videoreview[col].astype(str))


target_column = 'Primary_Impact_Type'  
X = df_videoreview.drop(columns=[target_column])
y = df_videoreview[target_column]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

svm = SVC(kernel='rbf', decision_function_shape='ovr', class_weight='balanced', random_state=42)
svm.fit(X_train_scaled, y_train)

y_pred_train = svm.predict(X_train_scaled)

print("Training Accuracy Video Review Dataset:", accuracy_score(y_train, y_pred_train))
print("\nClassification Report (Video Review Dataset):\n", classification_report(y_train, y_pred_train))
print("\nConfusion Matrix (Video Review Dataset):\n", confusion_matrix(y_train, y_pred_train))


Training Accuracy Video Review Dataset: 0.7586206896551724

Classification Report (Video Review Dataset):
               precision    recall  f1-score   support

           0       0.71      0.86      0.77        14
           1       1.00      1.00      1.00         1
           2       0.80      0.62      0.70        13
           3       1.00      1.00      1.00         1

    accuracy                           0.76        29
   macro avg       0.88      0.87      0.87        29
weighted avg       0.77      0.76      0.75        29


Confusion Matrix (Video Review Dataset):
 [[12  0  2  0]
 [ 0  1  0  0]
 [ 5  0  8  0]
 [ 0  0  0  1]]


In [3]:
#Video Review Dataset regularization for support vector machines

param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(SVC(kernel='rbf', class_weight='balanced', random_state=42), param_grid, cv=5)
grid.fit(X_train_scaled, y_train)

print("Best C Video Review Dataset:", grid.best_params_['C'])
print("Best cross-val accuracy Video Review Dataset:", grid.best_score_)

train_accuracy = grid.best_estimator_.score(X_train_scaled, y_train)
print("Training set accuracy with best C Video Review Dataset:", train_accuracy)


Best C Video Review Dataset: 1
Best cross-val accuracy Video Review Dataset: 0.4133333333333333
Training set accuracy with best C Video Review Dataset: 0.7586206896551724




In [4]:
#Video Review Dataset the kernel trick 

kernels = ['linear', 'poly', 'rbf', 'sigmoid']

for kernel in kernels:
    print(f"\n Video Review Dataset Training SVM with {kernel} kernel:")
    svm = SVC(kernel=kernel, degree=3 if kernel == 'poly' else 3, class_weight='balanced', random_state=42)
    svm.fit(X_train_scaled, y_train)
    y_pred_train = svm.predict(X_train_scaled)
    print(classification_report(y_train, y_pred_train))



 Video Review Dataset Training SVM with linear kernel:
              precision    recall  f1-score   support

           0       0.64      0.64      0.64        14
           1       1.00      1.00      1.00         1
           2       0.62      0.62      0.62        13
           3       1.00      1.00      1.00         1

    accuracy                           0.66        29
   macro avg       0.81      0.81      0.81        29
weighted avg       0.66      0.66      0.66        29


 Video Review Dataset Training SVM with poly kernel:
              precision    recall  f1-score   support

           0       1.00      0.64      0.78        14
           1       1.00      1.00      1.00         1
           2       0.72      1.00      0.84        13
           3       1.00      1.00      1.00         1

    accuracy                           0.83        29
   macro avg       0.93      0.91      0.91        29
weighted avg       0.88      0.83      0.82        29


 Video Review Datas

The next dataset, Injury Record, looks to determine the relationship between the playing surface and the injury and performance of NFL athletes. The Injury Record dataset accounts for 105 lower-limbs injuries that occurred over two seasons during the regular NFL season and provides information on the surface the game occurred on and the number of days the player missed due to injury (or how severe it was). The target in this case is surface which lists the type of surface (synethic or natural) the field was when the injury occurred.

In [5]:
df_injuryrecord = pd.read_csv("InjuryRecord.csv")

In [6]:
#Injury Record Dataset with support vector machines

label_encoder = LabelEncoder()

for col in df_injuryrecord.select_dtypes(include=['object']).columns:
    df_injuryrecord[col] = label_encoder.fit_transform(df_injuryrecord[col].astype(str))


target_column = 'Surface'  
X_injury = df_injuryrecord.drop(columns=[target_column])
y_injury = df_injuryrecord[target_column]

X_train, X_test, y_train, y_test = train_test_split(X_injury, y_injury, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)


svm_injury = SVC(kernel='rbf', decision_function_shape='ovr', class_weight='balanced', random_state=42)
svm_injury.fit(X_train_scaled, y_train)

y_pred_train = svm_injury.predict(X_train_scaled)

print("Training Accuracy Injury Record Dataset:", accuracy_score(y_train, y_pred_train))
print("\nClassification Report (Injury Record Dataset):\n", classification_report(y_train, y_pred_train))
print("\nConfusion Matrix (Injury Record Dataset):\n", confusion_matrix(y_train, y_pred_train))

Training Accuracy Injury Record Dataset: 0.6547619047619048

Classification Report (Injury Record Dataset):
               precision    recall  f1-score   support

           0       0.62      0.64      0.63        39
           1       0.68      0.67      0.67        45

    accuracy                           0.65        84
   macro avg       0.65      0.65      0.65        84
weighted avg       0.66      0.65      0.66        84


Confusion Matrix (Injury Record Dataset):
 [[25 14]
 [15 30]]


In [7]:
#Injury Record Dataset regularization for support vector machines
param_grid_injury = {'C': [0.01, 0.1, 1, 10, 100]}
grid_injury = GridSearchCV(SVC(kernel='rbf', class_weight='balanced', random_state=42), param_grid_injury, cv=5)
grid_injury.fit(X_train_scaled, y_train)

print("Best C Injury Record:", grid_injury.best_params_['C'])
print("Best cross-val accuracy Injury Record Dataset:", grid_injury.best_score_)

train_accuracy_injury = grid_injury.best_estimator_.score(X_train_scaled, y_train)
print("Training set accuracy with best C Injury Record Dataset:", train_accuracy_injury)



Best C Injury Record: 0.01
Best cross-val accuracy Injury Record Dataset: 0.5110294117647058
Training set accuracy with best C Injury Record Dataset: 0.5357142857142857


In [8]:
#Injury Record Dataset the kernel trick 

kernels_injury = ['linear', 'poly', 'rbf', 'sigmoid']

for kernel in kernels_injury:
    print(f"\n Injury Record Dataset Training SVM with {kernel} kernel:")
    svm_injury = SVC(kernel=kernel, degree=3 if kernel == 'poly' else 3, class_weight='balanced', random_state=42)
    svm_injury.fit(X_train_scaled, y_train)
    y_pred_train = svm_injury.predict(X_train_scaled)
    print(classification_report(y_train, y_pred_train))


 Injury Record Dataset Training SVM with linear kernel:
              precision    recall  f1-score   support

           0       0.53      0.46      0.49        39
           1       0.58      0.64      0.61        45

    accuracy                           0.56        84
   macro avg       0.55      0.55      0.55        84
weighted avg       0.56      0.56      0.56        84


 Injury Record Dataset Training SVM with poly kernel:
              precision    recall  f1-score   support

           0       0.68      0.44      0.53        39
           1       0.63      0.82      0.71        45

    accuracy                           0.64        84
   macro avg       0.65      0.63      0.62        84
weighted avg       0.65      0.64      0.63        84


 Injury Record Dataset Training SVM with rbf kernel:
              precision    recall  f1-score   support

           0       0.62      0.64      0.63        39
           1       0.68      0.67      0.67        45

    accuracy    

The last dataset, Concussion, contains a list of concussion injuries that occurred in the National Football League from the year 2012 to 2014. The data includes features such as Position, Pre-Season Injury?, Week of Injury, Weeks Injured, Games Missed, Reported Injury Type, Average Playtime Before Injury, etc. The target in this case will be "Reported Injury Type" which will be limited to just concussions.

In [9]:
#Concussion Dataset with support vector machines

df_concussion = pd.read_csv("Concussion Injuries 2012-2014 (1).csv")
df_clean_concussion = df_concussion.drop(columns=['ID', 'Player', 'Game', 'Date', 'Winning Team?', 'Unknown Injury?'])
df_clean_concussion = df_clean_concussion.dropna()

label_encoder = LabelEncoder()

for col in df_clean_concussion.select_dtypes(include=['object']).columns:
    df_clean_concussion[col] = label_encoder.fit_transform(df_clean_concussion[col].astype(str))

target_column = 'Reported Injury Type'  
X_concussion = df_clean_concussion.drop(columns=[target_column])
y_concussion = df_clean_concussion[target_column]

X_train, X_test, y_train, y_test = train_test_split(X_concussion, y_concussion, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)



svm_playlist = SVC(kernel='rbf', decision_function_shape='ovr', class_weight='balanced', random_state=42)
svm_playlist.fit(X_train_scaled, y_train)

y_pred_train = svm_playlist.predict(X_train_scaled)

print("Training Accuracy Concussion Dataset:", accuracy_score(y_train, y_pred_train))
print("\nClassification Report (Concussion Dataset):\n", classification_report(y_train, y_pred_train))
print("\nConfusion Matrix (Concussion Dataset):\n", confusion_matrix(y_train, y_pred_train))

Training Accuracy Concussion Dataset: 0.7846153846153846

Classification Report (Concussion Dataset):
               precision    recall  f1-score   support

           0       0.98      0.75      0.85       209
           1       0.47      0.92      0.63        51

    accuracy                           0.78       260
   macro avg       0.72      0.84      0.74       260
weighted avg       0.88      0.78      0.81       260


Confusion Matrix (Concussion Dataset):
 [[157  52]
 [  4  47]]


In [10]:
#Concussion Dataset with regularization for support vector machines
param_grid_playlist = {'C': [0.01, 0.1, 1, 10, 100]}
grid_playlist= GridSearchCV(SVC(kernel='rbf', class_weight='balanced', random_state=42), param_grid_playlist, cv=5)
grid_playlist.fit(X_train_scaled, y_train)

print("Best C Concussion:", grid_playlist.best_params_['C'])
print("Best cross-val accuracy Concussion Dataset:", grid_playlist.best_score_)

train_accuracy_injury = grid_playlist.best_estimator_.score(X_train_scaled, y_train)
print("Training set accuracy with best C Concussion Dataset:", train_accuracy_injury)


Best C Concussion: 10
Best cross-val accuracy Concussion Dataset: 0.7615384615384615
Training set accuracy with best C Concussion Dataset: 0.95


In [11]:
#Concussion Dataset with the kernel trick 

kernels_playlist = ['linear', 'poly', 'rbf', 'sigmoid']

for kernel in kernels_playlist:
    print(f"\n Concussion Dataset Training SVM with {kernel} kernel:")
    svm_playlist = SVC(kernel=kernel, degree=3 if kernel == 'poly' else 3, class_weight='balanced', random_state=42)
    svm_playlist.fit(X_train_scaled, y_train)
    y_pred_train = svm_playlist.predict(X_train_scaled)
    print(classification_report(y_train, y_pred_train))


 Concussion Dataset Training SVM with linear kernel:
              precision    recall  f1-score   support

           0       0.98      0.63      0.76       209
           1       0.38      0.94      0.54        51

    accuracy                           0.69       260
   macro avg       0.68      0.78      0.65       260
weighted avg       0.86      0.69      0.72       260


 Concussion Dataset Training SVM with poly kernel:
              precision    recall  f1-score   support

           0       0.97      0.89      0.93       209
           1       0.66      0.90      0.76        51

    accuracy                           0.89       260
   macro avg       0.82      0.89      0.84       260
weighted avg       0.91      0.89      0.89       260


 Concussion Dataset Training SVM with rbf kernel:
              precision    recall  f1-score   support

           0       0.98      0.75      0.85       209
           1       0.47      0.92      0.63        51

    accuracy             