

```
# This is formatted as code
```

## Omkar Mahesh Patil
## The project I attempted was: (1) Tabular, (2) Image, (3) Text, (4) Time-series
(please delete as appropriate)

# CSC8111 Coursework Specification

For this coursework you will perform **THREE** short questions which cover the breadth of the machine learning module along with attempting **ONE** of the four longer project-style questions. All of the short tasks and longer project-style questions can be found in this notebook. You should provide all of your answers in this notebook and submit it to Canvas before the submission deadline.

The learning objectives of these short questions are:
- To demonstrate a wide-range of machine learning skills.
- To be able to apply the most appropriate approach at the right time.



---
## Question 1: Classification (10 marks)

Load the dataset below, where X and y are the feature (input) variables and target (output) variable. Based on this dataset, build TWO classifiers using different machine learning approaches to predict the two classes in the target variable. You are free to use any appropriate machine learning models and libraries, but you need to split the dataset into training and test sets and optimise the model's hyperparameters (e.g. using GridSearchCV()). As a result, the performance metrics of the best classifier should be reported over the test set. Please follow the steps below to complete the code.

The dataset is available at:
https://ncl.instructure.com/courses/53509/files/7659751?wrap=1 and
https://ncl.instructure.com/courses/53509/files/7659755?wrap=1


## Set up the environment and load the dataset

In [None]:
# just run this cell, don't change the code
import numpy as np
from numpy import loadtxt
X = loadtxt('cls_X.csv', delimiter=',')
y = loadtxt('cls_y.csv', delimiter=',')

## Q1.1 Split the data into training and test sets (20% for testing)

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set - Features:", X_train.shape, "Target:", y_train.shape)
print("Test set - Features:", X_test.shape, "Target:", y_test.shape)

Training set - Features: (320, 6) Target: (320,)
Test set - Features: (80, 6) Target: (80,)


## Q1.2 Create your first classifier

#### Q1.2.1 First, make an attempt by using an appropriate machine learning method without optimising the hyperparameter(s). Report the model accuracy over the test set (i.e. test accuracy).

In [None]:
# write your code below to replace the ellipsis "..."
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

dt_classifier = DecisionTreeClassifier(random_state=42)

# Train the Decision Tree model using the training data
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
predictions = dt_classifier.predict(X_test)
# Print classification report
print("Classification Report:")
print(classification_report(y_test, predictions))


Classification Report:
              precision    recall  f1-score   support

         0.0       0.80      0.78      0.79        36
         1.0       0.82      0.84      0.83        44

    accuracy                           0.81        80
   macro avg       0.81      0.81      0.81        80
weighted avg       0.81      0.81      0.81        80



#### Q1.2.2 Then, optimise the hyperparameter(s) using the same machine learning method as above. Report the best hyperparameter(s) and, use it to make your first classifier and print out its test accuracy.

In [None]:
# write your code below to replace the ellipsis "..."
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Creating a Decision Tree Classifier
decision_tree = DecisionTreeClassifier(random_state=42)

# Define hyperparameters and their values to tune
param_grid = {
    'max_leaf_nodes': [None],
    'min_samples_split': [2],
    'min_samples_leaf': [1],
    'class_weight': [None]

}

# GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=decision_tree, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)

# Train the model using the best parameters
best_decision_tree = DecisionTreeClassifier(**best_params, random_state=42)
best_decision_tree.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_decision_tree.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Additional evaluation metrics
print("Classification Report:")
print(classification_report(y_test, y_pred))


Best Parameters: {'class_weight': None, 'max_leaf_nodes': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Score: 0.79375
Accuracy: 0.8125
Classification Report:
              precision    recall  f1-score   support

         0.0       0.80      0.78      0.79        36
         1.0       0.82      0.84      0.83        44

    accuracy                           0.81        80
   macro avg       0.81      0.81      0.81        80
weighted avg       0.81      0.81      0.81        80



## Q1.3 Create your second classifier

#### Q1.3.1 First, without optimising the hyperparameter(s), make an attempt by using a different machine learning method to the first classifier. Report the model accuracy over the test set (i.e. test accuracy).

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest classifier
random_forest = RandomForestClassifier()

# Fit the model using the training data
random_forest.fit(X_train, y_train)

# Make predictions on the test set
predictions = random_forest.predict(X_test)

# Evaluate model performance (e.g., accuracy)
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy of the Random Forest classifier: {accuracy:.2f}")


#### Q1.3.2 Then, optimise the hyperparameter(s) using the same machine learning method as above. Report the best hyperparameter(s) and, use it to make your second classifier and print out its test accuracy.



In [None]:
# write your code below to replace the ellipsis "..."
# write your code below to replace the ellipsis "..."
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters and their values to search
param_grid = {
    'n_estimators': [10, 20, 30],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
}

# Create a Random Forest classifier
random_forest = RandomForestClassifier()

# Perform GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=random_forest, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Use the best model
best_model = grid_search.best_estimator_

# Make predictions on the test set using the best model
best_predictions1 = best_model.predict(X_test)

# Evaluate model performance
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, best_predictions1)
print(f"Accuracy of the tuned Random Forest classifier: {accuracy:.2f}")


## Q1.4 Report the precision, recall, f1 score and confusion matrix on the best of the two classifiers

In [None]:
# write your code below to replace the ellipsis "..."
# write your code below to replace the ellipsis "..."
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

# Calculate precision, recall, and F1-score
precision = precision_score(y_test, best_predictions1)
recall = recall_score(y_test, best_predictions1)
f1 = f1_score(y_test, best_predictions1)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test, best_predictions1)
print("Confusion Matrix:")
print(conf_matrix)


---
## Question 2: Regression (10 marks)

In this question you are given a simple dataset which you will perform regression on to predict values. You will build TWO Regression models and then take the best one and perform hyperparameter tuning on it.

## Set up the environment

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

## Read in the data

You'll need to download the data.csv file from https://ncl.instructure.com/courses/53509/files/7657710?wrap=1 and upload it to your Google Drive. I placed it in a folder called data. Then you need to mount your Google Drive in Colab (cell below).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Then read in the data

In [None]:
data = np.loadtxt('/content/drive/MyDrive/Data/data.csv', delimiter=',')
print(data)

[[1.00000000e+00 9.00000000e+00 2.57259990e+00 2.49788633e+02]
 [1.00000000e+00 5.00000000e+00 9.21413366e+00 5.04502032e+02]
 [1.00000000e+00 1.70000000e+01 7.12330090e+00 1.33580225e+03]
 ...
 [5.00000000e+00 3.10000000e+01 6.80121067e+00 3.15979690e+03]
 [5.00000000e+00 1.00000000e+01 4.14995662e+00 6.21315789e+02]
 [5.00000000e+00 9.00000000e+00 9.61878173e+00 1.30105857e+03]]


## Q2.1 Split the data into X and y

X is the first three columns

y is the last column

In [None]:

import numpy as np


X=data[:,:3] # Select all rows and all columns except the last one for features
y = data[:, -1]   # Select all rows and the last column for the target variable

# Print the shapes of X and y to verify the split
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (1000, 3)
Shape of y: (1000,)


## Q2.2 Create the Train and Test datasets

20% of the data is kept back for testing

In [None]:
# your answer here
from sklearn.model_selection import train_test_split

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print("Training set - Features:", X_train.shape, "Target:", y_train.shape)
print("Test set - Features:", X_test.shape, "Target:", y_test.shape)

Training set - Features: (800, 3) Target: (800,)
Test set - Features: (200, 3) Target: (200,)


## Q2.3 Use TWO Regression approaches on the dataset

In each case report the R^2 value against the test data.

#LinearRegression

Q2.3.1 Regression approach 1

In [None]:
# your answer here
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score


# 1. Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
linear_reg_predictions = linear_reg.predict(X_test)
linear_reg_r2 = r2_score(y_test, linear_reg_predictions)
print(f"R-squared (Linear Regression): {linear_reg_r2:.4f}")




R-squared (Linear Regression): 0.8625


#Ridge Regression

Q2.3.2 Regression approach 2

In [None]:
# your answer here
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Generating random data for demonstration
#X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Splitting the data into training and testing sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and fitting the Ridge regression model
#alpha = 0.1  # Regularization strength
#ridge.fit(X_train, y_train)

ridge=Ridge()
ridge.fit(X_train,y_train)
# Making predictions on the test set
predictions = ridge.predict(X_test)

r2 = r2_score(y_test, predictions)
print(f"R^2 Score: {r2}")



R^2 Score: 0.8625239425970855


#Hyperparameter using Ridge Regression

## Q2.4 Optimise the hyperparameters

Take your best Regression approach from above and identify the best hyperparameters. Note as some Regression approaches have many hyperparameters you may limit yourself here to just THREE.

Q2.4.1 Search for the best hyperparameters

In [None]:
# your answer here
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler


# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Ridge Regression model
ridge_regression = Ridge()

# Define hyperparameters to search
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1, 10]
}

# Grid Search for best alpha
grid_search = GridSearchCV(estimator=ridge_regression, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)

# Train the model using the best alpha
best_ridge_regression = Ridge(**best_params)
best_ridge_regression.fit(X_train, y_train)

# Evaluate the model
score = best_ridge_regression.score(X_test, y_test)
print("R-squared on Test Set:", score)





Best Parameters: {'alpha': 10}
Best Score: 0.8446284937146773
R-squared on Test Set: 0.8624012211555678


#Best hyperparameter


Q2.4.2 Output the best hyperparameters found

In [None]:
# your answer here
# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

Best hyperparameters: {'alpha': 10}


Q2.4.3 Show the results for the best model

In [None]:
# your answer here
# Use the best model
best_model = grid_search.best_estimator_

# Make predictions on the test set using the best model
best_predictions = best_model.predict(X_test)
#best_predictions
best_model

---
## Question 3: Deep Learning (10 marks)

Q3.1 For MNIST dataset, implement a deep learning model with 3 hidden layers with layer size: 128, 256, 50.


In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.datasets import mnist
import keras.utils as utils

batch_size = 128
nb_classes = 10
im_dim = 784
nb_epoch = 2

In [None]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(60000, im_dim)
X_test = X_test.reshape(10000, im_dim)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
Y_train = utils.to_categorical(y_train, nb_classes)
Y_test = utils.to_categorical(y_test, nb_classes)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [None]:
model = Sequential()
model.add(Dense(128, input_shape=(im_dim,), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(nb_classes, activation='softmax'))

Add code to output your network structure

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               100480    
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 256)               33024     
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 50)                12850     
                                                                 
 dropout_2 (Dropout)         (None, 50)                0         
                                                                 
 dense_3 (Dense)             (None, 10)                5

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Train the model for just two epochs to show it works. All code provided - just run.

In [None]:
history = model.fit(X_train, Y_train, epochs=nb_epoch,
                    validation_split = 0.2,
                    batch_size=batch_size, verbose=1)
evaluation = model.evaluate(X_test, Y_test, verbose=1)
print('Summary: Loss over the test dataset: %.2f, Accuracy: %.2f' % (evaluation[0], evaluation[1]))

Epoch 1/2
Epoch 2/2
Summary: Loss over the test dataset: 0.12, Accuracy: 0.96


Q 3.2 For MNIST dataset, implement a CNN model with only one 2D CNN layer as the hidden layer.

In [None]:
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

batch_size = 128
num_classes = 10
epochs = 2


(x_train, y_train), (x_test, y_test) = mnist.load_data()

# input image dimensions
img_rows, img_cols = 28, 28
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


In [None]:
# Write down your code about the CNN model of Q3.2 here
# you should call your model 'model'
# Write down your code about the CNN model of Q3.2 here
# you should call your model 'model'
# ...
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D(pool_size=(2, 2)))
#model.add(Conv2D(64, (3, 3), activation='relu'))
#model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
#model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

Add code to output your network structure

In [None]:
# your code here
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2  (None, 13, 13, 32)        0         
 D)                                                              
                                                                 
 flatten (Flatten)           (None, 5408)              0         
                                                                 
 dropout_3 (Dropout)         (None, 5408)              0         
                                                                 
 dense_4 (Dense)             (None, 10)                54090     
                                                                 
Total params: 54410 (212.54 KB)
Trainable params: 54410 (212.54 KB)
Non-trainable params: 0 (0.00 Byte)
________________

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


We just train for two epochs to demonstrate that the network does work. Just run it.

In [None]:
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
              verbose=1, shuffle=True,
              validation_split = 0.2)
score = model.evaluate(x_test, y_test, verbose=0)
print('Summary: Loss over the test dataset: %.2f, Accuracy: %.2f' % (score[0], score[1]))

Epoch 1/2
Epoch 2/2
Summary: Loss over the test dataset: 0.12, Accuracy: 0.97


---
---
---
#Mini-projects: Introduction

The remainder of this document defines four project-style questions which go more deeply into one of the areas from the module. You should pick **ONE** of these project-stye questions to answer.

The learning objectives of this assignment are:
1. To learn about the design of machine learning analysis pipelines
2. To understand how to select appropriate methods given the dataset type
3. To learn how to conduct machine learning experimentation in a rigorous and effective manner
4. To critically evaluate the performance of the designed machine learning pipelines
5. To learn and practice the skills of reporting machine learning experiments

For this coursework you will be provided with a choice of four different datasets of different nature
1. A tabular dataset (defined as a classification problem)
2. An image dataset
3. A text dataset
4. A time series dataset

Your job is easy to state: You should pick ONE out of these four options and design a range of machine learning pipelines appropriate to the nature of each of the selected datasets. Overall, we expect that you will perform a thorough investigation involving (whenever relevant) all parts of a machine learning pipeline (exploration, preprocessing, model training, model interpretation and evaluation), evaluating a range of options for all parts of the pipeline and with proper hyperparameter tuning.

You will have to write a short report (as part of this notebook) that presents the experiments you did, their justification, a detailed description of the performance of your designed pipelines using the most appropriate presentation tools (e.g., tables of results, plots). We expect that you should be able to present your work at a level of detail that would enable a fellow student to reproduce your steps.

## Deliverables
An inline report and code blocks addressing the marking scheme below. The report should have 1000 to 2000 words. The word count excludes references, tables, figures and section headers, and has a 10% leeway.

## Marking scheme

- Writing Style, references, figures, etc. 7 marks
- Dataset exploration 7 marks
- Methods 21 marks
- Results of analysis 21 marks
- Discussion 14 marks

---
---
## Project 1: Tabular dataset (70 marks)

The dataset, called FARS, is a collection of statistics of US road traffic accidents. The class label is about the severity of the accident. It has 20 features and over 100K examples. The dataset is available in Canvas as a CSV file, in which the last column contains the class labels: https://ncl.instructure.com/courses/53509/files/7652449/download?download_frd=1

Experiments on the tabular dataset will be relatively fast compared to the other three options. To compensate, we expect that you evaluate a very broad range of options for the design of your machine learning pipelines, including (but not limited to) data normalisation, feature/instance selection, class imbalance correction, several (appropriate) machine learning models, hyperparameter tuning and cross-validation evaluation.

# Tabular Data


##**Importing libraries**

In the following section importing pandas,seaboarn and sklearn ,libraries.

In [49]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.svm import SVC


##Read in the data

Loading fars data & reading it through pandas library.The data is being read into data variable

In [50]:
file_path = '/content/fars 2.csv'
data = pd.read_csv(file_path)

Number of row 100968 and column is 30

In [51]:
data.shape

(100968, 30)

Printing all the 30 columns of the data variable

In [52]:
data.columns

Index(['CASE_STATE', 'AGE', 'SEX', 'PERSON_TYPE', 'SEATING_POSITION',
       'RESTRAINT_SYSTEM-USE', 'AIR_BAG_AVAILABILITY/DEPLOYMENT', 'EJECTION',
       'EJECTION_PATH', 'EXTRICATION', 'NON_MOTORIST_LOCATION',
       'POLICE_REPORTED_ALCOHOL_INVOLVEMENT', 'METHOD_ALCOHOL_DETERMINATION',
       'ALCOHOL_TEST_TYPE', 'ALCOHOL_TEST_RESULT',
       'POLICE-REPORTED_DRUG_INVOLVEMENT', 'METHOD_OF_DRUG_DETERMINATION',
       'DRUG_TEST_TYPE_(1_of_3)', 'DRUG_TEST_RESULTS_(1_of_3)',
       'DRUG_TEST_TYPE_(2_of_3)', 'DRUG_TEST_RESULTS_(2_of_3)',
       'DRUG_TEST_TYPE_(3_of_3)', 'DRUG_TEST_RESULTS_(3_of_3)',
       'HISPANIC_ORIGIN', 'TAKEN_TO_HOSPITAL',
       'RELATED_FACTOR_(1)-PERSON_LEVEL', 'RELATED_FACTOR_(2)-PERSON_LEVEL',
       'RELATED_FACTOR_(3)-PERSON_LEVEL', 'RACE', 'INJURY_SEVERITY'],
      dtype='object')

data.describe() provides a summary of descriptive statistics for numerical columns in the dataset.                                        
Count- tells about number of non null values present in each column.                                    
Mean-calculates the average value of each column which has datatype as int 64                            
std-define the standard deviation for each column           
min-It defines the minimum value of that particular column             
max-It defines the maximun value of that particular column
25%- Value below which 25% of the data falls.        
50%-Middle value separating the higher half from the lower half of the data.                  
75%-Value below which 75% of the data falls.




In [53]:
data.describe()

Unnamed: 0,AGE,ALCOHOL_TEST_RESULT,DRUG_TEST_RESULTS_(1_of_3),DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_RESULTS_(3_of_3)
count,100968.0,100968.0,100968.0,100968.0,100968.0
mean,37.106707,68.023116,207.393758,100.089672,95.441556
std,22.109641,42.306371,396.194002,295.089512,292.121277
min,0.0,0.0,0.0,0.0,0.0
25%,20.0,15.0,0.0,0.0,0.0
50%,32.0,96.0,0.0,0.0,0.0
75%,49.0,96.0,1.0,0.0,0.0
max,99.0,99.0,999.0,999.0,999.0


Prints the first five rows of the dataset

In [54]:
data.head()

Unnamed: 0,CASE_STATE,AGE,SEX,PERSON_TYPE,SEATING_POSITION,RESTRAINT_SYSTEM-USE,AIR_BAG_AVAILABILITY/DEPLOYMENT,EJECTION,EJECTION_PATH,EXTRICATION,...,DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_TYPE_(3_of_3),DRUG_TEST_RESULTS_(3_of_3),HISPANIC_ORIGIN,TAKEN_TO_HOSPITAL,RELATED_FACTOR_(1)-PERSON_LEVEL,RELATED_FACTOR_(2)-PERSON_LEVEL,RELATED_FACTOR_(3)-PERSON_LEVEL,RACE,INJURY_SEVERITY
0,Alabama,34,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Air_Bag_Available_but_Not_Deployed_for_this_Seat,Totally_Ejected,Unknown,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury
1,Alabama,20,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Deployed_Air_Bag_from_Front,Totally_Ejected,Unknown,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury
2,Alabama,43,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black,Fatal_Injury
3,Alabama,38,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Front_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Incapaciting_Injury
4,Alabama,50,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),Lap_and_Shoulder_Belt,Deployed_Air_Bag_from_Front,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Non-Hispanic,Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Black,Fatal_Injury


Prints the last five rows of the dataset

In [55]:
data.tail()

Unnamed: 0,CASE_STATE,AGE,SEX,PERSON_TYPE,SEATING_POSITION,RESTRAINT_SYSTEM-USE,AIR_BAG_AVAILABILITY/DEPLOYMENT,EJECTION,EJECTION_PATH,EXTRICATION,...,DRUG_TEST_RESULTS_(2_of_3),DRUG_TEST_TYPE_(3_of_3),DRUG_TEST_RESULTS_(3_of_3),HISPANIC_ORIGIN,TAKEN_TO_HOSPITAL,RELATED_FACTOR_(1)-PERSON_LEVEL,RELATED_FACTOR_(2)-PERSON_LEVEL,RELATED_FACTOR_(3)-PERSON_LEVEL,RACE,INJURY_SEVERITY
100963,Wyoming,10,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Left_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100964,Wyoming,9,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Right_Side,Lap_and_Shoulder_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100965,Wyoming,7,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Middle,Lap_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100966,Wyoming,4,Female,Passenger_of_a_Motor_Vehicle_in_Transport,Second_Seat_-_Middle,Lap_Belt,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Not_a_Fatality_(Not_Applicable),Yes,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_a_Fatality_(Not_Applicable),Possible_Injury
100967,Wyoming,61,Male,Driver,Front_Seat_-_Left_Side_(Drivers_Side),None_Used/Not_Applicable,Air_Bag_Not_Available_for_this_Seat,Not_Ejected,Not_Ejected/Not_Applicable,Not_Extricated,...,0,Not_Tested_for_Drugs,0,Hispanic_-_Origin_Not_Specified_or_Other_Origin,No,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,Not_Applicable_-_Driver/None_-_All_Other_Persons,White,Fatal_Injury


#Data Cleaning

In the process of data cleaning we are preparing the quality data which will be useful in training the models further.It involves different aspect like  checking missing values,eliminating Duplicates.


##Check for Null Values
isnull().sum()-function calculates  the number of missing values present in each column        
len(data)-this function returns the number of rows in each column    
In the percent_missing we are calculates the percentage of missing values present in each column.

In [56]:
percent_missing = round(data.isnull().sum()/len(data)*100,2)
percent_missing

CASE_STATE                             0.0
AGE                                    0.0
SEX                                    0.0
PERSON_TYPE                            0.0
SEATING_POSITION                       0.0
RESTRAINT_SYSTEM-USE                   0.0
AIR_BAG_AVAILABILITY/DEPLOYMENT        0.0
EJECTION                               0.0
EJECTION_PATH                          0.0
EXTRICATION                            0.0
NON_MOTORIST_LOCATION                  0.0
POLICE_REPORTED_ALCOHOL_INVOLVEMENT    0.0
METHOD_ALCOHOL_DETERMINATION           0.0
ALCOHOL_TEST_TYPE                      0.0
ALCOHOL_TEST_RESULT                    0.0
POLICE-REPORTED_DRUG_INVOLVEMENT       0.0
METHOD_OF_DRUG_DETERMINATION           0.0
DRUG_TEST_TYPE_(1_of_3)                0.0
DRUG_TEST_RESULTS_(1_of_3)             0.0
DRUG_TEST_TYPE_(2_of_3)                0.0
DRUG_TEST_RESULTS_(2_of_3)             0.0
DRUG_TEST_TYPE_(3_of_3)                0.0
DRUG_TEST_RESULTS_(3_of_3)             0.0
HISPANIC_OR

##Check for duplicate values
In the below code we are finding the duplicate rows in the dataset.

In [57]:
data.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
100963    False
100964    False
100965    False
100966    False
100967    False
Length: 100968, dtype: bool

##Remove Duplicate values from the dataset

Droping the duplicates values present in the dataset

In [58]:
data.drop_duplicates(inplace=True)

#Size of dataset after removing null and duplicate

After removing the duplicates number of rows are 93004 and columns are 30.

In [59]:
data.shape

(93004, 30)

#Report

In [60]:
pip install sweetviz




In [23]:
import sweetviz as sv
import pandas as pd
report = sv.analyze(data)
report.show_html('report.html')

                                             |          | [  0%]   00:00 -> (? left)

Report report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


#correlation matrix


Through the correlation matrix we are depicting the relation between the columns which has datatype has int64

In [61]:
correlation_matrix =data
correlation_matrix = data.corr()
for column in correlation_matrix.columns:
    print(f"Correlation of '{column}' with other columns:")
    print(correlation_matrix[column])
    print("\n")

Correlation of 'AGE' with other columns:
AGE                           1.000000
ALCOHOL_TEST_RESULT          -0.074804
DRUG_TEST_RESULTS_(1_of_3)    0.029645
DRUG_TEST_RESULTS_(2_of_3)    0.024178
DRUG_TEST_RESULTS_(3_of_3)    0.025266
Name: AGE, dtype: float64


Correlation of 'ALCOHOL_TEST_RESULT' with other columns:
AGE                          -0.074804
ALCOHOL_TEST_RESULT           1.000000
DRUG_TEST_RESULTS_(1_of_3)    0.035061
DRUG_TEST_RESULTS_(2_of_3)    0.081114
DRUG_TEST_RESULTS_(3_of_3)    0.104216
Name: ALCOHOL_TEST_RESULT, dtype: float64


Correlation of 'DRUG_TEST_RESULTS_(1_of_3)' with other columns:
AGE                           0.029645
ALCOHOL_TEST_RESULT           0.035061
DRUG_TEST_RESULTS_(1_of_3)    1.000000
DRUG_TEST_RESULTS_(2_of_3)    0.618869
DRUG_TEST_RESULTS_(3_of_3)    0.612399
Name: DRUG_TEST_RESULTS_(1_of_3), dtype: float64


Correlation of 'DRUG_TEST_RESULTS_(2_of_3)' with other columns:
AGE                           0.024178
ALCOHOL_TEST_RESULT        

  correlation_matrix = data.corr()


#Encoding


##One Hot Encoding & Label Encoding

Machine learning understand numerical data rather than string.
In our data set we have combination of numerical as well as strig data.Through one hot encoding we are converting categorical data into the numeric.      
Out of 30 columns 24 columns are encoding using one hot encoding.      
On the output column we are performing labelencoding.
And the rest five columns are in numerical format so there is no need to perform encoding on them.

First Label encoding is performed on the injury severity columns.
In the next step we have dropped integer columns of the dataframe.
Further one hot encoding is performed on the rest 24 columns and resetting the index.    
At the last we are concatinating "df_encoded" which has the 24 columns on which one hot encoding is performed with "includecolumn" which has 5 colums which are of type numerical.
So "df_xyz" has now 362 columns and further concatinated with "column_to_drop" which has the injury severity column.

In [62]:


le = LabelEncoder()
data['INJURY_SEVERITY']= le.fit_transform(data['INJURY_SEVERITY'])

columns_to_drop = ['AGE', 'INJURY_SEVERITY','ALCOHOL_TEST_RESULT','DRUG_TEST_RESULTS_(1_of_3)','DRUG_TEST_RESULTS_(2_of_3)','DRUG_TEST_RESULTS_(3_of_3)']
df_temp = data.drop(columns=columns_to_drop)

column_to_drop = data['INJURY_SEVERITY']
categorical_columns = df_temp.select_dtypes(include=['object']).columns
df_subset = df_temp[categorical_columns]

encoder = OneHotEncoder(sparse=False)
one_hot_encoded = encoder.fit_transform(df_subset)
df_encoded = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(df_subset.columns))

includecolumn=data[['AGE','ALCOHOL_TEST_RESULT','DRUG_TEST_RESULTS_(1_of_3)','DRUG_TEST_RESULTS_(2_of_3)','DRUG_TEST_RESULTS_(3_of_3)']]


includecolumn.reset_index(drop=True, inplace=True)
df_encoded.reset_index(drop=True, inplace=True)


df_xyz = pd.concat([includecolumn,df_encoded], axis=1)

df_xyz.reset_index(drop=True, inplace=True)
column_to_drop.reset_index(drop=True, inplace=True)


df_xyz1 = pd.concat([df_xyz,column_to_drop], axis=1)
df_temp.shape





(93004, 24)

total number of rows are 93004 and columns are 363 after doing one hot encoding and label encoding

In [63]:
df_xyz1.shape

(93004, 363)

For futher processing the data is divided into two sets 'X' and 'Y'.       
X which has 29 columns and Y which has Injury Severity column.



In [27]:
X=df_xyz1.drop(columns=['INJURY_SEVERITY'])
Y=df_xyz1['INJURY_SEVERITY']

nunique method in pandas is used to find out unique values present in the dataset.
Y dataset has 8 unique values

In [28]:
Y.nunique()

8

In [29]:
X.nunique()

AGE                                                       99
ALCOHOL_TEST_RESULT                                       69
DRUG_TEST_RESULTS_(1_of_3)                                95
DRUG_TEST_RESULTS_(2_of_3)                                73
DRUG_TEST_RESULTS_(3_of_3)                                59
                                                          ..
RACE_Other_Indian_(Includes_South_and_Central_America)     2
RACE_Samoan                                                2
RACE_Unknown                                               2
RACE_Vietnamese                                            2
RACE_White                                                 2
Length: 362, dtype: int64

#Split dataset into training set and test set
####training dataset consist of 80 percentage & testing dataset consist of 20 percentage

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=42)

In [65]:
X.shape

(93004, 362)

#OverSampler


An oversampler in machine learning refers to a technique used to tackle class imbalance in datasets.

##RandomOverSampler

Purpose of using RandomOverSampler is to prevent the machine learning model from being biased towards the majority class and to improve its ability to learn from the minority class.
As a input X_train and y_trai are given as input.
after oversampling the data is in this two varibales.
X_train---->X_random_oversampled
y_train---->y_random_oversampled

In [66]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

random_oversampler = RandomOverSampler(random_state=42)

x_random_oversampled, y_random_oversampled = random_oversampler.fit_resample(X_train,y_train)

print('Original dataset shape', Counter(Y))
print('Resampled dataset shape', Counter(y_random_oversampled))

Original dataset shape Counter({1: 41442, 4: 15642, 2: 14230, 5: 12945, 6: 8104, 7: 399, 3: 233, 0: 9})
Resampled dataset shape Counter({4: 33224, 1: 33224, 5: 33224, 6: 33224, 2: 33224, 7: 33224, 3: 33224, 0: 33224})


#Data Normalised  

Normalization refers to the process of scaling numerical features to a standard range, typically between 0 and 1 or within a specific range. Normalizing data is crucial in machine learning to ensure that features with different scales contribute equally to the analysis and model training.

---



##MinMaxScaler

After over sampling is donw now data need to be normalised for this we are using MinMax Scaler which transform the features between 0 and 1.          
x_random_oversampled & X_test to transform the features as states in the below code.

In [67]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(x_random_oversampled)

X_test_scaled = scaler.transform(X_test)

#Decision Tree

#####Decision tree in machine learning is a supervised learning algorithm used for both classification and regression tasks.It has tree like strcture.Tree-like model where each internal node represents a decision based on a feature

##Decision tree with hyperparameter

min_samples_split->minimum number of samples required to split an internal node.parametersare(1,2,4)
min_samples_leaf->minimum number of samples required to be at a leaf node. Nodes that would create leaf nodes containing fewer samples.parameters are(2,5,10)
max_depth->maximum depth that the tree can grow to during the learning process.parameters are(None,5,10,15)

In [22]:


param_grid = {
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt_classifier = DecisionTreeClassifier(random_state=42)

grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=3, scoring='accuracy', verbose=1)

grid_search.fit(X_train_scaled, y_random_oversampled)

best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

best_dt = DecisionTreeClassifier(random_state=42, **best_params)
best_dt.fit(X_train_scaled, y_random_oversampled)

predictions = best_dt.predict(X_test_scaled)

print(classification_report(y_test,predictions))

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       1.00      1.00      1.00      8218
           2       0.50      0.51      0.50      2881
           3       0.27      0.33      0.30        39
           4       0.83      0.80      0.81      3108
           5       0.37      0.36      0.36      2615
           6       0.23      0.24      0.23      1650
           7       0.40      0.45      0.42        86

    accuracy                           0.73     18601
   macro avg       0.45      0.46      0.45     18601
weighted avg       0.73      0.73      0.73     18601



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


##Decision Tree without Hyperparameter

In [68]:
decision_tree = DecisionTreeClassifier(random_state=42)

decision_tree.fit(X_train_scaled, y_random_oversampled)

y_pred = decision_tree.predict(X_test_scaled)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       1.00      1.00      1.00      8218
           2       0.50      0.51      0.50      2881
           3       0.27      0.33      0.30        39
           4       0.83      0.80      0.81      3108
           5       0.37      0.36      0.36      2615
           6       0.23      0.24      0.23      1650
           7       0.40      0.45      0.42        86

    accuracy                           0.73     18601
   macro avg       0.45      0.46      0.45     18601
weighted avg       0.73      0.73      0.73     18601



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#Random Forest

RandomForest model is a machine learnig model which is used both for classification and regression.It outpouts the mean prediction (for regression) or the mode prediction (for classification).

##RandomForest with hyperparameter

Hyperparameter Tunning is finding the best hyper parameter to improve the models performance.The hyperparameter used are           
n_estimators-> which determines the  number of decision tree that will be used in the forest
max_depth->controlling the complexity of the tree and preventing overfitting.
min_samples_split->minimum number of samples required to split an internal node during the construction of individual decision trees within the forest hyperparameters are
n_estimators->100, 200, 300
max_depth : 5, 10, 15
min_samples_split: 2, 5, 10

In [None]:


param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]

}
rf = RandomForestClassifier()

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy', verbose=2, n_jobs=-1)
grid_search.fit(X_train_scaled, y_random_oversampled)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

accuracy = best_model.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy}")


##Random Forest without Hyperparameter


In [37]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

rf_classifier = RandomForestClassifier(random_state=42)

rf_classifier.fit(X_train_scaled, y_random_oversampled)


predictions = rf_classifier.predict(X_test)

conf_matrix = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(conf_matrix)


print(                                        )
print(classification_report(y_test, y_pred))




Confusion Matrix:
[[   0    0    0    0    3    1    0    0]
 [   0 8214    0    0    0    0    0    4]
 [   0    0 1800   14    1  785  278    3]
 [   0    0    8   15    1   10    5    0]
 [  11    0    5    0 2830   32  175   55]
 [   0    0 1135   13  119  964  363   21]
 [   0    0  406    8  429  422  374   11]
 [   0    1    3    0   35    4    4   39]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       1.00      1.00      1.00      8218
           2       0.50      0.51      0.50      2881
           3       0.27      0.33      0.30        39
           4       0.83      0.80      0.81      3108
           5       0.37      0.36      0.36      2615
           6       0.23      0.24      0.23      1650
           7       0.40      0.45      0.42        86

    accuracy                           0.73     18601
   macro avg       0.45      0.46      0.45     18601
weighted avg       0.73      0.73      0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#Support Vector Machine

Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks,mostly used for classification.SVM helps in drawing a decision boundary between data points belonging to different categories by finding the best possible

##Without hyperparameters

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

svm_classifier = SVC(kernel='rbf', random_state=42)

svm_classifier.fit(X_train_scaled, y_random_oversampled)

predictions = svm_classifier.predict(X_test_scaled)

print("Classification Report:")
print(classification_report(y_test, predictions))


##With Hyperparameter
C->controls the trade-off between maximizing the margin and minimizing the classification error.
kernel->sed to map data into a higher-dimensional space
gamma->influences the reach of the kernel and the impact of a single training
degree->degree of the polynomial kernel function
hyperparameters are
C: 0.1, 1, 10, 100
kernel:linear,rbf,poly
gamma:scale,auto
degree: 2, 3, 4, 5

In [None]:



svm_classifier = SVC()

param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'],
    'degree': [2, 3, 4, 5]
}

grid_search = GridSearchCV(svm_classifier, param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)

best_svm = grid_search.best_estimator_
test_score = best_svm.score(X_test, y_test)
print("Test Set Score with Best Parameters:", test_score)


#Cross Validation

Cross-validation is a technique used in machine learning to assess how well a trained model generalizes to new
Cross-validation divides the input data in sets and then train the model to find out the best possible accuracy.
CV='5' specifies number of folds to be perofrmed

##Support Vector Machine

In [69]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

dt_classifier = DecisionTreeClassifier(random_state=42)

scores = cross_val_score(dt_classifier, X, Y, cv=5)

print("Cross-Validation Scores:", scores)
print("Average Cross-Validation Score:", scores.mean())

Cross-Validation Scores: [0.63727757 0.71936993 0.695554   0.63362185 0.72150538]
Average Cross-Validation Score: 0.6814657438350233


##Random Forest

In [None]:
from sklearn.model_selection import cross_val_score

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


scores = cross_val_score(rf_classifier, X, Y, cv=5)


print("Cross-Validation Scores:", scores)
print("Average Cross-Validation Score:", scores.mean())

Cross-Validation Scores: [0.73302511 0.7421644  0.74813182 0.77431321 0.75870968]
Average Cross-Validation Score: 0.7512688426394003


##Decision Tree

In [None]:
from sklearn.model_selection import cross_val_score

dt_classifier = DecisionTreeClassifier(random_state=42)

scores = cross_val_score(dt_classifier, X, y, cv=5)


print("Cross-Validation Scores:", scores)
print("Average Cross-Validation Score:", scores.mean())

Cross-Validation Scores: [0.9  0.85 0.83 0.86 0.87]
Average Cross-Validation Score: 0.8619999999999999


#Conclusion

