Assessment Task


Suggested possible analysis 

- What are the most important features for predicting X as a target variable?
- Which classification approach do you prefer for the prediction of X as a target variable, and why? o How to classify the loyal and churn customers using Support Vector Machines?
- Why is dimensionality reduction important in machine learning?

The pair would need to consider the following instructions (a - d) during the development of this pair project.
a) Logical justification based on the reasoning for the specific choice of machine learning approaches.
b) Multiple machine learning approaches (at least two) using hyperparameters and a comparison
between the chosen modelling approaches.
 c) Visualise your comparison of ML modelling outcomes. You may use a statistical approach to argue that one feature is more important than other features (for example, using PCA).
d) Cross-validation methods should be used to justify the authenticity of your ML results.
Your pair will present their findings and defend the results in the report (MS Doc or Open word format) by highlighting their individual contribution. Your report should capture the following aspects that are relevant
to 1.
2.
3.
4.
5.
your project investigations.
Motivation, a description of the problem domain, and an explanation of how the project's goals are justified using Prediction / Classification / Clustering Rules / Dimensionality Reduction etc..
(10 marks) Characterization of data, explanation and description of techniques used for the variation in the
accuracy across three training splits (10% / 20%/ 30%) using cross validation techniques.
(30 marks) Interpret and explain the results obtained, discuss overfitting / underfitting / generalisation, provide a rationale for the chosen model and use visualisations to support your findings. Comments in Python code, conclusions of the project should be specified at the end of the report. Harvard Style must be
used for citations and references.
(20 marks) Each team member presents a PowerPoint presentation of their work (maximum 5 slides) to emphasize their distinctive contributions based on their involvement in the project's conceptual
understanding, code development, and deployment.
(20 marks individual) Each team member fully described their individual contributions to the project in a reflective journal, using at least 500 to 700 words as well as images, diagrams, figures, and visualizations to elaborate
his/ her work.
Submission Requirements
All assessment submissions must meet the minimum requirements listed below. Failure to do so may have implications for the marks awarded.
● The code and datasets should be provided and uploaded in zip format on Moodle.
● Clearly detail the number of words used in the report.
● Number of Words in the report (2000 words +/-10%) excluding diagrams, code, references and
titles. Number of words used to express individual contributions is part of the mentioned words.
● In the case of individual submission, students will submit a (1000 words +/-10%) report.
● Describe the contribution of each team member in the project clearly and use a bar chart or pie
chart to represent the effort and time spent during this project. Use version control like Github or any other tool to show the progress of both team members in CA1. You should have at least 5 commits on Github before submission.
● The rubric is provided for the detailed breakdown of marks at the end of this CA1.
● Use Harvard Referencing when citing third party material
● Be the student’s own work.
● Include the CCT assessment cover page.
● Be submitted by the deadline date specified or be subject to late submission penalties
(20 marks individual)
 
● Note: The names of pair members must be uploaded on the link provided on Moodle until 15th October 2023 (23:59).
● Must be clearly specified the number of words used after each section in the report.

# DO CROSS VALIDATION AND ANN

# Librabries 

In [1]:
%matplotlib inline

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn import metrics

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score

# EDA

In [2]:
bike = pd.read_csv("Seoul_Bike.csv")

DOI
10.24432/C5F62R
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

https://archive.ics.uci.edu/dataset/560/seoul+bike+sharing+demand


In [3]:
bike.head(3)

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes


In [4]:
bike.shape

(8760, 14)

In [5]:
bike.describe()

Unnamed: 0,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm)
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,704.602055,11.5,12.882922,58.226256,1.724909,1436.825799,4.073813,0.569111,0.148687,0.075068
std,644.997468,6.922582,11.944825,20.362413,1.0363,608.298712,13.060369,0.868746,1.128193,0.436746
min,0.0,0.0,-17.8,0.0,0.0,27.0,-30.6,0.0,0.0,0.0
25%,191.0,5.75,3.5,42.0,0.9,940.0,-4.7,0.0,0.0,0.0
50%,504.5,11.5,13.7,57.0,1.5,1698.0,5.1,0.01,0.0,0.0
75%,1065.25,17.25,22.5,74.0,2.3,2000.0,14.8,0.93,0.0,0.0
max,3556.0,23.0,39.4,98.0,7.4,2000.0,27.2,3.52,35.0,8.8


In [6]:
bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date                       8760 non-null   object 
 1   Rented Bike Count          8760 non-null   int64  
 2   Hour                       8760 non-null   int64  
 3   Temperature(°C)            8760 non-null   float64
 4   Humidity(%)                8760 non-null   int64  
 5   Wind speed (m/s)           8760 non-null   float64
 6   Visibility (10m)           8760 non-null   int64  
 7   Dew point temperature(°C)  8760 non-null   float64
 8   Solar Radiation (MJ/m2)    8760 non-null   float64
 9   Rainfall(mm)               8760 non-null   float64
 10  Snowfall (cm)              8760 non-null   float64
 11  Seasons                    8760 non-null   object 
 12  Holiday                    8760 non-null   object 
 13  Functioning Day            8760 non-null   objec

## Changing from categorical to numerical

In [7]:
bike["Seasons"].unique()

array(['Winter', 'Spring', 'Summer', 'Autumn'], dtype=object)

In [8]:
bike["Holiday"].unique()

array(['No Holiday', 'Holiday'], dtype=object)

In [9]:
bike["Functioning Day"].unique()

array(['Yes', 'No'], dtype=object)

In [10]:
bike['Seasons'].replace('Winter', 0, inplace = True)
bike['Seasons'].replace('Spring', 1, inplace = True)
bike['Seasons'].replace('Summer', 2, inplace = True)
bike['Seasons'].replace('Autumn', 3, inplace = True)

In [11]:
bike['Seasons'].tail(2)

8758    3
8759    3
Name: Seasons, dtype: int64

In [12]:
bike["Holiday"].replace('No Holiday', 0, inplace = True)
bike["Holiday"].replace('Holiday', 1, inplace = True)

In [13]:
bike["Holiday"].tail(2)

8758    0
8759    0
Name: Holiday, dtype: int64

In [14]:
bike["Functioning Day"].replace('No', 0, inplace = True)
bike["Functioning Day"].replace('Yes', 1, inplace = True)

In [15]:
bike["Functioning Day"].tail(2)

8758    1
8759    1
Name: Functioning Day, dtype: int64

# Graphics 

sns.histplot(data = bike, y = "Rented Bike Count", x = "Hour")

sns.histplot(data = bike, x = "Holiday")

The locals use more the bikes than the turists

sns.histplot(data = bike, x = "Rented Bike Count")

sns.histplot(data = bike, x = "Temperature(°C)")

sns.histplot(data = bike, x = "Humidity(%)")



sns.histplot(data = bike, x = "Wind speed (m/s)")



sns.histplot(data = bike, x = "Visibility (10m)")

sns.histplot(data = bike, x = "Dew point temperature(°C)")

sns.histplot(data = bike, x = "Solar Radiation (MJ/m2)")


sns.histplot(data = bike, x = "Rainfall(mm)")

sns.histplot(data = bike, x = "Snowfall (cm)")

sns.histplot(data = bike, x = "Seasons")

The bikes have been used equally throughout the seasons 

The seasons have the same amount of observations but the number of rented bikes are different between seasons

bike.head(1)

bike.plot(x = "Temperature(°C)", y = "Rented Bike Count", style = "*")
plt.title("Rented Bike Count vs Temperature(°C)")
plt.xlabel("Temperature(°C)")
plt.ylabel("Rented Bike Count")
plt.show()

bike.plot(x = "Hour", y = "Rented Bike Count", style = "*")
plt.title("Rented Bike Count vs Hour")
plt.xlabel("Hour")
plt.ylabel("Rented Bike Count")
plt.show()

sns.histplot(data = bike, x = "Seasons", y = "Rented Bike Count")

('Winter', 0, inplace = True)
('Spring', 1, inplace = True)
('Summer', 2, inplace = True)
('Autumn', 3, inplace = True)

sns.histplot(data = bike, x = "Seasons", y = "Temperature(°C)")

sns.boxplot(data = bike["Rented Bike Count"])

sns.boxplot(data = bike["Temperature(°C)"])

 sns.boxplot(data = bike["Visibility (10m)"])

# Correlation

In [16]:
bike.head(3)

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,0,0,1
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,0,0,1
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,0,0,1


In [17]:
bike_d = bike.drop(["Date"], axis = 1)

In [18]:
bike_d.head(3)

Unnamed: 0,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,0,0,1
1,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,0,0,1
2,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,0,0,1


In [19]:
corr_matrix = np.corrcoef(bike_d, rowvar = False)

sns.heatmap(corr_matrix, annot = True, cmap = 'coolwarm', fmt = '.2f', 
            xticklabels = range(X.shape[1]), yticklabels = range(X.shape[1]))
plt.title("Feature Correlation Matrix")
plt.show()

NameError: name 'X' is not defined

The highest correlations are between: 
- Rented Bike Count & Temperature(°C) = 0.54 (0 & 2)
- Temperature(°C) & Dew point temperature(°C) = 0.91 (2 & 6)
- Temperature(°C) & Seasons = 0.59 (2 & 10)
- Humidity(%) & Visibility (10m) = 0.54 (3 & 5)
- Humidity(%) & Dew point temperature(°C) = 0.54 (3 & 6)
- Dew point temperature(°C) & Seasons = 0.58 (6 & 10)

In [None]:
def find_highly_correlated_vars(corr_matrix, threshold = 0.5):
    if isinstance(corr_matrix, pd.DataFrame):
        corr_matrix = corr_matrix.values
    rows, cols = np.where(np.abs(corr_matrix) > threshold)
    unique_pairs = set((min(r, c), max(r, c)) for r, c in zip(rows, cols) if r != c)
    return list(unique_pairs)

corr_matrix = bike_d.corr()
highly_corr_vars = find_highly_correlated_vars(corr_matrix, threshold=0.5)

print("Pairs of highly correlated variables:", highly_corr_vars)

In [None]:
bike_d.head(1)

In [None]:
sns.pairplot(bike[["Rented Bike Count", "Temperature(°C)"]])

#sns.pairplot(bike[["Rented Bike Count", "Seasons"]])

sns.pairplot(bike[["Temperature(°C)", "Dew point temperature(°C)"]])

sns.pairplot(bike[["Temperature(°C)", "Seasons"]])

sns.pairplot(bike[["Humidity(%)", "Visibility (10m)"]])

sns.pairplot(bike[["Humidity(%)", "Dew point temperature(°C)"]])

sns.pairplot(bike[["Dew point temperature(°C)", "Seasons"]])

Rented Bike Count & Temperature(°C) = 0.54 (0 & 2)

Temperature(°C) & Dew point temperature(°C) = 0.91 (2 & 6)

Temperature(°C) & Seasons = 0.59 (2 & 10)

Humidity(%) & Visibility (10m) = 0.54 (3 & 5)

Humidity(%) & Dew point temperature(°C) = 0.54 (3 & 6)
Dew point temperature(°C) & Seasons = 0.58 (6 & 10

Multicollinearity

bike_split = bike.drop(["Rented Bike Count", "Date"], axis = 1)

rbc = bike[["Rented Bike Count"]]

rbc.head()

X = bike_split

y = rbc

X.head(3)

y.head(3)

bike_array = bike_d.values

In [None]:
X = bike_d.iloc[:, 1:]  # All rows, columns from index 1 onwards (excluding the first column)
y = bike_d.iloc[:, 0]   # All rows, only the first column

In [None]:
y

In [None]:
X

# Store first two columns (Sepal length and Sepal width) in an array X 
X = df.iloc[:,:2]

# Store the target variable as lable into an array y
y = df.iloc[:,1]

# Display number of rows and columns
X.shape, y.shape

X = bike_array[:, 1:13]

y = bike_array[:, 0:1]

X

X.shape

y.shape

y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 50)

In [None]:
X.shape, y.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
scaler = MinMaxScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)

X_test_scaled  = scaler.transform(X_test)

X_scaled = scaler.transform(X)

# PCA

from sklearn.decomposition import PCA

pca = PCA()

pca.fit(X_scaled)

variance = pca.explained_variance_ratio_

var = np.cumsum(np.round(pca.explained_variance_ratio_, decimals = 3)*100)



plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("Number of Components")
plt.xticks(np.arange(0, 13, 1), fontsize = 15)
plt.ylabel("Variance (%)")
plt.yticks(np.arange(0.90, 1.0, 0.05), fontsize = 15)
#plt.xticks(np.arange(0, 24, 5), fontsize = 15, rotation = 0, ha = 'center')
plt.title("Train Dataset Explained Variance")
plt.show()

pca = PCA(n_components = 6)

bike_pca = pca.fit_transform(X_scaled)

bike_pca = pd.DataFrame(bike_pca)

bike_pca.head()

bike_concat = pd.concat([bike_pca, rbc], axis = 1)

bike_concat.head()

bike_array = bike_concat.values

X = bike_array[:,0:6]

y = bike_array[:,6]

y

X

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)

X.shape, y.shape, X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Models

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear_reg = LinearRegression()

linear_reg.fit(X_train, y_train)

yp_lr = linear_reg.predict(X_test)

mse = mean_squared_error(y_test, yp_lr)
r2 = r2_score(y_test, yp_lr)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
pred = pd.DataFrame({'Actual': y_test, 'Predicted': yp_lr})
pred.head()

## Polynomial 2 and 3

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
degree = 2
poly_features = PolynomialFeatures(degree=degree)

X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

yp_p2 = poly_reg.predict(X_test_poly)

mse = mean_squared_error(y_test, yp_p2)
r2 = r2_score(y_test, yp_p2)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
pred_p2 = pd.DataFrame({'Actual': y_test, 'Predicted': yp_p2})
pred_p2.head()

In [None]:
degree = 3
poly_features = PolynomialFeatures(degree=degree)

X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

yp_p3 = poly_reg.predict(X_test_poly)

# Evaluate the model performance
mse = mean_squared_error(y_test, yp_p3)
r2 = r2_score(y_test, yp_p3)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

## Ridge

In [None]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=1.0)  # Adjust alpha based on the strength of regularization
ridge_reg.fit(X_train, y_train)

yp_lnR = ridge_reg.predict(X_test)

mse = mean_squared_error(y_test, yp_lnR)
r2 = r2_score(y_test, yp_lnR)

# Print the results
print(f'Mean Squared Error (Ridge): {mse}')
print(f'R-squared (Ridge): {r2}')

In [None]:
pred_lnR = pd.DataFrame({'Actual': y_test, 'Predicted': yp_lnR})
pred_lnR.head()

In [None]:
degree = 2

poly = PolynomialFeatures(degree=degree, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

ridge_reg = Ridge(alpha = 10.0)  # Adjust alpha based on the strength of regularization
ridge_reg.fit(X_train_poly, y_train)

yp_rp2 = ridge_reg.predict(X_test_poly)

# Evaluate the Ridge model
mse_ridge = mean_squared_error(y_test, yp_rp2)
r2_ridge = r2_score(y_test, yp_rp2)

print(f'Mean Squared Error (Ridge): {mse_ridge}')
print(f'R-squared (Ridge): {r2_ridge}')

## Lasso

In [None]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=1.0)  # Adjust alpha based on the strength of regularization
lasso_reg.fit(X_train, y_train)

yp_lnL = lasso_reg.predict(X_test)

mse = mean_squared_error(y_test, yp_lnL)
r2 = r2_score(y_test, yp_lnL)

print(f'Mean Squared Error (Lasso): {mse}')
print(f'R-squared (Lasso): {r2}')

In [None]:
degree = 2

poly = PolynomialFeatures(degree=degree, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
 
lasso_reg = Lasso(alpha = 10.0, max_iter = 100000)
lasso_reg.fit(X_train_poly, y_train)

# Make predictions on the test set
yp_lp2 = lasso_reg.predict(X_test_poly)

# Evaluate the Lasso model
mse_lasso = mean_squared_error(y_test, yp_lp2)
r2_lasso = r2_score(y_test, yp_lp2)

print(f'Mean Squared Error (Lasso): {mse_lasso}')
print(f'R-squared (Lasso): {r2_lasso}')

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state = 12), param_grid, cv = 5, scoring = 'r2')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

y_pred_best = best_model.predict(X_test)
r2_best = r2_score(y_test, y_pred_best)

print(f'Best Hyperparameters: {best_params}')
print(f'R-squared (Best Model): {r2_best}')

In [None]:
rf_regressor = RandomForestRegressor(n_estimators = 100, random_state = 12)

rf_regressor.fit(X_train, y_train)

In [None]:
yp_rf = rf_regressor.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, yp_rf)
r2 = r2_score(y_test, yp_rf)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
pred_rf = pd.DataFrame({'Actual': y_test, 'Predicted': yp_rf})
pred_rf.head()

# ANN

In [None]:
 !pip install tensorflow
 !pip install keras

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [None]:
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(12,)),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1)  # Output layer with 1 neuron for regression
])

In [None]:
model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
model.fit(X_train_scaled, y_train, epochs = 200, batch_size = 64, validation_split = 0.2)

In [None]:
yp_ann = model.predict(X_test_scaled)

In [None]:
yp_ann

In [None]:
pred_ann = pd.DataFrame({'Actual': y_test.ravel(), 'Predicted': yp_ann.ravel()})

In [None]:
pred_ann

In [None]:
mse = mean_squared_error(y_test, yp_ann)
r2 = r2_score(y_test, yp_ann)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()

# Add the input layer and the first hidden layer
model.add(Dense(units=64, activation='relu', input_dim=X_train.shape[1]))

# Add additional hidden layers if needed
model.add(Dense(units=32, activation='relu'))

model.add(Dense(units=16, activation='relu'))

model.add(Dense(units=8, activation='relu'))

# Output layer for regression (1 neuron)
model.add(Dense(units=1, activation='linear'))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

#from keras.layers import Dropout
#model.add(Dropout(0.5)) 

#from keras.regularizers import l1, l2
#model.add(Dense(units=64, activation='relu', input_dim=X_train.shape[1], kernel_regularizer=l2(0.01)))

from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience = 10, restore_best_weights=True)
model.fit(X_train_scaled, y_train, epochs = 300, batch_size = 64, validation_split=0.2, callbacks=[early_stopping])


# Train the model
#model.fit(X_train_scaled, y_train, epochs = 200, batch_size = 64, validation_split=0.2)

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error (ANN): {mse}')
print(f'R-squared (ANN): {r2}')


# SVM

In [None]:
from sklearn import svm

In [None]:
regressor = svm.SVR(kernel = 'linear', C = 5.0, epsilon = 2.5)

regressor.fit(X_train, y_train)

yp_svm = regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_svm)
r2 = r2_score(y_test, yp_svm)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
pred_svm = pd.DataFrame({'Actual': y_test, 'Predicted': yp_svm})
pred_svm.head()

In [None]:
regressor = svm.SVR(kernel = 'poly', C = 50.0, epsilon = 3.5, degree = 2)

regressor.fit(X_train, y_train)

yp_svm_p2 = regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_svm_p2)
r2 = r2_score(y_test, yp_svm_p2)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
from sklearn.svm import LinearSVR

linear_svr_model = LinearSVR(C = 10.0, epsilon = 0.2)  # Adjust parameters based on your data
linear_svr_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = linear_svr_model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error (Linear SVR): {mse}')
print(f'R-squared (Linear SVR): {r2}')

In [None]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Assume X and y are your feature matrix and target variable
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train SVR models with different kernels
linear_svr = SVR(kernel='linear')
poly_svr = SVR(kernel='poly', degree=2)
rbf_svr = SVR(kernel='rbf')
sigmoid_svr = SVR(kernel='sigmoid')

# Fit the models
linear_svr.fit(X_train_scaled, y_train)
poly_svr.fit(X_train_scaled, y_train)
rbf_svr.fit(X_train_scaled, y_train)
sigmoid_svr.fit(X_train_scaled, y_train)

# Make predictions
y_pred_linear = linear_svr.predict(X_test_scaled)
y_pred_poly = poly_svr.predict(X_test_scaled)
y_pred_rbf = rbf_svr.predict(X_test_scaled)
y_pred_sigmoid = sigmoid_svr.predict(X_test_scaled)

# Evaluate the models
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

mse_rbf = mean_squared_error(y_test, y_pred_rbf)
r2_rbf = r2_score(y_test, y_pred_rbf)

mse_sigmoid = mean_squared_error(y_test, y_pred_sigmoid)
r2_sigmoid = r2_score(y_test, y_pred_sigmoid)

print(f'Mean Squared Error (Linear SVR): {mse_linear}, R-squared: {r2_linear}')
print(f'Mean Squared Error (Poly SVR): {mse_poly}, R-squared: {r2_poly}')
print(f'Mean Squared Error (RBF SVR): {mse_rbf}, R-squared: {r2_rbf}')
print(f'Mean Squared Error (Sigmoid SVR): {mse_sigmoid}, R-squared: {r2_sigmoid}')


# KNN

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': [1, 3, 5, 7, 9]}  # Adjust the range based on your data
grid_search = GridSearchCV(KNeighborsRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_k = grid_search.best_params_['n_neighbors']
#knn_regressor = KNeighborsRegressor(n_neighbors=best_k)
#knn_regressor.fit(X_train, y_train)

best_k

In [None]:
knn_regressor = KNeighborsRegressor(n_neighbors = 7, metric='manhattan') 

knn_regressor.fit(X_train, y_train)

yp_knn = knn_regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_knn)
r2 = r2_score(y_test, yp_knn)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
pred_knn = pd.DataFrame({'Actual': y_test, 'Predicted': yp_knn})
pred_knn.head()

In [None]:
from sklearn.ensemble import BaggingRegressor

bagging_regressor = BaggingRegressor(KNeighborsRegressor(n_neighbors = 4), n_estimators = 10, random_state=12)
bagging_regressor.fit(X_train, y_train)


yp_knn_bgg = bagging_regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_knn_bgg)
r2 = r2_score(y_test, yp_knn_bgg)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

In [None]:
best_params

In [None]:
tree_regressor = DecisionTreeRegressor(
    max_depth = 5,
    min_samples_split = 10,
    min_samples_leaf = 5,
    random_state = 12
)

tree_regressor.fit(X_train, y_train)

yp_dt = tree_regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_dt)
r2 = r2_score(y_test, yp_dt)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
tree_regressor = DecisionTreeRegressor(
    max_depth = 6,
    min_samples_split = 12,
    min_samples_leaf = 6,
    random_state = 12
)

tree_regressor.fit(X_train, y_train)

yp_dt = tree_regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_dt)
r2 = r2_score(y_test, yp_dt)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
tree_regressor = DecisionTreeRegressor(
    max_depth = 7,
    min_samples_split = 14,
    min_samples_leaf = 7,
    random_state = 12
)

tree_regressor.fit(X_train, y_train)

yp_dt = tree_regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_dt)
r2 = r2_score(y_test, yp_dt)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
tree_regressor = DecisionTreeRegressor(
    max_depth = 10,
    min_samples_split = 20,
    min_samples_leaf = 10,
    random_state = 12
)

tree_regressor.fit(X_train, y_train)

yp_dt = tree_regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_dt)
r2 = r2_score(y_test, yp_dt)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
tree_regressor = DecisionTreeRegressor(
    max_depth = 10,
    min_samples_split = 5,
    min_samples_leaf = 4,
    random_state = 12
)

tree_regressor.fit(X_train, y_train)

yp_dt = tree_regressor.predict(X_test)

mse = mean_squared_error(y_test, yp_dt)
r2 = r2_score(y_test, yp_dt)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

In [None]:
pred_dt = pd.DataFrame({'Actual': y_test, 'Predicted': yp_dt})
pred_dt.head()

In [None]:
print("Classification Report LN:", r2_score(y_test, yp_lr))

print("Classification Report LNP2:", r2_score(y_test, yp_p2))

print("Classification Report RF:", r2_score(y_test, yp_rf))

print("Classification Report ANN:", r2_score(y_test, yp_ann))

print("Classification Report SVM:", r2_score(y_test, yp_svm))

print("Classification Report KNN:", r2_score(y_test, yp_knn))

print("Classification Report DT:", r2_score(y_test, yp_dt))