<a href="https://colab.research.google.com/github/kripperda/MLA_KMR/blob/main/MLA3_KMR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



Machine Learning Assignment 3
---
Kory Ripperda


Imports

In [26]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer

from google.colab import drive


Mount to Drive

In [27]:
# Mount your Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Connect to Google Drive

In [28]:
# Read the file into a DataFrame named 'df'
df = pd.read_csv('/content/drive/MyDrive/banknote_authentication.csv')

Statistics and Visual Exploration

In [29]:
# Statistics and Visual Exploration
numerical_summary = df.describe()
print("Numerical Summary:")
print(numerical_summary)

class_counts = df['class'].value_counts().reset_index()
class_counts.columns = ['class', 'count']
print("\nClass Counts:")
print(class_counts)

plt.figure(figsize=(10, 5))
sns.barplot(x='class', y='count', data=class_counts)
plt.title('Forgery vs. No Forgery')
plt.show()

sns.pairplot(df, hue='class')
plt.show()

Numerical Summary:
          variance     skewness     curtosis      entropy      forgery
count  1372.000000  1372.000000  1372.000000  1372.000000  1372.000000
mean      0.433735     1.922353     1.397627    -1.191657     0.444606
std       2.842763     5.869047     4.310030     2.101013     0.497103
min      -7.042100   -13.773100    -5.286100    -8.548200     0.000000
25%      -1.773000    -1.708200    -1.574975    -2.413450     0.000000
50%       0.496180     2.319650     0.616630    -0.586650     0.000000
75%       2.821475     6.814625     3.179250     0.394810     1.000000
max       6.824800    12.951600    17.927400     2.449500     1.000000


KeyError: 'class'

From the pairplot, we can see the following patterns:
The 'skewness' and 'curtosis' features seem to be the most effective in separating the two classes.
There is a clear separation between the two classes in the scatter plots involving 'skewness' and 'curtosis'.
The 'variance' and 'entropy' features also show some degree of separation, but it's less distinct compared to 'skewness' and 'curtosis'.
The scatter plots between 'variance' and 'skewness', 'variance' and 'curtosis', and 'skewness' and 'curtosis' show noticeable clusters.
The scatterplots of entropy with the other features also shows that there is some separation.

Split Data into Train/Test

In [30]:
X = df.drop('class', axis=1)
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

KeyError: "['class'] not found in axis"

Build the Pipeline

In [None]:
numerical_features = X.columns
transformer = ColumnTransformer(transformers=[('minmax', MinMaxScaler(), numerical_features)])
pipeline = Pipeline(steps=[('transformer', transformer), ('svc', SVC(kernel='linear'))])

Execute the Model

In [None]:
pipeline.fit(X_train, y_train)

Evaluate the Model

In [None]:
y_pred = pipeline.predict(X_test)

def plot_cm(y_true, y_pred, figsize=(5,5)):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=figsize)
    sns.heatmap(cm, annot=True, fmt='d')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

plot_cm(y_test, y_pred)

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

The model performs exceptionally well with a very high precision, recall, and F1 score. This indicates that the model is very accurate in distinguishing between genuine and forged banknotes. The confusion matrix also shows that the model makes very few misclassifications.

Conclusion

The SVC model with a linear kernel, after MinMax scaling, is highly suitable for predicting whether a banknote is a forgery or not. The high precision, recall, and F1 score, along with the very low number of misclassifications in the confusion matrix, suggest that this model can be reliably used for this classification task. The clear separation observed in the pairplot between the features also supports the model's high performance.

SVM Regression

Connect to Google Drive

In [None]:
# Path to file in Google Drive
df = pd.read_csv('/content/drive/MyDrive/Steel_industry_data.csv')

# Rename Columns
df.columns = ['date', 'usage_kwh', 'lagging_current_power_factor', 'lagging_current_reactive_power',
              'leading_current_power_factor', 'leading_current_reactive_power', 'nsm', 'curve_flags']



Split Data into Train/Test

In [None]:
X = df.drop(['date', 'usage_kwh'], axis=1)
y = df['usage_kwh']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

Build the Pipeline

In [None]:
numerical_features = X.select_dtypes(include=['number']).columns
categorical_features = X.select_dtypes(include=['object']).columns

transformer = ColumnTransformer(transformers=[
    ('num', MinMaxScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

pipeline = Pipeline(steps=[
    ('transformer', transformer),
    ('svr', SVR())
])

Execute the Model

In [None]:
pipeline.fit(X_train, y_train)

Evaluate the Model

In [None]:
y_pred_train = pipeline.predict(X_train)
y_pred_test = pipeline.predict(X_test)

rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))

print(f"RMSE Train: {rmse_train}")
print(f"RMSE Test: {rmse_test}")

The RMSE (Root Mean Squared Error) represents the standard deviation of the residuals (prediction errors). In this case, the RMSE for the training set is approximately 53.6, and the RMSE for the test set is approximately 63.8.

The RMSE values indicate the average magnitude of the errors made by the model in predicting 'usage_kwh'. The test RMSE is slightly higher than the training RMSE, which is expected, as the model's performance is usually better on the data it was trained on.

A higher RMSE indicates that the model's predictions have larger errors. The values are relatively high, which suggests that the model might not be capturing all the underlying patterns in the data or that the data itself has significant variability. Further model tuning or feature engineering might be necessary to improve the model's performance.

Conclusion

The suitability of this Support Vector Regression model for predicting kwh usage depends on the acceptable error range for the specific application. The RMSE values suggest that the model's predictions have a significant degree of error. The model might still be useful if a rough estimate is sufficient, but it might not be suitable for applications requiring high precision.

To improve the model's performance, consider the following:
 -Feature engineering: Explore creating new features or transforming existing ones to better capture the underlying patterns.
 -Hyperparameter tuning: Optimize the hyperparameters of the SVR model, such as the kernel, C, and epsilon.
 - Model selection: Experiment with other regression models, such as Random Forest Regression or Gradient Boosting Regression, to see if they perform better.
 -Data analysis: Investigate the data further to identify potential outliers or other issues that might be affecting the model's performance.