<a href="https://colab.research.google.com/github/jadenfix/Daytuh/blob/main/NN_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
## Optional installs
%pip install numpy pandas scikit-learn matplotlib



In [None]:
## Imports
import warnings
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer, load_wine, load_diabetes, fetch_california_housing, load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

## Python & Pandas Refresher

### Load in the data

In [None]:
# Load the breast cancer dataset
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                  columns=np.append(cancer['feature_names'], ['target']))

# Display the first 5 rows of the dataframe
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


### Choose your parameters and target variable
**The parameters are the features that your model uses to make predictions, the target variable is the value that your model aims to predict.**

**In our breast cancer case, the target variable is called `target` where 0 represents a benign tumor and a 1 represents a malignant tumor**

##### **Categorical vs Continuous Data**
Categorical means the target variable is organized into buckets or "categories".

Continuous means the target variable is $\in {\mathbb{R}}$ (fancy for saying the target is a continuous or numeric value)

In the breast cancer case, the target is a categorical variable because we only have two buckets (categories) classifying the data as benign or malignant



In [None]:
# Split the data into features and target
X = df.drop(columns=['target'])
y = df['target']

### Scale the data (if necessary)

In [None]:
# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

#### Split the data into training and testing sets

In [None]:
# Split the standardized dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Let's look at a simple ML model

Remember decision trees?

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

# Calculate accuracy, precision, and recall
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print("Confusion Matrix:")
print(confusion)

Accuracy: 0.9473684210526315
Precision: 0.9577464788732394
Recall: 0.9577464788732394
Confusion Matrix:
[[40  3]
 [ 3 68]]


##### The results look pretty good, but can we do better?

## Neural Networks (MLP)

In [None]:
# Load the breast cancer dataset
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                  columns=np.append(cancer['feature_names'], ['target']))


In [None]:
X = df.drop(columns=['target'])
y = df['target']

In [None]:
# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
# Split the standardized dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### Let's look at the model as it comes, out of the box

In [None]:
# Define the MLPClassifier model
model = MLPClassifier()

[Scikit learn MLPClassifier documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

In [None]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate accuracy, precision, and recall
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print("Confusion Matrix:")
print(confusion)

Accuracy: 0.9736842105263158
Precision: 0.9722222222222222
Recall: 0.9859154929577465
Confusion Matrix:
[[41  2]
 [ 1 70]]




##### Are there any hyperparameters that make the model work better?

In [None]:
model = MLPClassifier(random_state=42, max_iter=2000, early_stopping=True, validation_fraction=0.1, n_iter_no_change=10)

In [None]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'hidden_layer_sizes': [(50, 50), (100, 100), (50, 100, 50)],
    'activation': ['relu', 'tanh'],
    'solver': ['adam', 'sgd'],
    'alpha': [0.0001, 0.001, 0.01],
    'batch_size': [32, 64, 128],
}

In [None]:
# Perform GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

KeyboardInterrupt: 

In [None]:
# Get the best model from grid search
best_model = grid_search.best_estimator_

In [None]:
# Make predictions on the testing data
y_pred = best_model.predict(X_test)

In [None]:
# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

In [None]:
# Print the best hyperparameters and performance metrics
print("Best NN hyperparameters found:")
print(grid_search.best_params_)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print('Confusion matrix:\n', cm)

## SVM

Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. SVM is particularly well-suited for classification tasks.

SVM aims to find the optimal hyperplane in an N-dimensional space to separate data points into different classes. The algorithm maximizes the margin between the closest points of different classes.

Let's make an SVM for our breast cancer data set.

In [None]:
# Load breast cancer dataset
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                columns=np.append(cancer['feature_names'], ['target']))

In [None]:
X = df.drop(columns=['target'])
y = df['target']

In [None]:
# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
svc = SVC(probability=True)

[Scikit learn SCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [None]:
# Define hyperparameters to tune
hyperparams = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
                'kernel': ['linear', 'rbf'],
                'gamma': ['scale', 'auto']}

In [None]:
# Perform hyperparameter tuning using k-fold cross-validation
grid_search = GridSearchCV(svc, hyperparams, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_svc = grid_search.best_estimator_

In [None]:
# Make predictions on test set
y_pred = best_svc.predict(X_test)

In [None]:
# Calculate performance metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

In [None]:
# Print SVM performance metrics
print("Best SVM hyperparameters found:")
print('Accuracy:', acc)
print('Precision:', prec)
print('Recall:', rec)
print('F1 score:', f1)
print('Confusion matrix:\n', cm)

## Try It Yourself

##### 1. Load in your dataset and do some exploratory analysis (ie. look at your data)

**[Here is the link to the Iris dataset documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)**

In [None]:
# You can do the titanic dataset if you want an extra challenge
# data_dir = "https://dlsun.github.io/pods/data/"
# titanic = pd.read_csv(data_dir + "titanic.csv")

iris = load_iris()
df = pd.DataFrame(np.c_[iris['data'], iris['target']],
                  columns=np.append(iris['feature_names'], ['target']))


#### 2. Split your data into features and target

In [None]:
# PUT CODE HERE

##### 3. Scale your feature data (if needed) and justify why you did or did not scale.

In [None]:
# PUT CODE HERE

##### 4. Split your data into train and test sets

In [None]:
# PUT CODE HERE

##### 5. Define/instantiate your model (Choose NN or SVM... or both)

In [None]:
# PUT CODE HERE

##### 6. Fit your model to your testing set

In [None]:
# PUT CODE HERE

##### 7. Predict on your testing set

In [None]:
# PUT CODE HERE

##### 8. Calculate performance metrics for your base model

HINT: if you're using the iris dataset you'll have to do some googling as to how to calculate metrics for multi-class classification

In [None]:
# PUT CODE HERE

##### 9. Interpret your results (...this should be in English)

(put your interpretations here)

##### 10. Tune your model (hyperparameter tuning...maybe GridSearchCV) to try to get a better accuracy score.

In [None]:
# PUT CODE HERE