<a href="https://colab.research.google.com/github/mohammadham/EX-Classification-Hands-on-ML-with-Scikit-Learn-Keras-TensorFlow-by-Au-elien-G-eron/blob/main/Classification_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ex 1 :

In this code, we're using the 
fetch_openml
 function from scikit-learn to load the MNIST dataset. We then split the data into training and test sets using the 
train_test_split
 function, which randomly splits the data into two sets based on the 
test_size
 parameter (in this case, 20% of the data is used for testing).

We define the parameter grid to search for the best hyperparameters for the KNeighborsClassifier model, which includes the weights and n_neighbors hyperparameters.

We use 
GridSearchCV
 to search for the best hyperparameters, and print out the best hyperparameters found by 
GridSearchCV
 and the corresponding accuracy score.

We evaluate the best model on the test set and print out the final accuracy score.


In [1]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
#mnist = fetch_openml('mnist_784', version=1, parser='auto')
X, y = mnist["data"], mnist["target"]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid to search
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

# Create a KNeighborsClassifier model
knn_clf = KNeighborsClassifier()

# Use GridSearchCV to search for the best hyperparameters
grid_search = GridSearchCV(knn_clf, param_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and corresponding accuracy score
print("Best hyperparameters: ", grid_search.best_params_)
print("Best accuracy score: ", grid_search.best_score_)

# Evaluate the best model on the test set
y_pred = grid_search.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
print("Final accuracy score: ", final_accuracy)

  warn(


Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5] END ....n_neighbors=3, weights=uniform;, score=0.969 total time=  28.1s
[CV 2/5] END ....n_neighbors=3, weights=uniform;, score=0.969 total time=  28.4s
[CV 3/5] END ....n_neighbors=3, weights=uniform;, score=0.972 total time=  27.9s
[CV 4/5] END ....n_neighbors=3, weights=uniform;, score=0.971 total time=  28.2s
[CV 5/5] END ....n_neighbors=3, weights=uniform;, score=0.970 total time=  28.5s
[CV 1/5] END ...n_neighbors=3, weights=distance;, score=0.971 total time=  29.5s
[CV 2/5] END ...n_neighbors=3, weights=distance;, score=0.970 total time=  27.7s
[CV 3/5] END ...n_neighbors=3, weights=distance;, score=0.973 total time=  27.8s
[CV 4/5] END ...n_neighbors=3, weights=distance;, score=0.972 total time=  28.0s
[CV 5/5] END ...n_neighbors=3, weights=distance;, score=0.971 total time=  27.5s
[CV 1/5] END ....n_neighbors=4, weights=uniform;, score=0.967 total time=  27.9s
[CV 2/5] END ....n_neighbors=4, weights=uniform;,

ex 2 :
In this code, we define a function 
shift_image
 that can shift an MNIST image in any direction by one pixel. We then create shifted copies of each image in the training set by calling 
shift_image
 four times (once for each direction) and adding the shifted images to the training set.

We convert the augmented training set to numpy arrays and shuffle the data. We then use 
GridSearchCV
 to find the best hyperparameters for the KNeighborsClassifier model on the augmented training set.

We train the KNeighborsClassifier model on the augmented training set using the best hyperparameters found by 
GridSearchCV
. Finally, we evaluate the model on the test set and print out the final accuracy score.

In [None]:
import numpy as np
from scipy.ndimage import shift
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist["data"].to_numpy(), mnist["target"].to_numpy()

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a function to shift an image in any direction by one pixel
def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])

# Create shifted copies of each image in the training set
X_train_augmented = [image for image in X_train]
y_train_augmented = [label for label in y_train]

for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    for image, label in zip(X_train, y_train):
        X_train_augmented.append(shift_image(image, dx, dy))
        y_train_augmented.append(label)

# Convert the augmented training set to numpy arrays and shuffle the data
X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)
shuffle_idx = np.random.permutation(len(X_train_augmented))
X_train_augmented = X_train_augmented[shuffle_idx]
y_train_augmented = y_train_augmented[shuffle_idx]

# Use GridSearchCV to find the best hyperparameters for the KNeighborsClassifier model
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]
knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, cv=5, verbose=3)
grid_search.fit(X_train_augmented, y_train_augmented)

# Train the KNeighborsClassifier model on the augmented training set
knn_clf = KNeighborsClassifier(**grid_search.best_params_)
knn_clf.fit(X_train_augmented, y_train_augmented)

# Evaluate the model on the test set
y_pred = knn_clf.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
print("Final accuracy score: ", final_accuracy)

ex 3 :


This code downloads the Titanic dataset, preprocesses the data using pipelines, fits a Random Forest classifier and a Support Vector Machine classifier.

In [None]:
import os
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Define the paths and URLs for the Titanic dataset
TITANIC_PATH = os.path.join("datasets", "titanic")
DOWNLOAD_URL = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/titanic/"

# Function to download the Titanic dataset
def fetch_titanic_data(url=DOWNLOAD_URL, path=TITANIC_PATH):
    if not os.path.isdir(path):
        os.makedirs(path)
    for filename in ("train.csv", "test.csv"):
        filepath = os.path.join(path, filename)
        if not os.path.isfile(filepath):
            print("Downloading", filename)
            urllib.request.urlretrieve(url + filename, filepath)

# Function to load the Titanic dataset
def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

# Download and load the Titanic dataset
fetch_titanic_data()
train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")

# Set the index of the training and test data to "PassengerId"
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

# Print information about the training data
train_data.info()

# Compute the median age of female passengers in the training data
train_data[train_data["Sex"]=="female"]["Age"].median()

# Print some statistics about the training data
train_data.describe()

# Count the number of survivors and non-survivors in the training data
train_data["Survived"].value_counts()

# Count the number of passengers in each passenger class in the training data
train_data["Pclass"].value_counts()

# Count the number of male and female passengers in the training data
train_data["Sex"].value_counts()

# Count the number of passengers who embarked at each port in the training data
train_data["Embarked"].value_counts()

# Define the preprocessing pipelines for numerical and categorical data
num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])
cat_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

# Combine the numerical and categorical pipelines using ColumnTransformer
num_attribs = ["Age", "SibSp", "Parch", "Fare"]
cat_attribs = ["Pclass", "Sex", "Embarked"]
preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

# Preprocess the training data and fit a Random Forest classifier
X_train = preprocess_pipeline.fit_transform(train_data[num_attribs + cat_attribs])
y_train = train_data["Survived"]
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_scores = cross_val_score(forest_clf, X_train, y_train, cv=10)
print("Random Forest accuracy:", forest_scores.mean())

# Preprocess the training data and fit a Support Vector Machine classifier
svm_clf = SVC(gamma="auto")
svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)
print("SVM accuracy:", svm_scores.mean())

# Plot the accuracy scores of the two classifiers
plt.figure(figsize=(8, 4))
plt.plot([1]*10, svm_scores, ".")
plt.plot([2]*10, forest_scores, ".")
plt.boxplot([svm_scores, forest_scores], labels=("SVM","Random Forest"))
plt.ylabel("Accuracy", fontsize=14)
plt.show()

# Create a new feature "AgeBucket" by grouping ages into buckets of 15 years
train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()

# Create a new feature "RelativesOnboard" by adding the number of siblings/spouses and parents/children
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()

ex 3 :
some improvements to the code:

1-Add comments to explain the purpose of each section of code.
Use more descriptive variable names to make the code easier to read and understand.

2-Use f-strings to format strings instead of concatenation.

3-Add error handling to the 
fetch_titanic_data
 function in case the download fails.

4-Use 
train_test_split
 to split the training data into a training set and a validation set for model selection.

5-Use 
GridSearchCV
 to perform hyperparameter tuning for the Random Forest classifier.

6-Use 
RandomizedSearchCV
 to perform hyperparameter tuning for the Support Vector Machine classifier.

7-Use 
precision_recall_curve
 and 
roc_curve
 to plot precision-recall and ROC curves for the classifiers.

8-Use 
confusion_matrix
 to compute the confusion matrix for the classifiers.
 
Here's the improved code:

In [None]:
import os
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import precision_recall_curve, roc_curve, confusion_matrix

# Define the paths and URLs for the Titanic dataset
TITANIC_PATH = os.path.join("datasets", "titanic")
DOWNLOAD_URL = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/titanic/"

# Function to download the Titanic dataset
def fetch_titanic_data(url=DOWNLOAD_URL, path=TITANIC_PATH):
    if not os.path.isdir(path):
        os.makedirs(path)
    for filename in ("train.csv", "test.csv"):
        filepath = os.path.join(path, filename)
        if not os.path.isfile(filepath):
            print(f"Downloading {filename}")
            try:
                urllib.request.urlretrieve(url + filename, filepath)
            except:
                print(f"Failed to download {filename}")

# Function to load the Titanic dataset
def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

# Download and load the Titanic dataset
fetch_titanic_data()
train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")

# Set the index of the training and test data to "PassengerId"
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

# Print information about the training data
print("Training data info:")
print(train_data.info())

# Compute the median age of female passengers in the training data
female_median_age = train_data[train_data["Sex"]=="female"]["Age"].median()
print(f"Median age of female passengers: {female_median_age}")

# Print some statistics about the training data
print("Training data statistics:")
print(train_data.describe())

# Count the number of survivors and non-survivors in the training data
survival_counts = train_data["Survived"].value_counts()
print("Survival counts:")
print(survival_counts)

# Count the number of passengers in each passenger class in the training data
class_counts = train_data["Pclass"].value_counts()
print("Passenger class counts:")
print(class_counts)

# Count the number of male and female passengers in the training data
sex_counts = train_data["Sex"].value_counts()
print("Sex counts:")
print(sex_counts)

# Count the number of passengers who embarked at each port in the training data
embarked_counts = train_data["Embarked"].value_counts()
print("Embarked counts:")
print(embarked_counts)

# Define the preprocessing pipelines for numerical and categorical data
num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])
cat_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

# Combine the numerical and categorical pipelines using ColumnTransformer
num_attribs = ["Age", "SibSp", "Parch", "Fare"]
cat_attribs = ["Pclass", "Sex", "Embarked"]
preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

# Preprocess the training data and split it into a training set and a validation set
X = preprocess_pipeline.fit_transform(train_data[num_attribs + cat_attribs])
y = train_data["Survived"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a Random Forest classifier and perform hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
forest_clf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(forest_clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print("Random Forest best parameters:", grid_search.best_params_)
print("Random Forest best score:", grid_search.best_score_)

# Fit a Support Vector Machine classifier and perform hyperparameter tuning using RandomizedSearchCV
param_dist = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto'],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid']
}
svm_clf = SVC(random_state=42)
random_search = RandomizedSearchCV(svm_clf, param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)
print("SVM best parameters:", random_search.best_params_)
print("SVM best score:", random_search.best_score_)

# Evaluate the Random Forest classifier on the validation set
forest_clf = RandomForestClassifier(n_estimators=grid_search.best_params_['n_estimators'],
                                     max_depth=grid_search.best_params_['max_depth'],
                                     min_samples_split=grid_search.best_params_['min_samples_split'],
                                     min_samples_leaf=grid_search.best_params_['min_samples_leaf'],
                                     random_state=42)
forest_clf.fit(X_train, y_train)
y_pred_forest = forest_clf.predict(X_val)
print("Random Forest accuracy on validation set:", (y_pred_forest == y_val).mean())
print("Random Forest confusion matrix:")
print(confusion_matrix(y_val, y_pred_forest))

# Evaluate the Support Vector Machine classifier on the validation set
svm_clf = SVC(C=random_search.best_params_['C'],
              gamma=random_search.best_params_['gamma'],
              kernel=random_search.best_params_['kernel'],
              random_state=42)
svm_clf.fit(X_train, y_train)
y_pred_svm = svm_clf.predict(X_val)
print("SVM accuracy on validation set:", (y_pred_svm == y_val).mean())
print("SVM confusion matrix:")
print(confusion_matrix(y_val, y_pred_svm))

# Plot precision-recall and ROC curves for the Random Forest classifier
y_scores_forest = forest_clf.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_val, y_scores_forest)
fpr, tpr, thresholds = roc_curve(y_val, y_scores_forest)
plt.figure(figsize=(8, 4))
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threshold", fontsize=14)
plt.legend(loc="upper left", fontsize=14)
plt.ylim([0, 1])
plt.title("Precision-Recall Curve", fontsize=16)
plt.show()
plt.figure(figsize=(8, 4))
plt.plot(fpr, tpr, linewidth=2)
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([0, 1, 0, 1])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('ROC Curve', fontsize=16)
plt.show()

# Plot precision-recall and ROC curves for the Support Vector Machine classifier
y_scores_svm = svm_clf.decision_function(X_val)
precisions, recalls, thresholds = precision_recall_curve(y_val, y_scores_svm)
fpr, tpr, thresholds = roc_curve(y_val, y_scores_svm)
plt.figure(figsize=(8, 4))
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threshold", fontsize=14)
plt.legend(loc="upper left", fontsize=14)
plt.ylim([0, 1])
plt.title("Precision-Recall Curve", fontsize=16)
plt.show()
plt.figure                                                  

Description about this code:
This code performs a binary classification task on the Titanic dataset, which contains information about passengers on the Titanic and whether they survived or not. The code downloads the dataset, preprocesses the data using pipelines, and splits the training data into a training set and a validation set. It then fits a Random Forest classifier and a Support Vector Machine classifier on the training set, performs hyperparameter tuning using GridSearchCV and RandomizedSearchCV, and evaluates the classifiers on the validation set. Finally, it plots precision-recall and ROC curves for the classifiers using the validation set.

The code first defines the paths and URLs for the Titanic dataset and defines functions to download and load the data. It then downloads and loads the data, sets the index of the training and test data to "PassengerId", and prints some information about the training data, such as the median age of female passengers, survival counts, passenger class counts, sex counts, and embarked counts.

The code then defines preprocessing pipelines for numerical and categorical data using SimpleImputer, StandardScaler, and OneHotEncoder, and combines them using ColumnTransformer. It preprocesses the training data and splits it into a training set and a validation set using train_test_split.

The code then fits a Random Forest classifier and performs hyperparameter tuning using GridSearchCV. It also fits a Support Vector Machine classifier and performs hyperparameter tuning using RandomizedSearchCV. It prints the best parameters and best score for each classifier.

The code then evaluates the Random Forest classifier and the Support Vector Machine classifier on the validation set, computes the accuracy and confusion matrix for each classifier, and plots precision-recall and ROC curves for each classifier using precision_recall_curve and roc_curve.

Overall, this code demonstrates how to preprocess data using pipelines, split data into training and validation sets, perform hyperparameter tuning using GridSearchCV and RandomizedSearchCV, and evaluate classifiers using accuracy, confusion matrix, precision-recall curve, and ROC curve.

================================================================================

ex 4 😢:
