# Machine Learning Fundamentals

## Model Selection

In [None]:
# Import all estimators list
from sklearn.utils import all_estimators

estimators = all_estimators(type_filter='classifier') # also try 'regressor' 
classification_estimators = []
i = 1
for name, class_ in estimators:
    classification_estimators.append(class_.__name__)
    print(f'{i}. {class_.__name__}')
    i += 1

In [None]:
# Import all sklearn functions
from sklearn.utils.discovery import all_functions
functions = all_functions()
for name, function in functions:
    print(name)

### Machine Learning - Algorithm Tradeoffs

* Linear Regression
    * High interpretability.
    * Lower accuracy for complex datasets.

* Decision Trees
    * Prone to overfitting, reducing accuracy.

* K-Nearest Neighbours (KNN)
    * Accuracy can be affected by irrelevant features.

* Support Vector Machines (SVMs)
    * Lower interpretability due to data transformation.
    * Can be more accurate for certain datasets.

* Random Forests
    * Reduced interpretability due to ensemble method.
    * Typically, higher accuracy than individual trees or KNN.

* Neural Networks
    * Least interpretable due to complex architectures.
    * Can achieve high accuracy on complex tasks.

## Data Preparation

### For supervised learning:

* **Categorise the problem early**
    * Classification.
    * Clustering.
    * Regression.
    * Ranking.

* **Check Data Quality**
    * Data formats and data types.
    * Consistency of expression e.g. 5.93, $5.93, five ninety-three ...

* **Reduce Data**
    * Attribute Sampling.
    * Record Sampling.
    * Aggregation.

* **Clean Data**
    * Substitute or fill missing Values
        
* **Create New Features**
    * E.g., Creating a ‘day’ feature from a ‘date’ feature

* **Rescale Data**
    * Data Normalisation
        * Create a 0.0 – 1.0 range for values
        * Encode categorical data (could be set as 0 or 1)

## Data Exploration

### Iris Dataset Demo

* Exploring the `iris` dataset
    * One of the built-in datasets in `seaborn`.
    * Use `pandas`, `matplotlib` and `seaborn` functions to assess the data.
        * Type and shape
        * Feature names
        * Target names
        * Array types

In [None]:
# fetch the seaborn built-in dataset 'iris'
import seaborn as sns
import pandas as pd 
import matplotlib.pyplot as plt
df_iris = sns.load_dataset('iris')

df_iris.sample(5)

##### We use a pairplot to see whether there is likely to be a way to separate out distinct classes in the data.
* Grid of scatter plots.
* Each feature plotted against every other feature.  

* Reading a Pairplot
    * Diagonal
        * Shows distribution of a single variable.
        * Usually a form of histogram.
    * Off-diagonal
        * Shows scatter plot between two variables.
        * Rising from left to right
            * Suggests positive correlation
            * e.g. Petal Length and Petal Width

* Insights from Pairplot:
    * Understand relationships between variables.
    * Identify clusters and outliers.
    * Discover trends and patterns.

In [None]:
# Pair Plot
sns.set_style(style="ticks")
sns.set_palette('viridis')
sns.pairplot(df_iris, hue="species", markers=["o", "s", "D"])
plt.show()

##### Next, a very simple classifier (a decision tree) to see whether our assumption is accurate.

In [None]:
# Decision Tree Classifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# using the iris dataset again 
# we first need to encode the species names as numeric variables
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df_iris['species_num'] = label_encoder.fit_transform(df_iris['species'])

# Define X and y
X = df_iris.drop(columns = ['species','species_num'], axis = 1)
y = df_iris['species_num']

# Train/test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# Create instance of classifier
clf = DecisionTreeClassifier().fit(X_train, y_train)

# Evaluate training and testing accuracy 
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

# Visualise the tree  
from sklearn import tree 

# Putting the feature names and class names into variables
fn = X.columns
cn = df_iris['species'].unique()

# making the figure larger to be more readable
plt.figure(figsize=(10,8))

# plotting the tree
tree.plot_tree(clf,
               feature_names = fn, 
               class_names=cn,
               filled = True);

## Types of machine learning

### Supervised Learning

* Model has both a known input and a known output used for training.
* Knows the output during the training process.
* Trains the model to reduce the error in prediction.
* Two major types of supervised learning methods:
    * Classification
    * Regression

#### Linear Regression
* Supervised learning method used for predicting:
    * A continuous outcome variable (Y)
    * Based on one or more predictor variables (X)
* Assumes a linear relationship between the predictor variables and the outcome.
* Coefficients in the model represent the change in the outcome associated with a one-unit change in the predictors.
* Model fitting involves minimising the sum of the squared differences between the observed and predicted values of the outcome.
* Performance is evaluated using measures Mean Absolute Error, Mean Squared Error and Root Mean Squared Error (RMSE).
* Assumptions include linearity, independence, and normality of residuals.

##### Example with one input variable

In [None]:
# fetch the tips dataset
df = sns.load_dataset('tips')
print(df.sample())



In [None]:
# Define X and y
X = df[['total_bill']]
y = df['tip']

In [None]:
# Train/test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

In [None]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression().fit(X_train, y_train)

print('linear model intercept: {}'.format(linreg.intercept_))
print('linear model coeff:\n{}'.format(linreg.coef_))
print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test)))

In [None]:
plt.figure(figsize=(5,4))
plt.scatter(X, y, marker= 'o', s=50, alpha=0.8)
plt.plot(X, linreg.coef_ * X + linreg.intercept_, 'r-')
plt.title('Least-squares linear regression')
plt.xlabel('Feature value (x)')
plt.ylabel('Target value (y)')
plt.show()

##### Example with multiple input variables

In [None]:
# fetch the tips dataset
df_tips = sns.load_dataset('tips')
print("Dataframe before pre-processing:")
print(df_tips.head(1))

# encode non-numeric variables
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df_tips['day'] = label_encoder.fit_transform(df_tips['day'])
df_tips['time'] = label_encoder.fit_transform(df_tips['time'])
df_tips['sex'] = label_encoder.fit_transform(df_tips['sex'])
df_tips['smoker'] = label_encoder.fit_transform(df_tips['smoker'])

print("\nDataframe after pre-processing:")
print(df_tips.head(1))

In [None]:
# Define X and y 
X = df_tips.drop(columns=['tip'])
y = df_tips['tip']

# Train/test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# Create another instance of the model
from sklearn.linear_model import LinearRegression

linreg2 = LinearRegression().fit(X_train, y_train)

print('linear model intercept: {}'.format(linreg2.intercept_))
print('linear model coeff:\n{}'.format(linreg2.coef_))
print('R-squared score (training): {:.3f}'.format(linreg2.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg2.score(X_test, y_test)))

#### Logistic Regression
* A statistical model used for binary classification problems.
* Estimates the probability of an event occurrence based on one or more predictor variables.
* Uses logistic sigmoid function to return a probability value between 0 and 1.
* Binary Outcome: predicts one of two possible outcomes.
* Applications
    * Healthcare, finance, social sciences.
    * For risk assessment and prediction.

##### Example using Titanic dataset

In [None]:
# fetch the tips dataset
df_titanic = sns.load_dataset('titanic')
df_titanic = df_titanic.dropna(how='any')

from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df_titanic['sex']= label_encoder.fit_transform(df_titanic['sex'])

# Define X and y
X = df_titanic[['sex','age','pclass']]
y = df_titanic['survived']

df_titanic.head()

In [None]:
# Train/test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

In [None]:
# TODO add log example here 
# use tips - tip Y/N
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

In [None]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

class_names=['Perished','Survived'] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

In [None]:
#define the predictor variable and the response variable
x = df_titanic['age']
y = df_titanic['survived']

#plot logistic regression curve
sns.regplot(x=x, y=y, data=df_titanic, logistic=True, ci=None)

#### K-Nearest Neighbours (KNN)

* K-NN algorithm compares a new entry with existing data entries.

* Assigns the new data a class based on closeness to neighbours

In [None]:
# TODO add knn example
# use tips tipped Y/N
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris # another way of accessing built-in datasets
from sklearn.model_selection import train_test_split
iris = load_iris(as_frame=True)
X = iris.data[["sepal length (cm)", "sepal width (cm)"]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)

In [None]:
# Create meshgrid 
h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['coral', 'gainsboro', 'linen'])
cmap_bold = ListedColormap(['firebrick', 'darkslategray', 'orange'])

for weights in ['uniform', 'distance']:
    # we create an instance of Neighbours Classifier and fit the data.
    knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification (k = %i, weights = '%s')"
              % (n_neighbors, weights))

plt.show()

##### Evaluating a model's performance

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score 
y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)

print("Accuracy \nThe proportion of correct predictions:", accuracy)
print("\nPrecision \nMinimising false negatives:", precision)
print("\nRecall \nMinimising false positives:", recall)

##### Normalising and scaling data

In [None]:
# Split the data into features (X) and target (y)
X = df.drop('fraud', axis=1)
y = df['fraud']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scale the features using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Unsupervised Learning  
* Machine learning techniques for discovering patterns in data.

* **Clustering**
    * Finding natural clusters of data points based on the variables.

* **Dimensionality Reduction**
    * Searching for patterns and correlations.
    * Using these patterns to express data in a compressed form.

#### K-means Clustering**  
* Widely used method for cluster analysis.
* Aims to partition a set of objects into K clusters.
* Minimises the sum of the squared distances between the objects and their assigned cluster mean.
* Applicatons:
    * Customer segmentation.

##### K-means example 
Borrowed from Kaggle: https://www.kaggle.com/code/satishgunjal/tutorial-k-means-clustering

In [None]:
# Fetch the data 
mall_df = pd.read_csv('https://raw.githubusercontent.com/jargonautical/bsuBootcampCohort5/refs/heads/main/Mall_Customers.csv')
mall_df.sample()

##### How does the data look?

In [None]:
# Simple scatterplot 
plt.figure(figsize=(10,6))
plt.scatter(mall_df['Annual Income (k$)'], mall_df['Spending Score (1-100)'])
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Unlabelled Mall Customer Data')
plt.show()

##### Building the model

In [None]:
# Create an instance of the model
from sklearn.cluster import KMeans 
X = mall_df.iloc[:, [3,4]].values # using only the last 2 columns
kmeans = KMeans(n_clusters=5).fit(X) # assuming 5 clusters, based on the scatterplot above

##### Evaluating the model  
* Unsupervised models don't have 'true positive' 'false positive' etc because nothing in the data is labelled.
* Instead we look for the number of clusters that minimises `inertia`:   
    (the means of the distance of each data point from the cluster centroids they have been assigned to)  
* The elbow plot is a simple method using a `for` loop to try different numbers of clusters and plot the corresponding inertia.

##### Elbow Plot

In [None]:
# ELBOW PLOT
from sklearn.cluster import KMeans
X = mall_df.iloc[:, [3,4]].values 
inertias = []

for i in range(1,11):
    kmeans = KMeans(n_clusters = i, init = 'random', random_state = 42).fit(X)
    inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = kmeans.predict(X_test)
confusion_matrix(y_test, y_pred)

##### 5 clusters looks like a good fit.
* We re-fit the model specifying this, and get predictions. 
* Then we can plot the predictions and assess whether we are happy with the model.

In [None]:
kmeans= KMeans(n_clusters = 5, random_state = 42)
# Compute k-means clustering
kmeans.fit(X)
# Compute cluster centers and predict cluster index for each sample.
pred = kmeans.predict(X)

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(X[pred == 0, 0], X[pred == 0, 1], c = 'deepskyblue', label = 'Cluster 0')
plt.scatter(X[pred == 1, 0], X[pred == 1, 1], c = 'salmon', label = 'Cluster 1')
plt.scatter(X[pred == 2, 0], X[pred == 2, 1], c = 'darkviolet', label = 'Cluster 2')
plt.scatter(X[pred == 3, 0], X[pred == 3, 1], c = 'lawngreen', label = 'Cluster 3')
plt.scatter(X[pred == 4, 0], X[pred == 4, 1], c = 'gold', label = 'Cluster 4')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1],s = 300, c = 'firebrick', label = 'Centroid', marker='*')

plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Customer Clusters')
plt.show()

#### Dimensionality reduction

In [None]:
from sklearn.decomposition import PCA

# import the data
df_iris = sns.load_dataset('iris')
features = ["sepal_width", "sepal_length", "petal_width", "petal_length"]

# encode species names 
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df_iris['species'] = label_encoder.fit_transform(df_iris['species'])

# create an instance of the Principal Component Analysis (PCA) model 
pca = PCA(n_components=2) # we specify that we are reducing the number of components to 2
components = pca.fit_transform(df[features]) # then we fit this instance to the features in the dataframe

# plot the results 
plt.figure(figsize=(10, 6))
plt.scatter(components[:, 0], # position on the first principal component of the observations
            components[:, 1], alpha=0.7, c=df_iris['species'], cmap='viridis') # position on the second principal component of the observations
            

plt.show()
