# Reminders
Homework 2 is due next week. You will need to know how to use Scikit-learn's `LinearRegression` model, which we'll go over today. 

# Lecture 7 - CME 193 - scikit-learn

[Scikit-learn](https://scikit-learn.org/stable/) is a library that allows you to do machine learning, that is, make predictions from data, in Python. There are four basic machine learning tasks:

 1. Regression: predict a number from datapoints, given datapoints and corresponding numbers
 2. Classification: predict a category from datapoints, given datapoints and corresponding numbers
 3. Clustering: predict a category from datapoints, given only datapoints
 4. Dimensionality reduction: make datapoints lower-dimensional so that we can visualize the data

Here is a [handy flowchart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) of when to use each technique.

![](https://scikit-learn.org/stable/_static/ml_map.png)

# Start of Basic Section

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Regression
Abalone are a type of edible marine snail, and they have internal rings that correspond to their age (like trees). We need to cut the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. In the following, we will use a dataset of [abalone measurements](https://archive.ics.uci.edu/ml/datasets/abalone). It has the following fields:

    Sex / nominal / -- / M, F, and I (infant)
    Length / continuous / mm / Longest shell measurement
    Diameter	/ continuous / mm / perpendicular to length
    Height / continuous / mm / with meat in shell
    Whole weight / continuous / grams / whole abalone
    Shucked weight / continuous	/ grams / weight of meat
    Viscera weight / continuous / grams / gut weight (after bleeding)
    Shell weight / continuous / grams / after being dried
    Rings / integer / -- / +1.5 gives the age in years

Suppose we are interested in predicting the age of the abalone given their measurements. This is an example of a regression problem.

In [None]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',
                   header=None, names=['sex', 'length', 'diameter', 'height', 'weight', 'shucked_weight',
                                       'viscera_weight', 'shell_weight', 'rings'])

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# we have pd.pandas.plot as a convinient way to make some simple plot (just a wrapper on matplotlib plt.plot())
# refer back to the Optional section of lec 6 or https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

df['rings'].plot(kind='hist') #you can see the number of rings clustered around 6 ~ 11

In [None]:
df.plot('weight', 'rings', kind='scatter') # we can see some trend here

### Four steps of ML
0. get the data
1. import the model
2. train the model
3. use the model to predict

In [None]:
df.head()

In [None]:
#0. create data
X = df[['weight']].to_numpy() #data as 2D input array
y = df['rings'].to_numpy() #data to output 1D array to predict

In [None]:
X.shape # we expect the input data for linear regression model to be 2D

In [None]:
X

In [None]:
y.shape

In [None]:
y

In [None]:
from sklearn.linear_model import LinearRegression
#1.we create a raw linearRegression model
model = LinearRegression() # Stats203

#2. we takes the model, train it with the dataset
model.fit(X, y)

# For y = A@X + B, all variables and parameters here are matrix,
# we return optmized coefficient A and intercept B here
print(model.coef_, model.intercept_)

In [None]:
#3. make a prediction
test_data = np.array([[1.5], [2.2]])
model.predict(test_data) #make two predictions when X is 1.5 or 2.2, what are y?

In [None]:
df.plot('weight', 'rings', kind='scatter')

weight = np.linspace(0, 3, 10).reshape(-1, 1)
plt.plot(weight, model.predict(weight), 'r') # y = 3.55 @ X + 6.99

In [None]:
print(model.score(X, y))
# return the R^2 coefficient = How much %data variance can be explained by the model -> STATS 203

This looks ok but it does not do a great job. It looks like $rings = const*\sqrt(weights)$, let's try that

In [None]:
 # let's create a new column of root of weight
df['root_weight'] = np.sqrt(df['weight'])

In [None]:
df.head()

In [None]:
X = df[['weight','root_weight']].to_numpy()
y = df['rings'].to_numpy()
model = LinearRegression()
model.fit(X, y)

In [None]:
X.shape, y.shape

In [None]:
# y = A@X + B
# X 2x1
# A 1x2
model.coef_
# The first coefficient is for `weight`, and second coef is for `root_weight`

In [None]:
# How .predict function works under the hood?

print("Calculate by for formula y = A@x + b: ",model.coef_ @ X[1,:] + model.intercept_) #Ax + b

print("Result of .predict() function: ", model.predict(X[1,:].reshape(1,-1))[0]) #Expected 2D array

In [None]:
weight = np.linspace(0, 3, 100).reshape(-1, 1)
root_weight = np.sqrt(weight)
features = np.hstack((weight,root_weight))
df.plot('weight', 'rings', kind='scatter')
plt.plot(weight, model.predict(features), 'r')

In [None]:
model.score(X,y) # we get improvement from  R^2  = 0.29

As we can see above, the density of points near the red line is much higher than the region where rings > 20. To visualize the density attribute, we use heatmap (2d histogram) to bucketize the counts within a 2d square.

In [None]:
plt.hist2d(df['weight'],df['rings'],bins=(50,30));
plt.plot(weight, model.predict(features), 'r')
plt.colorbar()

# End of Basic Section

## Classification

Another example of a machine learning problem is classification. Here we will use a dataset of flower measurements from three different flower species of *Iris* (*Iris setosa*, *Iris virginica*, and *Iris versicolor*). We aim to predict the species of the flower. Because the species is not a numerical output, it is not a regression problem, but a classification problem.

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
print(iris.DESCR)

In [None]:
X = iris.data[:, :2] #sepal
y = iris.target_names[iris.target]

In [None]:
print(X.shape)
X[:10]

In [None]:
y

In [None]:
# We plot the graph with attribute width and length, distinguished color by its class
for name in iris.target_names:
    plt.scatter(X[y == name, 0], X[y == name, 1], label=name)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend();

We have labeled data, i.e. we know the types of irises for each of these data points. We want to learn a model that can, given sepal width and sepal length, predict which type of iris it will be. To train the model, we do what is called a train/test split so we can evaluate how well our model performs. 

In [None]:
# split our data sets into training and testing sets
# Training set: used to train the model
# Testing set: used to valid performance of the trained model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
#use the intuition that a point is checking its neighbors 
# (if all neighbors are blue â†’ the point is likely to be blue) 
model = KNeighborsClassifier()
model.fit(X_train, y_train)

In [None]:
y_predict = model.predict(X_test)
y_predict

In [None]:
y_test

## Evaluating your model in a more elegant way

In [None]:
y_predict == y_test

In [None]:
print(np.mean(y_predict == y_test))  # Accuracy

In [None]:
import sklearn.metrics as metrics
metrics.accuracy_score(y_predict, y_test)

 More comprehensive score for classification:

 - Precision: fraction of relevant instances among the retrieved instances,  

- Recall:    fraction of relevant instances that were retrieved  

- F1-score: 2*(Precision * Recall)/(Precision+Recall): quantify the performance of model (max 1.0, higher -> better)

![](https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg)

In [None]:
#eg: virginica precision = 0.67, to predict sample of 1000 of mixed flower,
# if we predict 100 flowers are virginica, we will be 67% correct,
# virginica recall = 0.33 -> however, we only extracted 33% of the virginica,
# which we predict another 67% of true virginica as something else

print(metrics.classification_report(y_predict, y_test))

In [None]:
# Cross validation:
# Cross_val_score is used as a simple cross validation technique to prevent over-fitting and promote model generalization.
# Won't go into details, but it will divide data into multiple subsets and runs the model multiple times to see its performance on average

# By default, split data into 5 chucks, train the model 5 times by holding one each chuck used for validation
from sklearn.model_selection import cross_val_score
model = KNeighborsClassifier()
scores = cross_val_score(model, X, y, cv=5)
scores

In [None]:
print(f"Precision: {scores.mean()} (+/- {scores.std()})")

### How can we do better? Use More / Different data

In [None]:
X = iris.data[:, 2:]
y = iris.target_names[iris.target]

for name in iris.target_names:
    plt.scatter(X[y == name, 0], X[y == name, 1], label=name)
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend();

# It does a much better job than Sepal

In [None]:
X = iris.data[:, 2:]
y = iris.target_names[iris.target]
model = KNeighborsClassifier()
scores = cross_val_score(model, X, y, cv=5)
print(f"Precision: {scores.mean()} (+/- {scores.std()})")

#Looks good, but what's downside of it?

### More Advanced Topic -> Keyword: feature engineering, Regularization

![](https://www.mathworks.com/discovery/overfitting/_jcr_content/mainParsys/image.adapt.full.medium.svg/1705396624275.svg)


# Exercise (Post Lecture)
Try to fit some of the models in the following cell to the same data. Compute the relevant statistics (e.g. accuracy, precision, recall). Look up the documentation for the classifier, and see if the classifier takes any parameters. How does changing the parameter affect the result?

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [None]:
# # 0.Get data
# X = iris.data
# y = iris.target_names[iris.target]
# # Split the model
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# # 1. Get Model
# model = <YOUR MODEL FUNCTION>()

# # 2. Train model
# model.fit(X_train, y_train)

# # 3. Make Prediction
# print(metrics.classification_report(model.predict(X_test), y_test))

## Clustering

Clustering is useful if we don't have a dataset labelled (unsupervised classification) with the categories we want to predict, but we nevertheless expect there to be a certain number of categories. For example, suppose we have the previous dataset, but we are missing the labels. We can use a clustering algorithm like k-means to *cluster* the datapoints. Because we don't have labels, clustering is what is called an **unsupervised learning** algorithm.

![](https://www.researchgate.net/publication/351953193/figure/fig3/AS:11431281117150742@1675395484096/Supervised-and-unsupervised-machine-learning-a-Schematic-representation-of-an.png)

In [None]:
X = iris.data

In [None]:
for name in iris.target_names:
    plt.scatter(X[y == name, 0], X[y == name, 1], label=name)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend()
plt.show()

In [None]:
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

In [None]:
from sklearn.cluster import KMeans
# We initilized three starting points, adjusting them to an optimized solution
# The k-means algorithm assumes the data is generated by a mixture of Gaussians
model = KMeans(n_clusters=3, random_state=0)
model.fit(X)

In [None]:
# give all flows in the data a label
model.labels_

In [None]:
# The target labels looks different (which is fine) with similar trend
iris.target

In [None]:
# See it visually sepal
for name in [0,1,2]:
    plt.scatter(X[model.labels_ == name, 0], X[model.labels_ == name, 1], label=name)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend()
plt.show()

In [None]:
for name in iris.target_names:
    plt.scatter(X[y == name, 0], X[y == name, 1], label=name)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend()
plt.show()

### More Advanced Topic: How to choosing the appropriate Number of Cluster ? [Elbow Method] ; Are there more metrics for evaluation? -> CS229

### Exercise (Post Lecture)

Load the breast cancer dataset.

- Try to cluster it into two clusters and check if the clusters match with the target class from the dataset, which specifies if its malignant or not. Here we are testing if we can we idenitify if its malignant or benign without even looking at the target class i.e. using unsupervised learning.

- Next, train a supervised classifier, a `KNeighborsClassifier`, and see how much improvement we get?

In [None]:
bc = datasets.load_breast_cancer()
X = bc.data
y = bc.target
print(bc.DESCR)

In [None]:
# # Unsupervised
# YOUR CODE HERE

In [None]:
# # Supervised
# YOUR CODE HERE

## Dimensionality reduction

Dimensionality reduction is another unsupervised learning problem (that is, it does not require labels). It aims to project datapoints into a lower dimensional space while preserving distances between datapoints. Remember when we looked at using the SVD to do PCA in Lecture 4? This was dimensionality reduction!

In [None]:
X = iris.data[:, :]
y = iris.target_names[iris.target]

for name in iris.target_names:
    plt.scatter(X[y == name, 0], X[y == name, 1], label=name)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend()

In [None]:
for name in iris.target_names:
    plt.scatter(X[y == name, 2], X[y == name, 3], label=name)
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.legend()

We are going to use an algorithm callsed [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)

T-distributed Stochastic Neighbor Embedding: a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the cost function of divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data `from sklearn.decomposition import PCA`) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high


In [None]:
from sklearn.manifold import TSNE
model = TSNE(n_components=2)
X_transformed = model.fit_transform(X)

In [None]:
# we convert the data into 2-dim with preserved features
print(X.shape, X_transformed.shape)

In [None]:
for name in iris.target_names:
    plt.scatter(X_transformed[y == name, 0], X_transformed[y == name, 1], label=name)

plt.legend()

Lets take a look at the breast cancer dataset with dimensionality reduction

In [None]:
bc = datasets.load_breast_cancer()
print(bc.DESCR)

In [None]:
print(bc.keys())
print(bc['data'].shape)
print(bc.target_names)
print(bc.target)
print(bc.target_names[bc.target])

In [None]:
X = bc.data
y = bc.target_names[bc.target]
model = TSNE(n_components=2)
X_transformed = model.fit_transform(X)

In [None]:
# we convert the data into 2-dim with preserved features
print(X.shape, X_transformed.shape)

In [None]:
# If we used full model of 30 attributes
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(metrics.classification_report(y_pred, y_test))

In [None]:
# We use the 2-d dataset -> we get 92% accuracy -> still very good
print(X_transformed.shape)
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=0)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
print(metrics.classification_report(model.predict(X_test), y_test))

In [None]:
# visualize the reduced dimension data
for name in bc.target_names:
    plt.scatter(X_transformed[y == name, 0], X_transformed[y == name, 1], label=name)

plt.legend()

In [None]:
# visualize the predicted reduced dimension data
ypred = model.predict(X_transformed)
for name in bc.target_names:
    plt.scatter(X_transformed[ypred == name, 0], X_transformed[ypred == name, 1], label=name)

plt.legend()