# Hands-On Introduction to Machine Learning

# Part 2: Introduction to ML

<a target="_blank" href="https://colab.research.google.com/github/nunorc/hands-on-intro-ml/blob/master/notebooks/Part-2-Intro-ML.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook gives a quick introduction to Machine Learning illustrating the use of some techniques and approaches for supervised learning, using the [Python](https://www.python.org/) programming language.
The topics briefly covered include: single and multivariate linear regression, and image classification with SVM.

## 1. Linear Regression Example

### 1.1 Single Variable

In a nutshell linear regression is a supervised learning approached that attempts
to capture the mapping between inputs and an output using a linear model.
Let's recall the problem related with estimating the house price, given a set of features about houses estimate the sale price based on data.

We start with a small sample dataset to work on, the original and complete dataset is available from Kaggle https://www.kaggle.com/datasets/shree1992/housedata, this smaller dataset includes only some of the features and samples from the data original.

We start by downloading the dataset CSV file and save it locally using the `requests` package.


In [None]:
import requests

url = 'https://github.com/nunorc/hands-on-intro-ml/raw/master/notebooks/housing-data.csv'

# get the content of the file
r = requests.get(url, allow_redirects=True)

# save to a local locally
open('housing-data.csv', 'wb').write(r.content)

We now have the `housing-data.csv` local file with our data. Next we can immediately load the dataset to a `pandas` dataframe for inspection and manipulation

In [None]:
import pandas as pd

dataset = pd.read_csv('housing-data.csv')

We can inspect the content of the `dataframe` that `pandas` created, just by calling the variable, the notebook already is able to pretty print the data:

In [None]:
dataset

We can see that the table has 4207 rows, i.e. observations, or samples, and 5 columns, i.e. features: number of bedrooms, number of bathrooms, living area in square feet, number of floors, and also the price for which the house was sold.

Let's start by looking at a single variable linear regression, which means that we only have one input variable to our model, also called independent variable, in this cas we can express our linear regression model as:

$$Y = aX + b$$

so, the goal becomes to discover the best values for $a$ and $b$.

In [None]:
features = ['sqft_living']
target = 'price'

X = dataset[features].to_numpy()
y = dataset[target].to_numpy()

len(X), len(y)

Let's plot our feature versus the prince to visualize the data distribution, we can do this easily with `matplotlib`:

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X, y, marker='.')
plt.xlabel('sqft_living (X)')
plt.ylabel('price (y)')

plt.show()

Just by visually inspecting the plot it seems that there is a relation between the price (our dependent variable) and the area (our independent variable), i.e. the price increases with the increase of the area, but just by looking at the plot is hard to define this relation.

Let's create our first linear regression model. We start by importing the `LineaRegression` class from the `sklearn.linear_model` and we create a new variable `lr` instance of `LinearRegression`:

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

Now it's time to train this model with our data stored in the `X` variable, and with the target vector for the prices stored in `y`:

In [None]:
lr.fit(X, y)

We can check the coefficients of our model `a` and `b`, i.e. the slope and .. from our linear model `lr` with the following code:

In [None]:
f"y = { lr.coef_[0] }x + { lr.intercept_ }"

We can now include the fitted linear regression to our previous plot.
We do this by predicting the prince for every possible value of `x` between
the minimum and maximum value from the dataset, we create this list of values using the `arange` function from the `numnpy` package.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.scatter(X, y, marker='.')
plt.xlabel('sqft_living (X)')
plt.ylabel('price (y)')

x_list = np.arange(min(X), max(X))
plt.plot(x_list, lr.predict(x_list.reshape(-1, 1)), color='red')

plt.show()

**Activity:** try to plot the remaining features individually versus the price, to have an idea of possible correlations and good candidates for features.


**Activity:** you can also try to fit the a new linear regression model to these single input features.


### 1.2 Multivariate linear regression

Let's now train a model that uses all the features available in the dataset to predict the house price. In this case our dependent variable `Y` the predicted house price, depends on a linear combination of `n` weights times `n` features X:

$$Y = w_1X_1 + w_2X_2 + ...+ w_nX_n  b$$

So let's redefine our `X` and `y` variables from before as:

In [None]:
features = ['bedrooms', 'bathrooms', 'sqft_living', 'floors']
target = 'price'

X = dataset[features].to_numpy()
y = dataset[target].to_numpy()

X, y

Note that now `X` is a list of lists, because for each observation we have a lists of inputs features.

We now create a new linear regression object `lr`:

In [None]:
lr = LinearRegression()

lr.fit(X, y)

We can as before check the weights in our model for each feature.

In [None]:
f"y = { lr.coef_[0] } bedrooms + { lr.coef_[1] } bathrooms + { lr.coef_[2] } sqft_living + { lr.coef_[3] } floors + { lr.intercept_ }"

**Activity:** try to use other options for building a model, for example polynomial regression, you can use the same `X` and `y`.

Another usual important step is doing feature scaling, so that all the values from the input features are scaled to the range of values. A common way of doings this is for example the `MinMaxScaler`:

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_scaled = scaler.fit_transform(X)

X_scaled

We can see that the output of `X_scaled` illustrates different values compared to the original `X`, the values are now scaled, which means that their absolute value has less impact on the best coefficients being calculated that fit the data.

**Activity:** try to train again the linear regression model, but with the new scaled data and compare the obtained weights for the model.

## Image classification Example

Let's now look at a problem where we use images as inputs to our model, and try to classify these images into labels. This dataset, included with the `sk-learn` package, consists of 8x8 pixel images of hand-written digits and the goal is to assign a class between 0 and 9 that represents the digit.

We start by importing the dataset.

In [None]:
from sklearn import datasets

dataset = datasets.load_digits()

Let's plot the first image of the dataset:

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(2, 2))
plt.imshow(dataset.images[0], cmap=plt.cm.gray_r)

Which has the corresponding label:

In [None]:
dataset.target[0]

**Activity:** check other images in the dataset to see other examples of other labels by changing the index of the dataset lists in the previous cells.

Note that each image has shape 8x8, so to apply a classifier on this data, we need to flatten the images, turning each two dimensions array of gray scale values from shape 8x8 into a vector of 64 elements.

In [None]:
# flatten the images
n_samples = len(dataset.images)
X = dataset.images.reshape((n_samples, -1))

X.shape

By inspecting the `X` in the previous cell we can see that now input matrix of feature has 1797 observations or samples, and each sample has a vector of 64 features, i.e. one for each gray scale value for each pixel.

Next, we can then split the data into train and test subsets so that we can fit a classifier to the data the validate its' predictions using the `train_test_split` function:

In [None]:
from sklearn.model_selection import train_test_split

# split data into 70% train and 30% test subsets
X_train, X_test, y_train, y_test = train_test_split(X, dataset.target, test_size=0.3)

Now the result of the `train_test_split` function is a list of subsets of our features matrix `X` and target variable `y`. We can check the shapes of each list, illustrated in the cell below, to see that all the values match correctly:

In [None]:
f"X_train: {X_train.shape}, y_train: {y_train.shape}, X_test:{X_test.shape}, y_test{y_test.shape}"

Now let's use our training subsets to train a model, for example using a Support Vector Classification (SVM) model. We can use the `SVM` class from the `svm` module to create a new model `clf`:

In [None]:
from sklearn import datasets, metrics, svm

clf = svm.SVC()

This model follows the same approach as before, we can train the model using the `fit` method, giving the data as argument, this time we only pass the subsets for training, and holdout the testing subset:

In [None]:
clf.fit(X_train, y_train)

And now we can used the trained model to predict the classes for the samples in the test subset:

In [None]:
predicted = clf.predict(X_test)

We can now plot for example the first image in the test subset (note that we have to reshape the image back to 8x8 from a 64 elements vector):

In [None]:
plt.figure(figsize=(2, 2))
plt.imshow(np.array(X_test[0]).reshape(8, 8), cmap=plt.cm.gray_r)

And we check the true class from the testing subset and the predicted class for the first example:

In [None]:
f"true y: {y_test[0]} predicted y: {predicted[0]}"

**Activity:** compare the predicted class with the true class to see how well the model is doing, you can also plot the image, hint: update the indexes for the lists in the previous cells.

To have a better idea of how well the model is performing let's calculate the model accuracy using the `accuracy_score` function.


In [None]:
from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, predicted)

score

With a 98% accuracy the model is performing quite well for this tasks. There are many options available metrics to evaluate models performance, just to give another example


In [None]:
from sklearn.metrics import classification_report

report = classification_report(y_test, predicted)

print(report)

**Activity:** try to train another model from the `sklean` package for classification and compare the illustrated metrics to see which model performs better.

## Wrap Up

Although only a very limited number of models were illustrated in this notebook, the same approach can be used with many other techniques from the `sklearn` package, all the classes use the same kind of interface and follow the same strategy for the ones discussed.

**Congratulations!** You completed the *Part 2: Introduction to Machine Learning* notebook!

Pour yourself a cup of your favorite drink and take a minute to enjoy your first steps in mastering ML techniques.