# Shallow Machine Learning Introduction

- Statistics is the work horse in machine learning (ML).

## Shallow learning
- scikit-learn (a.k.a. sklearn)

## Catagories

| Regression | Classification | Clustering | Dimension Reduction|
| :-: | :-: | :-: | :-: |
| **Linear** | Logistic Regression | K-means | Principle Component Analysis |
| Polynomial | Support Vector Machine | Mean-Shift | Linear Discriminant Analysis |
| StepWise | Naive Bayes | DBScan | Gernalized Discriminant Analysis |
| Ridge | Nearest Neighbor | Agglomerative Hierachcial | Autoencoder |
| Lasso | Decision Tree | Spectral Clustering | Non-Negative Matrix Factorization |
| ElasticNet | Random Forest | Gaussian Mixture | UMAP |

<p><img alt="Classification" width="600" src="00_images/31_machine_learning/shallow_learning_depictions.jpg" align="center" hspace="10px" vspace="0px"></p>

Image Source: de Oliveira, E.C.L., da Costa, K.S., Taube, P.S., Lima, A.H. and Junior, C.D.S.D.S., 2022. Biological Membrane-Penetrating Peptides: Computational Prediction and Applications. Frontiers in Cellular and Infection Microbiology, 12, p.838259. (https://doi.org/10.3389/fcimb.2022.838259)

<hr style="border:2px solid gray"></hr>

## Linear Regression Refresher

**Idea**: Optimize the orientation of a line (i.e., the slope and y-intercept) that best fits coupled parameters (e.g. vaccination effectiveness as a function of dosage).

The equation that defines a line that has a single "feature" (i.e., one independent variable) is 

$y = m*x + b$

where m is the slope and b is the y-intercept.


- A simple, but prominent technique in ML

- Used often in supervised learning


Additional Info: https://en.wikipedia.org/wiki/Linear_regression

## Learning by example

**Example data**: housing prices across the United States

source: https://github.com/whoparthgarg/House-Price-Prediction (and https://www.kaggle.com/vedavyasv/usa-housing)

- **Avg. Area Income**: Avgerage income of city's residents where the house is located in
- **Avg. Area House Age**: Avgerage age of houses within the same city
- **Avg. Area Number of Rooms**: Avgerage number of rooms for houses within the same city
- **Avg. Area Number of Bedrooms**: Avgerage number of bedrooms for houses within the same city
- **Area Population**: Population of city where the house is located in
- **Price**: Price of the house
- **Address**: Address for the house

In [None]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

The dataset (**usa_housing.csv**) can be downloaded from the git repository: https://github.com/karlkirschner/Scientific_Programming_Course

In [None]:
## for Google Colaboratory
# from google.colab import files
# uploaded = files.upload()

In [None]:
!head -2 usa_housing.csv

Rename the headers since the are very long

In [None]:
housing = pd.read_csv('usa_housing.csv', header=1,
                      names=['income', 'age', 'rooms', 'bedrooms', 'population', 'price', 'address'])
housing

In [None]:
housing.describe()

#### Observables - a.k.a. "Features"

Definitions of words used in ML...

What **features** (i.e., data observables) do we want the machine **to learn** from for **making a prediction** (i.e., `income`, `age`, etc.) of an outcome observable (e.g., `price`)?

Coding wise, we can define the features like the following:

In [None]:
feature_list = ['income', 'age', 'rooms', 'bedrooms', 'population']

#### Visualize the data
Let's plot the features versus price to see what it might look like:

In [None]:
# def plot_features(feature_list, df: pd.DataFrame, plt_predict: bool=False):
#     fig = plt.figure(figsize=(11, 8))

#     fig.subplots_adjust(wspace=0.2, hspace=0.5)

#     for count, feature in enumerate(feature_list):
#         ax = fig.add_subplot(3, 2, count+1)  # first position can not be zero

#         ax.set_xlabel(xlabel=feature)
#         ax.set_ylabel(ylabel='price')

#         ax.scatter(df[feature], df['price'], color='dodgerblue', s=10, alpha=0.3)
#         if plt_predict:
#             plt.plot(df[feature], predict, color='black', linewidth=10, linestyle='solid')

#     plt.show()

In [None]:
# plot_features(feature_list=feature_list, df=housing)

In [None]:
fig = plt.figure(figsize=(11, 8))

fig.subplots_adjust(wspace=0.2, hspace=0.5)

for count, feature in enumerate(feature_list):
    ax = fig.add_subplot(3, 2, count+1)  # first position can not be zero
    
    ax.set_xlabel(xlabel=feature)
    ax.set_ylabel(ylabel='price')
    
    ax.scatter(housing[feature], housing['price'], color='dodgerblue', s=10, alpha=0.3)

plt.show()

<hr style="border:1px solid gray"></hr>

## Linear Regression on a Single Feature (i.e., one-dimensional)

The **simplest scenario** is to focus on **1 feature** (e.g., rooms) and see if we can create a model that allows us to predict a **house price** based on the number of rooms.

In [None]:
target = housing['price'].values
features = housing['rooms'].values

### Training and Testing

- Good **data scholarship** means we need to **split our data** into a **training** and **test** sets. We do this by using the following scikit-learn funtion:

`train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)`

- Returns: a list containing train-test split of the data input.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, target,
                                                                            test_size=0.25, train_size=0.75,
                                                                            random_state=1)

Let's double check the algorithm - we should have 25% of the data being researved for the future testing.

In [None]:
print(f'Length of the training data: {len(target_train)}')
print(f'Length of the test data: {len(target_test)}')

print(f'Fraction of data used for the test data set: '
      f'{len(target_test) / (len(target_train) + len(target_test)) :0.2f}')

#### Understanding what the output is
- Let's look at the data, and see what shape the NumPy arrays are:

In [None]:
display(features_train)
display(features_train.shape)

In [None]:
display(target_train)
display(target_train.shape)

#### Visualize the trainign and test data


In [None]:
plt.figure()

plt.scatter(features_train, target_train, s=10)
plt.scatter(features_test, target_test, s=8)

plt.show()

#### Reshape the data
- scikit-learn's <font color='dodgerblue'>LinearRegression</font> requires the data to have a certain <font color='dodgerblue'>NumPy array shape</font>
- The `target_train` and `target_test` are both already in their correct shape

In [None]:
display(target_train)
display(target_train.shape)

display(target_test)
display(target_test.shape)

- However, since we only only one feature (i.e., one column $\rightarrow$ number of rooms), the feature containing arrays needs to be reshaped to contain nested lists.

**Note:** If we do not reshape the data, then in the next step (i.e., `model = reg.fit(X=features_train, y=target_train)`) we would obtain the following error:

`ValueError: Expected 2D array, got 1D array instead:
array=[7.76350224 6.67325638 6.39398078 ... 6.11019169 7.04733826 5.35511362].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.`

Numpy's reshape function: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html
- `One shape dimension can be -1. In this case, the value is inferred from the length of the array and the remaining dimensions.`

Originally the data was:

In [None]:
display(features_train)
display(features_train.shape)

Reshape the data:

In [None]:
features_train = np.reshape(features_train, (-1, 1))
features_train

In [None]:
features_test = np.reshape(features_test, (-1, 1))
features_test

### Least Squared Linear Regression

- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

- `sklearn.linear_model.LinearRegression(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)`

We will train in two steps
1. Define our **model** to be a linear regression

In [None]:
reg = LinearRegression(fit_intercept=True)

2. Have the model **learn** from our data (i.e., optimize for a best fit)
     - This is the creation of a **model** that represents our training data

In [None]:
model = reg.fit(X=features_train, y=target_train)

In [None]:
model

<!-- To obtain the weights (a.k.a. coefficients) for each feature (i.e., currently only for rooms):
`print(f'Coefficients: {model.coef_}')` -->

### Making predictions using your model

- `predict`
    - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict

- One new house with 5 rooms


In [None]:
new_house_features = np.array([ [5] ])

display(new_house_features)
display(new_house_features.shape)

In [None]:
model.predict(X=new_house_features)

- Two new houses with 5 and 2 rooms

In [None]:
new_house_features = np.array([ [5], [2] ])
display(new_house_features.shape)

model.predict(X=new_house_features)

#### Evaluate the fit using the Coefficient of Determination ($R^2$)  - goodness-of-fit
- https://en.wikipedia.org/wiki/Coefficient_of_determination

Two ways to obtain this value:
1. `score`
    - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score


2. `r2_score`
    - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score

- Score = 1 : **Best possible model**
    
- Score = 0 : **Poor model** - a model that predicts the expected value of y regardless of the input feature values

- Score > 1 or < 0: **Wrong model** (or **wrong constraints applied**)

Using the **test data** set:

In [None]:
display(features_test.shape)

predict = model.predict(X=features_test)
predict

- `score`

In [None]:
model.score(X=features_test, y=target_test)

- `r2_score`

In [None]:
r2_score(y_true=target_test, y_pred=predict, multioutput='uniform_average')

#### Overlay the scattered data with the linear regression prediction

In [None]:
plt.figure()

plt.scatter(features_test, target_test, s=10, alpha=0.5)
plt.plot(features_test, predict, color='black', linewidth=10, linestyle='solid')

plt.show()

The polynomial `coefficients` and `y-intercept` of the resulting fitted line:

In [None]:
print(f'Coefficients: {model.coef_}')
print()
print(f'y-intercept: {model.intercept_}')

Now, let's make a price prediction for the first data entry:

In [None]:
housing.loc[[0]]

Using our ML model:

In [None]:
price = model.predict(X=np.array([ [6.7] ]))
f'{price[0]:0.3e}'

Alternatively, as a proof-of-concept:

- using the `coefficients` and `y-intercept`, we can use the **equation for a line** `y = m*x + b` to obtain a predicted price:

In [None]:
price = (model.coef_[0] * 6.7) + model.intercept_
f'{price:0.3e}'

**Results**: We see that for the first data entry, the model predicts a lower price than actual.



<hr style="border:1px solid gray"></hr>

## Model from two features

The equation that defines a line that has two "features" (i.e., two independent variables) is 

$y = m_1*x_1 + m_2*x_2 + b$

- $x_1$ and $x_2$ = data for the two features
- $m_1$ and $m_2$ = the coefficients
- $b$ = y-intercept


- How does one do this using multiple features (i.e., in multiple-dimensional space)?
- Let's generate a model that uses 'income', 'age', 'rooms', 'bedrooms', and 'population' to make a prediction

In [None]:
housing

In [None]:
target = housing['price'].values

display(target)

In [None]:
two_features = ['age', 'rooms']

display(two_features)
display(housing[two_features].shape)

**Notice**: since with are dealing with **2 features** (i.e., 2 Pandas DataFrame columns), so we pass the **DataFrame directly to `train_test_split`** without reshaping them, unlike the above 1 feature example.

In [None]:
features_train, features_test, target_train, target_test = train_test_split(housing[two_features], target,
                                                                            test_size=0.25, train_size=0.75,
                                                                            random_state=1)

In [None]:
features_train

In [None]:
reg = LinearRegression(fit_intercept=True)

In [None]:
model = reg.fit(X=features_train, y=target_train)

In [None]:
model.score(X=features_test, y=target_test)

In [None]:
predict = model.predict(X=features_test)
predict

Create a plot function that allows us to visualize multiple price vs. features.

In [None]:
def plot_features(feature_list: list,
                  target: np.ndarray,
                  feature_df: pd.DataFrame,
                  predict: np.ndarray=None):
    ''' Create a plot with multiple subplots displayed in two columns.
    
        Args
            feature_list: y-axis features to be extracted from feature_df (i.e. column names)
            target: x-axis data
            feature_df: y-axis data
            predict: predicted values based on machine learning
        Returns
            plot
        
        Library dependencis
            matplotlib
            numpy
            pandas
    '''

    if not isinstance(feature_list, list):
        raise TypeError('Input features are not given as a list.')
    elif not isinstance(target, np.ndarray):
        raise TypeError('Target values are not given as a NumPy array.')
    elif not isinstance(feature_df, pd.DataFrame):
        raise TypeError('feature_df is not given as a Pandas dataframe.')
    elif not isinstance(predict, np.ndarray):
        raise TypeError('predict is not given as a NumPy array.')
    else:  

        number_of_rows = int(np.ceil(len(feature_list)/2))  # number of rows for a 2 column plot

        fig = plt.figure(figsize=(11, 3*number_of_rows))    # same height subplots regardless of rows

        fig.subplots_adjust(wspace=0.2, hspace=0.5)

        for count, feature in enumerate(feature_list):    
            ax = fig.add_subplot(number_of_rows, 2, count+1)  # first position can not be zero

            ax.set_xlabel(xlabel=feature)
            ax.set_ylabel(ylabel='price')

            ax.scatter(feature_df[feature], target, color='dodgerblue', s=20, alpha=0.3)

            if predict is not None:
                ax.scatter(feature_df[feature], predict, color='orange', s=10, alpha=0.5, linestyle='solid')

        plt.show()

In [None]:
plot_features(feature_list=two_features, feature_df=features_test, target=target_test, predict=predict)

#### What would the resulting two-feature linear equation look like, for one of the input houses?

$y = (m_1*x_1) + (m_2*x_2) + (b)$

In [None]:
print(f'Coefficients: {model.coef_}')
print()
print(f'y-intercept: {model.intercept_}')

In [None]:
print(f'y = ({model.coef_[0]:0.2e} * x) \n'\
      f'  + ({model.coef_[1]:0.2e} * x) \n'\
      f'  + {model.intercept_:0.2e}')

#### Apply it to an individual house (i.e., the first data entry) to see how it repoduces the actual target value.

Recall that we can use Pandas `loc` to isolate rows and columns:

In [None]:
display(housing.loc[[0]])

display(housing.loc[[0, 3], ['age', 'rooms']])

display(housing.loc[[0], ['age', 'rooms']])

Using our **ML model**:

In [None]:
price = model.predict(X=housing.loc[[0], ['age', 'rooms']])

f'{price[0]:0.3e}'

Alternatively, using the **equation for a line**:

In [None]:
print(f'y = ({model.coef_[0]:0.2e} * {float(housing["age"].loc[[0]]):0.2e}) \n'\
      f'  + ({model.coef_[1]:0.2e} * {float(housing["rooms"].loc[[0]]):0.2e}) \n'\
      f'  + {model.intercept_:0.2e}')

In [None]:
price = (model.coef_[0] * float(housing["age"].loc[[0]]))     \
      + (model.coef_[1] * float(housing["rooms"].loc[[0]]))        \
      + model.intercept_

f'{price:0.3e}'

As we see, the `$ 1.208e+06` is still lower than the actual `$ 1.506e+06` value.

##### Sidenote: plot the line corresponding to each subfeature

1. Create a straight line for plotting:
2. Plot data and overlay with the straight lines

In [None]:
age_line = (model.coef_[0] * features_test["age"])

rooms_line = (model.coef_[0] * features_test["rooms"])

In [None]:
plt.figure()

plt.scatter(features_test['age'], target_test, s=20, alpha=0.5)
plt.scatter(features_test['rooms'],  target_test, s=20, alpha=0.5)

plt.plot(features_test['age'], age_line, color='dodgerblue', linewidth=10, alpha=0.5, linestyle='solid')
plt.plot(features_test['rooms'], rooms_line, color='orange', linewidth=10, alpha=0.5, linestyle='solid')

plt.show()

<hr style="border:1px solid gray"></hr>

### Model from five features

In [None]:
five_features = ['income', 'age', 'rooms', 'bedrooms', 'population']

display(housing[five_features])
display(housing[five_features].shape)

In [None]:
features_train, features_test, target_train, target_test = train_test_split(housing[five_features], target,
                                                                            test_size=0.25, train_size=0.75,
                                                                            random_state=1)

In [None]:
model = reg.fit(X=features_train, y=target_train)

In [None]:
model.score(X=features_test, y=target_test)

In [None]:
predict = model.predict(X=features_test)
predict

Let's visualize how well the ML'ed predicted values in comparison to the original `test` input data:

In [None]:
plot_features(feature_list=five_features, feature_df=features_test, target=target_test, predict=predict)

#### Apply it to an individual house (i.e., the first data entry) to see how it repoduces the actual target value.

In [None]:
housing.loc[[0]]

Using our **ML model**:

In [None]:
price = model.predict(X=housing.loc[[0], five_features])

f'{price[0]:0.3e}'

Alternatively, using the **equation for a line**:

$y = (m_1*x_1) + (m_2*x_2) + (m_3*x_3) + (m_4*x_4) + (m_5*x_5) + (b)$

In [None]:
print(f'y = ({model.coef_[0]:0.2e} * {float(housing["income"].loc[[0]]):0.2e}) \n'\
      f'  + ({model.coef_[1]:0.2e} * {float(housing["age"].loc[[0]]):0.2e}) \n'\
      f'  + ({model.coef_[2]:0.2e} * {float(housing["rooms"].loc[[0]]):0.2e}) \n'\
      f'  + ({model.coef_[3]:0.2e} * {float(housing["bedrooms"].loc[[0]]):0.2e}) \n'\
      f'  + ({model.coef_[4]:0.2e} * {float(housing["population"].loc[[0]]):0.2e}) \n'\
      f'  + {model.intercept_:0.2e}')

In [None]:
price = (model.coef_[0] * float(housing["income"].loc[[0]]))     \
      + (model.coef_[1] * float(housing["age"].loc[[0]]))        \
      + (model.coef_[2] * float(housing["rooms"].loc[[0]]))      \
      + (model.coef_[3] * float(housing["bedrooms"].loc[[0]]))   \
      + (model.coef_[4] * float(housing["population"].loc[[0]])) \
      + model.intercept_

f'{price:0.3e}'

Now we see very good agreement between the model predicted value and the target `$ 1.506e+06` value.

#### How do you run the model for a new house?

1. Create a new dataframe that provides the house's features
2. Use `predict` to generate a predicted value

In [None]:
new_house_features = pd.DataFrame(np.array([ [8.00e4, 6.5, 7.0, 4.0, 40.0e3 ] ]),
                                  columns=five_features)

display(new_house_features)

In [None]:
new_house_price = model.predict(X=new_house_features)
new_house_price

In [None]:
display(new_house_features)
print(f'Cost of the above house is predicted to be: ${float(new_house_price):0.3e}.')