# Shallow Machine Learning Introduction

#### Statistics is the work horse in machine learning.

## Shallow learning
- scikit-learn (a.k.a. sklearn)

## Catagories

| Regression | Classification | Clustering | Dimension Reduction|
| :-: | :-: | :-: | :-: |
| **Linear** | Logistic Regression | K-means | Principle Component Analysis |
| Polynomial | Support Vector Machine | Mean-Shift | Linear Discriminant Analysis |
| StepWise | Naive Bayes | DBScan | Gernalized Discriminant Analysis |
| Ridge | Nearest Neighbor | Agglomerative Hierachcial | Autoencoder |
| Lasso | Decision Tree | Spectral Clustering | Non-Negative Matrix Factorization |
| ElasticNet | Random Forest | Gaussian Mixture | UMAP |

## Linear Regression Refresher

**Idea**: Optimize the orientation of a line (i.e. the slope and y-intercept) that best fits coupled parameters (e.g. vaccination effectiveness as a function of dosage).

The equation that defines a line is 

$y = m*x + b$

where m is the slope and b is the y-intercept.


- A simple, but prevelent technique in machine learning

- Used in often in supervised learning


Additional Info: https://en.wikipedia.org/wiki/Linear_regression

## Learning by example

**Example data**: housing prices across the United States

source: https://github.com/whoparthgarg/House-Price-Prediction (and https://www.kaggle.com/vedavyasv/usa-housing)

- **Avg. Area Income**: Avgerage income of city's residents where the house is located in
- **Avg. Area House Age**: Avgerage age of houses within the same city
- **Avg. Area Number of Rooms**: Avgerage number of rooms for houses within the same city
- **Avg. Area Number of Bedrooms**: Avgerage number of bedrooms for houses within the same city
- **Area Population**: Population of city where the house is located in
- **Price**: Price of the house
- **Address**: Address for the house

In [None]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

The dataset (**usa_housing.csv**) can be downloaded from the git repository: https://github.com/karlkirschner/Scientific_Programming_Course

In [None]:
## for Google Colaboratory
# from google.colab import files
# uploaded = files.upload()

In [None]:
!head -2 usa_housing.csv

In [None]:
headers = ['income', 'age', 'rooms', 'bedrooms', 'population', 'price', 'address']

housing = pd.read_csv('usa_housing.csv', header=1, names=headers)
housing

In [None]:
housing.describe()

#### Plot how the different features correlate with the price:

- Using Pandas's built in plot function, we can do this quickly:

In [None]:
for feature in headers[0:-2]:
    housing.plot(x=feature, y='price', kind='scatter')

---
## Linear Regression on a Single Feature (i.e. 1D)

The simplest scenario is to focus upon 1 feature (i.e. rooms) and see if we can create a model that allows us to predict a house price based on the number of rooms.

In [None]:
target = housing['price'].values
features = housing['rooms'].values

### Training and Testing

- Good data scholarship means we need to split our data into a training and test sets. We do this by using the following scikit-learn funtion:

`sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)`

- Returns: a list containing train-test split of the data input.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, target,
                                                                            test_size=0.25, train_size=0.75,
                                                                            random_state=1)

Let's double check the algorithm - we should have 25% of the data being researved for the future testing.

In [None]:
print(f'Length of the training data: {len(target_train)}')
print(f'Length of the test data: {len(target_test)}')

print(f'Percentage of data used for the test data set: '
      f'{len(target_test) / (len(target_train) + len(target_test)):0.2f}')

#### Understanding what the output is
- Let's look at the data, and see what shape the Numpy arrays are:

In [None]:
features_train

In [None]:
features_train.shape

In [None]:
target_train

In [None]:
target_train.shape

#### Visualize the data
Let's plot the house cost versus the number of rooms to get a visual understanding of the data:

In [None]:
plt.figure()
plt.scatter(features_train, target_train)
plt.show()

#### Reshape the data
- scikit-learn's LinearRegression requires the data to have a certain Numpy array shape
- the `target_train` and `target_test` are both already in their correct shape
- However, since we only only one feature (i.e. one column -> number of rooms), the feature containing arrays need to be reshpaed to contain nested lists:

In [None]:
features_train

**Note:** If we do not reshape the data, then in the next step (i.e. `model = reg.fit(X=features_train, y=target_train)`) we would obtain the following error:

`ValueError: Expected 2D array, got 1D array instead:
array=[7.76350224 6.67325638 6.39398078 ... 6.11019169 7.04733826 5.35511362].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.`

Numpy's reshape function: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html
- the `-1`: a wildcard that specifies an unknown dimension, for which we will have numpy figure it out automatically
    - "One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions."

In [None]:
## Note: the following two are equivalent statments
features_train = np.reshape(features_train, (-1, 1))
# features_train = features_train.reshape(-1, 1)

features_train

In [None]:
features_test = features_test.reshape(-1, 1)
features_test

### Least Squared Linear Regression

- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

`sklearn.linear_model.LinearRegression(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)`

We will train in two steps
1. Define our **model** to be a linear regression
1. Have the model **learn** from our data (i.e. optimize for a best fit) 

But one could combine them to `reg = LinearRegression().fit(X, y)`

**Learn / fit** our data, and thus creating a **model** that represents our training data:

In [None]:
reg = LinearRegression(fit_intercept=True)

In [None]:
model = reg.fit(X=features_train, y=target_train)

To obtain the weights (a.ka. coefficients) for each feature (i.e. currently only for rooms):

In [None]:
model.coef_

#### Evaluate the fit using $R^2$ goodness-of-fit

Two ways to obtain this value:
1. score
2. r2_score (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score)

    "**Best possible score is 1.0** and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a score of 0.0."

In [None]:
model.score(X=features_test, y=target_test)

In [None]:
predict = model.predict(X=features_test)

In [None]:
r2_score(y_true=target_test, y_pred=predict, multioutput='uniform_average')

#### Overlay the scattered data with the linear regression prediction

In [None]:
plt.figure()
plt.scatter(features_train, target_train)
plt.plot(features_test, predict, color='black', linewidth=10, linestyle='solid')
plt.show()

## The next step:
- How does one doe this using multiple features (i.e. in multiple dimensional space)?
- Let's generate a model that uses 'income', 'age', 'rooms', 'bedrooms' and 'population' to make a prediction