# Shallow Machine Learning Introduction
- **s**ci**k**it-**learn** (a.k.a. sklearn)
- https://scikit-learn.org
- installation: https://scikit-learn.org/stable/install.html

## Catagories

| <font color='dodgerblue'>Regression</font> | <font color='dodgerblue'>Classification</font> | <font color='dodgerblue'>Clustering</font> | <font color='dodgerblue'>Dimension Reduction</font>|
| :-: | :-: | :-: | :-: |
| **Linear** | Logistic Regression | K-means | Principle Component Analysis |
| Polynomial | Support Vector Machine | Mean-Shift | Linear Discriminant Analysis |
| StepWise | Naive Bayes | DBScan | Gernalized Discriminant Analysis |
| Ridge | Nearest Neighbor | Agglomerative Hierachcial | Autoencoder |
| Lasso | Decision Tree | Spectral Clustering | Non-Negative Matrix Factorization |
| ElasticNet | Random Forest | Gaussian Mixture | UMAP |

<p><center><img alt="Classification" width="600" src="00_images/31_machine_learning/shallow_learning_depictions.jpg" align="center" hspace="10px" vspace="0px"></center></p>

**Image Source**: de Oliveira, E.C.L., da Costa, K.S., Taube, P.S., Lima, A.H. and Junior, C.D.S.D.S., 2022. Biological Membrane-Penetrating Peptides: Computational Prediction and Applications. Frontiers in Cellular and Infection Microbiology, 12, p.838259. (https://doi.org/10.3389/fcimb.2022.838259)

<hr style="border:2px solid gray"></hr>

## Linear Regression Refresher

**Idea**: <font color='dodgerblue'>Optimize the orientation of a line</font> that **best fits** **coupled/correlated parameters** 
- **1 dependent** and **1 independent**** variable: $y = m*x + b$
- optimize the **slope** and **y-intercept**
- a simple, but prominent technique in ML
- used frequently in supervised learning

**Example Data**
- vaccination effectiveness and dosage
- $\text{CO}_2$ emissions and engine size
- life expectancy and vacinnation coverage
- GPA and course attendance


Additional Info: https://en.wikipedia.org/wiki/Linear_regression

## Learning by example

**Example data**: housing prices across the United States

source: https://github.com/whoparthgarg/House-Price-Prediction (and https://www.kaggle.com/vedavyasv/usa-housing)

- **Avg. Area Income**: Average income of the city's residents where the house is located in
- **Avg. Area House Age**: Average age of houses within the same city
- **Avg. Area Number of Rooms**: Avgerage number of rooms for houses within the same city
- **Avg. Area Number of Bedrooms**: Average number of bedrooms for houses within the same city
- **Area Population**: Population of the city where the house is located in
- **Price**: Price of the house
- **Address**: Address for the house

In [None]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

The dataset (**usa_housing.csv**) can be downloaded from the git repository: https://github.com/karlkirschner/Scientific_Programming_Course

In [None]:
!head -2 usa_housing.csv

Rename the headers since the are very long

In [None]:
housing = pd.read_csv('usa_housing.csv', header=1,
                      names=['income', 'age', 'rooms', 'bedrooms', 'population', 'price', 'address'])
housing

In [None]:
housing.describe()

#### Observables - a.k.a. "Features"

(Definitions of words used in ML.)

What **features** do we want the machine **to learn from** for **making a prediction** of a **target observable**?

- <font color='dodgerblue'>features</font>: independent variables (`income`, `age`, `rooms`, `bedrooms`, `population`)
- <font color='dodgerblue'>target observable</font>: dependent variable (`price`)


Coding-wise, we can define the features like the following:

In [None]:
feature_list = ['income', 'age', 'rooms', 'bedrooms', 'population']

#### Visualize the data
Let's plot the features versus price to see what it might look like:

In [None]:
fig = plt.figure(figsize=(11, 8))

fig.subplots_adjust(wspace=0.2, hspace=0.5)

for count, feature in enumerate(feature_list):
    ax = fig.add_subplot(3, 2, count+1)  # first position can not be zero

    ax.set_xlabel(xlabel=feature)
    ax.set_ylabel(ylabel='price')

    ax.scatter(housing[feature], housing['price'], color='dodgerblue', s=10, alpha=0.3)

plt.show()

<hr style="border:1px solid gray"></hr>

## Linear Regression on a Single Feature (i.e., one-dimensional)

The **simplest scenario** is to focus on **1 feature** (e.g., `rooms`) and see if we can create a model for predicting a **house price** (i.e., `price`).

In [None]:
feature = housing['rooms'].values
target = housing['price'].values

### Training and Testing

- Good **data scholarship** means we need to **split our data** into a **training** and **test** sets. We do this by using the following scikit-learn funtion:

`train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)`

- Returns: a list containing train-test split of the data input.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
feature_train, feature_test, target_train, target_test = train_test_split(feature, target,
                                                                          test_size=0.25, train_size=0.75,
                                                                          random_state=1)

Let's double check the algorithm - we should have 25% of the data being researved for the future testing.

In [None]:
print(f'Length of the training data: {len(target_train)}')
print(f'Length of the test data: {len(target_test)}')

print(f'Fraction of data used for the test data set: '
      f'{len(target_test) / (len(target_train) + len(target_test)) :0.2f}')

#### Visualize the training and test data

In [None]:
plt.figure()

plt.scatter(feature_train, target_train, s=10, label='Training Data')
plt.scatter(feature_test, target_test, s=8, label='Test Data')

plt.xlabel(xlabel='# of Rooms')
plt.ylabel(ylabel='Price')

plt.legend(loc='best')

plt.show()

#### Reshape the data
- scikit-learn's <font color='dodgerblue'>LinearRegression</font> requires the data to have a certain <font color='dodgerblue'>NumPy array shape</font>
- **Already Done**: the `target_train` and `target_test` are both already in their correct shape
- **Need to Do**:  reshape `feature_train` and `feature_test` (becuase it is a 1 feature)

Feature Train:

In [None]:
display(feature_train)
display(feature_train.shape)

Feature Test:

In [None]:
display(feature_test)
display(feature_test.shape)

Since we only have **one feature** (i.e., one column; number of rooms), the feature arrays need **reshaping to contain nested lists**.

**Note:** If we do not reshape the data, then in the next step (i.e., `model = reg.fit(X=features_train, y=target_train)`) we would obtain the following error:

`ValueError: Expected 2D array, got 1D array instead:
array=[7.76350224 6.67325638 6.39398078 ... 6.11019169 7.04733826 5.35511362].
Reshape your data using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.`

Numpy's reshape function: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html
- One shape dimension can be -1
    - Then, the value is taken as the array length.`

In [None]:
display(feature_train)
display(feature_train.shape)

Reshape the data:

In [None]:
feature_train = np.reshape(feature_train, (-1, 1))
features_test = np.reshape(feature_test, (-1, 1))

display(feature_train)
display(feature_train.shape)

### Least Squared Linear Regression

- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

- `sklearn.linear_model.LinearRegression(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)`

We will train in two steps
1. Define our **callable model**
    - linear regression
    - fit the y-intercept

In [None]:
reg = LinearRegression(fit_intercept=True)

2. Have the model **learn** from our data (i.e., optimize for a best fit)
     - This is the creation of a **model** that represents our training data

In [None]:
model = reg.fit(X=feature_train, y=target_train)

In [None]:
print(type(model))
model

### Making predictions using your model

- `predict`: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict
    - **Args**: np.ndarray of feature(s)
    - **Return**: predicted target value

**Create a new/independent/unknown house**:
- Known feature: <font color='dodgerblue'>5 rooms</font>
- Predict target: <font color='dodgerblue'>Cost</font>

In [None]:
new_house_feature = np.array([ [5] ])

display(new_house_feature)
display(new_house_feature.shape)

In [None]:
model.predict(X=new_house_feature)

In [None]:
print(f'Thus, the house with 5 rooms is predicted to cost ca. {model.predict(X=new_house_feature)[0]:0.1e} dollars.')

Demonstrate if we had 2 new houses:
- 5 rooms
- 2 rooms

In [None]:
new_houses_feature = np.array([ [5], [2] ])
display(new_houses_feature)
display(new_houses_feature.shape)

model.predict(X=new_houses_feature)

In [None]:
print(f'''Thus, the houses with 5 and 2 rooms are predicted to cost ca.
      $ {model.predict(X=new_houses_feature)[0]:0.1e} and
      $ {model.predict(X=new_houses_feature)[1]:0.1e}, respectively.''')

#### Evaluate the fit using the Coefficient of Determination ($R^2$)  - goodness-of-fit
- https://en.wikipedia.org/wiki/Coefficient_of_determination

Two ways to obtain this value:
1. `score`
    - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score
    - `score(X, y, sample_weight=None)`
    - `sample_weight`: setting the relative importance of the data


2. `r2_score`
    - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
    - `r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', force_finite=True)`
    - `force_finite`: use when y_true is constant

- Score = 1: **Best possible model**
    
- Score = 0: **Poor model**

- Score > 1 or < 0: **Wrong model** (or **wrong constraints applied**)

Using the housing **test data** set:

In [None]:
display(features_test.shape)

predict = model.predict(X=features_test)
predict

- `score`

In [None]:
model.score(X=features_test, y=target_test, sample_weight=None)

- `r2_score`

In [None]:
r2_score(y_true=target_test, y_pred=predict, multioutput='uniform_average', sample_weight=None, force_finite=True)

#### Overlay the scattered data with the model's prediction
- recall that the model is a **linear regression** - straight line

In [None]:
plt.figure()

plt.scatter(features_test, target_test, s=10, alpha=0.5, label='raw test data')
plt.plot(features_test, predict, color='black', linewidth=5, linestyle='solid', label='Linear Reg. Pred.')

plt.xlabel(xlabel='# of Rooms')
plt.ylabel(ylabel='Price')

plt.legend(loc='best')

plt.show()

The polynomial `coefficients` (i.e. `m`) and `y-intercept` of the resulting fitted line:

In [None]:
print(f'Coefficients: {model.coef_}')
print(f'y-intercept: {model.intercept_}\n')

print(f'''Linear regression line:
      y(x) = price(room) = {model.coef_[0]:0.2e}x + {model.intercept_:0.2e}''')

#### Proof-of-concept

- using the **line equation**, our optimized `coefficients` and `y-intercept`, we can predict the price.

First, recall from above that this was done using the `predict` function:

In [None]:
model.predict(X=np.array([ [5] ]))

Now using an optimized straight line equation:

In [None]:
price = (model.coef_[0] * 5) + model.intercept_
f'{price}'

<hr style="border:1px solid gray"></hr>

## Create a Model from Two Features

The equation that defines a line that has two "features" (i.e., two independent variables) is 

$y = m_1*x_1 + m_2*x_2 + b$

- $x_1$ and $x_2$ = data for the two features
- $m_1$ and $m_2$ = the coefficients
- $b$ = y-intercept


- Extend this to `n` features (i.e., in multiple-dimensional space).

- Let's generate a model that uses 5 features:
    - 'income', 'age', 'rooms', 'bedrooms', and 'population'

In [None]:
housing

In [None]:
two_features = ['age', 'rooms']

display(two_features)
display(housing[two_features].shape)

**Notice**: since there are more than **1 features** (i.e., 2 Pandas DataFrame columns), we can pass the **DataFrame directly to `train_test_split`** without reshaping them (unlike the above example using 1 feature).

In [None]:
features_train, features_test, target_train, target_test = train_test_split(housing[two_features], target,
                                                                            test_size=0.25, train_size=0.75,
                                                                            random_state=1)

features_train

In [None]:
reg = LinearRegression(fit_intercept=True)

In [None]:
model = reg.fit(X=features_train, y=target_train)

model.score(X=features_test, y=target_test)

In [None]:
predict = model.predict(X=features_test)
predict

Create a plot function that allows us to visualize multiple price vs. features.

In [None]:
def plot_features(feature_list: list,
                  target: np.ndarray,
                  feature_df: pd.DataFrame,
                  predict: np.ndarray=None):
    ''' Create a plot with multiple subplots displayed in two columns.
    
        Args
            feature_list: y-axis features to be extracted from feature_df (i.e. column names)
            target: x-axis data
            feature_df: y-axis data
            predict: predicted values based on machine learning
        Returns
            plot
        
        Library dependencies
            matplotlib
            numpy
            pandas
    '''
    if not isinstance(feature_list, list):
        raise TypeError('Input features are not given as a list.')
    elif not isinstance(target, np.ndarray):
        raise TypeError('Target values are not given as a NumPy array.')
    elif not isinstance(feature_df, pd.DataFrame):
        raise TypeError('feature_df is not given as a Pandas dataframe.')
    elif not isinstance(predict, np.ndarray):
        raise TypeError('predict is not given as a NumPy array.')
    else:  

        number_of_rows = int(np.ceil(len(feature_list)/2))  # number of rows for a 2 column plot

        fig = plt.figure(figsize=(11, 3*number_of_rows))    # same height subplots regardless of rows

        fig.subplots_adjust(wspace=0.2, hspace=0.5)

        for count, feature in enumerate(feature_list):    
            ax = fig.add_subplot(number_of_rows, 2, count+1)  # first position can not be zero

            ax.set_xlabel(xlabel=feature)
            ax.set_ylabel(ylabel='price')

            ax.scatter(feature_df[feature], target, color='dodgerblue', s=20, alpha=0.3, label='known')

            if predict is not None:
                ax.scatter(feature_df[feature], predict, color='orange', s=10, alpha=0.5, linestyle='solid', label='prediction')
            
            ax.legend(loc='best')

        plt.show()

In [None]:
plot_features(feature_list=two_features, feature_df=features_test, target=target_test, predict=predict)

#### What would the resulting two-feature linear equation look like, for one of the input houses?

$y = (m_1*x_1) + (m_2*x_2) + (b)$

In [None]:
print(f'Coefficients: {model.coef_}')
print()
print(f'y-intercept: {model.intercept_}')

In [None]:
print(f'y = ({model.coef_[0]:0.2e} * x) \n'\
      f'  + ({model.coef_[1]:0.2e} * x) \n'\
      f'  + {model.intercept_:0.2e}')

#### Apply it to an individual house (i.e., the first data entry) to see how it repoduces the actual target value.

Use Pandas `loc[[]]` to isolate a row:

In [None]:
housing.loc[[0]]

Recall that we can use Pandas `loc[[ , ]]` to isolate rows and columns:

In [None]:
display(housing.loc[[0], ['age', 'rooms', 'price']])

Using our **ML model**:

In [None]:
price = model.predict(X=housing.loc[[0], ['age', 'rooms']])

f'{price[0]:0.2e}'

Alternatively, using the **equation for a line**

$y = (m_1*x_1) + (m_2*x_2) + (b)$

In [None]:
print(f'y = ({model.coef_[0]:0.2e} * {float(housing["age"].iloc[0]):0.2e})'\
      f' + ({model.coef_[1]:0.2e} * {float(housing["rooms"].iloc[0]):0.2e})'\
      f' + {model.intercept_:0.2e}')

In [None]:
predicted_price = (model.coef_[0] * float(housing["age"].iloc[0]))     \
                + (model.coef_[1] * float(housing["rooms"].iloc[0]))        \
                + model.intercept_

f'{predicted_price:0.2e}'

In [None]:
actual_price = housing["price"].iloc[0]

print(f'The listed price in the dataset is: {actual_price:0.2e}, '\
      f'a difference of {actual_price-predicted_price:0.2e}.')

##### Sidenote: plot the line corresponding to each subfeature

1. Create a straight line for plotting
2. Scatter plot the data and overlay with the straight lines
3. Do this in a loop that cycles over the features

Features:

In [None]:
two_features

In [None]:
plt.figure()

for observable in two_features:
    observable_line = (model.coef_[0] * features_test[observable])
    
    plt.scatter(features_test[observable], target_test, s=20, alpha=0.5)
    plt.plot(features_test[observable], observable_line, linewidth=10, alpha=0.5, linestyle='solid')

<hr style="border:1px solid gray"></hr>

### Model from five features

In [None]:
five_features = ['income', 'age', 'rooms', 'bedrooms', 'population']

display(housing[five_features])
display(housing[five_features].shape)

In [None]:
features_train, features_test, target_train, target_test = train_test_split(housing[five_features], target,
                                                                            test_size=0.25, train_size=0.75,
                                                                            random_state=1)

In [None]:
model = reg.fit(X=features_train, y=target_train)

model.score(X=features_test, y=target_test)

In [None]:
predict = model.predict(X=features_test)
predict

Let's visualize how well the ML'ed predicted values in comparison to the original `test` input data:

In [None]:
plot_features(feature_list=five_features, feature_df=features_test, target=target_test, predict=predict)

#### Apply it to an individual house (i.e., the first data entry) to see how it repoduces the actual target value.

In [None]:
housing.loc[[0]]

Using our **ML model**:

In [None]:
price = model.predict(X=housing.loc[[0], five_features])

f'{price[0]:0.3e}'

Alternatively, using the **equation for a line**:

$y = (m_1*x_1) + (m_2*x_2) + (m_3*x_3) + (m_4*x_4) + (m_5*x_5) + (b)$

In [None]:
print(f'''  y = ({model.coef_[0]:0.2e} * {float(housing["income"].iloc[0]):0.2e})
    + ({model.coef_[1]:0.2e} * {float(housing["age"].iloc[0]):0.2e})
    + ({model.coef_[2]:0.2e} * {float(housing["rooms"].iloc[0]):0.2e})
    + ({model.coef_[3]:0.2e} * {float(housing["bedrooms"].iloc[0]):0.2e})
    + ({model.coef_[4]:0.2e} * {float(housing["population"].iloc[0]):0.2e})
    + {model.intercept_:0.2e}''')

In [None]:
price = (model.coef_[0] * float(housing["income"].iloc[0]))     \
      + (model.coef_[1] * float(housing["age"].iloc[0]))        \
      + (model.coef_[2] * float(housing["rooms"].iloc[0]))      \
      + (model.coef_[3] * float(housing["bedrooms"].iloc[0]))   \
      + (model.coef_[4] * float(housing["population"].iloc[0])) \
      + model.intercept_

f'{price:0.3e}'

Now we see very good agreement between the model predicted value and the target `$ 1.506e+06` value.

<hr style="border:1px solid gray"></hr>

#### How do you run the model for a new house?

1. Create a new dataframe that provides the house's features
2. Use `predict` to generate a predicted value

In [None]:
new_house_features = pd.DataFrame(np.array([ [8.00e4, 6.5, 7.0, 4.0, 40.0e3 ] ]),
                                  columns=five_features)

display(new_house_features)

In [None]:
new_house_price = model.predict(X=new_house_features)
new_house_price

In [None]:
display(new_house_features)

print(f'The cost of the above house is predicted to be: ${float(new_house_price[0]):0.3e}.')

<hr style="border:1px solid gray"></hr>

## Clustering

### Why Cluster?
**<font color='dodgerblue'>One clusters data to discover inherent structures and patterns within a dataset.</font>**

- **Exploratory Data Analysis**: first step in understanding a new dataset
    - Identify natural divisions, outliers, or dominant characteristics
    - helps in forming hypotheses for further analysis

- **Simplification and Summarization**: focus on the cluster characteristics themselves (not individual data points)
    -  more conceptually manageable

- **Data Preprocessing and Compression**: dimensionality reduction
    - speed up computations
    - more conceptually manageable

- **Anomaly Detection**: Data points that do not fit well into any cluster, or form very small, isolated clusters, can be identified as anomalies or outliers.
    - pollution monitoring
    - disease outbreaks
    - error detection in experiments

- **Feature Engineering**: cluster assignments themselves can be used as new features for supervised learning tasks


### Scikit-learn Algorithms
- https://scikit-learn.org/1.5/modules/clustering.html

**<font color='dodgerblue'>Distances between Points</font>** (the core techniques)

- **K-Means** (widely used)
    - General-purpose, even cluster size, flat geometry, not too many clusters, inductive

- **Mean-shift**
    - Many clusters, uneven cluster size, non-flat geometry, inductive
    - Distances between points

- **Ward hierarchical clustering**
    - Many clusters, possibly connectivity constraints, transductive
    - Distances between points

- **OPTICS**
    - Non-flat geometry, uneven cluster sizes, variable cluster density, outlier removal, transductive
    - Distances between points

- **Bisecting K-Means**
    - General-purpose, even cluster size, flat geometry, no empty clusters, inductive, hierarchical
    - Distances between points

**<font color='dodgerblue'>Distances between nearest points</font>** (more specialized techniques)

- **DBSCAN**
    - identify clusters as **dense regions of points** in the data space, **separated** by areas of **lower point density** (noise).
    - Non-flat geometry, uneven cluster sizes, outlier removal, transductive

- **HDBSCAN**
    - Non-flat geometry, uneven cluster sizes, outlier removal, transductive, hierarchical, variable cluster density

**<font color='dodgerblue'>Graph distance (e.g. nearest-neighbor graph)</font>**

- **Affinity propagation**
    - Many clusters, uneven cluster size, non-flat geometry, inductive

- **Spectral clustering**
    - Few clusters, even cluster size, non-flat geometry, transductive

**<font color='dodgerblue'>Others</font>**

- **Gaussian mixtures**
    - Flat geometry, good for density estimation, inductive
    - Mahalanobis distances to centers


- **BIRCH**
    - Large dataset, outlier removal, data reduction, inductive	
    - Euclidean distance between points


- **Agglomerative clustering**
    - Many clusters, possibly connectivity constraints, non-Euclidean distances, transductive
    - Any pairwise distance

---

#### Scaling data to have the same magnitude

Clustering and Dimonsionality reduction often benefit from working on data with similar magnitude.

- Scale numerical values: drop the `Address` column

In [None]:
housing.drop('address', axis='columns', inplace=True)

- `StandardScaler`

  $x_{scaled} = \frac{(x - mean)}{sd}$

  where $sd$ is the standard deviation.

*We will also import seaborn for easily creating nice plots of the resulting data analysis.)

In [None]:
import seaborn as sns
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(housing)
scaled_features_df = pd.DataFrame(scaled_features, columns=housing.columns)

scaled_features_df

#### Recovering the scaled data

Original, unscaled data:

In [None]:
housing

In [None]:
recovered_data = scaler.inverse_transform(scaled_features)
recovered_df = pd.DataFrame(recovered_data, columns=housing.columns)
recovered_df

---

Now back to clustering...

#### kmeans

`sklearn.cluster.k_means(X, n_clusters, *, sample_weight=None, init='k-means++', n_init='auto', max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, algorithm='lloyd', return_n_iter=False)`

- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.k_means.html
- `X` needs to have data passed in a **2-dimensional container** (e.g., np.ndarray[[]]: `X=np.array([[1, 2], [1, 4], [1, 0]`)

**Returns**:
- <font color='dodgerblue'>centroids</font>: cluster centers (i.e., coordinates)
- <font color='dodgerblue'>labels</font>: the centroid index that each data point belongs to
- <font color='dodgerblue'>inertia</font>: a metric for how well the data points belong to the clusters
    - "sum of squared distances to the closest centroid for all observations in the training set"
    - <font color='dodgerblue'>It quantifies how "tight" or "compact" the clusters are.</font> - a lower value indicates tighter clusters.
    <!-- - Practical usage - in conjunction with the Elbow Method to help determine a reasonable number of clusters (K) by observing the point of diminishing returns in inertia reduction. -->

In [None]:
from sklearn.cluster import k_means

Remember that we can access specific columns in the DataFrame via:

In [None]:
scaled_features_df[['income', 'rooms', 'price']]

##### **Example 1**: how it **fails** when not including the properly shaped `X`.

- `X` needs to have data passed in a **2-dimensional container** (e.g., np.ndarray[[]]: `X=np.array([[1, 2], [1, 4], [1, 0]`)
- housing['income']: 1-dimens

In [None]:
scaled_features_df['income'].shape

In [None]:
centroids, labels, inertia = k_means(X=scaled_features_df['income'], n_clusters=3, random_state=0, n_init='auto')

##### Example 2: a properly shaped `X`:
- `X=scaled_features_df[['income']]`

In [None]:
scaled_features_df[['income']].shape

In [None]:
centroids, labels, inertia = k_means(X=scaled_features_df[['income']], n_clusters=3, random_state=0, n_init='auto')

display(f'Centroids: {centroids}')
display(f'Labels:    {labels}')
display(f'Inertia:   {inertia:.2e}')

##### Example 3: of a properly shaped `X`:
- `X=housing[['income', 'population']]`
- Adding complexity now with clustering based on two features

In [None]:
cluster_features = ['income', 'population']

In [None]:
centroids, labels, inertia = k_means(X=scaled_features_df[cluster_features], n_clusters=2, random_state=0, n_init='auto')

display(f'Centroids: {centroids}')
display(f'Labels:    {labels}')
display(f'Inertia:   {inertia:.2e}')

In [None]:
centroids, labels, inertia = k_means(X=scaled_features_df[cluster_features], n_clusters=3, random_state=0, n_init='auto')

display(f'Centroids: {centroids}')
display(f'Labels:    {labels}')
display(f'Inertia:   {inertia:.2e}')

**Note**: Inertia goes down with increasing clusters
- always true - more clusters lower inertia ("tighter" cluster)
- In practice: you need to decide when getting better inertia has little impact (e.g. elbow method)

In [None]:
plt.figure(figsize=(11, 5))
sns.scatterplot(data=scaled_features_df, x='income', y='population', hue=labels)

for centroid_centers in centroids:
    plt.scatter(x=centroid_centers[0], y=centroid_centers[1],
                marker=r"$\odot$", s=150, edgecolor='DodgerBlue', linewidths=1)

In [None]:
# cluster_features = ['bedrooms', 'rooms']

inertias_list = []

cluster_number = range(1, 15)

for i in cluster_number:
    centroids, labels, inertia = k_means(X=scaled_features_df[cluster_features], n_clusters=i, random_state=0, n_init='auto')
    inertias_list.append(inertia)

plt.plot(cluster_number, inertias_list, marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

Okay - let's take `n_culster` to be `6` as the approximate elbow bend.

In [None]:
centroids, labels, inertia = k_means(X=scaled_features_df[cluster_features], n_clusters=6, random_state=0, n_init='auto')

display(f'Centroids: {centroids}')
display(f'Labels:    {labels}')
display(f'Inertia:   {inertia:.2e}')

In [None]:
plt.figure(figsize=(11, 5))

# sns.scatterplot(data=housing, x='bedrooms', y='rooms', hue=labels)
sns.scatterplot(data=scaled_features_df, x='income', y='population', hue=labels)

for centroid_centers in centroids:
    plt.scatter(x=centroid_centers[0], y=centroid_centers[1],
                marker=r"$\odot$", s=150, edgecolor='DodgerBlue', linewidths=1)

<hr style="border:1px solid gray"></hr>

## Dimensionality Reduction

### Why Reduce a Dataset's Dimensions?

- Simplify complex datasets by reducing the number of features (variables) while retaining as much meaningful information as possible
- "Dimensionality reduction is a method for representing a given dataset using a lower number of features (i.e. dimensions) while still capturing the original data’s meaningful properties." [1]

- helps us **understand** the data better
    - **visualization**
- **improves machine learning performance**
- can help the data analysis by **reducing the original data's noise**

### Scikit-learn Algorithms
- https://scikit-learn.org/stable/modules/unsupervised_reduction.html

- <font color='dodgerblue'>Principal Component Analysis</font> (**PCA**)
    - https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
    - **Reproducible** (deterministic)
    - Linear
    - **Unsupervised** learning
    - Most fundamental approach - good for when data has linear structure

<br>

- <font color='dodgerblue'>Linear Discriminant Analysis</font> (**LDA**)
    - https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
    - **Reproducible** (deterministic)
    - Linear 
    - **supervised** learning
    - Very good for class separation - finds dimensions that best discriminate between classes

<br>

- <font color='dodgerblue'>t-Distributed Stochastic Neighbor Embedding</font> (**t-SNE**)
    - https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE
    - **Not reproducible** (non-deterministic; **stochastic**) due to random number usage
        - However, setting the **`random_state`** does allow reprodicibility
        - Proper usage is doing randomly, **multiple times** and **gather statistics**
    - **Non-linear**
    - **Unsupervised** learning
    - **preserve local data structures** (close points in the high-dimensions are  also close in low-dimensions)


**Note**:
- `decomposition module`: algorithms dedicated to transforming data by "decomposing" it into fundamental components or factors (i.e. PCA) - mostly **linear** in nature
- `manifolds module`: algorithms dedicated to discovering **non-linear** structures within high-dimensional data.
    - A "manifold" is a lower-dimensional space embedded within a higher-dimensional space.

**Reference**
1. **Source**: https://www.ibm.com/topics/dimensionality-reduction

#### PCA

Look for a feature combination (in the reduced dimensions) that best captures the variance of the original features.

`decomposition.PCA` - sets up your PCA analysis with how many dimension are wanted

`class sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)`

-`n_components`
    - Number of components to keep - how many dimensions is the data reduced to
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

In [None]:
from sklearn import decomposition

In [None]:
pca = decomposition.PCA(n_components=3)

`fit(X, y=None)`
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit

In [None]:
scaled_features_df

In [None]:
clusters = pca.fit_transform(scaled_features_df)
clusters

In [None]:
clusters.shape

In [None]:
pca_1 = clusters[0:100, 0] # first 100 rows of column 0
pca_2 = clusters[0:100, 1] # for column 1
pca_3 = clusters[0:100, 2] # for column 2

In [None]:
plt.figure(figsize=(4, 4))
plt.scatter(pca_1, pca_2, c=pca_1, cmap=plt.get_cmap('Dark2', 3), alpha=0.2)
plt.xlabel('pc1')
plt.ylabel('pc2')

plt.figure(figsize=(4, 4))
plt.scatter(pca_1, pca_3, c=pca_1, cmap=plt.get_cmap('Dark2', 3), alpha=0.2)
plt.xlabel('pc1')
plt.ylabel('pc3')

plt.figure(figsize=(4, 4))
plt.scatter(pca_2, pca_3, c=pca_1, cmap=plt.get_cmap('Dark2', 3), alpha=0.2)
plt.xlabel('pc2')
plt.ylabel('pc3')

In [None]:
fig = plt.figure(figsize=(8, 8))

ax = fig.add_subplot(111, projection='3d')

fig = ax.scatter(pca_1, pca_2, pca_3, c=pca_3, cmap=plt.get_cmap('Dark2', 3))

ax.set_xlabel('pca 1')
ax.set_ylabel('pca 2')
ax.set_zlabel('pca 3')

ax.set_box_aspect(None, zoom=0.9) # Keep original box aspect ratio

plt.colorbar(fig, shrink=0.30, location='right', label='PCA 3 Value')

#### t-SNE

- Very sensitive to feature scaling: use sklearn's `StandardScaler`

- set the `random_state` for reproducibility

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(n_components=2, random_state=42)

tsne_results_2d = tsne.fit_transform(scaled_features_df)

pd.DataFrame(tsne_results_2d)

In [None]:
tsne_1 = tsne_results_2d[0:100, 0]
tsne_2 = tsne_results_2d[0:100, 1]

In [None]:
plt.figure(figsize=(4, 4))

plt.scatter(tsne_1, tsne_2, c=tsne_1, cmap=plt.get_cmap('Dark2', 2), alpha=0.5)

plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')

plt.show()

In [None]:
tsne_3d = TSNE(n_components=3, random_state=42)
tsne_results_3d = tsne_3d.fit_transform(scaled_features_df)

tsne_3d_1 = tsne_results_3d[0:100, 0]
tsne_3d_2 = tsne_results_3d[0:100, 1]
tsne_3d_3 = tsne_results_3d[0:100, 2]

In [None]:
fig = plt.figure(figsize=(8, 8))

ax = fig.add_subplot(111, projection='3d')

scatter_plot = ax.scatter(tsne_3d_1, tsne_3d_2, tsne_3d_3, c=tsne_3d_3, cmap=plt.get_cmap('plasma', 3), alpha=0.7)

ax.set_xlabel('t-SNE 1')
ax.set_ylabel('t-SNE 2')
ax.set_zlabel('t-SNE 3')

ax.set_box_aspect(None, zoom=0.9) # Keep original box aspect ratio

plt.colorbar(scatter_plot, shrink=0.30, location='right', label='t-SNE 3 Value')

plt.show()