# Don't Drop Out Of Numpy

In [1]:
cd ..

/home/jovyan/projects/dsi/09-unsupervised_learning-a_tutorial_on_pca


In [2]:
from lib.preprocessing import BoxCoxTransformer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

This lesson will make heavy use of the numerical python library, `numpy`. Remember, its very important when working in `numpy` that you do not "drop out of `numpy`" that is change your data into regular lists.

In [3]:
type([1,2])

list

In [4]:
type(np.array([1,2]))

numpy.ndarray

The most common way to "drop out of `numpy`" is to use a list comprehension on a `numpy` array.

In [5]:
type([v for v in np.array([1,2])])

list

#### `numpy` vs `math`

Python has a `math` library in addition to `numpy`. The main difference is that `numpy` works on vectors, whereas `math` works on scalar values.

In [6]:
import math

We will need cosine and sine functions to define our true function. As we will be performing vector calculations, we will need to use the `numpy` trigonometric functions as opposed to the `math` trigonometric functions.

In [7]:
vv = np.linspace(1,1000,1000)
np.cos(vv)
try:
    math.cos(vv)
except TypeError as e:
    print(e)

only length-1 arrays can be converted to Python scalars


We could perform a list comprehension using the `math` function.

In [8]:
cos_vv = [math.cos(v) for v in vv]

The issue is time.

In [None]:
%%timeit
np.cos(vv)

In [None]:
%%timeit
[math.cos(v) for v in vv]

This difference only increases for larger $n$.

In [None]:
%%timeit 
np.cos(np.linspace(1,1000,10000))

In [None]:
%%timeit 
[math.cos(v) for v in np.linspace(1,1000,10000)]

In [None]:
%%timeit 
np.cos(np.linspace(1,1000,1000000))

In [None]:
%%timeit 
[math.cos(v) for v in np.linspace(1,1000,1000000)]

# Unsupervised Learning

With unsupervised learning, we are still interested in developing models, but we do not have an output for each input. Rather we simply have data points.


### With unsupervised learning, there is no target.

# Moments

If the points represent mass:
- the zeroth moment is the total mass
- the first moment divided by the total mass is the center of mass
- the second moment is the rotational inertia

If the points represent probability density:
- the zeroth moment is the total probability (i.e. one)
- the first moment is the mean
- the second central moment is the variance
- the third standardized moment is the skewness

### Critical to Remember

- the **mean** is the expected value of a feature
- the **variance** is a central moment and describes the spread around an expected value
- the **skewness** is a standardized moment and describes the degree to which the feature's distribution deviates from the normal distribution
- to centralize a feature, subtract the mean from the feature

   e.g. `X_c = X - X.mean()`

- to standardize a feature, subtract the mean and divide by the standard deviation

   e.g. `X_sc = (X - X.mean())/X.std()`
- `sklearn` has no built-in tool for removing skew, though one is in development and it may make it in to future versions. 
- the `lib.preprocessing` module in this repository includes an `sklearn` compatible `BoxCoxTransformer`.

## Deskewing Data

In [None]:
from sklearn.datasets import load_iris
from scipy.stats import skew

In [None]:
X, y = load_iris(return_X_y=True)
X = pd.DataFrame(X)

In [None]:
stats = X.describe().T
stats['skew'] = skew(X)
stats

In [None]:
fig = plt.figure(figsize=(20,6))
for i, col in enumerate(X.columns):
    fig.add_subplot(221+i)
    sns.distplot(X[col], label=str(col))
    plt.axvline(X[col].mean(), c='red')
    plt.axvline(X[col].median(), c='black')
    plt.legend()

In [None]:
X_dsk = pd.DataFrame(BoxCoxTransformer().fit_transform(X))

In [None]:
stats_dsk = X_dsk.describe().T
stats_dsk['skew'] = skew(X_dsk)
stats_dsk

In [None]:
fig = plt.figure(figsize=(20,6))
for i, col in enumerate(X_dsk.columns):
    fig.add_subplot(221+i)
    sns.distplot(X_dsk[col], label=str(col))
    plt.axvline(X_dsk[col].mean(), c='red')
    plt.axvline(X_dsk[col].median(), c='black')
    plt.legend()

In [None]:
X_cond = X_dsk.copy()
X_cond[2] = X[2]
X_cond[3] = X[3]

In [None]:
stats_dsk = X_cond.describe().T
stats_dsk['skew'] = skew(X_cond)
stats_dsk

In [None]:
fig = plt.figure(figsize=(20,6))
for i, col in enumerate(X_cond.columns):
    fig.add_subplot(221+i)
    sns.distplot(X_cond[col], label=str(col))
    plt.axvline(X_cond[col].mean(), c='red')
    plt.axvline(X_cond[col].median(), c='black')
    plt.legend()