You can follow along and play with this notebook by clicking the badge below

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jasongfleischer/UCSD_COGS118A/blob/main/Notebooks/Lecture_02_feature_representation.ipynb)


# Feature representation

1. Standardizing (z-transform), log transform, and other normalizations
2. One hot encoding
3. Representing image data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
#also in preprocessing: QunatileTransformer, MinMaxScaler and others!

from PIL import Image
import seaborn as sns

import requests # need to use HTTP stream to load data from Github inside Google Colab

In [None]:
penguins = sns.load_dataset('penguins').dropna() # dropna() gets rid of rows with missing data
penguins

# 1. StandardScaler() et al.
Let's re-scale the real valued inputs.  If we have variables that are a couple of orders of magnitude higher numbers than others, some machine learning algorithms will key on the big values and ignore the small values

In [None]:
X = penguins.drop(['species','island','sex'], axis=1)

X

In [None]:
X.hist(); 
plt.tight_layout(); #necessary to keep the subfigure titles from overlapping other rows

In [None]:
# the data is now z-transformed... this is "standard deviations from the mean of the column" now
X_s = pd.DataFrame( StandardScaler().fit_transform(X), columns=X.columns, index=X.index)

X_s

In [None]:
X_s.hist(); 
plt.tight_layout(); 

In [None]:
X_l = X.apply(lambda x: np.log(x)) #log transform the data to make it look more normal

X_l.hist() 
plt.tight_layout(); 

In [None]:
# log transform and then scale...  its still not really normal but it is less skew on some body & bill measures
X_ls = pd.DataFrame( StandardScaler().fit_transform(X_l), columns=X_l.columns, index=X_l.index)

X_ls.hist() 
plt.tight_layout(); 

# 2. OneHotEncoder()

Ok there are 3 categorical variables in the data... species, island it was found on, and sex of the bird

In [None]:
penguins.species.unique()

In [None]:
penguins.island.unique()

In [None]:
penguins.sex.unique()

In [None]:
encoder =  OneHotEncoder().fit( penguins[['species','island', 'sex']] )
encoder.categories_

In [None]:
transformed = encoder.transform( penguins[['species','island', 'sex']] ).toarray() 
# toarray() turns the output from a sparse to a dense matrix

transformed
# 1st 3 columns are species, next 3 columns are island, last 2 columns are sex

In [None]:
for index, category in enumerate( np.concatenate(encoder.categories_) ):
    X_s[category] = transformed[:,index]
    
X_s

In [None]:
X_s.hist(); 
plt.tight_layout();

Now all the categorical variables are OneHot, all the real variables have been scaled, and **ALL** variables are on the same order of magnitude so there's no variable-favoritism that can happen :) 

# 3. Image encoding

Images can be encoded numerically.  Typically this will be in a color space which you can think of as a vector space.  For grayscale (like below) this is done by putting the pixels into an array... rows and column represent the rows and columns of the images.  The number at each matrix location is a number between 0 and 255 (8 bits) which denotes the brightness of the image... bigger numbers are close to white, smaller numbers are close to black

In [None]:
im0 = np.array( [[0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200],
       [0,0,0,0,0,200,200,200,200,200]
      ])
im0

In [None]:
plt.imshow(im0,cmap='gray',vmin=0,vmax=255);
plt.axis('off');

If we use a color image, we will likely be in RGB color space (but other color spaces are possible!).  Each pixel is now a 3-D vector $(x_1,x_2,x_3)$ with the numbers representing Red, Green, and Blue intensity respectively 

In [None]:

b = [0,0,0] # black is all zeros
y = [255,255,0] # yellow is red + green in additive color mixing 

im1 =  np.array(
       [[b,b,b,b,y,y,b,b,b,b],
        [b,b,y,y,y,y,y,y,b,b],
        [b,y,y,y,y,y,y,y,y,b],
        [b,y,y,b,y,y,b,y,y,b],
        [y,y,y,y,y,y,y,y,y,y],
        [y,y,b,y,y,y,y,b,y,y],
        [b,y,y,b,b,b,b,y,y,b],
        [b,y,y,y,y,y,y,y,y,b],
        [b,b,y,y,y,y,y,y,b,b],
        [b,b,b,b,y,y,b,b,b,b]])


In [None]:
plt.imshow(im1);
plt.axis('off');

OK, so that's what image data looks like.  

But images are 3-D data: pixel rows x pixel columns x color vectors. We talk about machine learning algorithms taking an input that's a 1-D vector.  And the penguin example above is like that... each penguin is represented as a 1-D vector of numbers

How can we take a 3-D matrix and make it 1-D?  NumPy provides a method .flatten() which unravels  first the 3rd dimension (color), then the 2nd dimension (columns), and then lastly the 1st dimension (rows).

So we start with 

$$
\begin{bmatrix}
[R_{(0,0)},G_{(0,0)},B_{(0,0)}] & \cdots & [R_{(0,m)},G_{(0,m)},B_{(0,m)}] \\
\vdots & \ddots & \vdots \\
[R_{(n,0)},G_{(n,0)},B_{(n,0)}] & \cdots & [R_{(n,m)},G_{(n,m)},B_{(n,m)}]
\end{bmatrix}
$$

And get out 

$$ [R_{(0,0)},G_{(0,0)},B_{(0,0)}, R_{(0, 1)}, \cdots, B_{(0,m)}, R_{(1,0)}, \cdots \cdots B_{(n,m)} ] $$


In [None]:
im1.flatten() # unraveled in the order above

In [None]:
# here's another image, one that's a bit more complicated than our smiley
im2 = Image.open(
    requests.get('https://github.com/jasongfleischer/UCSD_COGS118A/raw/main/Notebooks/data/party-popper_1f389.png', stream=True).raw
)

im2

In [None]:
# OK, we want to make a dataset, so we need to downsize the giant popper to match im1, and set it up in the same format

im2a = np.array( im2.resize((10,10)).convert('RGB').getdata() ).flatten()
im2a

In [None]:
imdata = pd.DataFrame( [im1.flatten(),im2a], index=['smiley','popper'] )
# cool, here's some data to train our algorithm on!

imdata