# Reducing Complexity 

I will be using a dataset provided with “The Elements of Statistical Learning: 
Data Mining, Inference, and Prediction” by Trevor Hastie, Robert Tibshirani & Jerome Friedman. 

> Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been deslanted and size normalized, resulting in 16 x 16 grayscale images (Le Cun et al., 1990).
> 
> The data are in two gzipped files, and each line consists of the digit id (0-9) followed by the 256 grayscale values.
> 
> There are 7291 training observations and 2007 test observations [...]

Sources: 
- https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.info.txt
- https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.test.gz
- https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.train.gz
--- 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans

## Importing and preparing the data

In [2]:
# Read CSV file into a pandas DataFrame 
# Does *not* use first line as a header, recognizes white space character as a field separator 
# see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html 
data = pd.read_csv("data/zip.train", header=None, sep=" ")

In [3]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,248,249,250,251,252,253,254,255,256,257
0,6.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.631,0.862,...,0.823,1.0,0.482,-0.474,-0.991,-1.0,-1.0,-1.0,-1.0,
1,5.0,-1.0,-1.0,-1.0,-0.813,-0.671,-0.809,-0.887,-0.671,-0.853,...,-0.671,-0.033,0.761,0.762,0.126,-0.095,-0.671,-0.828,-1.0,
2,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-0.109,1.0,-0.179,-1.0,-1.0,-1.0,-1.0,
3,7.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.273,0.684,0.96,0.45,...,1.0,0.536,-0.987,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,
4,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.928,-0.204,0.751,0.466,...,0.639,1.0,1.0,0.791,0.439,-0.199,-0.883,-1.0,-1.0,


- As described in ```zip.info.txt``` row 0 holds the value of the integer represented by the following 256 grayscale values. 
- Row 257 is NaN because all lines end with a non-printing character. 

In [4]:
# Lose the last column 
del data[257]

- In order to reduce complexity in the data I will downscale the 256 grayscale values to 16. 
- The approximation is: f(x) = int((x+1)*7.5+1)

In [5]:
# selects the first column holding the digit id (0-9)
data_digits = data.iloc[:,0]

In [6]:
data_digits.head()

0    6.0
1    5.0
2    4.0
3    7.0
4    3.0
Name: 0, dtype: float64

In [7]:
"""
The int() method can *not* be applied to a pandas Series, so in ordert to 
convert the values in row 0 to integers I define and apply a lambda function.
"""

int_x = lambda x: int(x)
data_digits = data_digits.apply(int_x)

In [8]:
data_digits.head()

0    6
1    5
2    4
3    7
4    3
Name: 0, dtype: int64

In [9]:
# extract the columns holding the grayscale values and apply rescaling function 

data_grayscale_values = (data.iloc[:,1:257]).apply(lambda x: (x+1)*7.5+1)

In [10]:
data_grayscale_values.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,247,248,249,250,251,252,253,254,255,256
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.7675,14.965,7.2475,...,10.78,14.6725,16.0,12.115,4.945,1.0675,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,2.4025,3.4675,2.4325,1.8475,3.4675,2.1025,1.0,...,3.4675,3.4675,8.2525,14.2075,14.215,9.445,7.7875,3.4675,2.29,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.03,...,1.0,1.0,1.0,7.6825,16.0,7.1575,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,6.4525,13.63,15.7,11.875,7.9975,...,6.115,16.0,12.52,1.0975,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.54,6.97,14.1325,11.995,10.255,...,11.995,13.2925,16.0,16.0,14.4325,11.7925,7.0075,1.8775,1.0,1.0


In [11]:
# rounds floats to nearest whole-number float 

data_grayscale_values = np.round(data_grayscale_values)

In [12]:
data_rc = pd.concat([data_digits, np.round(data_grayscale_values)], axis=1)

In [13]:
data_rc.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,247,248,249,250,251,252,253,254,255,256
0,6,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,15.0,...,11.0,15.0,16.0,12.0,5.0,1.0,1.0,1.0,1.0,1.0
1,5,1.0,1.0,1.0,2.0,3.0,2.0,2.0,3.0,2.0,...,3.0,3.0,8.0,14.0,14.0,9.0,8.0,3.0,2.0,1.0
2,4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,8.0,16.0,7.0,1.0,1.0,1.0,1.0
3,7,1.0,1.0,1.0,1.0,1.0,6.0,14.0,16.0,12.0,...,6.0,16.0,13.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,3,1.0,1.0,1.0,1.0,1.0,2.0,7.0,14.0,12.0,...,12.0,13.0,16.0,16.0,14.0,12.0,7.0,2.0,1.0,1.0
