# DAT210x - Programming with Python for DS

## Module4- Lab2

In [2]:
import math
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
#%matplotlib inline
#%matplotlib notebook
from sklearn import preprocessing


In [3]:
# Look pretty...

# matplotlib.style.use('ggplot')
plt.style.use('ggplot')

### Some Boilerplate Code

For your convenience, we've included some boilerplate code here which will help you out. You aren't expected to know how to write this code on your own at this point, but it'll assist with your visualizations. We've added some notes to the code in case you're interested in knowing what it's doing:

### A Note on SKLearn's `.transform()` calls:

Any time you perform a transformation on your data, you lose the column header names because the output of SciKit-Learn's `.transform()` method is an NDArray and not a daraframe.

This actually makes a lot of sense because there are essentially two types of transformations:
- Those that adjust the scale of your features, and
- Those that change alter the number of features, perhaps even changing their values entirely.

An example of adjusting the scale of a feature would be changing centimeters to inches. Changing the feature entirely would be like using PCA to reduce 300 columns to 30. In either case, the original column's units have either been altered or no longer exist at all, so it's up to you to assign names to your columns after any transformation, if you'd like to store the resulting NDArray back into a dataframe.

In [4]:
def scaleFeaturesDF(df):
    # Feature scaling is a type of transformation that only changes the
    # scale, but not number of features. Because of this, we can still
    # use the original dataset's column names... so long as we keep in
    # mind that the _units_ have been altered:

    scaled = preprocessing.StandardScaler().fit_transform(df)
    scaled = pd.DataFrame(scaled)
    
    print("New Variances:\n", scaled.var())
    #print("New Describe:\n", scaled.describe())
    return scaled

SKLearn contains many methods for transforming your features by scaling them, a type of pre-processing):
    - `RobustScaler`
    - `Normalizer`
    - `MinMaxScaler`
    - `MaxAbsScaler`
    - `StandardScaler`
    - ...

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

However in order to be effective at PCA, there are a few requirements that must be met, and which will drive the selection of your scaler. PCA requires your data is standardized -- in other words, it's _mean_ should equal 0, and it should have unit variance.

SKLearn's regular `Normalizer()` doesn't zero out the mean of your data, it only clamps it, so it could be inappropriate to use depending on your data. `MinMaxScaler` and `MaxAbsScaler` both fail to set a unit variance, so you won't be using them here either. `RobustScaler` can work, again depending on your data (watch for outliers!). So for this assignment, you're going to use the `StandardScaler`. Get familiar with it by visiting these two websites:

- http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

Lastly, some code to help with visualizations:

In [5]:
def drawVectors(transformed_features, components_, columns, plt, scaled):
    if not scaled:
        return plt.axes() # No cheating ;-)

    num_columns = len(columns)

    # This funtion will project your *original* feature (columns)
    # onto your principal component feature-space, so that you can
    # visualize how "important" each one was in the
    # multi-dimensional scaling

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    ## visualize projections

    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    print("Features by importance:\n", important_features)

    ax = plt.axes()

    for i in range(num_columns):
        # Use an arrow to project each original feature as a
        # labeled vector on your principal component axes
        plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
        plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)

    return ax

### And Now, The Assignment

In [77]:
# Do * NOT * alter this line, until instructed!
scaleFeatures = False

Load up the dataset specified on the lab instructions page and remove any and all _rows_ that have a NaN in them. You should be a pro at this by now ;-)

**QUESTION**: Should the `id` column be included in your dataset as a feature?

In [28]:
# .. your code here ..
df=pd.read_csv('Datasets/kidney_disease.csv')
df.dropna(axis=0,inplace=True)
df.reset_index(drop=True,inplace=True)
#df.drop(labels=['id'],axis=1,inplace=True)
#df.head()

Let's build some color-coded labels; the actual label feature will be removed prior to executing PCA, since it's unsupervised. You're only labeling by color so you can see the effects of PCA:

In [29]:
labels = ['red' if i=='ckd' else 'green' for i in df.classification]

Use an indexer to select only the following columns: `['bgr','wc','rc']`

In [30]:
# .. your code here ..
df1=df[['bgr','wc','rc']]
df1.dtypes


bgr    float64
wc      object
rc      object
dtype: object

Either take a look at the dataset's webpage in the attribute info section of UCI's [Chronic Kidney Disease]() page,: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease or alternatively, you can actually look at the first few rows of your dataframe using `.head()`. What kind of data type should these three columns be? Compare what you see with the results when you print out your dataframe's `dtypes`.

If Pandas did not properly detect and convert your columns to the data types you expected, use an appropriate command to coerce these features to the right type.

In [31]:
# .. your code here ..
df.head()
df1.dtypes
df1.wc=pd.to_numeric(df1.wc,errors='coerce')
df1.rc=pd.to_numeric(df1.rc,errors='coerce')
#df1.dtypes
df.wc=pd.to_numeric(df1.wc,errors='coerce')
df.rc=pd.to_numeric(df1.rc,errors='coerce')
#df.dtypes
#df1

PCA Operates based on variance. The variable with the greatest variance will dominate. Examine your data using a command that will check the variance of every feature in your dataset, and then print out the results. Also print out the results of running `.describe` on your dataset.

_Hint:_ If you do not see all three variables: `'bgr'`, `'wc'`, and `'rc'`, then it's likely you probably did not complete the previous step properly.

In [32]:
# .. your code here ..
df.var()


id      1.060869e+04
age     2.406297e+02
bp      1.248891e+02
sg      3.023865e-05
al      1.996936e+00
su      6.616141e-01
bgr     4.217182e+03
bu      2.246322e+03
sc      9.471717e+00
sod     5.609143e+01
pot     1.208501e+01
hemo    8.307100e+00
wc      9.777380e+06
rc      1.039104e+00
dtype: float64

In [12]:
df.describe()

Unnamed: 0,id,age,bp,sg,al,su,bgr,bu,sc,sod,pot,hemo,wc,rc
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,274.841772,49.563291,74.050633,1.019873,0.797468,0.253165,131.341772,52.575949,2.188608,138.848101,4.636709,13.687342,8475.949367,4.891772
std,102.998517,15.512244,11.175381,0.005499,1.41313,0.813397,64.939832,47.395382,3.077615,7.489421,3.476351,2.882204,3126.880181,1.019364
min,3.0,6.0,50.0,1.005,0.0,0.0,70.0,10.0,0.4,111.0,2.5,3.1,3800.0,2.1
25%,243.0,39.25,60.0,1.02,0.0,0.0,97.0,26.0,0.7,135.0,3.7,12.6,6525.0,4.5
50%,298.5,50.5,80.0,1.02,0.0,0.0,115.5,39.5,1.1,139.0,4.5,14.25,7800.0,4.95
75%,355.75,60.0,80.0,1.025,1.0,0.0,131.75,49.75,1.6,144.0,4.9,15.775,9775.0,5.6
max,399.0,83.0,110.0,1.025,4.0,5.0,490.0,309.0,15.2,150.0,47.0,17.8,26400.0,8.0


Below, we assume your dataframe's variable is named `df`. If it isn't, make the appropriate changes. But do not alter the code in `scaleFeaturesDF()` just yet!

In [139]:
# .. your (possible) code adjustment here ..

if scaleFeatures: dff1 = scaleFeaturesDF(dff1)


Run PCA on your dataset, reducing it to 2 principal components. Make sure your PCA model is saved in a variable called `'pca'`, and that the results of your transformation are saved in another variable `'T'`:

In [79]:
# .. your code here .

#def do_PCA(df,svd):
    # .. your code here ..
import numpy as np
from sklearn.decomposition import PCA
#svd='full'

pca =PCA(n_components=2)
pca.fit(df1)
PCA(n_components=2, whiten=False)
T = pca.transform(df1)
df1.shape
T.shape
ax = drawVectors(T, pca.components_, df1.columns.values, plt, scaleFeatures)
T  = pd.DataFrame(T)

T.columns = ['component1', 'component2']
T.plot.scatter(x='component1', y='component2', marker='o', c=labels, alpha=0.75, ax=ax)

plt.show()
#df1.shape
#T
#   return df
#print(df.shape)

Now, plot the transformed data as a scatter plot. Recall that transforming the data will result in a NumPy NDArray. You can either use MatPlotLib to graph it directly, or you can convert it back to DataFrame and have Pandas do it for you.

Since we've already demonstrated how to plot directly with MatPlotLib in `Module4/assignment1.ipynb`, this time we'll show you how to convert your transformed data back into to a Pandas Dataframe and have Pandas plot it from there.

In [35]:
df1=df.drop(['id', 'classification', 'rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane'],axis=1)
df1

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,hemo,pcv,wc,rc
0,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,11.2,32,6700,3.9
1,53.0,90.0,1.020,2.0,0.0,70.0,107.0,7.2,114.0,3.7,9.5,29,12100,3.7
2,63.0,70.0,1.010,3.0,0.0,380.0,60.0,2.7,131.0,4.2,10.8,32,4500,3.8
3,68.0,80.0,1.010,3.0,2.0,157.0,90.0,4.1,130.0,6.4,5.6,16,11000,2.6
4,61.0,80.0,1.015,2.0,0.0,173.0,148.0,3.9,135.0,5.2,7.7,24,9200,3.2
5,48.0,80.0,1.025,4.0,0.0,95.0,163.0,7.7,136.0,3.8,9.8,32,6900,3.4
6,69.0,70.0,1.010,3.0,4.0,264.0,87.0,2.7,130.0,4.0,12.5,37,9600,4.1
7,73.0,70.0,1.005,0.0,0.0,70.0,32.0,0.9,125.0,4.0,10.0,29,18900,3.5
8,73.0,80.0,1.020,2.0,0.0,253.0,142.0,4.6,138.0,5.8,10.5,33,7200,4.3
9,46.0,60.0,1.010,1.0,0.0,163.0,92.0,3.3,141.0,4.0,9.8,28,14600,3.2


In [36]:
df1.dtypes
df1=pd.get_dummies(df1)
#df1

In [6]:
from scipy.io import loadmat 
dffmat=loadmat('Datasets/face_data.mat')
type(dffmat)
dffmat
type(dffmat['image_pcs'])
#dff1=dffmat['__globals__','__header__', '__version__','image_pcs','images','lights','poses']
#dff1
#dffmat.items()
#type(dffmat)
dff2=pd.DataFrame(dffmat['lights'])
dff3=pd.DataFrame(dffmat['image_pcs'])
dff4=pd.DataFrame(dffmat['poses'])
dff1=pd.DataFrame(dffmat['images'])
dff1=dff1.append(dff2,ignore_index=True)
#dff1=dff1.append(dff3,ignore_index=True)
#dff1=dff1.append(dff4,ignore_index=True)
#dff.dtypes
#dff.describe
#dff=pd.DataFrame(dffmat.items(),columns={'images_pcs','images','lights','poses'})
#dff=pd.DataFrame.from_dict(dff1)
dff1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,688,689,690,691,692,693,694,695,696,697
0,0.016176,0.016176,0.016176,0.016176,0.016176,0.016176,0.030515,0.016176,0.016176,0.016176,...,0.016176,0.016176,0.016176,0.016176,0.032521,0.016176,0.016176,0.016176,0.016176,0.016176
1,0.016176,0.016176,0.016176,0.016176,0.016176,0.016176,0.058134,0.016176,0.016176,0.016176,...,0.016176,0.032077,0.016176,0.020083,0.077405,0.016176,0.016176,0.016176,0.016176,0.016176
2,0.016176,0.016176,0.016176,0.016176,0.016176,0.016176,0.058241,0.016176,0.016176,0.016176,...,0.016176,0.113971,0.016176,0.040395,0.206648,0.016176,0.016176,0.016176,0.016176,0.016176
3,0.016176,0.016176,0.016176,0.024188,0.016176,0.016176,0.056526,0.016176,0.016176,0.016176,...,0.016176,0.101379,0.016176,0.038051,0.359222,0.016176,0.016176,0.016176,0.016176,0.016176
4,0.016176,0.016988,0.016176,0.039859,0.016176,0.016176,0.107307,0.021094,0.020083,0.016176,...,0.016176,0.036351,0.016176,0.028676,0.533104,0.016176,0.016176,0.016176,0.016176,0.016176
5,0.016176,0.028140,0.016176,0.040671,0.016176,0.016176,0.223591,0.057996,0.040395,0.016176,...,0.016176,0.122074,0.016176,0.047426,0.532047,0.016176,0.016176,0.016176,0.016176,0.016176
6,0.016176,0.047518,0.016176,0.038710,0.016176,0.016176,0.385907,0.123162,0.056801,0.016176,...,0.016176,0.124081,0.016176,0.034926,0.528998,0.016176,0.016176,0.016176,0.016176,0.016176
7,0.016176,0.050950,0.016176,0.035983,0.016176,0.016176,0.441437,0.086887,0.075551,0.016176,...,0.016176,0.139553,0.016176,0.063051,0.501333,0.016176,0.016176,0.016176,0.031388,0.016176
8,0.016176,0.080469,0.016176,0.034559,0.016176,0.016176,0.405928,0.138006,0.078676,0.016176,...,0.016176,0.283992,0.016176,0.059926,0.410907,0.016176,0.016176,0.016176,0.130867,0.016176
9,0.016176,0.116146,0.016176,0.033517,0.016176,0.016176,0.343444,0.252558,0.083364,0.016176,...,0.016176,0.414767,0.016176,0.031801,0.326057,0.016176,0.016176,0.016176,0.274341,0.016176


In [8]:
#from sklearn import manifold
import numpy as np
from sklearn.decomposition import PCA
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)
iso = manifold.Isomap(n_neighbors=4, n_components=2)
fitmap=iso.fit_transform(dff)
#type(fitmap)

fitmap.shape
#plot_embedding(fitmap)

ax = drawVectors(fitmap, pca.components_, dff.columns.values, plt, scaleFeatures)
fitmap = pd.DataFrame(fitmap)

fitmap.columns = ['component1', 'component2']
fitmap.plot.scatter(x='component1', y='component2', marker='o', c=labels, alpha=0.75, ax=ax)
fitmap
#plt.show()

NameError: name 'dff' is not defined

In [141]:
#from sklearn import manifold
import numpy as np
from sklearn.decomposition import PCA
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)
iso = manifold.Isomap(n_neighbors=4, n_components=3)
fitmap=iso.fit(dff1)
type(fitmap)

#fitmap.shape
#plot_embedding(fitmap)

ax = drawVectors(fitmap, pca.components_, dff1.columns.values, plt, scaleFeatures)
fitmap = pd.DataFrame(fitmap)
#fitmap

fitmap.columns = ['component1', 'component2','component3']
fig = plt.figure()
ax1=fig.add_subplot(111,projection='3d')

ax1.set_xlabel('component1')
ax1.set_ylabel('component2')
ax1.set_zlabel('component3')
ax1.scatter(dff1[0],dff1[1],dff1[2],c='red',marker='o')
#fitmap.plot.scatter(x='component1', y='component2',z='component3', marker='o', c=labels, alpha=0.75, ax=ax)

<IPython.core.display.Javascript object>

PandasError: DataFrame constructor not properly called!

In [103]:
# .. your code here .

#def do_PCA(df,svd):
    # .. your code here ..
import numpy as np
from sklearn.decomposition import PCA
#svd='full'

pca =PCA(n_components=2)
pca.fit(dff)
PCA(n_components=2, whiten=False)
T = pca.transform(dff)
dff.shape
T.shape
ax = drawVectors(T, pca.components_, dff.columns.values, plt, scaleFeatures)
T  = pd.DataFrame(T)
T.dtypes
T.describe
T.columns = ['comp1','comp2']
T.plot.scatter(x='comp1', y='comp2',marker='o', c=labels, alpha=0.75, ax=ax)

plt.show()
#df1.shape
#T
#   return df
#print(df.shape)