### True Learning Objectives

- How can I process data in Python

#### How do I process data elements in a DataFrame

With the knowledge of **loop** (repetitive action) and **if..else** (conditional statements), we return our focus on DataFrame, this time to examine how to perform data cleaning and preprocessing tasks. 

We will be using the breast cancer diagnostic data set from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Let's start by looking at our data. This is the famous Iris flower data set that has been used to demonstrate many data mining/machine learning techniques. 

In [2]:
import pandas as pd
data = pd.read_csv('data/breast-cancer-wisconsin.csv', header=None)
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                'Normal Nucleoli', 'Mitoses','Class']

print('Number of instances = %d' % (data.shape[0]))
print('Number of attributes = %d' % (data.shape[1]))
data.head()

Number of instances = 699
Number of attributes = 11


Unnamed: 0,Sample code,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


To address some issues such as duplication, we will go ahead and drop the `Sample code` column, which can be used to identify unique records in this data set. 

In [3]:
data = data.drop(['Sample code'],axis=1)
data.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,5,1,1,1,2,1,3,1,1,2
1,5,4,4,5,7,10,3,2,1,2
2,3,1,1,1,2,2,3,1,1,2
3,6,8,8,1,3,4,3,7,1,2
4,4,1,1,3,2,1,3,1,1,2


We can print out the sepal length, sepal width, petal length, and petal width for each record using column names and row indices:

In [5]:
# At row 0
print (data['sepal length'][0])

4.9
3.0
1.4
0.2


We can also print out this information in a nicer format:

In [14]:
print ('Sepal length for record', 0, 'is:', data['sepal length'][0])

Sepal length for record 0 is: 5.1


How can we do it for five records?

In [15]:
print ('Sepal length for record', 0, 'is:', data['sepal length'][0])
print ('Sepal length for record', 1, 'is:', data['sepal length'][1])
print ('Sepal length for record', 2, 'is:', data['sepal length'][2])
print ('Sepal length for record', 3, 'is:', data['sepal length'][3])
print ('Sepal length for record', 4, 'is:', data['sepal length'][4])

Sepal length for record 0 is: 5.1
Sepal length for record 1 is: 4.9
Sepal length for record 2 is: 4.7
Sepal length for record 3 is: 4.6
Sepal length for record 4 is: 5.0


The above approach does not work if we want to do it for 20, 50, 100 records. To do this, we need to use a **loop**. 

A **loop** allows you to specify the number of repetition. 

In [16]:
for i in range(0,5):
    print ('Sepal length for record', i, 'is:', data['sepal length'][i])    

Sepal length for record 0 is: 5.1
Sepal length for record 1 is: 4.9
Sepal length for record 2 is: 4.7
Sepal length for record 3 is: 4.6
Sepal length for record 4 is: 5.0


What does `range` do?

In [17]:
for i in range(0,5):
    print (i)
    print ('Sepal length for record', i, 'is:', data['sepal length'][i])

0
Sepal length for record 0 is: 5.1
1
Sepal length for record 1 is: 4.9
2
Sepal length for record 2 is: 4.7
3
Sepal length for record 3 is: 4.6
4
Sepal length for record 4 is: 5.0


### Data Operation

We want to find the Euclidean Distance between two records in the Iris data set

In [18]:
point_data = {'point': ['p1','p2','p3','p4'],
              'x': [0,2,3,4],
              'y': [2,0,1,1]}  

In [22]:
import math

for i in range(0,4):
    for j in range(0,4):
        d = math.sqrt(math.pow(point_data['x'][i] - point_data['x'][j],2) + 
                      math.pow(point_data['y'][i] - point_data['y'][j],2))
        print ('Distance between', point_data['point'][i], 'and', point_data['point'][j], 'is', d)

Distance between p1 and p1 is 0.0
Distance between p1 and p2 is 2.8284271247461903
Distance between p1 and p3 is 3.1622776601683795
Distance between p1 and p4 is 4.123105625617661
Distance between p2 and p1 is 2.8284271247461903
Distance between p2 and p2 is 0.0
Distance between p2 and p3 is 1.4142135623730951
Distance between p2 and p4 is 2.23606797749979
Distance between p3 and p1 is 3.1622776601683795
Distance between p3 and p2 is 1.4142135623730951
Distance between p3 and p3 is 0.0
Distance between p3 and p4 is 1.0
Distance between p4 and p1 is 4.123105625617661
Distance between p4 and p2 is 2.23606797749979
Distance between p4 and p3 is 1.0
Distance between p4 and p4 is 0.0


How Python's community helps to make this better ...

Library **SciPy** (https://www.scipy.org/): SciPy is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

We will be using functions `squareform` and `pdist`: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html 

**Euclidean Distance (slide 5)**

In [35]:
from scipy.spatial.distance import squareform, pdist

df = pandas.DataFrame(point_data)
print(df)
pandas.DataFrame(squareform(pdist(df.iloc[:, 1:], metric='euclidean')),
                 columns=df.point.unique(), 
                 index=df.point.unique())

  point  x  y
0    p1  0  2
1    p2  2  0
2    p3  3  1
3    p4  4  1


Unnamed: 0,p1,p2,p3,p4
p1,0.0,2.828427,3.162278,4.123106
p2,2.828427,0.0,1.414214,2.236068
p3,3.162278,1.414214,0.0,1.0
p4,4.123106,2.236068,1.0,0.0


## Question 1:
Modify the code below so that it calculates the pairwise distance of the point DataFrame using City block distance. 
*Hint: Visit the spatial distance documentation page above to find the correct metric name for City block*

In [38]:
pandas.DataFrame(squareform(pdist(df.iloc[:, 1:], metric='______')),
                 columns=df.point.unique(), 
                 index=df.point.unique())

Unnamed: 0,p1,p2,p3,p4
p1,0.0,4.0,4.0,5.0
p2,4.0,0.0,2.0,3.0
p3,4.0,2.0,0.0,1.0
p4,5.0,3.0,1.0,0.0


**Mahalanobis Distance (slide 10)**

Yet another supporting package:

**NumPy** (www.numpy.org): NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. (slide 1

In [59]:
import numpy as np

cov_mat = np.matrix([[0.3,0.2],[0.2,0.3]])
print(cov_mat)
A = np.matrix([[0.5],[0.5]])
B = np.matrix([[0],[1]])
C = np.matrix([1.5,1.5])

A_B = (A - B).T * cov_mat.I * (A - B)
print(A_B)

[[0.3 0.2]
 [0.2 0.3]]
[[5.]]


## Question 2:
Modify the code cell below to find the mahalanobis distance between point A and C (A_C) from the above code cell, and then print out the result. 

In [None]:
A_C = _________________________
print(___)

In [47]:
A - B

array([ 0.5, -0.5])

**Similarity measuremments (slide 14)**
Given the two binary vectors, identify points of similarity: (slide 14)

In [60]:
x = [1,0,0,0,0,0,0,0,0,0]
y = [0,0,0,0,0,0,1,0,0,1]

This is where we have the concept of conditional statements `if ... elif ... else ...` (elif = else if)

In [61]:
f01 = 0
f10 = 0
f00 = 0
f11 = 0

for i in range(0,10):
    if x[i] == 0 and y[i] == 1:
        f01 = f01 + 1
    elif x[i] == 1 and y[i] == 0:
        f10 = f10 + 1
    elif x[i] == 0 and y[i] == 0:
        f00 = f00 + 1
    else:
        f11 = f11 + 1
        
print (f01, f10, f00, f11)
            

2 1 7 0


## Question 3:

Based on the above similarity measurements, write the Python code to calculate Simple Matching and Jaccard Coefficients

In [None]:
SMC = ( + ) / ( + + + )
print (SMC)

J = 
print (J)

## Question 4:

Using information on vector d1 and d2 in slide 15, complete the following code cell to calculate cosine similarity

In [None]:
import math 

d1 = 
d2 = 

# calculate inner product. Hint: use a loop
inner_product = 0

for i in range ():
    inner_product = inner_product + 
    
# calculate length of d1. Also use a loop
d1_len = 0

# calculate length of d2. 


# calculate cosine similarity

cosine_d1_d2 = inner_product / (d1_len * d2_len)
print (cosine_d1_d2)