### True Learning Objectives

- How can I process data in Python

#### How do I explore data elements in a DataFrame

With Pandas, we are able to load data into an Excel-like data structure called **DataFrame**. In Excel, we have the ability to simplify repetitive actions through the combination of Ctrl-C and mouse drag to have a single equation applied to multiple rows, using the correct cells in each row for calculation. 

In Python, and in any other languages, this repeating action can be performed using **loop**. In other literature, it might also be called **iteration**. 

Let's start by looking at our data. This is the famous Iris flower data set that has been used to demonstrate many data mining/machine learning techniques. 

In [None]:
import pandas

data = pandas.read_csv('data/iris.csv', header=None)
print (data.head())
print (data.shape)

The `print` function allows us to print out the output of the `data.head()` function. In the previous lecture, we used `data.head()` directly. This works in a exploratory mode, but in a full-fledge program, to echo output to screen, we need to use the `print` function. 

The data set contains no header (hence the default headers 0, 1, 2, 3, 4), but we know that the columns are arranged as followed: sepal length in cm, sepal width in cm, petal length in cm, petal width in cm, and iris class (Iris setora, Iris Versicolour, and Iris Virginica). We need to first set the headers for our DataFrame:

In [None]:
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
print (data.head())

We can print out the sepal length, sepal width, petal length, and petal width for each record using column names and row indices:

In [None]:
# At row 0
print (data['sepal length'][0])

We can also print out this information in a nicer format:

In [None]:
print ('Sepal length for record', 0, 'is:', data['sepal length'][0])

How can we do it for five records?

In [None]:
print ('Sepal length for record', 0, 'is:', data['sepal length'][0])
print ('Sepal length for record', 1, 'is:', data['sepal length'][1])
print ('Sepal length for record', 2, 'is:', data['sepal length'][2])
print ('Sepal length for record', 3, 'is:', data['sepal length'][3])
print ('Sepal length for record', 4, 'is:', data['sepal length'][4])

The above approach does not work if we want to do it for 20, 50, 100 records. To do this, we need to use a **loop**. 

A **loop** allows you to specify the number of repetition. 

In [None]:
for i in range(0,5):
    print ('Sepal length for record', i, 'is:', data['sepal length'][i])    

What does `range` do?

In [None]:
for i in range(0,5):
    print (i)
    print ('Sepal length for record', i, 'is:', data['sepal length'][i])

### Data Operation

We want to find the Euclidean Distance between two records in the Iris data set

In [None]:
point_data = {'point': ['p1','p2','p3','p4'],
              'x': [0,2,3,4],
              'y': [2,0,1,1]}  

In [None]:
import math

for i in range(0,4):
    for j in range(0,4):
        d = math.sqrt(math.pow(point_data['x'][i] - point_data['x'][j],2) + 
                      math.pow(point_data['y'][i] - point_data['y'][j],2))
        print ('Distance between', point_data['point'][i], 'and', point_data['point'][j], 'is', d)

How Python's community helps to make this better ...

Library **SciPy** (https://www.scipy.org/): SciPy is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

We will be using functions `squareform` and `pdist`: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html 

**Euclidean Distance (slide 5)**

In [None]:
from scipy.spatial.distance import squareform, pdist

df = pandas.DataFrame(point_data)
print(df)
pandas.DataFrame(squareform(pdist(df.iloc[:, 1:], metric='euclidean')),
                 columns=df.point.unique(), 
                 index=df.point.unique())

## Question 1:
Modify the code below so that it calculates the pairwise distance of the point DataFrame using City block distance. 
*Hint: Visit the spatial distance documentation page above to find the correct metric name for City block*

In [None]:
pandas.DataFrame(squareform(pdist(df.iloc[:, 1:], metric='______')),
                 columns=df.point.unique(), 
                 index=df.point.unique())

**Mahalanobis Distance (slide 10)**

Yet another supporting package:

**NumPy** (www.numpy.org): NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. (slide 1

In [None]:
import numpy as np

cov_mat = np.matrix([[0.3,0.2],[0.2,0.3]])
print(cov_mat)
A = np.matrix([[0.5],[0.5]])
B = np.matrix([[0],[1]])
C = np.matrix([1.5,1.5])

A_B = (A - B).T * cov_mat.I * (A - B)
print(A_B)

## Question 2:
Modify the code cell below to find the mahalanobis distance between point A and C (A_C) from the above code cell, and then print out the result. 

In [None]:
A_C = _________________________
print(___)

**Similarity measuremments (slide 14)**
Given the two binary vectors, identify points of similarity: (slide 14)

In [None]:
x = [1,0,0,0,0,0,0,0,0,0]
y = [0,0,0,0,0,0,1,0,0,1]

This is where we have the concept of conditional statements `if ... elif ... else ...` (elif = else if)

In [None]:
f01 = 0
f10 = 0
f00 = 0
f11 = 0

for i in range(0,10):
    if x[i] == 0 and y[i] == 1:
        f01 = f01 + 1
    elif x[i] == 1 and y[i] == 0:
        f10 = f10 + 1
    elif x[i] == 0 and y[i] == 0:
        f00 = f00 + 1
    else:
        f11 = f11 + 1
        
print (f01, f10, f00, f11)
            

## Question 3:

Based on the above similarity measurements, write the Python code to calculate Simple Matching and Jaccard Coefficients

In [None]:
SMC = ( + ) / ( + + + )
print (SMC)

J = 
print (J)

## Question 4:

Using information on vector d1 and d2 in slide 15, complete the following code cell to calculate cosine similarity

In [None]:
import math 

d1 = 
d2 = 

# calculate inner product. Hint: use a loop
inner_product = 0

for i in range ():
    inner_product = inner_product + 
    
# calculate length of d1. Also use a loop
d1_len = 0

# calculate length of d2. 


# calculate cosine similarity

cosine_d1_d2 = inner_product / (d1_len * d2_len)
print (cosine_d1_d2)