# Assignment 3

**0.** First, you need to download the [wine quality dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv) from the UCI machine learning repository.

The following code uses Pandas to read the CSV file and store them in a DataFrame object named data, you should indicate the delimiter as ';'. Next, it will display the first five rows of the data frame.

In [None]:
import pandas as pd

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', delimiter=';')
#data = pd.read_csv('https://github.com/liuhoward/teaching/raw/master/business_intelligence/winequality-white.csv', delimiter=';')
print('number of rows before: {}'.format(len(data)))
# remove duplicated rows
data.drop_duplicates(inplace=True)
print('number of rows after: {}'.format(len(data)))
data.head()

**1. Data exploration**  

1)	Compute Mean, Standard deviation, Minimum, Maximum for each of the 12 attributes.

In [None]:
from pandas.api.types import is_numeric_dtype

for col in data.columns:
    if is_numeric_dtype(data[col]):
        print('%s:' % (col))
        print('\t Mean = %.2f' % data[col].mean())
        print('\t Standard deviation = %.2f' % data[col].std())
        print('\t Minimum = %.2f' % data[col].min())
        print('\t Maximum = %.2f' % data[col].max())

2)	Plot the histogram and boxplot for each of the 12 attributes.

In [None]:
%matplotlib inline

data.hist(bins=50, figsize=(10,10))

In [None]:
%matplotlib inline

data.boxplot(figsize=(15, 10), rot=45, fontsize=15)

3)	Compute the correlation for each pair of attributes. It should be a 12x12 correlation matrix. Discuss the correlation between each attribute and the Quality (which is the last attribute) and identify the ones that are most/least related to the Quality.

In [None]:
print('Correlation:')
data.corr()

We can also plot the correlation with different colors.

In [None]:
import matplotlib.pyplot as plt

corr = data.corr()

plt.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90);
plt.yticks(range(len(corr.columns)), corr.columns);
plt.show()

**2. Similarity**

1)  Extract these instances (i.e., rows) with the Quality score as 8 and then compute the Euclidean distance between any two instances (i.e., rows) among these extracted ones. Identify the top-10 pairs that have the smallest Euclidean distance.

In [None]:
# select data with quality==8
data_quality8 = data[data['quality'] == 8]

data_quality8.head()

In [None]:
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np

# calulate euclidean distances as a matrix
euclidean_dist = euclidean_distances(data_quality8)
euclidean_dist

In [None]:
# convert it to an vector
euclidean_dist_array = list()
size = len(data_quality8)
for i in range(size):
    for j in range(i + 1, size):
        euclidean_dist_array.append((i, j, euclidean_dist[i][j]))
        
# sort the distances
sorted_dist = sorted(euclidean_dist_array, key=lambda e: e[2])

for k in range(10):
    print('top {} pair:'.format(k + 1))
    i_index, j_index, distance = sorted_dist[k]
    print('euclidiean distance = %.8f' % distance)
    # column names are too long, so we do not display them by setting 'header=False'
    print((data_quality8.loc[data_quality8.index[[i_index, j_index]], data_quality8.columns]).to_string(header=False))
    print('\n')

2)	Extract these instances with the Quality score as 8 and then compute the Cosine similarity between any two instances (i.e., rows) among these extracted ones. Identify the top-10 pairs that have the highest Cosine similarity.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# calulate cosine similarity as a matrix
cosine_sim = cosine_similarity(data_quality8)
cosine_sim

In [None]:
# convert it to an vector
cosine_sim_array = list()
size = len(data_quality8)
for i in range(size):
    for j in range(i + 1, size):
        cosine_sim_array.append((i, j, cosine_sim[i][j]))
        
# sort the distances
sorted_sim = sorted(cosine_sim_array, key=lambda e: e[2], reverse=True)

for k in range(10):
    print('top {} pair:'.format(k + 1))
    i_index, j_index, distance = sorted_sim[k]
    print('cosine similarity = %.8f' % distance)
    # column names are too long, so we do not display them by setting 'header=False'
    print((data_quality8.loc[data_quality8.index[[i_index, j_index]], data_quality8.columns]).to_string(header=False))
    print('\n')

3)	Compare two top-10 lists and discuss about which measure (Euclidean and Cosine) works better. 

your answer:

**3. Data preprocessing**

1)	Use the first 11 attributes (excluding the Quality) data and run PCA to reduce the data to 2-dimension data.

Using PCA, the data matrix is projected to its first two principal components. The projected values of the original image data are stored in a pandas DataFrame object named projected.

In [None]:
import pandas as pd
from sklearn.decomposition import PCA

# slice data to keep only 11 attributes
data_no_quality = data.iloc[:, 0:11]

numComponents = 2
pca = PCA(n_components=numComponents)
pca.fit(data_no_quality)

projected = pca.transform(data_no_quality)
projected = pd.DataFrame(projected,columns=['pc1','pc2'],index=data_no_quality.index)

projected.head()

2)	Plot the scatterplot using the reduced 2-dimension data and label each instance (one point in 2-dimension) with different color indicating different Quality score.

In [None]:
# add quality label
projected['quality'] = data['quality']
projected.head()


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(30,20))

colors = {3:'b', 4:'g', 5:'r', 6:'c', 7:'m', 8:'y', 9:'k'}
markerTypes = {3:'s', 4:'v', 5:'^', 6:'o', 7:'+', 8:'x', 9:'D'}

for quality_type in markerTypes.keys():
    d = projected[projected['quality']==quality_type]
    plt.scatter(d['pc1'],d['pc2'],c=colors[quality_type],s=60,marker=markerTypes[quality_type])

3)	Examine the scatterplot results and discuss about the potential clusters and outliers.

your answer