# Week 2. Data Exploration and Pre-processing
Load wine data from UCI data repositories and show some samples with pandas

In [None]:
import pandas as pd

df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv' , sep = ';')
df.head()          # import and display the first five rows of wine data

## 1. Explore data statisitcal information

In [None]:
import matplotlib.pyplot as plt

plt.style.use('ggplot')
print(df.isnull().any()) # check if the data has Null values (noisy), missing value check, quite clean  


Next we use describe() function to see all statistical information of the data.

In [None]:
df.describe()

In [None]:
#pd.DataFrame.hist(df, figsize = [15,15]);
pd.DataFrame.boxplot(df, figsize = [15,15]);   

# draw the boxplot based on each attribute
# Outlier: usually, a value higher than Q3+1.5xIQR or lower than Q1-1.5xIQR.  IQR=(Q3-Q1)


In [None]:
df.boxplot(['alcohol','quality'])   # boxplot on the specified attributes: alcohol and quality

In [None]:
df.boxplot(['pH'])     # boxplot on the specified attribute: PH

In [None]:
pd.DataFrame.hist(df,figsize=[15,15])     # draw histogram for all the 12 attributes


From histograms and boxplots, can you see some outliers?  

Outside the majority range could be outliers.

In [None]:
import numpy as np

d1 = np.array(df.iloc[0], dtype=np.float32)
d2 = np.array(df.iloc[1], dtype=np.float32)
print(d1)           # display the first row of wine data
print(d2)           # display the second row of wine data

How to computer Manhattan Distance and Euclidean Distance? <br>
Manhattan distance: $d(i,j) = |x_{i1}-x_{j1}| + |x_{i2}-x_{j2}| + \cdots + |x_{ip}-x_{jp}|$<br>
Euclidean distance: $d(i,j) = \sqrt{(x_{i1}-x_{j1})^2 + (x_{i2}-x_{j2})^2 +\cdots+ (x_{ip}-x_{jp})^2}$ <br>
Numpy supports vector mathematical operations in element-wise manner.

In [None]:
print(np.sum(np.abs(d1-d2)))     # the sum of absolute differences between data points d1 and d2
print(np.sqrt(np.sum(np.power((d1-d2),2))))  # the square root of the squared sum of differences between data points d1 and d2

We also have libraries for computing them more conveniently.

In [None]:
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import euclidean_distances

print(manhattan_distances([d1,d2]))
print(euclidean_distances([d1,d2]))     # Results are displayed as distance matrices

## 2. Pre-process data

### drop rows with some criteria

In [None]:
# drop rows with some criteria
df.hist('residual sugar')         # draw histogram for the original data
idx= df['residual sugar']<=10     # keep the data when the attribute residual sugar is less than or equal to 10, drop others
temp_df = df[idx]
temp_df.hist('residual sugar')    # draw histogram for the selected data
print(df.shape)
print(temp_df.shape)

### Replace with mean or median values

In [None]:
noise_df = pd.read_csv('./winequality-red_with_noise.csv',sep=';')
noise_df.head(10)   # display the first 10 rows of another wine data set with noise


In [None]:
noise_df.isnull().any()          # check whether there is any missing data

You can see that two columns are now having missing data items.
We are going to fill the missing data in the first column with mean value of the "fixed acidity", while replace the ones in the second column with median value of "volatile acidity". 

In [None]:
v1 = noise_df['fixed acidity'].mean(skipna=True)        # calcuate the mean value for the attribute fixed acidity
v2 = noise_df['volatile acidity'].median(skipna=True)   # calculate the median value for the attribute volatile acidity
print(v1,v2)

In [None]:
values = {'fixed acidity':v1,'volatile acidity':v2}
clean_df = noise_df.fillna(value=values)      # fill in the missing value with mean or median
clean_df.head(10)
# note that fillna just returns a dataframe, the original dataframe has not been changed yet.

In [None]:
clean_df.isnull().any()             # check missing value again after filling in mean or median, now clean

## Let's see how normalization affect performance

In [None]:
# we drop "quality" column and use it as the targets for a multi-class classification problem.

In [None]:
X =  df.drop('quality',1)
y = df['quality']   # you need to specify the second parameter, which is direction of axis, starting from 0 
X.head()

In [None]:
y.head()     # display the first five rows of the class/label attribute: quality

In [None]:
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn import neighbors, linear_model

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn = neighbors.KNeighborsClassifier(n_neighbors = 5)      # KNN: k-nearest neighbour classification algorithm
knn_model_1 = knn.fit(X_train, y_train)
print('k-NN accuracy for test set: %f' % knn_model_1.score(X_test, y_test))
from sklearn.metrics import classification_report
y_true, y_pred = y_test, knn_model_1.predict(X_test)
print(classification_report(y_true, y_pred))  # precision, recall, and f1-score are evaluation metrics, the higher, the better


You can see that we have 45.6% of Accuracy by using K-NN classifier. Now, we are using the normalization methods, min-max scaling and z-score scaling, to see how the results are improved.

In [None]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()     # do the min-max normalization
Xs = min_max_scaler.fit_transform(X)      
from sklearn.model_selection import train_test_split
#from sklearn.cross_validation import train_test_split
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=42)
knn_model_2 = knn.fit(Xs_train, y_train)
print('k-NN score for test set: %f' % knn_model_2.score(Xs_test, y_test))
y_true, y_pred = y_test, knn_model_2.predict(Xs_test)
print(classification_report(y_true, y_pred))

The performance is improved by using min-max normalization.
Let's see how the z-score normalization goes.

In [None]:
from sklearn import preprocessing

Xs = preprocessing.scale(X)     # do the z-score normalization
from sklearn.model_selection import train_test_split
#from sklearn.cross_validation import train_test_split
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=42)
knn_model_2 = knn.fit(Xs_train, y_train)
print('k-NN score for test set: %f' % knn_model_2.score(Xs_test, y_test))
y_true, y_pred = y_test, knn_model_2.predict(Xs_test)
print(classification_report(y_true, y_pred))

The performance is further improved. In fact, which normalization method is chosen is totally up to data, can try both. 