The [pandas](http://pandas.pydata.org) package allows you to handle complex tables of data of different types and time series. Click on the following cell. Then click the 'Run' button to import pandas and check its version.

In [None]:
import pandas as pd
pd.__version__

### Task 1: import the wine dataset

###### Importing the dataset, wine.csv, by replacing the question marks ‘???’ in the following cell with the correct file name and the path.

Click [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to see more information on pandas.read_csv

In [None]:
import pandas as pd

data = pd.read_csv("../mlnc_DATA/wine.csv")

You can use the shape function in pandas to check the dimensionality of the dataset as shown in the following cell.

In [None]:
data.shape

To get a rough idea of this data file’s content, you can print the first five or the last five rows 
using the commands shown in the following two cells. You can also input an integer (its absolute value should be no more than the total number of rows in the dataset.) in the brackets. More details can be seen in [pandas.DataFrame.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) and [pandas.DataFrame.tail](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html).

In [None]:
data.head()

In [None]:
data.tail()

### Task 2: Access the data 

One way to access the data imported from pandas is to use function [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html). Run the following two cells to see what you get and compare the output with the outputs of data.head() and data.tail(). Note that in Python, the index of an array (or matrix) counts from zero. In the pair of square brackets of iloc in the following cell, the comma separates two parts: the first part is for accessing rows; the second part is for accessing columns. 

In [None]:
first_two_rows = data.iloc[0:2, :]
first_two_rows

In [None]:
first_two_columns = data.iloc[:, 0:2]
first_two_columns

Slicing of arrays: getting and setting smaller segments within a larger dataframe.  
To access a slice of a dataframe, you can use [start:stop:step] for each part in the pair of square brackets. The default values are start=0, stop = size of dimension, step=1. For example, as you may have already known, [0:2, :] is used to access elements starting from the first row, stopping at the third row with a step of 1. The second colon sign means to get all columns. 

#### Task 2 a) Replace the question marks ‘???’ in the following cell to get the first 13 columns in the dataframe.

In [None]:
"""Get all features"""
Inputs = data.iloc[:, 0:13]

###### You can also input an integer in each part of the pair of square brackets to get the specific row or the specific column

#### Task 2 b) Replace the question marks ‘???’ in the following cell to get the last column in the dataframe.

In [None]:
#### """Get labels"""
Labels = data.iloc[:,13]
Labels

### Normalising the data - do it for features only
Before doing a PCA analysis, you need to subtract the mean value from each feature. You can do it by applying [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from [sklearn](https://scikit-learn.org/stable/).preprocessing: which not only removes the mean but also scales features to have a unit variance. Run the following cell: the first line is to import StandardScaler; the second line is to normalise the data by using fit_transform.

In [None]:
from sklearn.preprocessing import StandardScaler
x1 = StandardScaler().fit_transform(Inputs)

There are two steps in fit_transform(). First, fit() is used to extract the mean value and the standard deviation from each feature. Then transform() is applied to remove the mean and scale the corresponding feature. You can also use fit() and transform() separately.  

### Task 3: Normalise the data using fit() and transform() separately.  Replace the question marks ‘???’ in the following cell.

In [None]:
statistics = StandardScaler().fit(Inputs)
x2 = statistics.transform(Inputs)
print(x2.mean(axis=0)) # print the mean value of each feature after removing the mean
print(x2.std(axis=0)) # print the standard deviation value of each feature after removing the mean

### Task 4: Do a principal component analysis (PCA). You can do it by applying [PCA from sklearn.decomposition](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html). Replace the question marks "???" in the following cell.

In [None]:
from sklearn.decomposition import PCA # import PCA

pca = PCA( ) # initialising a PCA instance for all features (empty)
proj_wine = pca.fit_transform(x2) # The eigen-decomposition is done by using the fit() function; projections of the data in the PCA space is obtained using the transform() function. 
print(proj_wine.shape)

Similarly to StandardScaler(), methods fit() and transform() can also be used separately for PCA(). Run the following cell.

In [None]:
eigen_decom = PCA().fit(x2)
proj_wine = eigen_decom.transform(x2)

It is important to report how much variance has been captured in a PCA analysis. You can obtain the information as shown in the following cell. Run the following cell.

In [None]:
print(pca.explained_variance_)  # Eigen-values sorted in descending order
print(pca.explained_variance_ratio_)  # Ratio / sum of eigen-values-(aka trace)


import numpy as np
pca.explained_variance_[0]/np.sum(pca.explained_variance_)

If we want to keep one decimal only in the results, we can use the round function from [numpy](https://numpy.org/) as shown in the following cell.

In [None]:
import numpy as np

print(np.round(pca.explained_variance_,2)) 
print(np.round(pca.explained_variance_ratio_,2)) 

### Task 5: Replace the question marks ‘???’ in the following cell to see how much variance has been captured using the first two principal components. 

In [None]:
var = np.sum(pca.explained_variance_[:2])
print(var)
var_percentage = np.sum(pca.explained_variance_ratio_[:2])*100
print(var_percentage,'%')

### Scree plot: index of principal components against variance (variance percentage)

To produce a plot, you may use [matplotlib.pyplot](https://matplotlib.org/2.0.2/users/pyplot_tutorial.html). To do that, run the following cell to import matplotlib.pyplot.

In [None]:
import matplotlib.pyplot as plt

### Task 6: Replacing the question marks ‘???’ in the following cell to produce a scree plot. 

In [None]:
figure = plt.figure()
ax = plt.gca()
plt.plot(pca.explained_variance_ratio_, color='red', linestyle='dotted')
ax.set_title("Scree plot")
ax.set_xlabel("Index of principal components")
ax.set_ylabel("Explained Variance Ratio")

# Elbow point at about PC 7
# Meaning 7 features (7 linear combinations of original features)

### Data Visualisation using PCA

### Task 7: Replace the question marks ‘???’ in the following cell to produce a scatter plot of the first principal component against the second principal component. You can see more details on how to use the scatter function from [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html).

In [None]:
figure = plt.figure()
ax = plt.gca()
plt.scatter(proj_wine[:,0],proj_wine[:,1], c=Labels, edgecolor='none', alpha=0.5)
ax.set_title("The PCA plot")
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")

You can put two or more subplots in one figure by using the [subplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) function

### Task 8: Replace the question marks ‘???’ in the following cell to produce a figure including two subplots in one row.

In [None]:
#### import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=2)
fig.tight_layout(pad=2.50) # set subplot spacing

plt.subplot(121)
dots_trn=plt.scatter(proj_wine[:,0],proj_wine[:,1], c=Labels, edgecolor='none', alpha=0.5)
plt.xlabel('PC1')
plt.ylabel('PC2')
classes=['C1', 'C2', 'C3']

"""plt.legend(handles=dots_trn.legend_elements()[0], labels=classes)"""

plt.subplot(122)
plt.plot(pca.explained_variance_ratio_, color='red', linestyle='dotted')
plt.xlabel("Index of principal components")
plt.ylabel("The explained variance ratio")


### Task 9: Save the pca plot to a file 

You can save a figure using the savefig() command. For example, to save the  previous PCA scatter figure you have produced, you can run the code in the following cell.


In [None]:
figure.savefig('pca_wine.png')

A file called pca_wine.png is saved in the current working directory. You may  check if it contains what you think it contains, you can run the code in the following cell.

In [None]:
from IPython.display import Image
Image('pca_wine.png')

You can find the list of supported file types for your system by using the code shown in the following cell.

In [None]:
figure.canvas.get_supported_filetypes()