In [0]:
# this is so that this notebook works in both Python 2.7 and 3.x
from __future__ import print_function, division

## Principal Component Analysis 

In this module, you will be using Scikit-Learn, Pandas and Numpy. Start by loading 

* `numpy` as `np`
* `pandas` as `pd`
* from the `sklearn` library and specifically from `sklearn.decomposition` import the `PCA` class

In [0]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

You will also be interested in visualising some of the results, load

* `matplotlib.pylab` as `plt`
* don't forget to add the line `%matplotlib inline` so that the plots are displayed in this notebook
* `seaborn` as `sns`

In [0]:
import matplotlib.pylab as plt
%matplotlib inline
import seaborn as sns

### Learning Activity - Apply PCA in the input data using scikit-learn

In scikit-learn, the usual methodology is:

* instantiate an object from the class associated to the method of interest (e.g.: PCA)
* apply a `.fit` or `.fit_transform` method to train the model
* apply a `.predict` method (if relevant)

More information about the PCA object in sklearn can be found there: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Before starting, you need to load the dataset `customers` from `data/online_retail_afterEDA.csv`. 

In [0]:
customers = pd.read_csv("data/online_retail_afterEDA.csv",
                        index_col = "CustomerID")
customers.head()

Select the continuous features (PCA is not meant to work with other types of variables). The continuous variables are

```python
['balance', 'max_spent', 'mean_spent', 'min_spent', 
'n_orders', 'total_items', 'total_items_returned', 
'total_refunded', 'total_spent']
```

Call the resulting dataframe `customers` still. Let's also save the last column (has returned or not) which you will use later on. Call that `has_returned`.

In [0]:
continuous_features = ['balance', 'max_spent', 'mean_spent',
                    'min_spent', 'n_orders', 'total_items', 
                    'total_items_returned', 'total_refunded', 
                    'total_spent']

has_returned = customers['has_returned']
customers = customers[continuous_features]
customers.head()

#### Getting started with PCA

You can now initialise a PCA object and create an index for each Principal Component:

In [0]:
# ..add your code here..


The values of the Principal Components (scores) can be computed by the `fit_transform()` (alternatively, `fit()` followed by `transform()`) function. This function returns a matrix with the principal components, where the first column in the matrix contains the first principal component, the second column the second component, and so on.

In [0]:
# Create the PCA scores matrix and check the dimensionality


The loadings for the principal components are stored in a named element *components_*. This contains a matrix with the loadings of each principal component, where the first column in the matrix contains the loadings for the first principal component, the second column contains the loadings for the second principal component, and so on. 

In [0]:
# Create the PCA loadings matrix and check the dimensionality


### Learning Activity: Calculate and plot the explained and cumulative variance 

But how much information have we lost? We can figure this out by looking at the explained and cumulative variance. The explained variance gives us the proportion of variance explained by each successive Principal Component. The cumulative variance  is obtained by adding the successive proportions of explained variance to obtain the total sum.

In [0]:
# Calculate the explained variance

# Calculate the cumulative variance

# Combine both in a data frame


We can also plot the explained variance using a barplot with seaborn:

In [0]:
# Plot the explained variance per PC using a barplot


#### Bonus: cumulative variance

As a bonus, can you compute the cumulative explained variance? and can you plot it using a barplot?

In [0]:
# ..add your code here..


### Test Activity - Reading in the associated classes 

At this stage, we will import and join per row with the `customers` DataFrame the associated classes in order to use this knowledge during visualisation. Try to import the customer classes from the provided "`customer_classes.csv`" into the variable **_y_**. Remember to also define the column that will be used as the row labels of the DataFrame as in the previous step (if you are unsure, open the csv file and try to decide the column name you are going to use). 

In this case, the class **_y_** contains two classes (binary case) - "yes" vs. "no" - that represent the returning and non-returning customers respectively.

### Learning Activity - Joining with class

A very useful feature of `pandas` is its `join()` function, which allows combining tables based on one column shared between tables. Here we use `join()` to combine the input features and information on whether customers return (associated classes).

You will join the class labels with the dataset by a shared index. **`CustomerID`** is the obvious choice here, as it is the only column shared between the two DataFrames.

In [0]:
# Join the PCA scores and y DataFrames based on the common CustomerIDs


### Learning Activity - Export to csv

Now that we have produced datasets that are ready for applying some machine learning algorithms we will save (or "export") them to our local machine and working repository. This also serves as a checkpoint for the bootcamp so that you can get started straight away with the next module even if you got stuck in some part above.

Writing a `pd.Dataframe` to disk is very easy - you just use the `.to_csv()` method, and specify the file path to where you want it saved. There also other [formats](http://pandas.pydata.org/pandas-docs/stable/api.html#id12) that you save to, which are based on functions that work in exactly the same way.

In [0]:
# Save to a csv file with the '.to_csv()' method 
# and give the file a name you want
