A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 1. Dimensional Reduction.

This problem will give you a chance to practice using a dimensional reduction technique (PCA)  on Delta Airline's aircrafts.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA

from nose.tools import assert_equal, assert_is_instance, assert_almost_equal
from numpy.testing import assert_array_almost_equal

Delta Airline (and other major airlines) has data on all of their aircrafts on their [website](http://www.delta.com/content/www/en_US/traveling-with-us/airports-and-aircraft/Aircraft.html). For example, the following image shows the specifications of AIRBUS A319 VIP.

![](https://github.com/UI-DataScience/accy571-fa16/raw/master/Week7/assignments/images/AIRBUS_A319_VIP.png)

In this problem, we will use `/home/data_scientist/data/delta.csv`, a CSV file that has aircraft data taken from the Delta Airline website.

In [None]:
df = pd.read_csv('/home/data_scientist/data/delta.csv', index_col='Aircraft')

This data set has 34 columns (including the names of the aircrafts)
  on 44 aircrafts. It inclues both quantitative measurements such as cruising speed,
  accommodation and range in miles, as well as categorical data,
  such as whether a particular aircraft has Wi-Fi or video.
  These binary are assigned values of either 1 or 0, for yes or no respectively.
  
```python
>>> print(df.head())
```
```
                  Seat Width (Club)  Seat Pitch (Club)  Seat (Club)  \
Aircraft                                                              
Airbus A319                     0.0                  0            0   
Airbus A319 VIP                19.4                 44           12   
Airbus A320                     0.0                  0            0   
Airbus A320 32-R                0.0                  0            0   
Airbus A330-200                 0.0                  0            0   

                  Seat Width (First Class)  Seat Pitch (First Class)  \
Aircraft                                                               
Airbus A319                           21.0                        36   
Airbus A319 VIP                       19.4                        40   
Airbus A320                           21.0                        36   
Airbus A320 32-R                      21.0                        36   
Airbus A330-200                        0.0                         0   

                  Seats (First Class)  Seat Width (Business)  \
Aircraft                                                       
Airbus A319                        12                      0   
Airbus A319 VIP                    28                     21   
Airbus A320                        12                      0   
Airbus A320 32-R                   12                      0   
Airbus A330-200                     0                     21   

                  Seat Pitch (Business)  Seats (Business)  \
Aircraft                                                    
Airbus A319                           0                 0   
Airbus A319 VIP                      59                14   
Airbus A320                           0                 0   
Airbus A320 32-R                      0                 0   
Airbus A330-200                      60                32   

                  Seat Width (Eco Comfort)   ...     Video  Power  Satellite  \
Aircraft                                     ...                               
Airbus A319                           17.2   ...         0      0          0   
Airbus A319 VIP                        0.0   ...         1      0          0   
Airbus A320                           17.2   ...         0      0          0   
Airbus A320 32-R                      17.2   ...         0      0          0   
Airbus A330-200                       18.0   ...         1      1          0   

                  Flat-bed  Sleeper  Club  First Class  Business  Eco Comfort  \
Aircraft                                                                        
Airbus A319              0        0     0            1         0            1   
Airbus A319 VIP          0        0     1            1         1            0   
Airbus A320              0        0     0            1         0            1   
Airbus A320 32-R         0        0     0            1         0            1   
Airbus A330-200          1        0     0            0         1            1   

                  Economy  
Aircraft                   
Airbus A319             1  
Airbus A319 VIP         0  
Airbus A320             1  
Airbus A320 32-R        1  
Airbus A330-200         1  

[5 rows x 33 columns]
```

First, let's visualize the relationships between the following attirubtes related to the aircraft physical characteristics:

- Cruising Speed (mph)
- Range (miles)
- Engines
- Wingspan (ft)
- Tail Height (ft)
- Length (ft)

(You don't have to create the following plot.)

![](https://github.com/UI-DataScience/accy571-fa16/raw/master/Week7/assignments/images/pair_grid_physical.png)

**You do not have to create this plot.** I don't include the code that generated this plot, and leave it as an optional exercise for you.

We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft’s overall size. Remarkably there is an almost perfectly linear relationship between wingspan and tail height.

The exception here is the variable right in the middle which is the number of engines. There is one lone outlier which has four engines, while all the other aircraft have two. In this way the engines variable is really more like a categorical variable, but we shall see as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this.

## Principal Components Analysis (A naive approach)

Let’s say we know nothing about dimensionality reduction techniques and just naively apply principle components to the data. (You might want to read through the entire notebook to see why we are calling this the naive approach.)n

- Write a function named `fit_pca()` that takes a pandas.DataFrame and uses [sklearn.decomposition.PCA](http://scikit-learn.org/0.17/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA) to fit a PCA model on all values of `df`.
- Note that `fit_pca()` also takes a second argument, `n_components`, which should be passed to the `n_components` parameter of `PCA()`. Use defaults values for all optional parameters in `PCA()` except `n_components`.
- The function should return an instance of the `PCA` object. For example,
```python
def fit_pca(df, n_components):
    pca = PCA(
        # YOUR CODE HERE
    )
    # YOUR CODE HERE
    return pca
```

In [None]:
def fit_pca(df, n_components):
    """
    Uses sklearn.decomposition.PCA to fit a PCA model on "df".
    
    Parameters
    ----------
    df: A pandas.DataFrame. Comes from delta.csv.
    n_components: An int. Number of principal components to keep.
    
    Returns
    -------
    An sklearn.decomposition.pca.PCA instance.
    """
    
    # YOUR CODE HERE
    
    return pca

In [None]:
# we keep all components by setting n_components = number of columns in df
pca_naive = fit_pca(df, n_components=df.shape[1])
print(pca_naive.explained_variance_ratio_)

In [None]:
assert_is_instance(pca_naive, PCA)
assert_almost_equal(pca_naive.explained_variance_ratio_.sum(), 1.0, 3)
assert_equal(pca_naive.n_components_, df.shape[1])
assert_equal(pca_naive.whiten, False)

Let's visualize the percentage of variance explained by each of the selected components.

![](https://github.com/UI-DataScience/accy571-fa16/raw/master/Week7/assignments/images/var_naive.png)

(Again, **you do not have to create this plot.** I don't include the code that generated this plot, and leave it as an optional exercise for you.)

Taking this naive approach, we can see that the first principal component accounts for 99.9% of the variance in the data. (Note the y-axis is on a log scale.) In the following code cell, we see that the first principal component is just the range in miles.

In [None]:
abs_val = np.abs(pca_naive.components_[0])
max_pos = abs_val.argmax()
max_val = abs_val.max()

print('"{0}" accounts for {1:0.1f} % of the variance.'.format(df.columns[max_pos], 100.0 * max_val))

This is because the scale of the different variables in the data set is quite variable.
  PCA is a scale-dependent method. For example, if the range of one column is [-100, 100],
  while the that of another column is [-0.1, 0.1], PCA will place more weight
  on the feature with larger values.
  One way to avoid this is to *standardize* a data set by
  scaling each feature so that the individual features all look like
  Gausssian distributions with zero mean and unit variance.
  
For further detail, see
  [Preprocessing data](http://scikit-learn.org/stable/modules/preprocessing.html).
  The function `sklearn.preprocessing.scale` provides a quick and easy way to
  perform this operation on a single array-like dataset.

In [None]:
scaled = scale(df)

# we keep only 10 components
n_components = 10
pca = fit_pca(scaled, n_components=n_components)
print(pca.explained_variance_ratio_)

![](https://github.com/UI-DataScience/accy571-fa16/raw/master/Week7/assignments/images/var_scaled.png)

(**You do not have to create this plot.** I don't include the code that generated this plot, and leave it as an optional exercise for you.)

Great, so now we’re in business. There are various rules of thumb for selecting the number of principal components to retain in an analysis of this type, one of which I’ve encountered is

```
Pick the number of components which explain 85% or greater of the variation.
```

So, we will keep the first 4 principal components (remember that we are counting from zero, so we are keeping the 0th, 1st, 2nd, and 3rd components&mdash;four components). In Problem 2, we will use these four components to fit a $k$-means model. Before we move on to the next problem, let's apply the dimensional reduction on the scaled data. (In the previous sections, we didn't actually have to apply `transform()`. This step is to make sure that the scaled data is actually "transformed".)

## Apply dimensional reduction

- Write a function named `reduce()` that takes a PCA model (that is already trained on array) and a Numpy array, and applies dimensional reduction on the array. 

In [None]:
def reduce(pca, array):
    """
    Applies the "pca" model on array.
    
    Parameters
    ----------
    pca: An sklearn.decomposition.PCA instance.
    
    Returns
    -------
    A Numpy array
    """
    
    # YOUR CODE HERE
    
    return reduced

In [None]:
reduced = reduce(pca, scaled)

In [None]:
assert_is_instance(reduced, np.ndarray)
assert_array_almost_equal(reduced, pca.fit_transform(scaled))