### Self-Study Colab Activity 6.1: Summarizing Data with PCA

This activity is meant to explore the results of applying PCA to a dataset.  Below, a dataset from a credit card company is loaded and displayed.  This dataset contains customer data pertaining to demographic and payment information as well as basic demographics.  The final column `default payment next month` is what we want to create profiles for.  

You are to use PCA and reduce the dimensionality of the data to 2 and 3 dimensions.  Then, draw scatterplots of the resulting data and color them by `default`.  Does it seem that 2 or 3 principal components will seperate the data into clear groups?  Why or why not?  You should post your visualizations and argument for whether the components offer more succinct data representations on the discussion board for this activity.  (Note: In this assignment you should use the sklearn version of `PCA`.)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.decomposition import PCA

In [None]:
default = pd.read_csv('module 6/colab_activity6_1_starter/data/credit.csv', index_col=0)

In [None]:
default.info()

In [None]:
default.head()

##### Plotting in 3D

Below, example plots are made with both `matplotlib` and with `plotly` to demonstrate how to construct three dimensional plots.  

`matplotlib`

In `matplotlib` an `axes` object is created where the projection is specified as '3d'.  Then, the `axes` are use to plot a 3D object on.  Below, a scatter plot is built, but there are many 3D objects that can be created in addition to points.  For more information see the documentation on 3D plotting [here](https://matplotlib.org/stable/api/toolkits/mplot3d.html?highlight=3d).


`plotly`

Use the `scatter_3d` function directly.

In [None]:
plt.figure(figsize=(6, 6))
ax = plt.axes(projection='3d')
ax.scatter3D(default['AGE'], default['BILL_AMT1'], default['BILL_AMT2'], c=default['default.payment.next.month'],
             alpha=0.4)
ax.set_xlabel('AGE', labelpad=20)
ax.set_ylabel('Bill 1 Amount', labelpad=20)
ax.set_zlabel('Bill 2 Amount', labelpad=20)
ax.view_init(10, 60)
plt.title('Age and Bill Amount Colored by Default')
plt.tight_layout();

In [None]:
px.scatter_3d(data_frame=default, x='AGE', y='BILL_AMT1', z='BILL_AMT2', color='default.payment.next.month')

In [None]:
pca2 = PCA(n_components=2)
pca3 = PCA(n_components=3)
default_pca = PCA()
default_pca.fit(default)

In [None]:
dim_reduction_2 = pca2.fit_transform(X=default)
dim_reduction_3 = pca3.fit_transform(X=default)

In [None]:
dim_reduction_2

In [None]:
dim_reduction_3

In [None]:
px.scatter(data_frame=dim_reduction_3)

In [None]:
px.scatter(data_frame=dim_reduction_2)


In [None]:
explained_variance = default_pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)


In [None]:
explained_variance

In [None]:
cumulative_variance

In [None]:


fig = px.line(explained_variance, labels={'index': 'Principal Component', 'value': 'Explained Variance Ratio'},
              title='Scree Plot', markers=True)
fig.update_layout(showlegend=False)
fig.show()
fig.write_image('module 6/colab_activity6_1_starter/images/pca.png')
