# Summarizing data with PCA

This activity is meant to explore the results of applying PCA to a dataset.  Below, a dataset from a credit card company is loaded and displayed.  This dataset contains customer data pertaining to demographic and payment information as well as basic demographics.  The final column `default payment next month` is what we want to create profiles for.  

You are to use PCA and reduce the dimensionality of the data to 2 and 3 dimensions.  Then, draw scatterplots of the resulting data and color them by `default`.  Does it seem that 2 or 3 principal components will seperate the data into clear groups?  Why or why not?  You should post your visualizations and argument for whether the components offer more succinct data representations on the discussion board for this activity.  (Note: In this assignment you should use the sklearn version of `PCA`.)

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy.linalg import svd

# import seaborn as sns
# from mpl_toolkits import mplot3d

In [None]:
pd.set_option("display.max_columns", None)

## Data Load and Display

### Load

In [None]:
df = pd.read_csv("./data/credit.csv")

### Cleanup

In [None]:
df = df.rename(columns={"default.payment.next.month": "will_default"}).drop(
    columns="ID"
)

### Display

In [None]:
# df.info()
df.head()

## Into to Plotting in 3D

### With matplotlib

In [None]:
plt.figure(figsize=(6, 6))
ax = plt.axes(projection="3d")
ax.scatter3D(
    df["AGE"],
    df["BILL_AMT1"],
    df["BILL_AMT2"],
    c=df["will_default"],
    alpha=0.4,
)
ax.set_xlabel("AGE", labelpad=20)
ax.set_ylabel("Bill 1 Amount", labelpad=20)
ax.set_zlabel("Bill 2 Amount", labelpad=20)
ax.view_init(10, 60)
plt.title("Age and Bill Amount Colored by Default")
plt.tight_layout()

### With plotly

In [None]:
px.scatter_3d(
    data_frame=df,
    x="AGE",
    y="BILL_AMT1",
    z="BILL_AMT2",
    color="will_default",
)

## Correlation Coefficients

### Calculate

In [None]:
corr = df.corr()

In [None]:
display(corr)

In [None]:
# sns.heatmap(corr, annot=False)
fig = px.imshow(corr, title="Correlation Coefficients", height=700, width=700)
fig.show()

### Observations

Notice lots of lines of constant color and even constant blocks of color
- So cc same vs several contiguous variables
- i.e. there is lots of redundancy in the data

## Perform the PCA

### Scale

In [None]:
scaling = StandardScaler()
scaling.fit(df)
df_scaled = scaling.transform(df)

### SVD

#### Compute the Decomp

In [None]:
(U, sigma, VT) = svd(df_scaled, full_matrices=False)

#### Plot Singular Values

In [None]:
plt.plot(np.arange(len(sigma)) + 1, sigma, linestyle="solid")
plt.grid(True)
plt.show()

#### Explained Ratios

In [None]:
percent_variance_explained = sigma / sigma.sum()
cum_percent_variance_explained = np.cumsum(percent_variance_explained)

#### Plot Cumulative Variance vs Num Features

In [None]:
plt.plot(
    np.arange(len(cum_percent_variance_explained)) + 1,
    cum_percent_variance_explained,
    linestyle="solid",
    marker="o",
)
plt.grid(True)
plt.xlabel("Number of Features")
plt.ylabel("Cumulative Variance Explained")
plt.title("Cumulative Variance Explained vs. Num Features")
plt.show()

In [None]:
nums = np.arange(len(sigma))

### PCA

In [None]:
r = 3
principal = PCA(n_components=r)
principal.fit(df_scaled)
x = principal.transform(df_scaled)

In [None]:
df_pca = pd.DataFrame(x, columns=["component_" + str(k + 1) for k in range(r)])
df_pca.head()