# Unsupervised Learning Final Project

This is the final project for the **Data Science in Python: Unsupervised Learning** course. This notebook is split into seven sections:
1. Data Prep & EDA
2. K-Means Clustering
3. PCA for Visualization
4. K-Means Clustering (Round 2)
5. PCA for Visualization (Round 2)
6. EDA on Clusters
7. Make Recommendations

## 0. Goal & Scope

**GOAL**: You are trying to better understand the company’s different segments of employees and how to increase employee retention within each segment.

**SCOPE**: Your task is to use a clustering technique to segment the employees, a dimensionality reduction technique to visualize the segments, and finally explore the clusters to make recommendations to increase retention.

## 1. Data Prep & EDA

### a. Data Prep: Check the data types

The data can be found in the `employee_data.csv` file.

In [None]:
# read in the employee data


In [None]:
# note the number of rows and columns


In [None]:
# view the data types of all the columns


In [None]:
# look at the numeric columns


In [None]:
# look at the non-numeric columns


### b. Data Prep: Convert the data types

Use `np.where` and `pd.get_dummies` to create a DataFrame for modeling where all fields are numeric.

In [None]:
# create a copy of the dataframe


In [None]:
# look at the gender values


In [None]:
# change gender into a numeric field using np.where


In [None]:
# look at the attrition values


In [None]:
# change attrition to a numeric field using np.where


In [None]:
# look at the department values


In [None]:
# change department to a numeric field via dummy variables


In [None]:
# attach the columns back on to the dataframe


In [None]:
# view the cleaned dataframe


In [None]:
# note the number of rows and columns


### c. EDA

Our goal is to find the different types of employees at the company and take a look at their attrition (whether they end up leaving or not).

In [None]:
# what is the overall attrition for all employees in the data aka what percent of employees leave the company?


In [None]:
# create a summary table to show the mean of each column for employees who stay vs leave - what are your takeaways?


### d. Data Prep: Remove the Attrition and ID Columns

Exclude the attrition column (to be overlayed onto our clusters later on) and the ID column.

In [None]:
# create a new dataframe without the attrition column for us to model on


In [None]:
# drop the employee column as well before modeling


In [None]:
# note the number of rows and columns in the dataframe


In [None]:
# create a pair plot comparing all the columns of the dataframe - what observations do you notice?


## 2. K-Means Clustering

Let's segment the employees using K-Means clustering.

### a. Standardize the data

In [None]:
# scale the data using standardization


In [None]:
# double check that all the column means are 0 and standard deviations are 1


### b. Write a loop to fit models with 2 to 15 clusters and record the inertia and silhouette scores

In [None]:
# import kmeans and write a loop to fit models with 2 to 15 clusters


In [None]:
# plot the inertia values


In [None]:
# plot the silhouette scores


### c. Identify a k value that looks like an elbow on the inertia plot and has a high silhouette score

In [None]:
# fit a kmeans model for the k value that you identified


In [None]:
# find the number of employees in each cluster


In [None]:
# create a heat map of the cluster centers


In [None]:
# interpret the clusters


## 3. PCA

Let's visualize the data using PCA.

### a. Fit a PCA Model with 2 components for visualization

In [None]:
# fit a PCA model with 2 components


In [None]:
# view the explained variance ratio


In [None]:
# view the components


In [None]:
# view the columns


In [None]:
# interpret the components


### b. Overlay the K-Means cluster colors

In [None]:
# transform the data


In [None]:
# plot the data


In [None]:
# overlay the kmeans clusters (hint: set the hue to be the cluster labels)


### c. Overlay the Department colors instead

In [None]:
# overlay the department colors (hint: set the hue to be the department column)


## 4. K-Means Clustering: Round 2

Since the departments seemed to dominate the visualization, let's exclude them and try fitting more K-Means models.

### a. Create a new dataframe without the Departments

In [None]:
# create a new dataframe that excludes the three department columns from the scaled dataframe


### b. Write a loop to fit models with 2 to 15 clusters and record the inertia and silhouette scores

In [None]:
# write a loop to fit models with 2 to 15 clusters


In [None]:
# plot the inertia values


In [None]:
# plot the silhouette scores


### c. Identify a few k values that looks like an elbow on the inertia plot and have a high silhouette score

#### i. k = [some value]

In [None]:
# fit a kmeans model for the k value that you identified


In [None]:
# find the number of employees in each cluster


In [None]:
# create a heat map of the cluster centers


In [None]:
# interpret the clusters


#### ii. k = [another value]

In [None]:
# fit a kmeans model for the k value that you identified


In [None]:
# find the number of employees in each cluster


In [None]:
# create a heat map of the cluster centers


In [None]:
# interpret the clusters


#### iii. k = [another value]

In [None]:
# fit a kmeans model for the k value that you identified


In [None]:
# find the number of employees in each cluster


In [None]:
# create a heat map of the cluster centers


In [None]:
# interpret the clusters


## 5. PCA: Round 2

Let's visualize the data (without Departments) using PCA.

### a. Fit a PCA Model with 2 components for visualization

In [None]:
# fit a PCA model with 2 components


In [None]:
# view the explained variance ratio


In [None]:
# view the components


In [None]:
# view the columns


In [None]:
# interpret the components


### b. Overlay the K-Means cluster colors

In [None]:
# transform the data


In [None]:
# plot the data


In [None]:
# overlay the kmeans clusters (choose your favorite k-means model from the previous section)


### c. OPTIONAL: Create a 3D plot

In [None]:
# fit a PCA model with 3 components


In [None]:
# view the explained variance ratio


In [None]:
# view the components


In [None]:
# view the columns


In [None]:
# interpret the components


In [None]:
# transform the data


In [None]:
# create a 3d scatter plot


## 6. EDA on Clusters

Let's decide to go with the 6 clusters without department data.

### a. Confirm the 6 clusters

In [None]:
# fit a kmeans model with 6 clusters


In [None]:
# view the cluster labels


### b. Create a dataframe with the cluster labels and names

In [None]:
# create a dataframe with two columns - one of the label and another of the cluster name

# create a mapping for the cluster names

# combine the labels and names into a single dataframe


### c. View the attrition rates for each cluster

In [None]:
# combine the clusters and attrition data


In [None]:
# what is the attrition rate for each cluster?


In [None]:
# sort the values


In [None]:
# interpret the findings


In [None]:
# find the number of employees in each cluster


### d. View the department breakdown for each cluster

In [None]:
# combine the clusters and department data


In [None]:
# what is the attrition rate for each cluster + department combination?


In [None]:
# sort the values


In [None]:
# interpret the findings


In [None]:
# find the number of employees in each cluster + department combo


## 7. Make Recommendations

In [None]:
# looking at the clusters, what segment info would you share with the team?


In [None]:
# what recommendations would you suggest to retain employees in each cluster?
