# This Jupyter notebook is prepared by Lucas McClean.

## A. Basic Setup

1. Import required libraries: pandas, numpy, matplotlib (set %matplotlib inline), matplotlib pyplot, seaborn, missingno, scipy.stats, sklearn

2. Load the dataset into a DataFrame and display the number of rows and columns

3. Use describe() to show summary statistics of numerical columns

4. Explain any interesting or useful statistics you observe

5. Display the first 5 and last 5 rows of the DataFrame

6. List all numerical columns

7. List all categorical columns

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno as msno
import scipy.stats as st
import sklearn

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data = pd.read_csv("/content/drive/MyDrive/hrdata.csv")

In [None]:
data.shape

In [None]:
data.describe()

Most of these aren't meaningful; we're mostly looking at `city_development_index`, `training_hours`, and `city_development_matrics` (mispelling of matrix I presume). The `city_development_index` and `city_development_matrics` seem to be describing the same underlying value. They're also pretty consistent (i.e. there is vary little variance and very few outliers). As for the `training_hours`, there's not too much to glean here other than that it might be a useful statistic given its wide range.

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.select_dtypes(include=["number"]).columns

In [None]:
data.select_dtypes(include=["object"]).columns

## B. Missing Values Analysis

1. Display column-wise count of missing values in descending order

2. Display column-wise percentage of missing values in descending order

3. Create a bar plot showing only columns with missing values, ordered from least missing (left) to most missing (right)

4. Use missingno to generate and interpret:

  - Bar plot

  - Matrix plot (using a sample of 200 rows)

  - Heatmap
  
  - Interpret any interesting patterns observed in the heatmap and at least one other plot


In [None]:
data.isna().sum().sort_values(ascending=False)

In [None]:
(data.isna().mean() * 100).sort_values(ascending=False)

In [None]:
missing_counts = data.isna().sum()
missing_counts = missing_counts[missing_counts > 0].sort_values(ascending=True)

missing_counts.plot(kind="bar")
plt.ylabel("Missing Values")
plt.xlabel("Columns")
plt.title("Missing Values by Column")
plt.tight_layout()
plt.show()

In [None]:
msno.bar(data)

In [None]:
msno.matrix(data.sample(200))

In [None]:
msno.heatmap(data)

From the heatmap we can see a very strong correlation in missing values between the `company_type` and `company_size`. This makes sense as the data are both related to the previous employer (if there was no previous employer, both size and type would be missing). We can see a similar correlation with `major_discipline` and `education_level`. If a candidate does not have an education, neither will appear.

Looking at the "Missing Values by Column", we can see that there are only five columns with a concerning amount of missing data. For our purposes, it is unfortunate that the previous employer data and `target` values are so absent. These likely would be major predictors. The `gender` and `major_discipline` values are likely not as concerning.

## C. Understanding Categorical Attributes

For each categorical feature:

  1. Create a seaborn bar plot showing category counts

  2. Create a seaborn countplot of the feature against the target
  
  3. Interpret any interesting information and any information that might help you to make any decision on combining, removing, or adding features based on that, or any resampling may be needed.
  

In [None]:
feature = "city"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 18))
sns.barplot(data=counts, x="count", y=feature)
plt.title("City Counts")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

In [None]:
feature = "city"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 20))
sns.countplot(data=data, y=feature, hue="target", order=order)
plt.title("Target by City")
plt.tight_layout()
plt.show()

### City

The ratio of the target for each city does vary quite widely which means that the city may be a good predictor. The number of rows for each city, however, is pretty skewed. It might be best to create a new column for each row that is the ratio of the target for each city (though this may be unwieldy for rows from a rarer city).

In [None]:
feature = "gender"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.title("Gender Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "gender"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.title("Target by Gender")
plt.tight_layout()
plt.show()

### Gender

These plots show that there is almost no relationship between `gender` and `target`. The ratios for male, female, and other are almost identical. Combined with the fact that this feature has a lot of missing values, it should likely be dropped.

In [None]:
feature = "relevent_experience"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.title("Relevant Experience Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "relevent_experience"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.title("Target by Relevant Experience")
plt.tight_layout()
plt.show()

### Relevant Experience

> Note: "relevant" is misselled as "relevent" in the data set.

There is a strong relationship between this feature and the `target`, however, this can be converted into a binary category for easier processing by the model. There is a large different in the counts of the two values, so it's worth considering how to properly sample for the training set.

In [None]:
feature = "enrolled_university"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.title("Enrolled University Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "enrolled_university"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.title("Target by Enrolled University")
plt.tight_layout()
plt.show()

### Enrolled University

Once again, this feature shows a strong and varying relationship to the `target`, but the distribution is fairly skewed. We'll have to be careful when sampling for the training set.

In [None]:
feature = "education_level"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.title("Education Level Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "education_level"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.title("Target by Education Level")
plt.tight_layout()
plt.show()

### Education Level

Interestingly, "Undergraduate" is not a category. It's possible that "Graduate" is intended to mean "Undergraduate" given that "Masters" is also included. That would also align with the fact that it's the most common. This may be worth dropping, however, as the ratio of the target seems to be just about equivalent for each level.

In [None]:
feature = "major_discipline"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.xticks(rotation=45)
plt.title("Major Discipline Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "major_discipline"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.xticks(rotation=45)
plt.title("Target by Major Discipline")
plt.tight_layout()
plt.show()

### Major Discipline

The "STEM" category far outweighs any other. This will likely skew our results to applying almost singularly to those of the STEM discipline. If the goal was changed to this, we could drop the others. But the ratio of the target for each discipline seems to be almost equivalent anyways, so we can likely drop this feature.

In [None]:
feature = "experience"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.xticks(rotation=45)
plt.title("Experience Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "experience"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target")
plt.xticks(rotation=45)
plt.title("Target by Experience")
plt.tight_layout()
plt.show()

### Experience

The data has a lot of members in the ">20" category. It's worth considering how to properly sample for testing. It's also likely worth binning here to reduce the number of categories. Different ranges to seem to have different relationships to the target though.

In [None]:
feature = "company_size"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.xticks(rotation=45)
plt.title("Company Size Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "company_size"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.xticks(rotation=45)
plt.title("Target by Company Size")
plt.tight_layout()
plt.show()

### Company Size

This feature seems to have a fair amount of samples for each bucket. The target also seems to react to the feature as well.

In [None]:
feature = "company_type"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.xticks(rotation=45)
plt.title("Company Type Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "company_type"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.title("Target by Company Type")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Company Type

There's a major skew towards "Pvt Ltd" with very few samples for the other types. Given that the ratio for the target seems to roughly the same for each category, we should probably drop this feature.

In [None]:
feature = "last_new_job"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.title("Last New Job Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "last_new_job"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.title("Target by Last New Job")
plt.tight_layout()
plt.show()

### Last New Job

We'll have to make sure to use stratified (or something similar) to ensure that all categories here are equally represeneted. The ratios differ fairly significantly, so this feature will likely be a good predictor.

In [None]:
feature = "state"

counts = data[feature].value_counts().reset_index()
counts.columns = [feature, "count"]

plt.figure(figsize=(5, 4))
sns.barplot(data=counts, y="count", x=feature)
plt.title("State Counts")
plt.tight_layout()
plt.show()

In [None]:
feature = "state"

order = data[feature].value_counts().index

plt.figure(figsize=(5, 4))
sns.countplot(data=data, x=feature, hue="target", order=order)
plt.title("Target by State")
plt.tight_layout()
plt.show()

### State

Given that there's only one state, we can safely drop this feature.

## D. Understanding Numerical Attributes

For each numerical feature:

  - Plot the distribution using a histogram

  - Plot the distribution using seaborn distplot

  - Interpret any interesting observations


In [None]:
feature = "enrollee_id"

plt.figure(figsize=(5, 4))
sns.histplot(data=data, x=feature)
plt.title("Enrollee ID Histogram")
plt.tight_layout()
plt.show()

In [None]:
feature = "enrollee_id"

plt.figure(figsize=(5, 4))
sns.distplot(data[feature])
plt.title(f"Enrolle ID Distribution")
plt.tight_layout()
plt.show()

### Enrollee ID

This feature has a fairly uniform distribution.

In [None]:
feature = "city_development_index"

plt.figure(figsize=(5, 4))
sns.histplot(data=data, x=feature)
plt.title("City Development Index Histogram")
plt.tight_layout()
plt.show()

In [None]:
feature = "city_development_index"

plt.figure(figsize=(5, 4))
sns.distplot(data[feature])
plt.title(f"City Development Index Distribution")
plt.tight_layout()
plt.show()

### City Development Index

The development index is bimodal, however, the second spike contains a singular value. Outside of this outlier, it skews to the left fairly uniformly.

In [None]:
feature = "training_hours"

plt.figure(figsize=(5, 4))
sns.histplot(data=data, x=feature)
plt.title("Training Hours Histogram")
plt.tight_layout()
plt.show()

In [None]:
feature = "training_hours"

plt.figure(figsize=(5, 4))
sns.distplot(data[feature])
plt.title(f"Training Hours Distribution")
plt.tight_layout()
plt.show()

### Training Hours

This feature is unimodal and skewed to the left. Most people have under fifty hours and the number of people keeps decreasing with each increase in training hours. This may be a strong indicator for the target, we'll have to verify in a later step.

In [None]:
feature = "target"

plt.figure(figsize=(5, 4))
sns.histplot(data=data, x=feature)
plt.title("Target Histogram")
plt.tight_layout()
plt.show()

In [None]:
feature = "target"

plt.figure(figsize=(5, 4))
sns.distplot(data[feature])
plt.title(f"Target Distribution")
plt.tight_layout()
plt.show()

### Target

As previously mentioned, the target should realy be a categorical variable as there are only two values. Most people, however, are not looking for a job change. We'll have to keep that in mind when sampling to ensure an accurate representation.

In [None]:
feature = "city_development_matrics"

plt.figure(figsize=(5, 4))
sns.histplot(data=data, x=feature)
plt.title("City Development Matrix Histogram")
plt.tight_layout()
plt.show()

In [None]:
feature = "city_development_matrics"

plt.figure(figsize=(5, 4))
sns.distplot(data[feature])
plt.title(f"City Development Matrix Distribution")
plt.tight_layout()
plt.show()

### City Development Matrix

This feature has the exact same distribution as the index, its therefore likely best to drop one or the other.