<a href="https://www.kaggle.com/code/hikmatullahmohammadi/salary-classification-3-models-coparison?scriptVersionId=107211876" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<h1 style='text-align: center; color: white; background: blue; padding: 30px' >
    Salary Classification (Complete ML model)
</h1>
<h2 style='text-align: center; color: white; background: blue; padding: 20px; margin:0'>
    "Three ML Algorithms Comparison"
</h2>


<div style='color: white; background: blue; padding: 20px; margin:0;font-size: 18px'>
    <b>We will cover:</b>
    <ol>
        <li>Data Discovery </li>
        <li>Handling missing values</li>
        <li>Exploratory Data Analysis (EDA)</li>
        <li>Feature Engineering</li>
        <li>Modeling</li>
    </ol>
</div>

In [None]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# read the dataset
df = pd.read_csv('../input/salary-prediction-classification/salary.csv')
df.head()

<h2 style='text-align: center; background: gold; padding: 20px; border: 2px solid black'>
    1- Data Discovery
</h2>

In [None]:
emp_df = df.copy()

emp_df.shape

**Result:**<br />
There are **32561** observations and **15** features

In [None]:
# featues:
emp_df.columns

In [None]:
# An high-level overview on the dataset
emp_df.info()

In [None]:
# some statistical values in numeric features
emp_df.describe()

**Result:** <br />
There are some wierd data distribution in features 'capital-gain' & 'capital-loss'. We will handle it later.

In [None]:
# look at the features' data types
emp_df.dtypes

**Result:**<br />
There is no inappropriate data types

In [None]:
# data types exploratoin
emp_df.select_dtypes(exclude=['object', 'category']).columns

**Result:**<br />
There are **6** features of type 'numeric' and **15-6 = 9** features of type 'object.'

---

Now, let's look into each (categorical) feature one-by-one. <br />
We will look for missing values, what are the values in the features and how the values are distributed.<br />
**Note1**: We will not handle missing vlaues in this section, we will only detect them now but handle will handle them later.<br />
**Note2**: We will look into some more details in Data Visualization part.


In [None]:
emp_df['workclass'].value_counts()
# emp_df['workclass'].unique()

In [None]:
# replace '?' by NaN
emp_df['workclass'].replace(' ?', np.nan, inplace=True)

In [None]:
emp_df['education'].value_counts()

In [None]:
emp_df['marital-status'].value_counts()

In [None]:
emp_df['occupation'].value_counts()

In [None]:
# replace ' ?' by Nan
emp_df['occupation'].replace(' ?', np.nan, inplace=True)

In [None]:
emp_df['relationship'].value_counts()

In [None]:
emp_df['race'].value_counts()

In [None]:
emp_df['sex'].value_counts()

In [None]:
emp_df['native-country'].unique()

In [None]:
# replace ' ?' by NaN
emp_df['native-country'].replace(' ?', np.nan, inplace=True)

In [None]:
emp_df['salary'].value_counts()

<h2 style='text-align: center; background: gold; padding: 20px; border: 2px solid black'>
    2- Handling missing values
</h2>

In [None]:
emp_df.isnull().sum()

In [None]:
# display the rows where 'workclass' is NaN
rows_with_workclass_na = emp_df[emp_df['workclass'].isnull()]
rows_with_workclass_na.head()

**Result:** <br />
From the above table we can see that wherever the 'workclass' feature missing, the 'occupation' feature is also missing. Hence, 'occupation' is missing at random.<br />
So, here is how we deal with it:<br />
We will fill 'workclass' NaN values by its mode (most frequent), and then we will fill the 'occupation' missing values by the value which has the highest frequeny with 'workclass' being the mode.


In [None]:
emp_df['workclass'].mode()

In [None]:
# occupations where 'workclass' is ' Private'
temp = emp_df['occupation'][emp_df['workclass']==' Private']

temp.mode()

**Result:** <br />
We will fill 'workclass' missing values with ' Private' and 'occupation' missing values with ' Craft-repair' 

In [None]:
emp_df['workclass'].fillna(emp_df['workclass'].mode()[0], inplace=True)
emp_df['workclass'].isnull().sum()

In [None]:
emp_df['occupation'].fillna(' Craft-repair', inplace=True)
emp_df['occupation'].isnull().sum()

In [None]:
emp_df['native-country'].fillna(emp_df['native-country'].mode()[0], inplace=True)
emp_df['native-country'].isnull().sum()

In [None]:
emp_df.isnull().sum().sum()

#### --------- No more missing values

<h2 style='text-align: center; background: gold; padding: 20px; border: 2px solid black'>
    3- Exploratory Data Analysis (EDA)
</h2>

First we will do some data visualizations, and then we will get some highly useful [tabular] info using `pd.crosstab()`. **Don't miss that part.**

In [None]:
# How the target variable is distributed
ax = sns.countplot(emp_df['salary'], hue=emp_df['sex'])
ax.set_title('Salary Distribution')

In [None]:
def draw_boxen_plot(feature, hue=None):
    fig = plt.figure(figsize=(6, 5))
    ax = fig.gca()
    sns.boxenplot(data=emp_df, x='salary', y=feature, ax=ax, hue=hue)
    sns.set_style('whitegrid')
    ax.set_title('Salary VS '+feature.title())

In [None]:
# Salary VS Age
draw_boxen_plot('age', 'sex')

**Result:** <br />
Most of employees bellow 35 have lower income, and 46-50 have higher income. and some more...

In [None]:
# hours-per-week VS Salary
draw_boxen_plot('hours-per-week')

**Result:** <br />
Employees who work over 40 hours per week are paid more than those who work bellow 40 hours per week. Or most of employees who have >=50k income work over 40 hours weekly.

In [None]:
# Sex VS Salary
fig = plt.figure(figsize=(10, 6))
ax = sns.countplot(data=emp_df,x='salary', hue='sex')
ax.set_title('Sex VS Salary')

**Result:** <br />
There are very few women having an income of greater than 50k in comparison with men.

In [None]:
fig = plt.figure(figsize=(10, 6))
ax = sns.countplot(data=emp_df,y='workclass', hue='salary')
ax.set_title('Workclass VS Salary')

**Result:**<br />
Only in *Self-emp-inc* the number of >=50k income is higher than <50k income.


In [None]:
draw_boxen_plot('fnlwgt', 'sex')

**Result:** <br />
Many outliers are detected. In addition, it doesn't seem to be a significant feature. (We will consider removing it.)

In [None]:
plt.figure(figsize=(15, 8))
ax = sns.countplot(data=emp_df,y='education', hue='salary')
ax.set_title('Education VS Salary')

**Result:**<br />
Most of employees who are in masters, doctorate or prof-shcool categories have higher income. (>=50k)<br />
Most of HS-grads have an income of <50k, and many more... 

In [None]:
plt.figure(figsize=(15, 8))
ax = sns.countplot(data=emp_df,y='race', hue='salary')
sns.set_palette('Accent_r')
ax.set_title('Race VS Salary')

In [None]:
emp_df.corr()

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(emp_df.corr(), annot=True, cmap='autumn_r')

<h4 style='text-align: center; background: gold; padding: 10px; border: 2px solid black'>
    Now, let's do some EDA (Exploratory Data Analysis) and gain some useful info using pd.crosstab method.
</h4>

In [None]:
def crosstab_counts(feature, normalize=False):
    return pd.crosstab(
        index=emp_df[feature],
        columns='Counts(%)',
        normalize=normalize
    ).apply(lambda x: round(x*100, 4)).sort_values(by='Counts(%)', ascending=False)

In [None]:
crosstab_counts('salary', True).T

In [None]:
crosstab_counts('sex', True).T

In [None]:
# workclass counts
crosstab_counts('workclass', True).T

In [None]:
# education counts
crosstab_counts('education', True)

In [None]:
# marital-status counts
crosstab_counts('marital-status', True).T

In [None]:
# relationship count
crosstab_counts('relationship', True).T

In [None]:
crosstab_counts('race', True).T

In [None]:
# Sex vs salary
pd.crosstab(
    index=emp_df['salary'],
    columns=emp_df['sex'],
    margins=True,
    normalize=True
)

In [None]:
# Race vs salary
pd.crosstab(
    index=emp_df['salary'],
    columns=emp_df['race'],
    margins=True,
    normalize=True
)

In [None]:
# Relationship vs salary
pd.crosstab(
    index=emp_df['salary'],
    columns=emp_df['relationship'],
    margins=True,
    normalize=True
)

In [None]:
pd.crosstab(
    index=emp_df['salary'],
    columns=emp_df['sex'],
    margins=True
)

In [None]:
emp_df.head(2)

<h2 style='text-align: center; background: gold; padding: 20px; border: 2px solid black'>
    4- Feature Engineering
</h2>

In this section, we will only look at **Mutual Information (MI)**

<h4 style='text-align: center; background: gold; padding: 10px; border: 2px solid black'>
Mutual Information</h4>

Mutual Information (MI) is a mechanism using which we can see how good each variable has an impact on the target variable, for instance. MI Score helps us find the level of effectiveness of a feature on the target variable, the higher a feature's MI Score, the more effective.<br />
**In other words: MI between two features is a measure of the extent to which knowledge of on feature reduces the uncertainty about the other.**

In [None]:
emp_df.columns

In [None]:
# convert >50k to 1 and <=50k to 0
emp_df['salary'] = emp_df['salary'].map({' >50K': 1, ' <=50K': 0})

In [None]:
X = emp_df.iloc[:, :-1]
y = emp_df['salary']

In [None]:
from sklearn.feature_selection import mutual_info_regression

def get_mi_score(X, y):
    X1 = X.copy()
    # make sure that all discrete features be converted to numbers
    for i in X1.select_dtypes('object').columns:
        X1[i], _ = X1[i].factorize()
    mi_score = mutual_info_regression(X1, y)
    return pd.Series(mi_score, name='MI Score', index=X1.columns).sort_values(ascending=False)

In [None]:
mi_score = get_mi_score(X, y)
mi_score

In [None]:
# plot mi scores
mi_score = pd.DataFrame(mi_score).sort_values(by='MI Score')
fig = plt.figure(figsize=(8, 6))
ax = fig.gca()
mi_score.plot.barh(ax=ax)
ax.set(title='MI Scores')

In [None]:
# remove two features with lowest mi scores, almost 0
fs = ['native-country', 'fnlwgt']
X.drop(fs, axis=1, inplace=True)
del fs # delete fs variable

In [None]:
# convert categories into numbers
for i in X.select_dtypes('object').columns:
    X[i], _ = X[i].factorize()
X.head()

In [None]:
X.dtypes

<h2 style='text-align: center; background: gold; padding: 20px; border: 2px solid black'>
    5- Modeling
</h2>

<div style=' background: gold; padding: 20px; border: 2px solid black'>
    Here we will compare three ML classificatoin algorithms. We will see how to implement each of them, and how they perform in comparison with each other.<br />
<ol>
    <li>Logistic Regression</li>
    <li>Random Forest</li>
    <li>K-Nearest Neighbors</li>
</ol>
</div>

In [None]:
# spilt the train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=10)

<h2 style='text-align: center; background: gold; padding: 10px; border: 2px solid black'>
    5.1- Logistic Regression
</h2>

In [None]:
# fit the model
from sklearn.linear_model import LogisticRegression
logR = LogisticRegression(random_state=0, max_iter=X.shape[0])
logR.fit(X_train, y_train)

In [None]:
# predict
y_pred = logR.predict(X_test)
y_pred

In [None]:
# calculate accuracy
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# percentage
accuracy = (cm[0,0] + cm[1,1]) / cm.sum()
print("Accuracy: ", accuracy * 100)

<h2 style='text-align: center; background: gold; padding: 10px; border: 2px solid black'>
    5.2- Random Forest
</h2>

In [None]:
# fit the model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)

In [None]:
# predict
y_pred = rf.predict(X_test)
y_pred

In [None]:
# calculate accuracy
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# percentage
accuracy = (cm[0,0] + cm[1,1]) / cm.sum()
print("Accuracy: ", accuracy * 100)

<h2 style='text-align: center; background: gold; padding: 10px; border: 2px solid black'>
    5.3- K-Nearest Neighbors (KNN)
</h2>

In [None]:
# Let's first find the best value for K. In this case for 'n_neighbors' argument
from sklearn.neighbors import KNeighborsClassifier

scores = []
for i in range(1, 10, 2):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    score = knn.score(X_test, y_test)
    scores.append(score)
scores = pd.DataFrame(pd.Series(scores, index=[1,3,5,7,9], name='scores'))\
    .sort_values(by='scores', ascending=False)
scores

From the above chart, k=9 has the highest value. Hence, that is the best choice for K

In [None]:
# fit the model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)

In [None]:
# predict
y_pred = knn.predict(X_test)

In [None]:
# calculate accuracy
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# percentage
accuracy = (cm[0,0] + cm[1,1]) / cm.sum()
print("Accuracy: ", accuracy * 100)

<h2 style='text-align: center; background: gold; padding: 10px; border: 2px solid black'>
    Comparison
</h2>

<div style='background: gold; padding: 10px; border: 2px solid black'>
    Logistic Regression:  <b>82.4488</b><br />
    Random Forest:  <b>85.1575</b><br />
    KNN:  <b>83.8216</b>
</div>

<h4 style='text-align: center;background: gold; padding: 10px; border: 2px solid black'>
    Please check my other works at <a href="https:/kaggle.com/hikmatullahmohammadi" target='_blank'>@hikmatullahmohammadi</a>
</h4>

<h1 style='text-align: center;background: gold; padding: 20px; border: 2px solid black'>
    Regards :)
</h1>