## Setting the stage

We import all necessary packages.

In [197]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

from sklearn import linear_model
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score

## Reading the data

We read the csv saved from our previous notebook. Apparently, continent 'NA' is imported with pandas as a 'NaN' value so we will go ahead and fix that by replacing all 'NaN' by NA.

In [198]:
df = pd.read_csv('./ai-ml-salaries-clean.csv')
df.fillna('NA',inplace=True)
df.head()

Unnamed: 0,work_year,experience_level,employment_type,salary_in_usd,remote_ratio,company_size,employee_continent,company_continent,is_colocated
0,2022,2,FT,130000,NoRemote,M,,,1
1,2022,2,FT,90000,NoRemote,M,,,1
2,2022,2,FT,120000,FullRemote,M,,,1
3,2022,2,FT,100000,FullRemote,M,,,1
4,2022,2,FT,85000,FullRemote,M,,,1


Since we do not have any continuous variables to correlate with our target variable i.e., salary_in_usd, a regression analysis here will not yield a great model accuracy. We could create a classification model to determine whether an employee earns more than $120k (50% overall quantile) or not. 

We can check that this accounts for a 50-50 split in the data we have, so the model accuracy should serve as a good measure of our model's success.

In [199]:
print(df[(df['salary_in_usd']<120000)]['salary_in_usd'].count())
print(df[(df['salary_in_usd']>=120000)]['salary_in_usd'].count())

654
678


Before we commit to this classification model, let us also check whether or not these two sets are directly distinguishable by the company continent, of which a very large percentage is North America.

Indeed, this is not the case.

In [200]:
print(df[(df['salary_in_usd']<120000) & (df['employee_continent']=='NA')]['salary_in_usd'].count())
print(df[(df['salary_in_usd']>=120000) & (df['employee_continent']=='NA')]['salary_in_usd'].count())

309
650


In [201]:
# df['salary_in_usd'] = MinMaxScaler().fit_transform(df[['salary_in_usd']])

We create our data set of features by eliminating the target variable

In [202]:
x = df.drop(['salary_in_usd'],axis=1).copy(deep=True)
x.head()

Unnamed: 0,work_year,experience_level,employment_type,remote_ratio,company_size,employee_continent,company_continent,is_colocated
0,2022,2,FT,NoRemote,M,,,1
1,2022,2,FT,NoRemote,M,,,1
2,2022,2,FT,FullRemote,M,,,1
3,2022,2,FT,FullRemote,M,,,1
4,2022,2,FT,FullRemote,M,,,1


We create our labels dataset

In [203]:
y=df['salary_in_usd'].copy()
y.head()

0    130000
1     90000
2    120000
3    100000
4     85000
Name: salary_in_usd, dtype: int64

We split it by the logic decided earlier since we will be doing classification.

In [204]:
y = y.apply(lambda row: 1 if row>120000 else 0)
y

0       1
1       0
2       0
3       0
4       0
       ..
1327    1
1328    1
1329    0
1330    0
1331    0
Name: salary_in_usd, Length: 1332, dtype: int64

Since our continent data is not ordinal, we need to have some way to feed this into our algorithm. Ideally, models like random trees or boosted trees should be able to deal with character data but the packages in scikit-learn require that these categorical variables be encoded as numbers.

In our dataset, we will use one-hot encoding for this feature.

In [205]:
# new_dummy = pd.get_dummies(x.employee_continent).copy()
# new_dummy.columns=['AF_e','AS_e','EU_e','NA_e','OC_e','SA_e']
# x = pd.concat((x,new_dummy),1)
# x.drop(['employee_continent'],axis=1,inplace=True)
# x.head()

We are dropping these temporarily to keep our model simple. We might add these later.

In [206]:
x.drop(['employee_continent'],axis=1,inplace=True)

As we encode the categorical variables using one-hot encoding, we drop the original columns containing text data.

In [207]:
to_encode=['employment_type','company_size','remote_ratio','company_continent']
for col in to_encode:
    x = pd.concat((x,pd.get_dummies(x[col]).copy()),1)
    x.drop([col],axis=1,inplace=True)
x.head()

Unnamed: 0,work_year,experience_level,is_colocated,CT,FL,FT,PT,L,M,S,FullRemote,HalfRemote,NoRemote,AF,AS,EU,NA,OC,SA
0,2022,2,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0
1,2022,2,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0
2,2022,2,1,0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0
3,2022,2,1,0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0
4,2022,2,1,0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0


We will use this split out data set into training data, validation data, and final hold out test data. 

In [208]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.4, random_state=79)
x_val, x_test, y_val, y_test = train_test_split(x_test,y_test,test_size=0.5, random_state=93)

In [209]:
#model= MLPRegressor(learning_rate='constant',activation='relu',hidden_layer_sizes=(10,))
# model=LogisticRegression()
model = RandomForestClassifier()

params={
    'n_estimators':[5,50,100],
    'max_depth':[2,10,20,None]
}

# params={
#     'C':[0.01,0.1,1,10]
# }

model = GridSearchCV(model,params,cv=5)
model.fit(x_train,y_train)
model.best_estimator_

RandomForestClassifier(max_depth=2)

In [210]:
y_pred = model.predict(x_val)
accuracy_score(y_val,y_pred) , precision_score(y_val,y_pred) , recall_score(y_val,y_pred) 

(0.7781954887218046, 0.7581699346405228, 0.8405797101449275)

In [211]:
y_pred = model.predict(x_test)
accuracy_score(y_test,y_pred), precision_score(y_test,y_pred) , recall_score(y_test,y_pred)

(0.7865168539325843, 0.7423312883435583, 0.8897058823529411)

Our model doesn't have very good accuracy, maybe this can be improved with better feature engineering, or trying other types of models and hyperparameters from the grid. However, it is worth noting that our model has a great recall. 

This indicates that the model is optimized for false negatives. Only roughly 9% of the time, did an employee have salary higher than 120k and our model classified it under 120k. 