# Predicting Job Satisfaction With Kaggle Survey Data

### Contributers:

### **Mert Pekey - Ali Yıldırım - Zeynep Berksöz**

Aim of this work is predicting job satisfaction of people who work in data science field.

Dataset: Modified version of a Kaggle Survey Dataset

Label: Job Satisfaction in the range of 1 to 10

### Mounting Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import LinearSVR
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import RepeatedKFold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_regression
pd.set_option('display.max_columns', None)

### Quick Look To Data

In [None]:
train = pd.read_excel('/content/drive/MyDrive/ContentCreation/train.xlsx')
test = pd.read_excel('/content/drive/MyDrive/ContentCreation/test.xlsx')
explanation = pd.read_csv('/content/drive/MyDrive/ContentCreation/dataset_explanation.csv',delimiter = ';')

In [None]:
print('training shape is', train.shape)

training shape is (5529, 54)


In [None]:
train.head(3)

Unnamed: 0,ID,GenderSelect,Country,Age,EmploymentStatus,CodeWriter,CurrentJobTitleSelect,TitleFit,CurrentEmployerType,MLToolNextYearSelect,MLMethodNextYearSelect,LanguageRecommendationSelect,LearningPlatformUsefulnessBlogs,LearningPlatformUsefulnessKaggle,LearningPlatformUsefulnessCourses,LearningPlatformUsefulnessProjects,LearningPlatformUsefulnessSO,LearningPlatformUsefulnessTextbook,LearningPlatformUsefulnessYouTube,DataScienceIdentitySelect,FormalEducation,MajorSelect,Tenure,PastJobTitlesSelect,MLSkillsSelect,MLTechniquesSelect,EmployerIndustry,EmployerSize,WorkProductionFrequency,WorkAlgorithmsSelect,WorkToolsFrequencyPython,WorkToolsFrequencyR,WorkToolsFrequencySQL,WorkMethodsFrequencyCross-Validation,WorkMethodsFrequencyDataVisualization,WorkMethodsFrequencyDecisionTrees,WorkMethodsFrequencyLogisticRegression,WorkMethodsFrequencyNeuralNetworks,WorkMethodsFrequencyPCA,WorkMethodsFrequencyRandomForests,WorkMethodsFrequencyTimeSeriesAnalysis,WorkChallengeFrequencyPolitics,WorkChallengeFrequencyUnusedResults,WorkChallengeFrequencyDirtyData,WorkChallengeFrequencyExplaining,WorkChallengeFrequencyTalent,WorkChallengeFrequencyClarity,WorkChallengeFrequencyDataAccess,CompensationScore,WorkDataVisualizations,WorkInternalVsExternalTools,WorkMLTeamSeatSelect,RemoteWork,JobSatisfaction
0,1,Male,Pakistan,28.0,"Independent contractor, freelancer, or self-em...",Yes,Software Developer/Software Engineer,Fine,Self-employed,Python,Link Analysis,SAS,,,Very useful,,,,Not Useful,Sort of (Explain more),Bachelor's degree,"Information technology, networking, or system ...",3 to 5 years,"Programmer,Software Developer/Software Engineer",Survival Analysis,"Hidden Markov Models HMMs,Logistic Regression,...",Technology,,Always,Random Forests,Often,,Often,,,,,,,Often,,,,Often,,,,,8.0,51-75% of projects,Approximately half internal and half external,Standalone Team,,4
1,2,Male,Mexico,26.0,Employed full-time,Yes,Computer Scientist,Poorly,Employed by a company that doesn't perform adv...,Python,Deep learning,Python,Somewhat useful,Very useful,,,,Very useful,,No,Master's degree,Computer Science,1 to 2 years,"Computer Scientist,Programmer,Researcher","Natural Language Processing,Supervised Machine...","Bayesian Techniques,Support Vector Machines (S...",Government,"1,000 to 4,999 employees",Rarely,"Bayesian Techniques,SVMs",Sometimes,Often,,Sometimes,Most of the time,,,,,,Rarely,,,,,Often,,,,100% of projects,More internal than external,Business Department,,7


## - Preprocessing

### Missing Values Imputed as "na"

NaN Values are replaced with "na" to handle later after analyze data to know how to handle with them.

**Note:** Analysis of the data for imputing missing values are not in this notebook.

In [None]:
train = train.fillna('na')
test = test.fillna('na')

In [None]:
# mycArray =['India','United States','Sweden',
#            'Japan','China','Israel','Taiwan','Mexico','Canada','Switzerland','Pakistan','Colombia']
# for mycountry in mycArray:
#   train[mycountry] = train['Country'].apply(lambda text: 1 if text==mycountry else 0)
#   train[mycountry] = train[mycountry].astype('float64')
#   test[mycountry] = test['Country'].apply(lambda text: 1 if text==mycountry else 0)
#   test[mycountry] = test[mycountry].astype('float64')

### Removing Outliers From Age Column

In [None]:
train['Age'] = train['Age'].replace('na',train[train['Age']!='na']['Age'].mean())
train = train[(train['Age']<65) & (train['Age'] >18)]
test['Age'] = test['Age'].replace('na',test[test['Age']!='na']['Age'].mean())

### Handling With Ordinal Features

In [None]:
freq_map = {'na': 0, 'Rarely' : 1, 'Sometimes': 2, 'Often': 3,'Most of the time': 4}
train.iloc[:,30:48] = train.iloc[:,30:48].replace(freq_map)
train.iloc[:,30:48] = train.iloc[:,30:48].astype('float64')
freq_map = {'na': 0, 'Rarely' : 1, 'Sometimes': 2, 'Often': 3,'Most of the time': 4}
test.iloc[:,30:48] = test.iloc[:,30:48].replace(freq_map)
test.iloc[:,30:48] = test.iloc[:,30:48].astype('float64')


freq_map2 = {'na': 0, 'Not Useful' : 1, 'Somewhat useful': 2,'Very useful': 3}
train.iloc[:,12:19] = train.iloc[:,12:19].replace(freq_map2)
train.iloc[:,12:19] = train.iloc[:,12:19].astype('float64')
test.iloc[:,12:19] = test.iloc[:,12:19].replace(freq_map2)
test.iloc[:,12:19] = test.iloc[:,12:19].astype('float64')


freq_map3 = {'na': 0, 'None' : 0, 'Less than 10% of projects': 1,
            '10-25% of projects': 2,'26-50% of projects': 3,'51-75% of projects': 4,'76-99% of projects': 5,'100% of projects': 6}
train['WorkDataVisualizations'] = train['WorkDataVisualizations'].replace(freq_map3)
train['WorkDataVisualizations'] = train['WorkDataVisualizations'].astype('float64')
test['WorkDataVisualizations'] = test['WorkDataVisualizations'].replace(freq_map3)
test['WorkDataVisualizations'] = test['WorkDataVisualizations'].astype('float64')


freq_map4 = {'na': 0, 'Poorly' : 1, 'Fine': 2,'Perfectly': 3}
train['TitleFit'] = train['TitleFit'].replace(freq_map4)
train['TitleFit'] = train['TitleFit'].astype('float64')
test['TitleFit'] = test['TitleFit'].replace(freq_map4)
test['TitleFit'] = test['TitleFit'].astype('float64')


freq_map5 = {'na': 0, "I don't write code to analyze data" : 0, 'Less than a year': 1,
            '1 to 2 years': 2,'3 to 5 years': 3,'6 to 10 years': 4,'More than 10 years': 5}
train['Tenure'] = train['Tenure'].replace(freq_map5)
train['Tenure'] = train['Tenure'].astype('float64')
test['Tenure'] = test['Tenure'].replace(freq_map5)
test['Tenure'] = test['Tenure'].astype('float64')


freq_map6 = {'na': 0, "I don't know" : 0, 'I prefer not to answer': 0,
            'Fewer than 10 employees': 1,'10 to 19 employees': 1,'20 to 99 employees': 1,'100 to 499 employees':2 ,'500 to 999 employees': 3,
            '1,000 to 4,999 employees':4,'5,000 to 9,999 employees':5,'10,000 or more employees':5}
train['EmployerSize'] = train['EmployerSize'].replace(freq_map6)
train['EmployerSize'] = train['EmployerSize'].astype('float64')
test['EmployerSize'] = test['EmployerSize'].replace(freq_map6)
test['EmployerSize'] = test['EmployerSize'].astype('float64')


freq_map7 = {'na': 0, "Don't know" : 0, 'Never': 1,
            'Rarely': 2,'Sometimes': 3,'Most of the time': 4,'Always': 5}
train[['WorkProductionFrequency','RemoteWork']] = train[['WorkProductionFrequency','RemoteWork']].replace(freq_map7)
train[['WorkProductionFrequency','RemoteWork']] = train[['WorkProductionFrequency','RemoteWork']].astype('float64')
test[['WorkProductionFrequency','RemoteWork']] = test[['WorkProductionFrequency','RemoteWork']].replace(freq_map7)
test[['WorkProductionFrequency','RemoteWork']] = test[['WorkProductionFrequency','RemoteWork']].astype('float64')

### Binning Values of Some Features (This part added after analyzing the data)

In [None]:
train['WorkInternalVsExternalTools'] = train['WorkInternalVsExternalTools'].replace('na','Do not know')
test['WorkInternalVsExternalTools'] = test['WorkInternalVsExternalTools'].replace('na','Do not know')


train['WorkMLTeamSeatSelect'] = train['WorkMLTeamSeatSelect'].replace('na','Other')
test['WorkMLTeamSeatSelect'] = test['WorkMLTeamSeatSelect'].replace('na','Other')


train['EmployerIndustry'] = train['EmployerIndustry'].replace('na','Other')
test['EmployerIndustry'] = test['EmployerIndustry'].replace('na','Other')


freq_map8 = {'na': 'Other', 'A different identity' : 'Other', 'Non-binary, genderqueer, or gender non-conforming': 'Other'}
train['GenderSelect'] = train['GenderSelect'].replace(freq_map8)
test['GenderSelect'] = test['GenderSelect'].replace(freq_map8)


train['PastJobTitlesSelect'] = train['PastJobTitlesSelect'].replace('na',"I haven't started working yet")
test['PastJobTitlesSelect'] = test['PastJobTitlesSelect'].replace('na',"I haven't started working yet")

mydictML = dict(train['MLToolNextYearSelect'].value_counts())
for word, count in mydictML.items():
    if word == 'na':
         train['MLToolNextYearSelect'] = train['MLToolNextYearSelect'].replace('na',"I don't plan on learning a new tool/technology")
    elif word != 'TensorFlow' and word != 'Python' and word != 'Spark / MLlib' and word != 'R' and word != 'na' and word != 'Other' and word != "I don't plan on learning a new tool/technology":
         train['MLToolNextYearSelect'] = train['MLToolNextYearSelect'].replace(word, 'Other')           
mydictML = dict(test['MLToolNextYearSelect'].value_counts())
for word, count in mydictML.items():
    if word == 'na':
         test['MLToolNextYearSelect'] = test['MLToolNextYearSelect'].replace('na',"I don't plan on learning a new tool/technology")
    elif word != 'TensorFlow' and word != 'Python' and word != 'Spark / MLlib' and word != 'R' and word != 'na' and word != 'Other' and word != "I don't plan on learning a new tool/technology":
         test['MLToolNextYearSelect'] = test['MLToolNextYearSelect'].replace(word, 'Other')




mydictML = dict(train['MLMethodNextYearSelect'].value_counts())
for word, count in mydictML.items():
    if word == 'na':
         train['MLMethodNextYearSelect'] = train['MLMethodNextYearSelect'].replace('na',"I don't plan on learning a new ML/DS method")
    elif word != 'Deep learning' and word != 'Neural Nets' and word != 'Time Series Analysis' and word != 'Bayesian Methods' and word != 'na' and word != 'Other' and word != "Text Mining" and word != "Genetic & Evolutionary Algorithms" and word != "I don't plan on learning a new ML/DS method":
         train['MLMethodNextYearSelect'] = train['MLMethodNextYearSelect'].replace(word, 'Other')           
mydictML = dict(test['MLMethodNextYearSelect'].value_counts())
for word, count in mydictML.items():
    if word == 'na':
         test['MLMethodNextYearSelect'] = test['MLMethodNextYearSelect'].replace('na',"I don't plan on learning a new ML/DS method")
    elif word != 'Deep learning' and word != 'Neural Nets' and word != 'Time Series Analysis' and word != 'Bayesian Methods' and word != 'na' and word != 'Other' and word != "Text Mining" and word != "Genetic & Evolutionary Algorithms" and word != "I don't plan on learning a new ML/DS method":
         test['MLMethodNextYearSelect'] = test['MLMethodNextYearSelect'].replace(word, 'Other')



mydictML = dict(train['LanguageRecommendationSelect'].value_counts())
for word, count in mydictML.items():
    if word == 'na':
         train['LanguageRecommendationSelect'] = train['LanguageRecommendationSelect'].replace('na',"Other")
    elif word != 'Python' and word != 'R' and word != 'SQL' and word != 'na' and word != 'Other':
         train['LanguageRecommendationSelect'] = train['LanguageRecommendationSelect'].replace(word, 'Other')
mydictML = dict(test['LanguageRecommendationSelect'].value_counts())
for word, count in mydictML.items():
    if word == 'na':
         test['LanguageRecommendationSelect'] = test['LanguageRecommendationSelect'].replace('na',"Other")
    elif word != 'Python' and word != 'R' and word != 'SQL' and word != 'na' and word != 'Other':
         test['LanguageRecommendationSelect'] = test['LanguageRecommendationSelect'].replace(word, 'Other')



mydictML = dict(train['FormalEducation'].value_counts())
for word, count in mydictML.items():
    if word == 'I did not complete any formal education past high school' or word == 'I prefer not to answer' or word == 'na':
         train['FormalEducation'] = train['FormalEducation'].replace(word, 'Other')      
mydictML = dict(test['FormalEducation'].value_counts())
for word, count in mydictML.items():
    if word == 'I did not complete any formal education past high school' or word == 'I prefer not to answer' or word == 'na':
         test['FormalEducation'] = train['FormalEducation'].replace(word, 'Other')
            
train['MajorSelect'] = train['MajorSelect'].replace('na', 'Other')
train['MajorSelect'] = train['MajorSelect'].replace('Psychology', 'A social science')
train['MajorSelect'] = train['MajorSelect'].replace('A humanities discipline', 'A social science')
train['MajorSelect'] = train['MajorSelect'].replace('Fine arts or performing arts', 'A social science')
train['MajorSelect'] = train['MajorSelect'].replace('Management information systems', 'Information technology, networking, or system administration')
train['MajorSelect'] = train['MajorSelect'].replace('A health science', 'Basic Science')
train['MajorSelect'] = train['MajorSelect'].replace('Physics', 'Basic Science')
train['MajorSelect'] = train['MajorSelect'].replace('Biology', 'Basic Science')
train['MajorSelect'] = train['MajorSelect'].replace('I never declared a major', 'Other')

test['MajorSelect'] = test['MajorSelect'].replace('na', 'Other')
test['MajorSelect'] = test['MajorSelect'].replace('Psychology', 'A social science')
test['MajorSelect'] = test['MajorSelect'].replace('A humanities discipline', 'A social science')
test['MajorSelect'] = test['MajorSelect'].replace('Fine arts or performing arts', 'A social science')
test['MajorSelect'] = test['MajorSelect'].replace('Management information systems', 'Information technology, networking, or system administration')
test['MajorSelect'] = test['MajorSelect'].replace('A health science', 'Basic Science')
test['MajorSelect'] = test['MajorSelect'].replace('Physics', 'Basic Science')
test['MajorSelect'] = test['MajorSelect'].replace('Biology', 'Basic Science')
test['MajorSelect'] = test['MajorSelect'].replace('I never declared a major', 'Other')




train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('na', 'Other')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('Researcher', 'Scientist/Researcher')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('Engineer', 'Other')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('Statistician', 'Other')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('Computer Scientist', 'Other')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('Programmer', 'Other')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('Predictive Modeler', 'Other')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('DBA/Database Engineer', 'Other')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('Operations Research Practitioner', 'Other')
train['CurrentJobTitleSelect'] = train['CurrentJobTitleSelect'].replace('Data Miner', 'Other')

test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('na', 'Other')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('Researcher', 'Scientist/Researcher')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('Engineer', 'Other')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('Statistician', 'Other')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('Computer Scientist', 'Other')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('Programmer', 'Other')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('Predictive Modeler', 'Other')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('DBA/Database Engineer', 'Other')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('Operations Research Practitioner', 'Other')
test['CurrentJobTitleSelect'] = test['CurrentJobTitleSelect'].replace('Data Miner', 'Other')



train['EmployerIndustry'] = train['EmployerIndustry'].replace('Internet-based', 'Technology')
train['EmployerIndustry'] = train['EmployerIndustry'].replace('Telecommunications', 'Technology')
train['EmployerIndustry'] = train['EmployerIndustry'].replace('Pharmaceutical', 'Other')
train['EmployerIndustry'] = train['EmployerIndustry'].replace('Military/Security', 'Other')
train['EmployerIndustry'] = train['EmployerIndustry'].replace('Hospitality/Entertainment/Sports', 'Other')

test['EmployerIndustry'] = test['EmployerIndustry'].replace('Internet-based', 'Technology')
test['EmployerIndustry'] = test['EmployerIndustry'].replace('Telecommunications', 'Technology')
test['EmployerIndustry'] = test['EmployerIndustry'].replace('Pharmaceutical', 'Other')
test['EmployerIndustry'] = test['EmployerIndustry'].replace('Military/Security', 'Other')
test['EmployerIndustry'] = test['EmployerIndustry'].replace('Hospitality/Entertainment/Sports', 'Other')


trainmean = train[train['CompensationScore']!='na']['CompensationScore'].mean()

if trainmean < 3:
  train['CompensationScore'] = train['CompensationScore'].replace('na','Low')
elif trainmean >= 3 and trainmean < 8:
  train['CompensationScore'] = train['CompensationScore'].replace('na','Medium')
elif trainmean >= 8:
  train['CompensationScore'] = train['CompensationScore'].replace('na','High')

train['CompensationScore'] = train['CompensationScore'].replace(0.0,'Low')
train['CompensationScore'] = train['CompensationScore'].replace(1.0,'Low')
train['CompensationScore'] = train['CompensationScore'].replace(2.0,'Low')
train['CompensationScore'] = train['CompensationScore'].replace(3.0,'Medium')
train['CompensationScore'] = train['CompensationScore'].replace(4.0,'Medium')
train['CompensationScore'] = train['CompensationScore'].replace(5.0,'Medium')
train['CompensationScore'] = train['CompensationScore'].replace(6.0,'Medium')
train['CompensationScore'] = train['CompensationScore'].replace(7.0,'Medium')
train['CompensationScore'] = train['CompensationScore'].replace(8.0,'High')
train['CompensationScore'] = train['CompensationScore'].replace(9.0,'High')
train['CompensationScore'] = train['CompensationScore'].replace(10.0,'High')

testmean = test[test['CompensationScore']!='na']['CompensationScore'].mean() 
if testmean< 3:
  test['CompensationScore'] = test['CompensationScore'].replace('na','Low')
elif testmean >= 3 and testmean < 8:
  test['CompensationScore'] = test['CompensationScore'].replace('na','Medium')
elif testmean >= 8:
  test['CompensationScore'] = test['CompensationScore'].replace('na','High')


test['CompensationScore'] = test['CompensationScore'].replace(0.0,'Low')
test['CompensationScore'] = test['CompensationScore'].replace(1.0,'Low')
test['CompensationScore'] = test['CompensationScore'].replace(2.0,'Low')
test['CompensationScore'] = test['CompensationScore'].replace(3.0,'Medium')
test['CompensationScore'] = test['CompensationScore'].replace(4.0,'Medium')
test['CompensationScore'] = test['CompensationScore'].replace(5.0,'Medium')
test['CompensationScore'] = test['CompensationScore'].replace(6.0,'Medium')
test['CompensationScore'] = test['CompensationScore'].replace(7.0,'Medium')
test['CompensationScore'] = test['CompensationScore'].replace(8.0,'High')
test['CompensationScore'] = test['CompensationScore'].replace(9.0,'High')
test['CompensationScore'] = test['CompensationScore'].replace(10.0,'High')

### CodeWriter Feature Only Consists of "Yes"

In [None]:
train.drop('CodeWriter',axis='columns', inplace=True)
test.drop('CodeWriter',axis='columns', inplace=True)

### Helper Functions For Features Extraction

In [None]:
# It returns the first value before "," or returns the data itself if there is only one value
def takeFirstEmployer(text):
  if pd.isnull(text)==False:
    index = text.find(',')
    if index != -1:
        return text[0:index]
    else:
        return text

In [None]:
# It returns the amount of data separated by commas
def takeAmount(text):
  if pd.isnull(text)==False:
    amountList = text.split(',')
    if amountList == 0:
      return 1
    else:
      return len(amountList)

In [None]:
train['CurrentEmployerType_Amount'] = train['CurrentEmployerType'].apply(takeAmount)
train['PastJobTitlesSelect_Amount'] = train['PastJobTitlesSelect'].apply(takeAmount)
train['MLSkillsSelect_Amount'] = train['MLSkillsSelect'].apply(takeAmount)
train['MLTechniquesSelect_Amount'] = train['MLTechniquesSelect'].apply(takeAmount)
train['WorkAlgorithmsSelect_Amount'] = train['WorkAlgorithmsSelect'].apply(takeAmount)

test['CurrentEmployerType_Amount'] = test['CurrentEmployerType'].apply(takeAmount)
test['PastJobTitlesSelect_Amount'] = test['PastJobTitlesSelect'].apply(takeAmount)
test['MLSkillsSelect_Amount'] = test['MLSkillsSelect'].apply(takeAmount)
test['MLTechniquesSelect_Amount'] = test['MLTechniquesSelect'].apply(takeAmount)
test['WorkAlgorithmsSelect_Amount'] = test['WorkAlgorithmsSelect'].apply(takeAmount)

train['CurrentEmployerType_Amount'] = train['CurrentEmployerType_Amount'].astype('float64')
train['PastJobTitlesSelect_Amount'] = train['PastJobTitlesSelect_Amount'].astype('float64')
train['MLSkillsSelect_Amount'] = train['MLSkillsSelect_Amount'].astype('float64')
train['MLTechniquesSelect_Amount'] = train['MLTechniquesSelect_Amount'].astype('float64')
train['WorkAlgorithmsSelect_Amount'] = train['WorkAlgorithmsSelect_Amount'].astype('float64')


train['CurrentEmployerType2'] = train['CurrentEmployerType'].apply(takeFirstEmployer)
train['MLSkillsSelect2'] = train['MLSkillsSelect'].apply(takeFirstEmployer)
train['MLTechniquesSelect2'] = train['MLTechniquesSelect'].apply(takeFirstEmployer)
train['WorkAlgorithmsSelect2'] = train['WorkAlgorithmsSelect'].apply(takeFirstEmployer)

test['CurrentEmployerType2'] = test['CurrentEmployerType'].apply(takeFirstEmployer)
test['MLSkillsSelect2'] = test['MLSkillsSelect'].apply(takeFirstEmployer)
test['MLTechniquesSelect2'] = test['MLTechniquesSelect'].apply(takeFirstEmployer)
test['WorkAlgorithmsSelect2'] = test['WorkAlgorithmsSelect'].apply(takeFirstEmployer)

### Binning Values of New Features

In [None]:
train['CurrentEmployerType2'] = train['CurrentEmployerType2'].replace('na',train['CurrentEmployerType2'].mode()[0])
test['CurrentEmployerType2'] = test['CurrentEmployerType2'].replace('na',test['CurrentEmployerType2'].mode()[0])


freq_map8 = {'RNNs': 'Neural Networks',
             'GANs' : 'Neural Networks',
             'CNNs': 'Neural Networks',
             'Random Forests': 'Ensemble Methods',
             'Gradient Boosted Machines' : 'Ensemble Methods',
             'Evolutionary Approaches': 'Neural Networks',
             'SVMs': 'Neural Networks',
             'HMMs' : 'Markov',
             'Markov Logic Networks': 'Markov'}
train['WorkAlgorithmsSelect2'] = train['WorkAlgorithmsSelect2'].replace(freq_map8)
test['WorkAlgorithmsSelect2'] = test['WorkAlgorithmsSelect2'].replace(freq_map8)


freq_map8 = {'Neural Networks - GANs': 'Neural Networks',
             'Neural Networks - RNNs' : 'Neural Networks',
             'Neural Networks - CNNs': 'Neural Networks',
             'Decision Trees - Random Forests': 'Ensemble Methods',
             'Decision Trees - Gradient Boosted Machines' : 'Ensemble Methods',
             'Evolutionary Approaches': 'Neural Networks',
             'Gradient Boosting':'Ensemble Methods',
             'Support Vector Machines (SVMs)': 'Neural Networks',
             'Hidden Markov Models HMMs' : 'Markov',
             'Markov Logic Networks': 'Markov'}
train['MLTechniquesSelect2'] = train['MLTechniquesSelect2'].replace(freq_map8)
test['MLTechniquesSelect2'] = test['MLTechniquesSelect2'].replace(freq_map8)


freq_map8 = {'Speech Recognition': 'Natural Language Processing',
             'Machine Translation' : 'Natural Language Processing',
             'Time Series': 'Supervised Machine Learning (Tabular Data)'}
train['MLSkillsSelect2'] = train['MLSkillsSelect2'].replace(freq_map8)
test['MLSkillsSelect2'] = test['MLSkillsSelect2'].replace(freq_map8)

### New Feature: Continents

In [2]:
!pip install pycountry-convert

In [None]:
import pycountry_convert as pc

def countryToContinent(countryname):
  if pd.isnull(countryname)==False:
    if countryname != 'Other':
      country_code = pc.country_name_to_country_alpha2(countryname, cn_name_format="default")
      continent_name = pc.country_alpha2_to_continent_code(country_code)
      return continent_name
    else:
      return 'Other'

In [None]:
train['Country'] = train['Country'].replace("People 's Republic of China","China")
train['Country'] = train['Country'].replace("Republic of China","China")
train['Country'] = train['Country'].replace("na","Other")

test['Country'] = test['Country'].replace("People 's Republic of China","China")
test['Country'] = test['Country'].replace("Republic of China","China")
test['Country'] = test['Country'].replace("na","Other")


train['Continent'] = train['Country'].apply(countryToCapital)
test['Continent'] = test['Country'].apply(countryToCapital)

In [None]:
# To Get The Name of the Categorical Features
mycolumns = []
for col in list(train.columns):
    if train[col].dtypes == 'object':
        mycolumns.append(col)

### Generating Dummy Variables For Categorical Features

**Note:** Some categorical features omitted in this part because they were decided to be dropped

In [None]:
df_train = train.copy()
for col in ['Country',
            'GenderSelect',
            'CurrentJobTitleSelect',
            'LanguageRecommendationSelect',
            'DataScienceIdentitySelect',
            'FormalEducation',
            'MajorSelect',
            'CompensationScore',
            'WorkInternalVsExternalTools',
            'WorkMLTeamSeatSelect',
            'CurrentEmployerType2',
            'MLSkillsSelect2',
            'WorkAlgorithmsSelect2',
            'Continent']:
    df_train = pd.concat([df_train, pd.get_dummies(train[col], drop_first=True, prefix=col)], axis=1)
    df_train = df_train.drop(col,axis='columns')
    
df_test = test.copy()
for col in ['Country',
            'GenderSelect',
             'CurrentJobTitleSelect',
             'LanguageRecommendationSelect',
             'DataScienceIdentitySelect',
             'FormalEducation',
             'MajorSelect',
             'CompensationScore',
             'WorkInternalVsExternalTools',
             'WorkMLTeamSeatSelect',
             'CurrentEmployerType2',
             'MLSkillsSelect2',
             'WorkAlgorithmsSelect2',
             'Continent']:
    df_test = pd.concat([df_test, pd.get_dummies(test[col], drop_first=True, prefix=col)], axis=1)
    df_test = df_test.drop(col,axis='columns')
    
df_train = df_train.drop(['MLTechniquesSelect2','MLMethodNextYearSelect','MLToolNextYearSelect','EmploymentStatus','EmployerIndustry','CurrentEmployerType','PastJobTitlesSelect','MLSkillsSelect','MLTechniquesSelect','WorkAlgorithmsSelect'], axis='columns')
df_test = df_test.drop(['MLTechniquesSelect2','MLMethodNextYearSelect','MLToolNextYearSelect','EmploymentStatus','EmployerIndustry','CurrentEmployerType','PastJobTitlesSelect','MLSkillsSelect','MLTechniquesSelect','WorkAlgorithmsSelect'], axis='columns')

df_train = df_train.drop('ID',axis='columns')
df_test = df_test.drop('ID',axis='columns')

### Feature Selection with SelectKBest

**Important Note:** After we run this part, we analyzed the results and drop some features. Therefore, if you run this part now, it does not show the results we saw during the analysis

**Features dropped after this part:** 

MLTechniquesSelect2, MLMethodNextYearSelect, MLToolNextYearSelect, EmploymentStatus, EmployerIndustry

In [None]:
best_features = SelectKBest(score_func=f_regression, k=110)
X_selected = best_features.fit_transform(X_train_scaled,y_train)
fit = best_features.fit(X_train,y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)
feature_scores = pd.concat([dfcolumns,dfscores],axis=1)
feature_scores.columns = ['Specs','Score']
print(feature_scores.nlargest(60,'Score'))

## - Modelling

In [None]:
# Split the train data to X and y
X_train = df_train.drop('JobSatisfaction', axis='columns')
y_train = df_train['JobSatisfaction']

### Scaling Data with MinMaxScaler

Default values of MinMaxScaler used

In [None]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(df_test)

### Cross Validation

##### Ridge Regression and Bayesian Regression are the two which has the best performance for this data to predict the Job Satisfaction

In [None]:
scores = cross_validate(linear_model.Ridge(alpha= 60, solver= 'auto',random_state=1), X_train_scaled, y_train, cv=10,scoring=('neg_root_mean_squared_error'))
print('Ridge Regression')
print('------------------')
print('CV Results: ',scores['test_score'])
print('Average of these results: ',scores['test_score'].mean()*(-1))

scores = cross_validate(linear_model.BayesianRidge(), X_train_scaled, y_train, cv=10,scoring=('neg_root_mean_squared_error'))
print('Bayesian Ridge Regression')
print('------------------')
print('CV Results: ',scores['test_score'])
print('Average of these results: ',scores['test_score'].mean()*(-1))

Ridge
[-1.83194625 -1.89992558 -2.00211263 -1.92822139 -1.81203156 -1.89757889
 -1.94498728 -1.97355826 -1.99550873 -2.04121762]
1.932708819503997
BayesianRidge
[-1.83192838 -1.89991067 -2.00214458 -1.92856712 -1.81237767 -1.89776591
 -1.94503936 -1.97355489 -1.99581892 -2.0412168 ]
1.932832431450286


### HyperParameter Tuning For Ridge Regression

In [None]:
parambr = {'alpha':list(range(10,120,2)),
           'solver':['auto','svd','saga'],
           'random_state':[1]}
bag_grid = GridSearchCV(estimator = linear_model.Ridge() , param_grid = parambr, cv=10 ,scoring = "neg_root_mean_squared_error",n_jobs=-1)
bag_grid.fit(X_train_scaled,y_train)
print(bag_grid.best_params_)

'\nparambr = {\'alpha\':list(range(10,120,2)),\n           \'solver\':[\'auto\'],\n           \'random_state\':[1]}\nbag_grid = GridSearchCV(estimator = linear_model.Ridge() , param_grid = parambr, cv=10 ,scoring = "neg_root_mean_squared_error",verbose=12,n_jobs=-1)\nbag_grid.fit(X_train_scaled,y_train)\n'

### Exporting Prediction as .csv File For Kaggle Submission

In [None]:
ridge = linear_model.Ridge(alpha= 60, solver= 'auto',random_state=1)
ridge.fit(X_train_scaled,y_train)
y_pred = ridge.predict(X_test_scaled)

In [None]:
submission = pd.DataFrame(y_pred)
submission.rename(columns={0: "Prediction"}, inplace = True)
submission['ID'] = list(range(1,1001))
submission = submission[['ID', 'Prediction']]
submission.to_csv('CS412SubmissionCSV_Classy_FicationRidge.csv', index=False)