# Data Description 
## HR Analytics: Job Change of Data Scientists - Predict who will move to a new job

### Context and Content
A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

### Inspiration

1. Predict the probability of a candidate will work for the company


2. Interpret model(s) such a way that illustrate which features affect candidate decision


### Features

- enrollee_id : Unique ID for candidate

- city: City code

- city_ development _index : Developement index of the city (scaled)

- gender: Gender of candidate

- relevent_experience: Relevant experience of candidate

- enrolled_university: Type of University course enrolled if any

- education_level: Education level of candidate

- major_discipline :Education major discipline of candidate

- experience: Candidate total experience in years

- company_size: No of employees in current employer's company

- company_type : Type of current employer

- lastnewjob: Difference in years between previous job and current job

- training_hours: training hours completed

- target: 0 – Not looking for job change, 1 – Looking for a job change

Reference: 
https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists?select=aug_test.csv

In [18]:
import numpy as np
import pandas as pd

In [19]:
# train data
data = pd.read_csv("aug_train.csv")
data

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [21]:
data.describe()

Unnamed: 0,enrollee_id,city_development_index,training_hours,target
count,19158.0,19158.0,19158.0,19158.0
mean,16875.358179,0.828848,65.366896,0.249348
std,9616.292592,0.123362,60.058462,0.432647
min,1.0,0.448,1.0,0.0
25%,8554.25,0.74,23.0,0.0
50%,16982.5,0.903,47.0,0.0
75%,25169.75,0.92,88.0,0.0
max,33380.0,0.949,336.0,1.0


## Observations: 

1. Total Number Features = 13
2. Number of Categorical Features = 10
3. Number of Numerical Features = 3 
4. Number of records = 19158
5. 75% of target is 0 which means data is imbalanced- More data on class 0 (not look for a job) than class 1 (look for a job)

# Data Preprocessing

## Feature selection: 

- Eliminate insignificant features: enrollee_id

In [22]:
new_data = data.drop(columns = 'enrollee_id')

##  Missing values:

### Methods applied to handle missing values: 

As all missing values in this data are categoical, we can either drop the corresponding records or treat missing data as just another category named Unknown. Note: there are also other ways like "Developing model to predict missing values" which we have not considered.

If we drop all records that include NaN, more than half of the rows will be eliminated. So we decided to drop NaN values for some columns (that we think their impact in the outcome is more) and use Unknown category for the rest. 

We drop records with NaN values in the following features: education_level, major_discipline, experience, last_new_job 

And use Unknown category for the rest


In [23]:
new_data.isna().sum()

city                         0
city_development_index       0
gender                    4508
relevent_experience          0
enrolled_university        386
education_level            460
major_discipline          2813
experience                  65
company_size              5938
company_type              6140
last_new_job               423
training_hours               0
target                       0
dtype: int64

In [67]:
# drop NaN values for some columns
data_wo_nan = new_data.dropna(subset=['education_level','major_discipline', 'experience', 'last_new_job'])

# Replace other NaN with Unknown value 
data_wo_nan = data_wo_nan.replace(np.nan,'Unknown')

# check missing values again
data_wo_nan.isna().sum()

city                      0
city_development_index    0
gender                    0
relevent_experience       0
enrolled_university       0
education_level           0
major_discipline          0
experience                0
company_size              0
company_type              0
last_new_job              0
training_hours            0
target                    0
dtype: int64

## Categorical Columns: 
- check the unique values for each column so we can decide better how to convert them to numerical ones

In [25]:
#Select categorical columns
cat_columns = data_wo_nan.select_dtypes(include='object')
print("Unique values of city: \n",cat_columns.city.unique())
print("\nUnique values of gender: \n",cat_columns.gender.unique())
print("\nUnique values of relevent_experience: \n",cat_columns.relevent_experience.unique())
print("\nUnique values of enrolled_university: \n",cat_columns.enrolled_university.unique())
print("\nUnique values of education_level: \n",cat_columns.education_level.unique())
print("\nUnique values of major_discipline: \n",cat_columns.major_discipline.unique())
print("\nUnique values of experience: \n",cat_columns.experience.unique())
print("\nUnique values of company_size: \n",cat_columns.company_size.unique())
print("\nUnique values of company_type: \n",cat_columns.company_type.unique())
print("\nUnique values of last_new_job: \n",cat_columns.last_new_job.unique())

Unique values of city: 
 ['city_103' 'city_40' 'city_21' 'city_115' 'city_162' 'city_176' 'city_46'
 'city_61' 'city_114' 'city_13' 'city_159' 'city_102' 'city_160' 'city_16'
 'city_64' 'city_101' 'city_83' 'city_105' 'city_104' 'city_73' 'city_67'
 'city_75' 'city_41' 'city_100' 'city_93' 'city_11' 'city_36' 'city_20'
 'city_71' 'city_57' 'city_152' 'city_19' 'city_65' 'city_74' 'city_173'
 'city_136' 'city_98' 'city_97' 'city_90' 'city_50' 'city_138' 'city_82'
 'city_157' 'city_89' 'city_150' 'city_175' 'city_28' 'city_94' 'city_59'
 'city_165' 'city_145' 'city_142' 'city_12' 'city_37' 'city_43' 'city_116'
 'city_99' 'city_23' 'city_10' 'city_45' 'city_128' 'city_70' 'city_158'
 'city_123' 'city_7' 'city_72' 'city_106' 'city_143' 'city_78' 'city_109'
 'city_24' 'city_149' 'city_48' 'city_144' 'city_91' 'city_146' 'city_133'
 'city_126' 'city_118' 'city_134' 'city_167' 'city_27' 'city_84' 'city_54'
 'city_39' 'city_79' 'city_76' 'city_81' 'city_131' 'city_44' 'city_155'
 'city_33' 'ci

### observations
- relevent_experience is a boolian column we can convert it to 0 for no experience and 1 for have experience
- education_level and company_size are ordinal categories. 
- experience and last_new_job can be converted to numeric values directly. 
- company_type, enrolled_university, gender, major_discipline are nominal categories and we can use get_dummies for them.
- We can also use one_hot_encoder for city feature which is also a nominal categoriy but the drawback is the number of unique city values are a lot and it will add a lot of features to our space if we use one_hot_encoder. 

In [65]:
proc_data = data_wo_nan.copy()

# relevent_experience replace with 0 and 1, 1 for having experience and 0 for no experience
proc_data['relevent_experience'] = proc_data['relevent_experience'].replace(['Has relevent experience','No relevent experience'],[1,0])

# manually assign ordinal numbers to education_level and company_size
# for graduate level I will give 1 and for master 2 and for phd 3. Graduate level can be equals to masters and phd but usually people with phd would not represent themselves as graduate. 
# any graduate level certificate can be considered as graduate so I will assign a lower number to graduate than masters. 
# for company_size unknown will get 0.
proc_data['education_level'] = proc_data['education_level'].replace(['Graduate','Masters','Phd'],[1,2,3])
proc_data['company_size'] = proc_data['company_size'].replace(['Unknown','<10', '10/49','50-99', '100-500','500-999','1000-4999','5000-9999','10000+'] ,range(0,9))

# convert experience and last_new_job to numeric values
proc_data['experience'] = proc_data['experience'].str.replace('>','').str.replace('<','')
proc_data['experience'] = pd.to_numeric(proc_data['experience'])

proc_data['last_new_job'] = proc_data['last_new_job'].str.replace('>','')
proc_data['last_new_job'] = proc_data['last_new_job'].replace('never',0)
proc_data['last_new_job'] = pd.to_numeric(proc_data['last_new_job'])

final_data = pd.get_dummies(proc_data, columns = ['company_type', 'enrolled_university', 'gender', 'major_discipline','city'])
final_data


Unnamed: 0,city_development_index,relevent_experience,education_level,experience,company_size,last_new_job,training_hours,target,company_type_Early Stage Startup,company_type_Funded Startup,...,city_city_84,city_city_89,city_city_9,city_city_90,city_city_91,city_city_93,city_city_94,city_city_97,city_city_98,city_city_99
0,0.920,1,1,20,0,1,36,1.0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.776,0,1,15,3,4,47,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0.624,0,1,5,0,0,83,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0.789,0,1,1,0,0,52,1.0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0.767,1,2,20,3,4,8,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19150,0.920,1,1,10,4,3,23,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
19152,0.920,1,1,7,2,1,25,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
19153,0.878,0,1,14,0,1,42,1.0,0,0,...,0,0,0,0,0,0,0,0,0,0
19154,0.920,1,1,14,0,4,52,1.0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [66]:
final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16007 entries, 0 to 19155
Columns: 151 entries, city_development_index to city_city_99
dtypes: float64(2), int64(6), uint8(143)
memory usage: 3.3 MB


## Preprocessing function

In [68]:
# preprocessing function
def preprocessing_data(df: pd.DataFrame):
    data = df.copy()
    # drop NaN values for some columns
    data = data.dropna(subset=['education_level','major_discipline', 'experience', 'last_new_job'])
    # Replace other NaN with Unknown value 
    data = data.replace(np.nan,'Unknown')
    # relevent_experience replace with 0 and 1, 1 for having experience and 0 for no experience
    data['relevent_experience'] = data['relevent_experience'].replace(['Has relevent experience','No relevent experience'],[1,0])

    # manually assign ordinal numbers to education_level and company_size
    # for graduate level I will give 1 and for master 2 and for phd 3. Graduate level can be equals to masters and phd but usually people with phd would not represent themselves as graduate. 
    # any graduate level certificate can be considered as graduate so I will assign a lower number to graduate than masters. 
    # for company_size unknown will get 0.
    
    data['education_level'] = data['education_level'].replace(['Graduate','Masters','Phd'],[1,2,3])
    data['company_size'] = data['company_size'].replace(['Unknown','<10', '10/49','50-99', '100-500','500-999','1000-4999','5000-9999','10000+'] ,range(0,9))

    # convert experience and last_new_job to numeric values
    data['experience'] = data['experience'].str.replace('>','').str.replace('<','')
    data['experience'] = pd.to_numeric(data['experience'])

    data['last_new_job'] = data['last_new_job'].str.replace('>','')
    data['last_new_job'] = data['last_new_job'].replace('never',0)
    data['last_new_job'] = pd.to_numeric(data['last_new_job'])

    data = pd.get_dummies(proc_data, columns = ['company_type', 'enrolled_university', 'gender', 'major_discipline','city'])
    return(data)

In [69]:
raw_data =  pd.read_csv("aug_train.csv")
processed_data = preprocessing_data(raw_data)

In [70]:
processed_data

Unnamed: 0,city_development_index,relevent_experience,education_level,experience,company_size,last_new_job,training_hours,target,company_type_Early Stage Startup,company_type_Funded Startup,...,city_city_84,city_city_89,city_city_9,city_city_90,city_city_91,city_city_93,city_city_94,city_city_97,city_city_98,city_city_99
0,0.920,1,1,20,0,1,36,1.0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.776,0,1,15,3,4,47,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0.624,0,1,5,0,0,83,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0.789,0,1,1,0,0,52,1.0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0.767,1,2,20,3,4,8,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19150,0.920,1,1,10,4,3,23,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
19152,0.920,1,1,7,2,1,25,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
19153,0.878,0,1,14,0,1,42,1.0,0,0,...,0,0,0,0,0,0,0,0,0,0
19154,0.920,1,1,14,0,4,52,1.0,0,0,...,0,0,0,0,0,0,0,0,0,0
