![image info](https://ineuron.ai/images/ineuron-logo.png)

## 1. Problem Statement:

- Here our objective is to build a model which will predict wheather a person is trying to change his/her job or not?

## 2) Data Collection.
* The Dataset is collected from https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists
* The data consists of 14 column and 19158 rows.

### 2.1 Import Data and Required Packages
#### Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

#### Import the CSV Data as Pandas DataFrame

In [2]:
df = pd.read_csv('data/aug_train.csv')

#### Show Top 5 Records

In [3]:
df.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


#### Shape of the dataset

In [4]:
df.shape

(19158, 14)

#### Summary of the dataset

- The described method will help to see how data has been spread for numerical values.
- We can clearly see the minimum value, mean values, different percentile values, and maximum values.

In [5]:
df.describe()

Unnamed: 0,enrollee_id,city_development_index,training_hours,target
count,19158.0,19158.0,19158.0,19158.0
mean,16875.358179,0.828848,65.366896,0.249348
std,9616.292592,0.123362,60.058462,0.432647
min,1.0,0.448,1.0,0.0
25%,8554.25,0.74,23.0,0.0
50%,16982.5,0.903,47.0,0.0
75%,25169.75,0.92,88.0,0.0
max,33380.0,0.949,336.0,1.0


#### Insights
- 'training_hours'  feature max value is too much above mean.
- Needs further investigation

#### Check Datatypes in the dataset
#### info() is used to check the Information about the data and the datatypes of each respective attribute

In [6]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

#### Insights
- Most of the data is categorical, As data has 10 object and 4 numeric feature.
- There are lots of missing values

#### 'enrollee_id' is unique for each records, So will not contribute in model building. Hence we can drop it

In [7]:
df['enrollee_id'].nunique()

19158

In [8]:
df= df.drop('enrollee_id',axis=1)

#### There are > and < sign with 20 and 1 in experience column. So let's solve this by adding and Substracting 1 with 20 and 1

In [9]:
def replace_less_greater(experience):
    if experience == '>20':
        return 21
    elif experience == '<1':
        return 0
    else:
        return experience

In [10]:
df.experience = df.experience.map(replace_less_greater)
df["experience"] = df["experience"].fillna(0)
df["experience"] = df['experience'].astype('int')
df.experience.unique()

array([21, 15,  5,  0, 11, 13,  7, 17,  2, 16,  1,  4, 10, 14, 18, 19, 12,
        3,  6,  9,  8, 20])

#### There are > sign with 4 and 'never' in last_new_job column. So let's solve this by adding 1 and replace never with 0.

In [11]:
df.last_new_job.unique()

array(['1', '>4', 'never', '4', '3', '2', nan], dtype=object)

In [12]:
def replace_last_job(last_new_job):
    if last_new_job == '>4':
        return 5
    elif last_new_job == 'never':
        return 0

    else:
        return last_new_job

df.last_new_job = df.last_new_job.map(replace_last_job)
df["last_new_job"] = df["last_new_job"].fillna(0)
df["last_new_job"] = df['last_new_job'].astype('int')
df['last_new_job'].unique()

array([1, 5, 0, 4, 3, 2])

#### Binning for company size 

In [13]:
df.company_size.replace('<10','9',inplace=True) 
df.company_size.replace('10/49','20',inplace=True)
df.company_size.replace('50-99','55',inplace=True)
df.company_size.replace('100-500','300',inplace=True)
df.company_size.replace('10000+','10001',inplace=True)
df.company_size.replace('500-999','600',inplace=True)
df.company_size.replace('5000-9999','6000',inplace=True)
df.company_size.replace('1000-4999','3000',inplace=True)
df.company_size= pd.to_numeric(df.company_size)

In [14]:
df['company_size']=np.where(df['company_size']>2000,'Large-org.', np.where(df['company_size']>1,'Small & Medium-org.','Undefined'))

In [15]:
df["company_size"].value_counts()

Small & Medium-org.    9310
Undefined              5938
Large-org.             3910
Name: company_size, dtype: int64

In [16]:
df.to_csv('data/hr_cleaned.csv', index= False)