# 1. Project Introduction
## 1.1 Problem Statement
A data-driven company offering in-house training programs seeks to identify candidates who are genuinely interested in joining the organization after completing the course. Many individuals sign up for these programs, but not all intend to work for the company afterward. Distinguishing between those seeking employment and those pursuing training for other reasons (e.g., reskilling, career advancement elsewhere) is critical for efficient resource allocation, training customization, and targeted recruitment.

The [dataset](https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists/data) at hand includes demographics, education, experience, and other enrollment-related information. The goal is to develop a predictive model that estimates the probability of a candidate seeking a job change, using the available features. The insights derived can also be used for HR analytics, identifying the factors most strongly associated with an individual’s intention to switch jobs.

The problem is framed as a binary classification task:

1 → Candidate is looking for a job change

0 → Candidate is not looking for a job change

Additionally, the project aims to offer explainable model outputs that can assist in understanding which features influence job change intent, aiding strategic HR decisions.

## 1.2 Project Goals
* Build a model to predict whether a candidate is looking for a job change.

* Identify key drivers of this decision (feature importance, explainability).

* Support business decisions in HR, training design, and candidate targeting.

* Optionally: Deploy interactive tools (Power BI dashboard, Streamlit app) for practical use.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 2. Data Loading

## 2.1 Dataset Overview
The dataset includes the following key features:

* `enrollee_id`: Unique identifier

* `city, city_development_index`: Geographical and urbanization data

* `gender, relevent_experience, enrolled_university, education_level, major_discipline`

* `experience, company_size, company_type, last_new_job, training_hours`

* `target`: Whether the candidate is seeking a job change (1) or not (0)

*Notes:*

*Dataset is imbalanced, requiring special treatment in modeling.*

*Many features are categorical, including those with high cardinality.*

*Some fields contain missing values, necessitating careful imputation strategies.*


## 2.2 Load and Explore Dataset
Use Pandas for basic exploration: .head(), .info(), .describe().

Visualize missing data using seaborn/missingno.

Load the train data:

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/hr-data/aug_train.csv')
df_test = pd.read_csv('/content/drive/MyDrive/hr-data/aug_test.csv')

Load first 5 entries - head:

In [None]:
df.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


Describe the distribution of values:

In [None]:
df.describe()

Unnamed: 0,enrollee_id,city_development_index,training_hours,target
count,19158.0,19158.0,19158.0,19158.0
mean,16875.358179,0.828848,65.366896,0.249348
std,9616.292592,0.123362,60.058462,0.432647
min,1.0,0.448,1.0,0.0
25%,8554.25,0.74,23.0,0.0
50%,16982.5,0.903,47.0,0.0
75%,25169.75,0.92,88.0,0.0
max,33380.0,0.949,336.0,1.0


Check the column value types and counts:

In [None]:
print(df.info(), df_test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

Drop the duplicate entries:

In [None]:
df.drop_duplicates(inplace=True)
df_test.drop_duplicates(inplace=True)

Show missing value count:

In [None]:
df.isnull().sum()

Unnamed: 0,0
enrollee_id,0
city,0
city_development_index,0
gender,0
relevent_experience,0
enrolled_university,0
education_level,0
major_discipline,0
experience,0
company_size,0


In [None]:
df_test.isnull().sum()

Unnamed: 0,0
enrollee_id,0
city,0
city_development_index,0
gender,0
relevent_experience,0
enrolled_university,0
education_level,0
major_discipline,0
experience,0
company_size,0


The column 'enrollee_id' is not needed for our analysis, so we can drop it.

In [None]:
df = df.drop(columns='enrollee_id')
df_test = df_test.drop(columns='enrollee_id')

Now we should modify the dataframe :by filling the missing values.

In [None]:
def mis(data):
    data.gender = data.gender.fillna("Other")
    data.enrolled_university = data.enrolled_university.fillna("no_enrollment")
    data.education_level = data.education_level .fillna("Other")
    data.major_discipline = data.major_discipline.fillna("Other")
    data.experience = data.experience.fillna("0")
    data.experience = data.experience.apply(lambda val : val.replace(">", ''))
    data.experience = data.experience.apply(lambda val : val.replace("<", ''))
    data.company_size = data.company_size.map({'50-99':2 ,'100-500':3 ,'10000+': 7,'10/49' :1 ,'1000-4999':5 , '<10':0 ,'500-999':4 ,'5000-9999':6})
    data.company_size = data.company_size.fillna(0)
    data.company_type = data.company_type.fillna("Other")
    data.last_new_job = data.last_new_job.fillna("never")
    data.last_new_job = data.last_new_job.replace(">4", "5")
    return(data)

In [None]:
df = mis(df)
df_test = mis(df_test)

In [None]:
df.head()

Unnamed: 0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,20,0.0,Other,1,36,1.0
1,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,2.0,Pvt Ltd,5,47,0.0
2,city_21,0.624,Other,No relevent experience,Full time course,Graduate,STEM,5,0.0,Other,never,83,0.0
3,city_115,0.789,Other,No relevent experience,no_enrollment,Graduate,Business Degree,1,0.0,Pvt Ltd,never,52,1.0
4,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,20,2.0,Funded Startup,4,8,0.0


In [None]:
df['experience'].unique()

array([20., 15.,  5.,  1., 11., 13.,  7., 17.,  2., 16.,  4., 10., 14.,
       18., 19., 12.,  3.,  6.,  9.,  8.,  0.])

In [None]:
df['experience'] = df['experience'].replace({'>20': 20, '<1': 0,' nan':0}).astype(float)

In [None]:
def bin_experience(val):
    val = int(val)
    if val == 0:
        return "No experience"
    elif val <= 3:
        return "Junior"
    elif val <= 6:
        return "Mid-level"
    elif val <= 10:
        return "Senior"
    elif val <= 15:
        return "Experienced"
    else:
        return "Veteran"

In [None]:
df['experience'] = df['experience'].apply(bin_experience)

We will save this cleaned dataframe so it can be used for EDA. Then we will continue with encoding categorical variables so the dataframe can also be used for modeling.

In [None]:
df.to_csv('/content/drive/MyDrive/hr-data/aug_train-eda.csv', index=False)
df.to_csv('/content/drive/MyDrive/hr-data/aug_test-eda.csv', index=False)

Let's check it now:

In [None]:
df['experience'].unique()

array(['Veteran', 'Experienced', 'Mid-level', 'Junior', 'Senior',
       'No experience'], dtype=object)

In [None]:
df_test['experience'] = df_test['experience'].apply(bin_experience)

In [None]:
experience_bin_to_num = {
    "No experience": 0,
    "Junior": 1,
    "Mid-level": 2,
    "Senior": 3,
    "Experienced": 4,
    "Veteran": 5
}

In [None]:
df['experience'] = df['experience'].map(experience_bin_to_num)

In [None]:
df_test['experience'] = df_test['experience'].map(experience_bin_to_num)

In [None]:
df.dtypes

Unnamed: 0,0
city,object
city_development_index,float64
gender,object
relevent_experience,object
enrolled_university,object
education_level,object
major_discipline,object
experience,int64
company_size,float64
company_type,object


Check missing values again:

In [None]:
df.isnull().sum()

Unnamed: 0,0
enrollee_id,0
city,0
city_development_index,0
gender,0
relevent_experience,0
enrolled_university,0
education_level,0
major_discipline,0
experience,0
company_size,0


In [None]:
df.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,20,50-99,Unknown,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,Male,No relevent experience,Full time course,Graduate,STEM,5,50-99,Unknown,never,83,0.0
3,33241,city_115,0.789,Male,No relevent experience,no_enrollment,Graduate,Business Degree,0,50-99,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,20,50-99,Funded Startup,4,8,0.0


This looks good. Now let's save the modified dataframe:

In [None]:
df.to_csv('/content/drive/MyDrive/hr-data/aug_test-mod.csv', index=False)

## 2.3 Store in SQLite, explore with SQL

Use SQL queries to:

* Count candidates by education_level, company_type, etc.

* Compute basic aggregations: avg training hours by group, job change rates by gender, etc.