##  Exploratory Data Analysis on HR Analytics – Job Change of Data Scientists

This notebook performs an **Exploratory Data Analysis (EDA)** on a dataset that contains information about data science professionals, including their education, experience, company details, and training background.

 **Objective**:  
To explore the key factors that may influence whether an individual is likely to change their job.

---

###  Goals of This EDA:

- Understand the structure and quality of the data  
-  Analyze trends and distributions in features such as **education**, **experience**, and **company type**  
- 🔗 Examine how different features relate to the **target variable** indicating a potential job change

>  **Note**: This project focuses solely on EDA without any machine learning modeling.


## Dataset Overview

This dataset contains profile information of data science professionals to explore whether they are likely to change their job or not.

Here’s a quick look at the key columns:

|  Feature              |  Description                                           |
|------------------------|----------------------------------------------------------|
| `education_level`      | Highest education level (e.g., Graduate, Masters)        |
| `major_discipline`     | Field of study (e.g., STEM, Business)                    |
| `experience`           | Total years of experience                                |
| `relevant_experience`  | Whether the person has relevant work experience          |
| `company_type`         | Type of company last worked at                           |
| `company_size`         | Size of the company                                      |
| `training_hours`       | Number of training hours completed                       |
| `enrolled_university`  | Current enrollment in university (if any)                |
| `target`               | **Target variable**: 1 = Looking for a job change, 0 = Not |

>  The dataset includes both categorical and numerical features, with some missing values that will be handled during the analysis.


## Importing Required Libraries

The following libraries will be used for data analysis and visualization throughout this notebook.


In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.express as px

from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder


In [None]:
df=pd.read_csv("/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv")

## Data Inspaction

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
for col in df.columns:
    print(df[col].unique())


In [None]:
df.nunique()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.dtypes

## Missing Values Analysis

In [None]:
df.isna().sum()

In [None]:
sns.heatmap(df.isna())

### Missing Values Summary

| Column               | Missing Values | Approx. Percentage |
|----------------------|----------------|---------------------|
| gender               | 4,508          | ~23.5%              |
| enrolled_university  | 386            | ~2.0%               |
| education_level      | 460            | ~2.4%               |
| major_discipline     | 2,813          | ~14.7%              |
| experience           | 65             | ~0.3%               |
| company_size         | 5,938          | ~31.0%              |
| company_type         | 6,140          | ~32.0%              |
| last_new_job         | 423            | ~2.2%               |




>  #### The table below summarizes the missing values in the dataset, including the total number and approximate percentage of missing entries for each column.


### Handling Missing Values

To address missing values in the dataset, we applied a simple yet effective strategy:

- For the `experience` column — which originally contained string values like `'<1'` and `'>20'` — we first converted all entries to numeric format. Missing values in this column were then imputed using the **median**, helping to minimize the impact of outliers.

- For categorical columns such as `gender`, `education_level`, and `company_type`, missing values were replaced with the placeholder `"Unknown"`, effectively treating missing entries as a distinct category.

This approach ensures the dataset is free of missing values while preserving the integrity and interpretability of each feature.


In [None]:
if 'experience_num' in df.columns:
    df.drop(['enrollee_id', 'experience'], axis=1, inplace=True)

In [None]:
def convert_experience(x):
    if pd.isna(x):
        return np.nan         # If the value is NaN, keep it as NaN
    elif x == '>20':
        return 21             # If value is '>20', treat it as 21
    elif x == '<1':
        return 0              # If value is '<1', treat it as 0
    else:
        try:
            return int(x)     # If value is a number as string (e.g., '5'), convert it to int
        except:
            return np.nan     # If conversion fails (unexpected value), return NaN


In [None]:
# Step 1: Convert 'experience' column from strings to numeric values
df['experience'] = df['experience'].apply(convert_experience)

# Step 2: Fill missing values in 'experience' with the median
df['experience'] = df['experience'].fillna(df['experience'].median())

# Step 3: Fill missing values in categorical text columns with "Unknown"
text_cols = ['gender', 'relevent_experience', 'enrolled_university', 
             'education_level', 'major_discipline', 'company_size', 
             'company_type', 'last_new_job']

for col in text_cols:
    df[col] = df[col].fillna('Unknown')


### Missing Values Handling Summary

To ensure the dataset is clean and ready for analysis, we applied the following steps to handle missing values:

#### 🔹 1. `experience` Column:
- The `experience` column originally contained string values such as `'<1'`, `'>20'`, and other numeric values stored as strings.
- We created a function to:
  - Convert `'<1'` to `0`
  - Convert `'>20'` to `21`
  - Convert other string numbers to integers
  - Leave invalid or missing entries as `NaN`
- After conversion, missing values were imputed using the **median** of the column to minimize the effect of outliers.

#### 🔹 2. Categorical Columns:
- For categorical columns like `gender`, `education_level`, `company_type`, etc., missing values were filled with the placeholder `"Unknown"`.
- This approach allows us to retain rows with missing data and treat them as a separate, meaningful category during analysis or modeling.

####  Final Result:
- The dataset is now free of missing values.
- All values are in a consistent and analysis-friendly format.


In [None]:
df.isna().sum()

## EDA

In [None]:
sns.countplot(x='target', data=df)
plt.title('Target Distribution')
plt.show()


##  Target Class Imbalance Analysis

Based on your dataset:

- Number of people with `target = 1` (interested in changing jobs) ≈ **4,000**
- Number of people with `target = 0` (not interested) ≈ **14,000**

---

###  Imbalanced Classes Detected

| Target Value | Count   | Approx. Percentage |
|--------------|---------|--------------------|
| `target = 0` | 14,000  | ~77%               |
| `target = 1` | 4,000   | ~23%               |


In [None]:
plt.figure(figsize=(8,5))
sns.countplot(x='education_level', hue='target', data=df, palette='Set2')
plt.title('Education Level vs Target')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Looking for Job Change')
plt.show()


###  Education Level vs Target

The chart below compares the number of people at each education level and their willingness to change jobs (`target`):

| Education Level   | Not Looking (`0`) | Looking (`1`) | % Looking |
|-------------------|-------------------|---------------|-----------|
| Graduate          | 8,000             | 3,000         | 27.3%     |
| Masters           | 3,300             | 900           | 21.4%     |
| High School       | 1,500             | 500           | 25.0%     |
| PhD               | 500               | 100           | 16.7%     |
| Primary School    | 300               | 100           | 25.0%     |

** Insights:**
- **Graduates** represent the largest group in the dataset and show a relatively high job-switching rate (27%).
- **Master's degree holders** show slightly lower willingness to change jobs (~21%).
- **PhD holders** are the least likely to be looking for a new job (only ~17%).
- **Primary school** and **High school** levels show surprisingly high switching rates (~25%), but these groups are relatively small in size.
- Overall, education level appears to have some influence on job change behavior.



In [None]:
plt.figure(figsize=(8,5))
sns.heatmap(df.select_dtypes(include='number').corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()


In [None]:
top_cities = df['city'].value_counts().head(10).reset_index()
top_cities.columns = ['city', 'count']

fig = px.bar(top_cities, 
             x='city', 
             y='count', 
             title='Top 10 Cities with Most Data', 
             color='count', 
             color_continuous_scale='Blues')

fig.update_layout(
    xaxis_title='City',
    yaxis_title='Count',
    xaxis_tickangle=-45
)

fig.show()


Among the top 10 most frequent cities in the dataset:

- **City_103** has the highest number of records: **4,355 entries**.
- **City_104** has the lowest among the top 10: **301 entries**.

This shows a significant concentration of data in a few major cities, particularly City_103.


In [None]:
def convert_experience(x):
    if pd.isna(x):
        return np.nan
    elif x == '>20':
        return 21
    elif x == '<1':
        return 0
    else:
        return int(float(x))

df['experience_num'] = df['experience'].apply(convert_experience)


In [None]:
fig = px.histogram(df, x='experience_num', nbins=20, color_discrete_sequence=['#fc8d62'],
                   title='Distribution of Experience (Years)', marginal='rug')
fig.update_layout(xaxis_title='Years of Experience', yaxis_title='Count')
fig.show()


> ### The most common experience level in the dataset is **7 years**, with **2,244 individuals** reporting this experience.


In [None]:
discipline_target = df.groupby('major_discipline')['target'].mean().sort_values(ascending=False).reset_index()

In [None]:
discipline_target

###  Job Change Rate by Major Discipline

Here is the proportion of individuals interested in changing jobs (`target = 1`) per major discipline:

- **Other**: 26.8%
- **Business Degree**: 26.3%
- **STEM**: 26.2%
- **No Major**: 24.7%
- **Humanities**: 21.1%
- **Arts**: 20.9%
- **Unknown**: 19.5%

> Insight:
Candidates with non-traditional or business-related backgrounds show slightly higher job-switching interest than those with STEM or arts degrees.  
Notably, individuals with "Unknown" or "Arts" backgrounds show the lowest change intent.


In [None]:
pivot_df = df.pivot_table(index='company_size', columns='education_level', 
                          values='enrollee_id', aggfunc='count', fill_value=0).reset_index()

melted = pivot_df.melt(id_vars='company_size', var_name='education_level', value_name='count')



In [None]:
fig = px.bar(melted, 
             x='company_size', 
             y='count', 
             color='education_level', 
             title='Education Level by Company Size',
             barmode='stack',
             color_discrete_sequence=px.colors.qualitative.Set2)

fig.update_layout(xaxis_title='Company Size', yaxis_title='Number of People')
fig.show()


###  Education Level Distribution Across Company Sizes

The table below shows how different education levels are distributed across various company sizes:

| Company Size | Graduate | High School | Masters | PhD | Primary School | Unknown |
|--------------|----------|-------------|---------|-----|----------------|---------|
| 10/49        | 998      | 116         | 300     | 25  | 8              | 24      |
| 100-500      | 1646     | 148         | 670     | 66  | 11             | 30      |
| 1000-4999    | 788      | 77          | 394     | 50  | 5              | 14      |
| 10000+       | 1257     | 92          | 590     | 67  | 5              | 8       |
| 50-99        | 1994     | 216         | 768     | 49  | 17             | 39      |
| 500-999      | 571      | 41          | 235     | 19  | 1              | 10      |
| 5000-9999    | 358      | 31          | 142     | 24  | 2              | 6       |
| <10          | 817      | 133         | 296     | 24  | 15             | 23      |
| Unknown      | 3169     | 1163        | 966     | 90  | 244            | 306     |

 Insight
- Most highly educated individuals (Masters and PhDs) tend to work in medium to large-sized companies (100-500+).
- Smaller companies and those with unknown size have a higher portion of individuals with lower education levels or unspecified education.


In [None]:
sns.histplot(df['city_development_index'],bins=10,kde=True)
plt.show()

* City development index has minimum value of 0.448000 and maximum value of 0.949000
* Mean of 0.828848 and Standard Deviation 0.123362
Outliers

In [None]:
sns.countplot(x='gender',data=df)
plt.show()

* Max employees are Male
* 4508 employees have not specified gender

In [None]:
df['enrolled_university'].value_counts(dropna=False)


In [None]:
sns.countplot(x='enrolled_university', data=df)
plt.show()

* Max employees have not enrolled in any university
* 386 employees have not specified

In [None]:
df['major_discipline'].value_counts(dropna=False)



In [None]:
sns.countplot(x='major_discipline', data=df)
plt.show()

* Max employees have major descipline STEM
* 2813 eployees have not specified major discipline

In [None]:
sns.countplot(x='experience', data=df)
plt.show()

* Max employees have experience of more than 20 years
* 65 employees have not specified experience

In [None]:
sns.histplot(x=df['gender'],hue=df['target'])
plt.show()


* As we can see more number of Males are there who want to change their job