Kaggle Dataset used: [Job Placement in Developing Countries](https://www.kaggle.com/datasets/ahsan81/job-placement-dataset)

In [1]:
# Import essential libraries
import numpy as np
import pandas as pd
import seaborn as sb
%matplotlib inline
import matplotlib.pyplot as plt
sb.set()

## Description of dataset

---

> **gender** : Gender of the candidate

> **ssc_percentage** : Senior secondary exams percentage (10th Grade)

> **ssc_board** : Board of education for ssc exams

> **hsc_percentage** : Higher secondary exams percentage (12th Grade)

> **hsc_board** : Board of education for hsc exams

> **hsc_subject** : Subject of study for hsc

> **degree_percentage** : Percentage of marks in undergrad degree

> **undergrad_degree** : Undergrad degree majors

> **work_experience** : Past work experience

> **emp_test_percentage** : Aptitude test percentage

> **specialization** : Postgrad degree majors - (MBA specialization)

> **mba_percent** : Percentage of marks in MBA degree
 
> **status** (RESPONSE VARIABLE) : Status of placement. 

**Background Context on SSC/HSC Central vs Other Board**

The central board usually caters to middle-class children from an urban or semi-urban background while other boards caters to students from a range of backgrounds including the rural hitherland. The central board is also child-centric and flexible. The entire syllabus is designed to make learning fun for children. Each chapter comes with activities and projects to ensure that children are interested. While other boards are usually exam-centric and focus on ensuring that children study to do well in their board examinations. 

Read more at:
- http://timesofindia.indiatimes.com/articleshow/2110832.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst 

- https://yellowslate.com/blog/understand-cbse-vs-ssc-before-you-regret 

In [2]:
df= pd.read_csv('Job_Placement_Data.csv')
df

Unnamed: 0,gender,ssc_percentage,ssc_board,hsc_percentage,hsc_board,hsc_subject,degree_percentage,undergrad_degree,work_experience,emp_test_percentage,specialisation,mba_percent,status
0,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed
3,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed
211,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed
212,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed
213,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed


## **Renaming Variables**

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   gender               215 non-null    object 
 1   ssc_percentage       215 non-null    float64
 2   ssc_board            215 non-null    object 
 3   hsc_percentage       215 non-null    float64
 4   hsc_board            215 non-null    object 
 5   hsc_subject          215 non-null    object 
 6   degree_percentage    215 non-null    float64
 7   undergrad_degree     215 non-null    object 
 8   work_experience      215 non-null    object 
 9   emp_test_percentage  215 non-null    float64
 10  specialisation       215 non-null    object 
 11  mba_percent          215 non-null    float64
 12  status               215 non-null    object 
dtypes: float64(5), object(8)
memory usage: 22.0+ KB


In [4]:
# Create a copy of the Dataset
df_clean = df.copy()

# Converting variable names to uppercase and renaming 'mba_percent'
df_clean.columns = df_clean.columns.str.upper()
df_clean = df_clean.rename(columns={"MBA_PERCENT": "MBA_PERCENTAGE"})

df_clean

Unnamed: 0,GENDER,SSC_PERCENTAGE,SSC_BOARD,HSC_PERCENTAGE,HSC_BOARD,HSC_SUBJECT,DEGREE_PERCENTAGE,UNDERGRAD_DEGREE,WORK_EXPERIENCE,EMP_TEST_PERCENTAGE,SPECIALISATION,MBA_PERCENTAGE,STATUS
0,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed
3,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed
211,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed
212,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed
213,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed


In [5]:
# Checking data type of each column
print(df_clean.dtypes) 

GENDER                  object
SSC_PERCENTAGE         float64
SSC_BOARD               object
HSC_PERCENTAGE         float64
HSC_BOARD               object
HSC_SUBJECT             object
DEGREE_PERCENTAGE      float64
UNDERGRAD_DEGREE        object
WORK_EXPERIENCE         object
EMP_TEST_PERCENTAGE    float64
SPECIALISATION          object
MBA_PERCENTAGE         float64
STATUS                  object
dtype: object


## **Checking for duplicates**

In [6]:
# Check for duplicate data
print(df_clean.duplicated())

# Counting duplicate data
print("\nNumber of duplicate records : ", df_clean.duplicated().sum())

0      False
1      False
2      False
3      False
4      False
       ...  
210    False
211    False
212    False
213    False
214    False
Length: 215, dtype: bool

Number of duplicate records :  0



### Observations:

*   Total of 215 datapoints
*   No null datapoints
*   No duplicate datapoints
*   13 data variables of float and object types


## **One-Hot Encoding of Categorical variables**

One-hot encoding is used to transform categorical data into a binary value for machine learning later. Machine learning algorithms typically require numerical input to easily understand these categorical variable inputs. 

References: https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/

In [7]:
# Import the encoder from sklearn
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

# OneHotEncoding of categorical predictors (not the response)
df_clean_cat = df_clean[['GENDER','SSC_BOARD','HSC_BOARD','HSC_SUBJECT','UNDERGRAD_DEGREE','SPECIALISATION','WORK_EXPERIENCE']]
ohe.fit(df_clean_cat)
df_clean_cat_ohe = pd.DataFrame(ohe.transform(df_clean_cat).toarray(), 
                                  columns=ohe.get_feature_names_out(df_clean_cat.columns))

# Check the encoded variables
df_clean_cat_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   GENDER_F                    215 non-null    float64
 1   GENDER_M                    215 non-null    float64
 2   SSC_BOARD_Central           215 non-null    float64
 3   SSC_BOARD_Others            215 non-null    float64
 4   HSC_BOARD_Central           215 non-null    float64
 5   HSC_BOARD_Others            215 non-null    float64
 6   HSC_SUBJECT_Arts            215 non-null    float64
 7   HSC_SUBJECT_Commerce        215 non-null    float64
 8   HSC_SUBJECT_Science         215 non-null    float64
 9   UNDERGRAD_DEGREE_Comm&Mgmt  215 non-null    float64
 10  UNDERGRAD_DEGREE_Others     215 non-null    float64
 11  UNDERGRAD_DEGREE_Sci&Tech   215 non-null    float64
 12  SPECIALISATION_Mkt&Fin      215 non-null    float64
 13  SPECIALISATION_Mkt&HR       215 non

All the categorical features have been one-hot encoded since they have changed from 'object64' to 'float64' datatype.

In [8]:
# One-Hot Encoding 'STATUS'
status = pd.get_dummies(df_clean['STATUS'],drop_first = True)
df_clean.drop(['STATUS'],axis=1,inplace=True)

In [9]:
# Combining Numeric features with the OHE Categorical features and response variable 'STATUS'
numeric_data = df_clean[['SSC_PERCENTAGE','HSC_PERCENTAGE','DEGREE_PERCENTAGE','EMP_TEST_PERCENTAGE','MBA_PERCENTAGE']]
df_clean_ohe = pd.concat([numeric_data, df_clean_cat_ohe, status], sort = False, axis = 1).reindex(index=numeric_data.index)

# Check the final dataframe 
df_clean_ohe

Unnamed: 0,SSC_PERCENTAGE,HSC_PERCENTAGE,DEGREE_PERCENTAGE,EMP_TEST_PERCENTAGE,MBA_PERCENTAGE,GENDER_F,GENDER_M,SSC_BOARD_Central,SSC_BOARD_Others,HSC_BOARD_Central,...,HSC_SUBJECT_Commerce,HSC_SUBJECT_Science,UNDERGRAD_DEGREE_Comm&Mgmt,UNDERGRAD_DEGREE_Others,UNDERGRAD_DEGREE_Sci&Tech,SPECIALISATION_Mkt&Fin,SPECIALISATION_Mkt&HR,WORK_EXPERIENCE_No,WORK_EXPERIENCE_Yes,Placed
0,67.00,91.00,58.00,55.0,58.80,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1
1,79.33,78.33,77.48,86.5,66.28,0.0,1.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1
2,65.00,68.00,64.00,75.0,57.80,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1
3,56.00,52.00,52.00,66.0,59.43,0.0,1.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0
4,85.80,73.60,73.30,96.8,55.50,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,80.60,82.00,77.60,91.0,74.49,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1
211,58.00,60.00,72.00,74.0,53.62,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1
212,67.00,67.00,73.00,59.0,69.72,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1
213,74.00,66.00,58.00,70.0,60.23,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1


In [10]:
df_clean_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 22 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   SSC_PERCENTAGE              215 non-null    float64
 1   HSC_PERCENTAGE              215 non-null    float64
 2   DEGREE_PERCENTAGE           215 non-null    float64
 3   EMP_TEST_PERCENTAGE         215 non-null    float64
 4   MBA_PERCENTAGE              215 non-null    float64
 5   GENDER_F                    215 non-null    float64
 6   GENDER_M                    215 non-null    float64
 7   SSC_BOARD_Central           215 non-null    float64
 8   SSC_BOARD_Others            215 non-null    float64
 9   HSC_BOARD_Central           215 non-null    float64
 10  HSC_BOARD_Others            215 non-null    float64
 11  HSC_SUBJECT_Arts            215 non-null    float64
 12  HSC_SUBJECT_Commerce        215 non-null    float64
 13  HSC_SUBJECT_Science         215 non

In [11]:
# convert the 'status' column from uint to int
df_clean_ohe['Placed'] = df_clean_ohe['Placed'].astype('int')

df_clean_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 22 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   SSC_PERCENTAGE              215 non-null    float64
 1   HSC_PERCENTAGE              215 non-null    float64
 2   DEGREE_PERCENTAGE           215 non-null    float64
 3   EMP_TEST_PERCENTAGE         215 non-null    float64
 4   MBA_PERCENTAGE              215 non-null    float64
 5   GENDER_F                    215 non-null    float64
 6   GENDER_M                    215 non-null    float64
 7   SSC_BOARD_Central           215 non-null    float64
 8   SSC_BOARD_Others            215 non-null    float64
 9   HSC_BOARD_Central           215 non-null    float64
 10  HSC_BOARD_Others            215 non-null    float64
 11  HSC_SUBJECT_Arts            215 non-null    float64
 12  HSC_SUBJECT_Commerce        215 non-null    float64
 13  HSC_SUBJECT_Science         215 non

In [12]:
# creating a new csv of modified dataframe
df_clean_ohe.to_csv('Job_Placement_Clean.csv')