Kaggle Dataset used: [Job Placement in Developing Countries](https://www.kaggle.com/datasets/ahsan81/job-placement-dataset)

In [1]:
# Import essential libraries
import numpy as np
import pandas as pd
import seaborn as sb
%matplotlib inline
import matplotlib.pyplot as plt
sb.set()

## Description of dataset

---

> **gender** : Gender of the candidate

> **ssc_percentage** : Senior secondary exams percentage (10th Grade)

> **ssc_board** : Board of education for ssc exams

> **hsc_percentage** : Higher secondary exams percentage (12th Grade)

> **hsc_board** : Board of education for hsc exams

> **hsc_subject** : Subject of study for hsc

> **degree_percentage** : Percentage of marks in undergrad degree

> **undergrad_degree** : Undergrad degree majors

> **work_experience** : Past work experience

> **emp_test_percentage** : Aptitude test percentage

> **specialization** : Postgrad degree majors - (MBA specialization)

> **mba_percent** : Percentage of marks in MBA degree
 
> **status** (RESPONSE VARIABLE) : Status of placement. 

**Background Context on SSC/HSC Central vs Other Board**

The central board usually caters to middle-class children from an urban or semi-urban background while other boards caters to students from a range of backgrounds including the rural hitherland. The central board is also child-centric and flexible. The entire syllabus is designed to make learning fun for children. Each chapter comes with activities and projects to ensure that children are interested. While other boards are usually exam-centric and focus on ensuring that children study to do well in their board examinations. 

Read more at:
- http://timesofindia.indiatimes.com/articleshow/2110832.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst 

- https://yellowslate.com/blog/understand-cbse-vs-ssc-before-you-regret 

In [3]:
df= pd.read_csv('Job_Placement_Data.csv')
df

Unnamed: 0,gender,ssc_percentage,ssc_board,hsc_percentage,hsc_board,hsc_subject,degree_percentage,undergrad_degree,work_experience,emp_test_percentage,specialisation,mba_percent,status
0,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed
3,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed
211,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed
212,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed
213,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed


## **Cleaning Data**

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   gender               215 non-null    object 
 1   ssc_percentage       215 non-null    float64
 2   ssc_board            215 non-null    object 
 3   hsc_percentage       215 non-null    float64
 4   hsc_board            215 non-null    object 
 5   hsc_subject          215 non-null    object 
 6   degree_percentage    215 non-null    float64
 7   undergrad_degree     215 non-null    object 
 8   work_experience      215 non-null    object 
 9   emp_test_percentage  215 non-null    float64
 10  specialisation       215 non-null    object 
 11  mba_percent          215 non-null    float64
 12  status               215 non-null    object 
dtypes: float64(5), object(8)
memory usage: 22.0+ KB


In [5]:
# Create a copy of the Dataset
df_clean = df.copy()

# Converting variable names to uppercase and renaming 'mba_percent'
df_clean.columns = df_clean.columns.str.upper()
df_clean = df_clean.rename(columns={"MBA_PERCENT": "MBA_PERCENTAGE"})

df_clean

Unnamed: 0,GENDER,SSC_PERCENTAGE,SSC_BOARD,HSC_PERCENTAGE,HSC_BOARD,HSC_SUBJECT,DEGREE_PERCENTAGE,UNDERGRAD_DEGREE,WORK_EXPERIENCE,EMP_TEST_PERCENTAGE,SPECIALISATION,MBA_PERCENTAGE,STATUS
0,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed
3,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed
211,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed
212,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed
213,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed


In [6]:
# Check for duplicate data
print(df_clean.duplicated())

# Counting duplicate data
print("\nNumber of duplicate records : ", df_clean.duplicated().sum())

0      False
1      False
2      False
3      False
4      False
       ...  
210    False
211    False
212    False
213    False
214    False
Length: 215, dtype: bool

Number of duplicate records :  0


In [7]:
# Checking data type of each column
print(df_clean.dtypes) 

GENDER                  object
SSC_PERCENTAGE         float64
SSC_BOARD               object
HSC_PERCENTAGE         float64
HSC_BOARD               object
HSC_SUBJECT             object
DEGREE_PERCENTAGE      float64
UNDERGRAD_DEGREE        object
WORK_EXPERIENCE         object
EMP_TEST_PERCENTAGE    float64
SPECIALISATION          object
MBA_PERCENTAGE         float64
STATUS                  object
dtype: object



### Observations from Data cleaning:

*   Total of 215 datapoints
*   No null datapoints
*   No duplicate datapoints
*   13 data variables of float and object types


In [8]:
# creating a new csv of modified dataframe
df_clean.to_csv('Job_Placement_Clean.csv')