# Data Preparation of "Campus Recruitment" Data Set

#### Nicolas Arrieche, Emery Stokes

From the description of the data set by the creator, "This data set consists of placement data of students in Jain University Bangalore. It includes secondary and higher secondary school percentage and specialization. It also includes degree specialization, type and work experience, and salary offers to the placed students"

More information about the dataset can be found [here](https://www.kaggle.com/benroshan/factors-affecting-campus-placement).  

---

We will now clean the data set and prepare it for visualization, analysis, and interactivity.

In [9]:
import pandas as pd
import numpy as np

### Read CSV  


In [27]:
raw_df = pd.read_csv(
    'Placement_Data_Full_Class.csv',
)

print("Shape = ", raw_df.shape)
raw_df.head()

Shape =  (215, 15)


Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


### Renaming the columns to clarify what they mean


In [34]:
original_names = []
for col in raw_data:
    original_names.append(col)

new_names = ['Serial Number','Gender','Secondary Education Percentage (10th Grade)', 'Board of Education',
             'Higher Secondary Education Percentage (12th Grade)','Higher Board of Education',
             'Specialization in Higher Secondary Education','Degree Percentage', 'Degree Type', 'Work Experience', 
             'Employability Test', 'MBA Specialization', 'MBA percentage', 'Placement Status','Salary']

name_change = {}
for i in range(len(original_names)):
    name_change[original_names[i]] = new_names[i]
    

labelled_df = raw_df.rename(columns=name_change)
labelled_df.head()

Unnamed: 0,Serial Number,Gender,Secondary Education Percentage (10th Grade),Board of Education,Higher Secondary Education Percentage (12th Grade),Higher Board of Education,Specialization in Higher Secondary Education,Degree Percentage,Degree Type,Work Experience,Employability Test,MBA Specialization,MBA percentage,Placement Status,Salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


### Removing Unecessary Columns and Editing Others for Easier Use
To make the dataset more simple, we will remove columns that will not be used in our analysis and in making the data frame interactive. 

Serial number doesn't provide any useful information to the viewer, for board of education, whether it is central or other won't make sense to people not from that area, and employability test is conducted by the college so a viewer will have no idea what the numbers actually mean. 

In [40]:
dropped_df = labelled_df.drop(columns=['Serial Number', 'Board of Education', 'Higher Board of Education', 'Employability Test'])
dropped_df.head()

Unnamed: 0,Gender,Secondary Education Percentage (10th Grade),Higher Secondary Education Percentage (12th Grade),Specialization in Higher Secondary Education,Degree Percentage,Degree Type,Work Experience,MBA Specialization,MBA percentage,Placement Status,Salary
0,M,67.0,91.0,Commerce,58.0,Sci&Tech,No,Mkt&HR,58.8,Placed,270000.0
1,M,79.33,78.33,Science,77.48,Sci&Tech,Yes,Mkt&Fin,66.28,Placed,200000.0
2,M,65.0,68.0,Arts,64.0,Comm&Mgmt,No,Mkt&Fin,57.8,Placed,250000.0
3,M,56.0,52.0,Science,52.0,Sci&Tech,No,Mkt&HR,59.43,Not Placed,
4,M,85.8,73.6,Commerce,73.3,Comm&Mgmt,No,Mkt&Fin,55.5,Placed,425000.0


### Missing Values  
Next, we will display the number of missing values in each column of the raw dataset.

In [41]:
# display number of missing values per column
print("Total Missing Values:\n\n" + str(dropped_df.isnull().sum()))

Total Missing Values:

Gender                                                 0
Secondary Education Percentage (10th Grade)            0
Higher Secondary Education Percentage (12th Grade)     0
Specialization in Higher Secondary Education           0
Degree Percentage                                      0
Degree Type                                            0
Work Experience                                        0
MBA Specialization                                     0
MBA percentage                                         0
Placement Status                                       0
Salary                                                67
dtype: int64


####  Since the data doesn't have any missing values, we can just continue. 


---  

### Write Cleaned Dataset to CSV  
We will now write this cleaned version of the dataset to CSV for use in the creation of the interactive dashboard.

In [43]:
dropped_df.to_csv('cleaned_data.csv', index=False)