# Predicting Employment of Students at an University Campus   

## Phase 1: Data Pre-processing, Exploration, and Visualisation 

### Group 46
### Zhaojin Liu s3206722, Martin Thu s3494324, Klara Vickov s3873315

## Table of Contents


## Introduction

In [1]:
import warnings

warnings.filterwarnings("ignore")

Import necessary libraries

In [2]:
import pandas as pd 
import requests
import numpy as np
import io

### Dataset Source
*The dataset was sourced from Kaggle.*

### Dataset Details
*Describe in detail what the dataset is about*

*Report number of observations (rows) and features (columns)*

*Print 10 random observations*

This code is to be displayed before running any pandas data frames: 

In [3]:
pd.set_option('display.max_columns', None) 

Read in data set from GitHub repository

In [4]:
df_url = 'https://raw.githubusercontent.com/kvick1/ML_A1/main/Placement_Data_Full_Class.csv'
url_content = requests.get(df_url, verify=False).content
placement = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

In [5]:
placement.shape

(215, 15)

In [6]:
placement.sample(n=10, random_state=999)

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
148,149,F,77.0,Central,86.0,Central,Arts,56.0,Others,No,57.0,Mkt&Fin,64.08,Placed,240000.0
191,192,M,67.0,Others,61.0,Central,Science,72.0,Comm&Mgmt,No,72.0,Mkt&Fin,61.01,Placed,264000.0
125,126,F,84.0,Central,73.0,Central,Commerce,73.0,Comm&Mgmt,No,75.0,Mkt&Fin,73.33,Placed,350000.0
68,69,F,69.7,Central,47.0,Central,Commerce,72.7,Sci&Tech,No,79.0,Mkt&HR,59.24,Not Placed,
187,188,M,78.5,Central,65.5,Central,Science,67.0,Sci&Tech,Yes,95.0,Mkt&Fin,64.86,Placed,280000.0
91,92,M,52.0,Central,57.0,Central,Commerce,50.8,Comm&Mgmt,No,67.0,Mkt&HR,62.79,Not Placed,
174,175,M,73.24,Others,50.83,Others,Science,64.27,Sci&Tech,Yes,64.0,Mkt&Fin,66.23,Placed,500000.0
7,8,M,82.0,Central,64.0,Central,Science,66.0,Sci&Tech,Yes,67.0,Mkt&Fin,62.14,Placed,252000.0
122,123,F,66.5,Central,66.8,Central,Arts,69.3,Comm&Mgmt,Yes,80.4,Mkt&Fin,71.0,Placed,236000.0
47,48,M,63.0,Central,60.0,Central,Commerce,57.0,Comm&Mgmt,Yes,78.0,Mkt&Fin,54.55,Placed,204000.0


### Dataset Features
*Explain features that will be included in project in a table format*

*Table needs one feature per row with 4 columns - feature name, data type, units, brief description*

*Target Feature*

The salary feature will not be included in our project because it is only included for students who have received a placement. As our model will only be looking at whether students receive a placement, the salary amount is irrelevant.

In [7]:
placement.columns.values

array(['sl_no', 'gender', 'ssc_p', 'ssc_b', 'hsc_p', 'hsc_b', 'hsc_s',
       'degree_p', 'degree_t', 'workex', 'etest_p', 'specialisation',
       'mba_p', 'status', 'salary'], dtype=object)

In [8]:
placement.dtypes

sl_no               int64
gender             object
ssc_p             float64
ssc_b              object
hsc_p             float64
hsc_b              object
hsc_s              object
degree_p          float64
degree_t           object
workex             object
etest_p           float64
specialisation     object
mba_p             float64
status             object
salary            float64
dtype: object

In [9]:
placement = placement.drop(columns=['salary'])

In [10]:
placement.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed


## Goals and Objectives
*Goals and objectives for modeling*

## Data Cleaning and Preprocessing
*deal with missing values, outliers, incorrect values*

*drop ID-like columns*

*add subsections for each cleaning process*

Checking for missing values. No missing values

In [11]:
placement.isnull().sum()

sl_no             0
gender            0
ssc_p             0
ssc_b             0
hsc_p             0
hsc_b             0
hsc_s             0
degree_p          0
degree_t          0
workex            0
etest_p           0
specialisation    0
mba_p             0
status            0
dtype: int64

In [12]:
placement.describe(include=np.number).round(3)

Unnamed: 0,sl_no,ssc_p,hsc_p,degree_p,etest_p,mba_p
count,215.0,215.0,215.0,215.0,215.0,215.0
mean,108.0,67.303,66.333,66.37,72.101,62.278
std,62.209,10.827,10.898,7.359,13.276,5.833
min,1.0,40.89,37.0,50.0,50.0,51.21
25%,54.5,60.6,60.9,61.0,60.0,57.945
50%,108.0,67.0,65.0,66.0,71.0,62.0
75%,161.5,75.7,73.0,72.0,83.5,66.255
max,215.0,89.4,97.7,91.0,98.0,77.89


Drop ID-like column

In [13]:
placement.rename(columns = {'sl_no': 'ID'},inplace = True)
placement

Unnamed: 0,ID,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status
0,1,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,3,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed
3,4,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,5,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,211,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed
211,212,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed
212,213,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed
213,214,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed


In [15]:
placement.set_index('ID', inplace = True)
placement

Unnamed: 0_level_0,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,M,67.00,Others,91.00,Others,Commerce,58.00,Sci&Tech,No,55.0,Mkt&HR,58.80,Placed
2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
3,M,65.00,Central,68.00,Central,Arts,64.00,Comm&Mgmt,No,75.0,Mkt&Fin,57.80,Placed
4,M,56.00,Central,52.00,Central,Science,52.00,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
5,M,85.80,Central,73.60,Central,Commerce,73.30,Comm&Mgmt,No,96.8,Mkt&Fin,55.50,Placed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
211,M,80.60,Others,82.00,Others,Commerce,77.60,Comm&Mgmt,No,91.0,Mkt&Fin,74.49,Placed
212,M,58.00,Others,60.00,Others,Science,72.00,Sci&Tech,No,74.0,Mkt&Fin,53.62,Placed
213,M,67.00,Others,67.00,Others,Commerce,73.00,Comm&Mgmt,Yes,59.0,Mkt&Fin,69.72,Placed
214,F,74.00,Others,66.00,Others,Commerce,58.00,Comm&Mgmt,No,70.0,Mkt&HR,60.23,Placed


## Data Exploration and Visualisation
*charts and graphs as appropriate with proper labels and explanation*

*4 plots each of : 1-variable, 2-variable, 3-variable (12 in total)*

## Literature Review
*Advanced submission - 600+ words*

*minimum 10 journal articles and 4 conference papers in a dedicated references section*

## Summary and Conclusions
*Summarise phase 1 state insights gained*

## References
*References from advanced submission and report in general*

All pandas data frames to be rounded to 3 decimal places

In [None]:
df = df.style.set_precision(3)

Do not display more than 10 lines of a pandas data frame