# Data Challenge Assignment 


# Analysis

In [3]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

from scipy.stats import chi2_contingency
from scipy.stats import norm
import math
from scipy.stats import f_oneway

import seaborn as sns

from sklearn import metrics
from sklearn.metrics import accuracy_score

In [5]:
df = pd.read_excel("Data_Pathrise.xlsx")
df.drop(columns=['primary_track'],inplace=True)

We remove rows with missing values since replacing them with 0 could greatly skew the results of our analysis.

In [6]:
df = df.dropna()

## Chi-Squared

We want to see if there are any correlations between the categorical variables in our dataset.

In [8]:
highest_level_of_education = pd.crosstab(df['highest_level_of_education'],df['placed'])
chi2, p, dof, ex = chi2_contingency(highest_level_of_education, correction=False)
highest_level_of_education 

placed,0,1
highest_level_of_education,Unnamed: 1_level_1,Unnamed: 2_level_1
Bachelor's Degree,278,285
Doctorate or Professional Degree,18,23
GED or equivalent,3,1
High School Graduate,1,5
Master's Degree,161,159
"Some College, No Degree",24,27
Some High School,2,1


In [9]:
employment_status = pd.crosstab(df['employment_status '],df['placed'])
chi2, p, dof, ex = chi2_contingency(employment_status, correction=False)
employment_status

placed,0,1
employment_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Contractor,42,36
Employed Full-Time,96,94
Employed Part-Time,54,60
Student,147,174
Unemployed,148,137


In [11]:
length_of_job_search = pd.crosstab(df['length_of_job_search'],df['placed'])
chi2, p, dof, ex = chi2_contingency(length_of_job_search, correction=False)
length_of_job_search

placed,0,1
length_of_job_search,Unnamed: 1_level_1,Unnamed: 2_level_1
1-2 months,156,177
3-5 months,97,85
6 months to a year,48,52
Less than one month,166,162
Over a year,20,25


In [12]:
biggest_challenge_in_search = pd.crosstab(df['biggest_challenge_in_search'],df['placed'])
chi2, p, dof, ex = chi2_contingency(biggest_challenge_in_search, correction=False)
biggest_challenge_in_search

placed,0,1
biggest_challenge_in_search,Unnamed: 1_level_1,Unnamed: 2_level_1
Behavioral interviewing,13,13
Figuring out which jobs to apply for,30,32
Getting past final round interviews,57,68
Getting past mid-stage interviews,37,37
Getting past phone screens,49,35
Hearing back on my applications,176,205
Lack of relevant experience,51,49
Resume gap,7,6
Technical interviewing,58,52
Technical skills,9,4


In [13]:
professional_experience = pd.crosstab(df['professional_experience'],df['placed'])
chi2, p, dof, ex = chi2_contingency(professional_experience, correction=False)
professional_experience

placed,0,1
professional_experience,Unnamed: 1_level_1,Unnamed: 2_level_1
1-2 years,177,193
3-4 years,126,110
5+ years,53,60
Less than one year,131,138


In [14]:
work_authorization_status = pd.crosstab(df['work_authorization_status'],df['placed'])
chi2, p, dof, ex = chi2_contingency(work_authorization_status, correction=False)
work_authorization_status

placed,0,1
work_authorization_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Canada Citizen,8,6
Citizen,233,250
F1 Visa/CPT,21,33
F1 Visa/OPT,146,134
Green Card,39,41
H1B,12,7
Not Authorized,3,0
Other,23,28
STEM OPT,2,2


In [15]:
cohort_tag = pd.crosstab(df['cohort_tag'],df['placed'])
chi2, p, dof, ex = chi2_contingency(cohort_tag, correction=False)
cohort_tag

placed,0,1
cohort_tag,Unnamed: 1_level_1,Unnamed: 2_level_1
APR18A,0,9
APR18B,6,10
APR19A,8,15
APR19B,9,7
APR20A,2,1
AUG18A,5,16
AUG19A,16,13
AUG19B,17,12
AUG19C,20,13
DEC18A,18,28


Our analysis shows us that only 'primary_track' has a significant relationship with 'placed'.

## ANOVA

In this section, we want to look at the correlation between our numerical variables and 'placed'.

In [16]:
f_oneway(df['number_of_interviews'][df['placed'] == 1],
               df['number_of_interviews'][df['placed'] == 0])

F_onewayResult(statistic=0.4968984543353839, pvalue=0.4810325672694369)

In [17]:
f_oneway(df['number_of_applications'][df['placed'] == 1],
               df['number_of_applications'][df['placed'] == 0])

F_onewayResult(statistic=0.00016274307079096282, pvalue=0.9898241801929569)

In [18]:
f_oneway(df['program_duration_days'][df['placed'] == 1],
              df['program_duration_days'][df['placed'] == 0])


F_onewayResult(statistic=54.15366647321312, pvalue=3.9013114673298523e-13)

There is a significant relationship between days in program and placement. 

## Placement 

Determining whether an applicant will be placed is a binary classification problem. We tackle this with a vanilla neural network as a starting point.

In [20]:
X_merge = df[['cohort_tag',
 'program_duration_days',
 'employment_status ',
 'highest_level_of_education',
 'length_of_job_search',
 'biggest_challenge_in_search',
 'professional_experience',
 'work_authorization_status',
 'number_of_interviews',
 'number_of_applications']]

In [22]:
y_merge = df['placed']

In [23]:
X_train_merge, X_test_merge, y_train_merge, y_test_merge = train_test_split(X_merge, y_merge,test_size=.3,random_state=1)

Because we have a large variety of types of data, we still need to have different preprocessing methods. 'highest_level_of_education' is an ordinal variable which means that labelencoding is the most effective method since it will convert the data into numerical representations and imply a hierarchical relationship for our model.

In [24]:
X_train_label = X_train_merge[['employment_status ','highest_level_of_education','professional_experience','work_authorization_status','length_of_job_search']]
X_test_label = X_test_merge[['employment_status ','highest_level_of_education','professional_experience','work_authorization_status','length_of_job_search']]

labelencoder = LabelEncoder()
X_train_label = X_train_label.apply(labelencoder.fit_transform)
X_test_label = X_test_label.apply(labelencoder.fit_transform)

In [25]:
X_train_numerical = X_train_merge[['number_of_interviews','number_of_applications','program_duration_days']]
X_test_numerical = X_test_merge[['number_of_interviews','number_of_applications','program_duration_days']]
scaler = StandardScaler()
X_train_merge[['number_of_interviews','number_of_applications','program_duration_days']] = scaler.fit_transform(X_train_merge[['number_of_interviews','number_of_applications','program_duration_days']])
X_test_merge[['number_of_interviews','number_of_applications','program_duration_days']] = scaler.fit_transform(X_test_merge[['number_of_interviews','number_of_applications','program_duration_days']])
X_train_numerical =X_train_merge[['number_of_interviews','number_of_applications','program_duration_days']] 
X_test_numerical = X_test_merge[['number_of_interviews','number_of_applications','program_duration_days']] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_merge[['number_of_interviews','number_of_applications','program_duration_days']] = scaler.fit_transform(X_train_merge[['number_of_interviews','number_of_applications','program_duration_days']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#r

We perform OneHotEncoding on variables that are categorical but are independent of one another.

In [27]:
columns = ['cohort_tag','primary_track','biggest_challenge_in_search']
X_train_dum_df = pd.get_dummies(X_train_merge, columns=['cohort_tag','biggest_challenge_in_search'])
X_train_dum_df = X_train_dum_df.iloc[:,7:]

X_test_dum_df = pd.get_dummies(X_test_merge, columns=['cohort_tag','biggest_challenge_in_search'])
X_test_dum_df = X_test_dum_df.iloc[:,7:]

temp3 = [item for item in X_train_dum_df.columns if item not in X_test_dum_df.columns]
temp4 = [item for item in X_test_dum_df.columns if item not in X_train_dum_df.columns]
for i in temp3:
    del X_train_dum_df[i]
for i in temp4:
    del X_test_dum_df[i]

In [28]:
X_train_merge = pd.concat([X_train_label,X_train_dum_df], axis= 1)
X_train_merge = pd.concat([X_train_numerical,X_train_merge], axis= 1)

X_test_merge = pd.concat([X_test_label,X_test_dum_df], axis= 1)
X_test_merge = pd.concat([X_test_numerical,X_test_merge], axis= 1)

Next, we score and combine our model to get a sense of how well it predicts an applicants chance of placement.

In [29]:
clf = MLPClassifier(hidden_layer_sizes=(128),random_state=1, max_iter=5000,solver='sgd',tol=0.000000001).fit(X_train_merge, y_train_merge)
predict = clf.predict(X_test_merge)
clf.score(X_test_merge, y_test_merge)



0.696969696969697

## Days in Program


The next section runs a neural network regression model that aims to predict the number of days it takes for someone to be placed.

In [31]:
df_placement = df[df['placed']==1]

In [34]:
X_merge = df_placement[['cohort_tag',
 'employment_status ',
 'highest_level_of_education',
 'length_of_job_search',
 'biggest_challenge_in_search',
 'professional_experience',
 'work_authorization_status',
 'number_of_interviews',
 'number_of_applications']]

In [35]:
y_merge = df_placement['program_duration_days']

In [36]:
X_train_merge, X_test_merge, y_train_merge, y_test_merge = train_test_split(X_merge, y_merge,test_size=.20,random_state=1)

In [37]:
X_train_label = X_train_merge[['highest_level_of_education','professional_experience','work_authorization_status','length_of_job_search']]
X_test_label = X_test_merge[['highest_level_of_education','professional_experience','work_authorization_status','length_of_job_search']]

X_train_label = X_train_label.apply(labelencoder.fit_transform)
X_test_label = X_test_label.apply(labelencoder.fit_transform)

In [38]:
X_train_numerical = X_train_merge[['number_of_interviews','number_of_applications']]
X_test_numerical = X_test_merge[['number_of_interviews','number_of_applications']]
scaler = StandardScaler()
X_train_merge[['number_of_interviews','number_of_applications']] = scaler.fit_transform(X_train_merge[['number_of_interviews','number_of_applications']])
X_test_merge[['number_of_interviews','number_of_applications']] = scaler.fit_transform(X_test_merge[['number_of_interviews','number_of_applications']])
X_train_numerical =X_train_merge[['number_of_interviews','number_of_applications']] 
X_test_numerical = X_test_merge[['number_of_interviews','number_of_applications']] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_merge[['number_of_interviews','number_of_applications']] = scaler.fit_transform(X_train_merge[['number_of_interviews','number_of_applications']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_merge[['n

In [41]:
columns = ['cohort_tag','biggest_challenge_in_search']
X_train_dum_df = pd.get_dummies(X_train_merge, columns=['cohort_tag','biggest_challenge_in_search'])
X_train_dum_df = X_train_dum_df.iloc[:,7:]

X_test_dum_df = pd.get_dummies(X_test_merge, columns=['cohort_tag','biggest_challenge_in_search'])
X_test_dum_df = X_test_dum_df.iloc[:,7:]

temp3 = [item for item in X_train_dum_df.columns if item not in X_test_dum_df.columns]
temp4 = [item for item in X_test_dum_df.columns if item not in X_train_dum_df.columns]
for i in temp3:
    del X_train_dum_df[i]
for i in temp4:
    del X_test_dum_df[i]


In [42]:
X_train_merge = pd.concat([X_train_label,X_train_dum_df], axis= 1)
X_train_merge = pd.concat([X_train_numerical,X_train_merge], axis= 1)

X_test_merge = pd.concat([X_test_label,X_test_dum_df], axis= 1)
X_test_merge = pd.concat([X_test_numerical,X_test_merge], axis= 1)
#X_train_merge = pd.concat([X_train_label,X_train_ohe], axis= 1)
#X_test_merge = pd.concat([X_test_ohe,X_test_label], axis= 1)

In [43]:
clf = MLPRegressor(random_state=1, max_iter=5000).fit(X_train_label, y_train_merge)
predict = clf.predict(X_test_label)
clf.score(X_test_label, y_test_merge)



0.06463085511946787

# Analysis

Because some of the data was missing, we ended up filtering the dataset. Additionally, we removed columns that could be of ethical concern from our analysis. This ultimately limited the success of our classication and regression models that predicted placement and program duration. I am confident with more data, that the model will improve over time.

Overall, there is a correlation for program track and cohort date play a role in whether or not someone is placed. For a majority of the summer months, almost more than half of the cohort did not end up being placed. There may be an underlying reason for this, perhaps it isn't the best season for hiring. Surprisingly, degree and amount of experience did not appear to have a significant relationship with placement. This is something to consider when looking at applicants and shows that there are factors like type of relevant work experience that could play major role in success within the program. 