# Predicting Job Chances after Bootcamp
My goal for this project is to be able to predict if a bootcamp participant is able to get a developer job afterwards.

## Exploring the Data

In [485]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("2016-FCC-New-Coders-Survey-Data.csv", low_memory = False)
print "Student data read successfully!"

Student data read successfully!


Let's take a look at the data of students who attended a bootcamp and finished it. It is also important to only have participants who answered the question, if they got a job or not.

In [486]:
student_data = student_data[(student_data['AttendedBootcamp'] == 1) & (student_data['BootcampFinish'] == 1)]
student_data = student_data.dropna(subset = ['BootcampFullJobAfter'])

Let's see how many bootcamp participants we have information on, and learn about the job success rate among those.

In [487]:
# TODO: Calculate number of students who did or did not get a job after a bootcamp
n_students = student_data.shape[0]

# TODO: Calculate number of features, -1 to not calculate the target column
n_features = student_data.shape[1] - 1

# TODO: Calculate passing students
n_success = student_data[student_data['BootcampFullJobAfter'] == 1].shape[0]

# TODO: Calculate failing students
n_failed = n_students - n_success

# TODO: Calculate graduation rate
success_rate = n_success * 100.0 / n_students

# Print the results
print "Total number of participants: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of participants who got a job: {}".format(n_success)
print "Number of participants who failed to get a job: {}".format(n_failed)
print "Job success rate of bootcamp participants: {:.2f}%".format(success_rate)

Total number of participants: 635
Number of features: 112
Number of participants who got a job: 371
Number of participants who failed to get a job: 264
Job success rate of bootcamp participants: 58.43%


## Preparing the Data
### Identify feature and target columns
Let's have a look at the feature columns we have. We'll use `'BootcampFullJobAfter'` as the target column.

In [488]:
# Extract feature columns
feature_cols = list(student_data.columns[:3]) + list(student_data.columns[4:])

# Extract target column 'passed'
target_col = student_data.columns[3] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

Feature columns:
['Age', 'AttendedBootcamp', 'BootcampFinish', 'BootcampLoanYesNo', 'BootcampMonthsAgo', 'BootcampName', 'BootcampPostSalary', 'BootcampRecommend', 'ChildrenNumber', 'CityPopulation', 'CodeEventBootcamp', 'CodeEventCoffee', 'CodeEventConferences', 'CodeEventDjangoGirls', 'CodeEventGameJam', 'CodeEventGirlDev', 'CodeEventHackathons', 'CodeEventMeetup', 'CodeEventNodeSchool', 'CodeEventNone', 'CodeEventOther', 'CodeEventRailsBridge', 'CodeEventRailsGirls', 'CodeEventStartUpWknd', 'CodeEventWomenCode', 'CodeEventWorkshop', 'CommuteTime', 'CountryCitizen', 'CountryLive', 'EmploymentField', 'EmploymentFieldOther', 'EmploymentStatus', 'EmploymentStatusOther', 'ExpectedEarning', 'FinanciallySupporting', 'Gender', 'HasChildren', 'HasDebt', 'HasFinancialDependents', 'HasHighSpdInternet', 'HasHomeMortgage', 'HasServedInMilitary', 'HasStudentDebt', 'HomeMortgageOwe', 'HoursLearning', 'ID.x', 'ID.y', 'Income', 'IsEthnicMinority', 'IsReceiveDiabilitiesBenefits', 'IsSoftwareDev', 'Is

To make reasonable predictions, we have to drop some columns which contain information we wouldn't have, if the person is still attending the bootcamp. We also drop the columns `'AttendedBootcamp'` and `'BootcampFinish'`, because we only look at students who answered those questions with `yes`.

We also get rid of the `'ID.x'`, `'ID.y'`, `'NetworkID'`, `'Part1EndTime'`, `'Part1StartTime'`, `'Part2EndTime'`and `'Part2StartTime'` features, because they are unique to each individual student and don't help with the prediction.

In [489]:
features_to_remove = ['AttendedBootcamp', 'BootcampFinish', 'BootcampMonthsAgo', 'BootcampPostSalary', 'BootcampRecommend', 'ID.x', 'ID.y', 'NetworkID', 'Part1EndTime', 'Part1StartTime', 'Part2EndTime', 'Part2StartTime']

for ftr in features_to_remove:
    feature_cols.remove(ftr)

Let's have a look on the data.

In [490]:
student_data[feature_cols].head()

Unnamed: 0,Age,BootcampLoanYesNo,BootcampName,ChildrenNumber,CityPopulation,CodeEventBootcamp,CodeEventCoffee,CodeEventConferences,CodeEventDjangoGirls,CodeEventGameJam,...,ResourceSoloLearn,ResourceStackOverflow,ResourceTreehouse,ResourceUdacity,ResourceUdemy,ResourceW3Schools,ResourceYouTube,SchoolDegree,SchoolMajor,StudentDebtOwe
93,32.0,0.0,Codify Academy,,"between 100,000 and 1 million",,,,,,...,,,,1.0,,,,bachelor's degree,Biology,
97,26.0,0.0,DaVinci Coders,,more than 1 million,,,1.0,,,...,,,,,1.0,,,master's degree (non-professional),Music,80000.0
130,41.0,1.0,Coder Foundry,3.0,"less than 100,000",,1.0,1.0,,,...,,,,,1.0,,,"some college credit, no degree",,8000.0
159,26.0,0.0,General Assembly,,"between 100,000 and 1 million",,1.0,1.0,,,...,,,,,,,,"some college credit, no degree",,
206,36.0,0.0,Thinkful,2.0,more than 1 million,,1.0,,,,...,,,,,,,,bachelor's degree,Communications,


We have a lot of cells with no value. That's probably because the people filling out the survey could decide which questions they choose to answer or because all 0 and no are displayed as `NaN`.

Especially in questions where the answers are `yes` or `no`, we have the values 1.0 or `NaN`.

Let's take a quick look if we have the value 0 in the category `'StudentDebtOwe'`, or if all 0 values are `NaN`.

In [491]:
student_data[student_data['StudentDebtOwe'] == 0]

Unnamed: 0,Age,AttendedBootcamp,BootcampFinish,BootcampFullJobAfter,BootcampLoanYesNo,BootcampMonthsAgo,BootcampName,BootcampPostSalary,BootcampRecommend,ChildrenNumber,...,ResourceSoloLearn,ResourceStackOverflow,ResourceTreehouse,ResourceUdacity,ResourceUdemy,ResourceW3Schools,ResourceYouTube,SchoolDegree,SchoolMajor,StudentDebtOwe


It seems there is value 0. This means it's pretty safe to fill the missing values with 0.

In [492]:
student_data = student_data.fillna(0)

In [493]:
# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

### Preprocess Feature Columns
Most machine learning algorithms expect numeric data. Let's see if we have any non-numeric features.

In [494]:
# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()


Feature values:
      Age  BootcampLoanYesNo      BootcampName  ChildrenNumber  \
93   32.0                0.0    Codify Academy             0.0   
97   26.0                0.0    DaVinci Coders             0.0   
130  41.0                1.0     Coder Foundry             3.0   
159  26.0                0.0  General Assembly             0.0   
206  36.0                0.0          Thinkful             2.0   

                    CityPopulation  CodeEventBootcamp  CodeEventCoffee  \
93   between 100,000 and 1 million                0.0              0.0   
97             more than 1 million                0.0              0.0   
130              less than 100,000                0.0              1.0   
159  between 100,000 and 1 million                0.0              1.0   
206            more than 1 million                0.0              1.0   

     CodeEventConferences  CodeEventDjangoGirls  CodeEventGameJam  \
93                    0.0                   0.0               0.0   
97 

There are several binary values. This seems fine after replacing all `NaN` with zeros. We still have some non-numeric columns. These are known as categorical variables.

To handle the categorcial variables, I create as many columns as possible values and fill these with binary values.

In [495]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'SchoolMajor' => 'SchoolMajor_Accounting', 'SchoolMajor_Acting', ...
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (732 total features):
['Age', 'BootcampLoanYesNo', 'BootcampName_0', 'BootcampName_10x.org.il', 'BootcampName_4Geeks Academy', 'BootcampName_AcadGild', 'BootcampName_Academia de C\xc3\xb3digo', 'BootcampName_Academy X', 'BootcampName_Ada', 'BootcampName_Anyone Can Learn To Code', 'BootcampName_App Academy', 'BootcampName_Atlanta Code', 'BootcampName_Austin Coding Academy', 'BootcampName_Big Nerd Ranch', 'BootcampName_Bit Bootcamp', 'BootcampName_Bitmaker Labs', 'BootcampName_Bloc.io', 'BootcampName_BoiseCodeWorks', 'BootcampName_BrainStation', 'BootcampName_CODEcamp Charleston', 'BootcampName_Camp Code Away', 'BootcampName_CareerFoundry', 'BootcampName_Code Fellows', 'BootcampName_Code For Progress', 'BootcampName_CodeCore Bootcamp', 'BootcampName_CodeaCamp', 'BootcampName_Codecademy Labs', 'BootcampName_Coder Camps', 'BootcampName_Coder Foundry', "BootcampName_Coder's Lab", 'BootcampName_Codesmith', 'BootcampName_Codeup', 'BootcampName_Codify Academy', 'Bootc

###  Training and Testing Data Split
I use a test set with 25 % of the original data.

In [496]:
from sklearn.cross_validation import train_test_split

# TODO: Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.25, random_state = 42)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 476 samples.
Testing set has 159 samples.


## Training and Evaluating Model

In [497]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label=1.)


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))

In [498]:
# Import the three supervised learning models from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier

# Initialize the three models
clf_A = DecisionTreeClassifier(random_state=42)
clf_B = GaussianNB()
clf_C = SVC(random_state=42)

# Execute the 'train_predict' function for each classifier and each training set size
# train_predict(clf, X_train, y_train, X_test, y_test)
for clf in [clf_A, clf_B, clf_C]:
    print "\n{}: \n".format(clf.__class__.__name__)
    for n in [100, 200, 300]:
        train_predict(clf, X_train[:n], y_train[:n], X_test, y_test)


DecisionTreeClassifier: 

Training a DecisionTreeClassifier using a training set size of 100. . .
Trained model in 0.0021 seconds
Made predictions in 0.0004 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.8478.
Training a DecisionTreeClassifier using a training set size of 200. . .
Trained model in 0.0031 seconds
Made predictions in 0.0003 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0003 seconds.
F1 score for test set: 0.8342.
Training a DecisionTreeClassifier using a training set size of 300. . .
Trained model in 0.0061 seconds
Made predictions in 0.0004 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.8022.

GaussianNB: 

Training a GaussianNB using a training set size of 100. . .
Trained model in 0.0013 seconds
Made predictions in 0.0007 seconds.
F1 score for training set: 0.8644.
Made predictions in 0.0013 seconds.
F1 score for test set: 0.7345.
Tr

In [499]:
# Import 'gridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

# TODO: Create the parameters list you wish to tune
parameters = {'min_samples_split': range(5,200,5)}

# TODO: Initialize the classifier
clf = DecisionTreeClassifier(random_state=42)

# TODO: Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score,pos_label=1.)

# TODO: Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, parameters, scoring = f1_scorer)

# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)

# Get the estimator
clf = grid_obj.best_estimator_

# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))

Made predictions in 0.0006 seconds.
Tuned model has a training F1 score of 0.9280.
Made predictions in 0.0002 seconds.
Tuned model has a testing F1 score of 0.8603.
