The following cells train a model using Logistic Regression. It will produce a final table which reports what L2 penalty is best for training the model.

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

In [4]:
# Imports train/test data. Converts grade to a numeric value.
train_data = pd.read_csv('data/edx_train.csv')
test_data = pd.read_csv('data/edx_test.csv')
train_data['grade'] = pd.to_numeric(train_data.grade, errors = 'coerce')
test_data['grade'] = pd.to_numeric(test_data.grade, errors = 'coerce')

In [5]:
features_update = ['registered',
            'viewed',
            'explored',
            'LoE_DI',
            'YoB',
            'grade',
            'nevents',
            'ndays_act',
            'nplay_video',
            'nchapters',
            'nforum_posts'
            ]
target = 'certified'

# Create dummies for non-numeric features 
train_dummy = pd.get_dummies(train_data[features_update + [target]])
test_dummy = pd.get_dummies(test_data)

# Train/val split
train_rf, val_rf = train_test_split(train_dummy, test_size = 0.2)

# Collect dummy column names
features_update = list(train_rf.columns)
features_update.remove(target)

# Updates train/val with dummies and accounts for NA values
train_rf[features_update] = train_rf[features_update].fillna(0)
val_rf[features_update] = val_rf[features_update].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [20]:
# Sets L2 penalties and creates an array of the penalties names for the later table
l2_penalties = [0.01, 1.0, 4.0, 10.0, 1e2, 1e3, 1e5]
l2_names = [f'coefficients [L2 = {l2_penalty:.0e}]'
           for l2_penalty in l2_penalties]
acc = []
count = 0

# Loops through all l2_penalties
for l2_penalty, l2_names in zip(l2_penalties, l2_names):
    model = LogisticRegression(penalty = 'l2', fit_intercept = False, C = (1 / l2_penalty))
    model = model.fit(train_rf[features_update], train_rf[target])
    
    train_pred = model.predict(train_rf[features_update])
    val_pred = model.predict(val_rf[features_update])
    
    acc_train = accuracy_score(train_rf[target], train_pred)
    acc_val = accuracy_score(val_rf[target], val_pred)
    count = count + 1
    acc_combined = [count, l2_penalty, acc_train, acc_val]
    acc.append(acc_combined)

acc_table = pd.DataFrame(acc)
acc_table.columns = ['count', 'l2_penalty', 'train_accuracy', 'validation_accuracy']
acc_table

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0,count,l2_penalty,train_accuracy,validation_accuracy
0,1,0.01,0.985013,0.98516
1,2,1.0,0.983728,0.984589
2,3,4.0,0.983157,0.984018
3,4,10.0,0.981302,0.983447
4,5,100.0,0.964316,0.965753
5,6,1000.0,0.924351,0.93379
6,7,100000.0,0.884385,0.890982


The cell below takes the highest performing l2_penalty and applies it to the Logistic Regression so we can ensure the highest accuracy in this case.

In [24]:
# Gets the id of the row with the higest validation accuracy 
acc_row = acc_table.iloc[acc_table['validation_accuracy'].idxmax()]

# 
model = LogisticRegression(penalty = 'l2', fit_intercept = False, C = (1 / acc_row['l2_penalty']))
model = model.fit(train_rf[features_update], train_rf[target])

val_pred = model.predict(val_rf[features_update])
acc_val = accuracy_score(val_rf[target], val_pred)

print('Accuracy Score: ' + str(acc_val))

Accuracy Score: 0.9851598173515982


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Now this model if performing fairly well and it ready to make final predictions based on each student. The model will attempt to conclude if a student will earn a certificate in the course. 

In [28]:
# Creates dummies based on the test dataset and makes predictions
test_dummy = test_dummy[features_update].fillna(0)
test_pred = model.predict(test_dummy[features_update])
# Creates a dataframe of the students' userid and the predictions for their certificate outcome.
final_df = pd.DataFrame(list(zip(test_data['userid_DI'], test_pred)), columns = ['userid_DI', 'certified'] )
final_df

Unnamed: 0,userid_DI,certified
0,MHxPC130476531,1
1,MHxPC130559898,0
2,MHxPC130552712,1
3,MHxPC130394971,1
4,MHxPC130191077,1
...,...,...
2915,MHxPC130421523,0
2916,MHxPC130116114,0
2917,MHxPC130239033,0
2918,MHxPC130445460,0
