# Predicting and understanding viewer engagement with educational videos 

Code to analyse educational video engaging from different features.


## About the dataset

We extracted training and test datasets of educational video features from the VLE Dataset put together by researcher Sahan Bulathwela at University College London. 

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single educational video, and includes information about diverse properties of the video content as described further below. The target variable is `engagement` which was defined as True if the median percentage of the video watched across all viewers was at least 30%, and False otherwise.


For this final assignment, you will bring together what you've learned across all four weeks of this course, by exploring different prediction models for this new dataset. In addition, we encourage you to apply what you've learned about model selection to do hyperparameter tuning using training/validation splits of the training data, to optimize the model and further increase its performance. In addition to a basic evaluation of model accuracy, we've also provided a utility function to visualize which features are most and least contributing to the overall model performance.

**File descriptions** 
    assets/train.csv - the training set (Use only this data for training your model!)
    assets/test.csv - the test set
<br>

**Data fields**

train.csv & test.csv:

    title_word_count - the number of words in the title of the video.
    
    document_entropy - a score indicating how varied the topics are covered in the video, based on the transcript. Videos with smaller entropy scores will tend to be more cohesive and more focused on a single topic.
    
    freshness - The number of days elapsed between 01/01/1970 and the lecture published date. Videos that are more recent will have higher freshness values.
    
    easiness - A text difficulty measure applied to the transcript. A lower score indicates more complex language used by the presenter.
    
    fraction_stopword_presence - A stopword is a very common word like 'the' or 'and'. This feature computes the fraction of all words that are stopwords in the video lecture transcript.
    
    speaker_speed - The average speaking rate in words per minute of the presenter in the video.
    
    silent_period_rate - The fraction of time in the lecture video that is silence (no speaking).
    



* Dataset from: https://github.com/sahanbull/VLE-Dataset



In [2]:
import warnings
warnings.filterwarnings("ignore")

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)   # Do not change this value: required to be compatible with solutions generated by the autograder.

In [None]:
##The first comented function performs a search of optimum parameters based on auc_score and then the uncommented applies this tunning to data.
"""

def engagement_model():
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVC
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.neural_network import MLPClassifier
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import roc_auc_score
    
    metric="roc_auc"
    rec = None
    
    df = pd.read_csv("assets/train.csv", index_col="id")
    X=df.iloc[:,[1,3,6]]
    y=df.iloc[:,-1]
    
    models=np.array([SVC(), GradientBoostingClassifier(), MLPClassifier()])
    grid_vals = np.array([{'kernel': ["rbf", "linear"],'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]},
                          {'learning_rate': [0.001, 0.01, 0.05, 0.1, 1, 10, 100], "n_estimators": [5,10,50,100,500]},
                          {'hidden_layer_sizes': [[100], [50], [100,100], [50,50]],'alpha': [0.0001, 0.001, 0.01, 0.1]}])#Most important parameters
    
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    tree = DecisionTreeClassifier().fit(X_train, y_train)
    imp=tree.feature_importances_
    
    grid_clf_auc = GridSearchCV(models[1], param_grid = grid_vals[1], scoring = metric, cv=2)
    grid_clf_auc.fit(X_train, y_train)
    y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test) 

    print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
    print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
    print('Grid best score (AUC): ', grid_clf_auc.best_score_)
    
    return grid_vals
engagement_model()

Test set AUC:  0.8513744982114587
Grid best parameter (max. AUC):  {'learning_rate': 0.01, 'n_estimators': 500}
Grid best score (AUC):  0.8353437378545192

"""
def engagement_model():
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVC
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.neural_network import MLPClassifier
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import roc_auc_score
    
    metric="roc_auc"
    
    df = pd.read_csv("assets/train.csv", index_col="id")
    test = pd.read_csv("assets/test.csv", index_col="id")
    X=df.iloc[:,:-1]
    y=df.iloc[:,-1]
    
    models=np.array([SVC(), GradientBoostingClassifier(), MLPClassifier()])
    grid_vals = np.array([{'kernel': ["rbf", "linear"],'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]},
                          {'learning_rate': [0.001, 0.01, 0.05, 0.1, 1, 10, 100], "n_estimators": [5,10,50,100,500]},
                          {'hidden_layer_sizes': [[100], [50], [100,100], [50,50]],'alpha': [0.0001, 0.001, 0.01, 0.1]}])#Most important parameters
    
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    tree = DecisionTreeClassifier().fit(X_train, y_train)
    imp=tree.feature_importances_
    
    """for i in range(len(models)):
        grid_clf_auc = GridSearchCV(models[i], param_grid = grid_vals[i], scoring = metric)
        grid_clf_auc.fit(X_train, y_train)
        y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test) 

        print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
        print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
        print('Grid best score (AUC): ', grid_clf_auc.best_score_)"""
    
    winner=GradientBoostingClassifier(learning_rate= 0.01, n_estimators= 500).fit(X_train, y_train)
    rec=test.iloc[:,-1]
    
    return pd.Series(winner.predict_proba(test)[:,1], index=rec.index)
engagement_model()



id
9240     0.021111
9241     0.030671
9242     0.102101
9243     0.954662
9244     0.023380
           ...   
11544    0.027166
11545    0.016209
11546    0.022203
11547    0.914186
11548    0.020090
Length: 2309, dtype: float64