## Grid Search 

1. change recall to be at the length of the respective course description
1. instead of using regular max_df as cutoff use max_df across subjects as filter? 
    - We want to continue to filter out generic words using max df, how we calculate the max_df threshold isn't as important
    - you can get which words were filtered out using stop_words_ attribute
    - make it generalizable to different school datasets
    - max_df = 1.0 (**NEEDS TO BE 1.0 OTHERWISE IT'S NOT TREATED AS A PERCENTAGE and you "ignore words that only appear in more than 1 document" so you get all the esoteric words that only appear in one course description**)
    - what does this error mean `ValueError: max_df corresponds to < documents than min_df` and why does it occur was max_df = 0?
1. does getting bigrams and trigrams separately make a difference? - try getting the vocab using ngram_range = 1,3 and no limit on max_features
    - Yes it does make a difference and might be an option to explore for feature engineering
    - remove trigrams / no limit on bigram featurse
1. [x] add hidden layer for the regression
1. [x] *optimization:* do not append to dataframes, start w/ lists and convert to dataframe OR initialize a numpy matrix for the hyperparameters using np.empty first and then [populate instead of appending](https://stackoverflow.com/questions/13370570/elegant-grid-search-in-python-numpy)

In [34]:
import time
import os
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
from itertools import chain
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from scipy.spatial.distance import cosine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import KFold
from sklearn.model_selection import ParameterGrid
from keras.layers import Input, Dense
from keras.models import Model
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
K.set_session(sess)
pd.options.mode.chained_assignment = None 

TRAINING_DIR = os.getcwd()
DATA_DIR = './data'
OUTPUT_DIR = '/home/matthew/ICS-Research/scores'
vectorfile = os.path.join(DATA_DIR, 'course_vecs.tsv')
infofile = os.path.join(DATA_DIR, 'course_info.tsv')
scorefile_path = os.path.join(OUTPUT_DIR, 'score_file.tsv')
textcolumn = 'course_description'

In [35]:
def calculate_metric(df_with_keywords, metric):
    """Calculate metric evaluating quality of inferred keywords with respect to its true course description
    Args: 
        df_with_keywords: dataframe with predicted keywords for every course for every columns
        metrics (string): {r: recall, p: precision, c: cosine similarity, 
        df: document frequency of inferred keywords using course subject as documents}
    Returns
        Desired Metric
    """
    def clean_descrip_title(row):
        punc_remover = str.maketrans('', '', string.punctuation)
        lowered = row['descrip_title'].lower()
        lowered_removed_punc = lowered.translate(punc_remover)
        cleaned_set = set(lowered_removed_punc.split())
        return cleaned_set

    def recall_keywords(row):
        return row['description_title_set'].intersection(row['course_keywords_set'])
    
    def cosine_similarity(x, y):
        return 1 - cosine(x,y)
    
    prediction_df = df_with_keywords.copy()
    only_predicted_keywords_df = prediction_df[prediction_df.columns.difference(['course_name', 'course_title', 'course_description', 'course_subject', 'course_alternative_names'])]
    num_keywords_predicted_per_course = only_predicted_keywords_df.shape[1]
    prediction_df['course_keywords'] = only_predicted_keywords_df.iloc[:,:].apply(lambda x: ', '.join(x), axis=1)
    prediction_df = prediction_df[['course_name', 'course_title', 'course_description', 'course_keywords', 'course_alternative_names']]
    prediction_df['course_keywords'] = prediction_df['course_keywords'].apply(lambda keywords: ', '.join(sorted(set([word.strip() for word in keywords.split(',')]))))
    prediction_df['course_keywords_set'] = prediction_df['course_keywords'].apply(lambda keywords: (set([word.strip() for word in keywords.split(',')])))
    prediction_df['descrip_title'] = prediction_df['course_title'] + ' ' + prediction_df['course_description']
    prediction_df['description_title_set'] = prediction_df.apply(clean_descrip_title, axis = 1)
    prediction_df['shared_words'] = prediction_df.apply(recall_keywords, axis = 1)
    
    if metric == 'r':
        print('[INFO] Calculating Recall...')
        assert num_keywords_predicted_per_course == max_descript_len, 'Number of keywords predicted should equal longest description length'
        prediction_df['recall'] = prediction_df['shared_words'].apply(lambda words: len(list(words)) / max_descript_len)
        average_recall = np.mean(prediction_df['recall'])
        return average_recall
    if metric == 'p':
        print('[INFO] Calculating Precision...')
        assert num_keywords_predicted_per_course == num_top_words, 'Number of keywords predicted should equal number of predicted words per course'
        prediction_df['precision'] = prediction_df['shared_words'].apply(lambda words: len(list(words)) / num_top_words)
        average_precision = np.mean(prediction_df['precision'])
        return average_precision
    if metric == 'c':
        print('[INFO] Calculating Cosine Similarity Between Keyword Distributions...')
        predicted_keyword_list = only_predicted_keywords_df.values.tolist()
        predicted_keyword_list = list(chain.from_iterable(predicted_keyword_list))
        keyword_counter = Counter(predicted_keyword_list)
        print('[INFO] Most common keywords by count: ', keyword_counter.most_common(10))
        
        num_possible_keywords = df_with_keywords.shape[0] * num_top_words
        num_predicted_keywords = len(keyword_counter.keys())
        assert sum(keyword_counter.values()) == split_Y_valid.shape[0] * num_top_words,\
        'Total number of predicted keywords should equal number of courses * number of predicted keywords per course.'
        unif_keyword_vector = np.repeat(num_possible_keywords / num_predicted_keywords, num_predicted_keywords)
        predicted_keyword_vector = np.array(list(keyword_counter.values()))
        assert unif_keyword_vector.shape == predicted_keyword_vector.shape,\
        'Uniform keyword frequency vector should have same dimension as predicted keywords frequency vector.'
    
        cos_sim = cosine_similarity(predicted_keyword_vector, unif_keyword_vector)
        return cos_sim
    if metric == 'df':
        print('[INFO] Calculating Document Frequency of Predicted Keywords across Course Subjects...')
        document_df_cols = df_with_keywords.columns.difference(['course_title', 'course_description', 'course_name', 'course_alternative_names'])
        document_df = df_with_keywords.loc[:,document_df_cols]
        document_df.set_index('course_subject', inplace=True)
        
        document_dict = defaultdict(list)
        terms = set()
        for index, row in document_df.iterrows():
            document_dict[index].extend(row.values)
            terms.update(row.values)

        doc_freq_dict = defaultdict()
        num_docs = len(document_dict.keys())
        for term in terms:
            doc_freq_i = 0
            for key in document_dict.keys():
                if term in document_dict.get(key):
                    doc_freq_i += 1
            doc_freq_dict[term] = doc_freq_i / (num_docs)
            
        print('[INFO] Most common keywords by document frequencies: ', Counter(doc_freq_dict).most_common(10)) 
        average_document_frequency_score = np.mean(list(doc_freq_dict.values()))
        return average_document_frequency_score
        


In [36]:
def get_vocab(dataframe, column, max_df=0.057611, use_idf=True):
    """Gets the vocab labels to be used as inferred keywords. 
    Args:
        Dataframe with column name (string) to parse vocab from.
        Max_df (float): max document frequency for sklearn's vectorizer
        Use_idf (boolean): Use tf-idf to get top feature labels vs just using tf
    Returns:
        Array of vocab labels.
    """
    print("[INFO] Getting vocab...")
    
    dataframe[column] = dataframe[column].fillna('')
    test_corpus = dataframe.course_title.fillna('') + ' ' + dataframe.course_title.fillna('') + ' ' + dataframe.course_description.fillna('')
    vectorizer = TfidfVectorizer(max_df=max_df, stop_words='english', ngram_range=(1,1), use_idf=use_idf) 
    X = vectorizer.fit_transform(test_corpus)   # vectorizer.fit_transform(dataframe[column])
    unigrams = vectorizer.get_feature_names()
    print('[INFO] number unigrams: %d' % (len(unigrams)))

    vectorizer = TfidfVectorizer(max_df=max_df, stop_words='english', ngram_range=(2,2), use_idf=use_idf, max_features=max(1, int(len(unigrams)/2)))
    X = vectorizer.fit_transform(test_corpus)  #  vectorizer.fit_transform(dataframe[column])
    bigrams = vectorizer.get_feature_names()
    print('[INFO] Number of bigrams: %d' % (len(bigrams)))

    vocab = np.concatenate((unigrams, bigrams)) # , trigrams))
    vocab_list = list(vocab)
    removed_numbers_list = [word for word in vocab_list if not any(char.isdigit() for char in word)]
    vocab = np.array(removed_numbers_list)
    return vocab

In [37]:
def to_bag_of_words(dataframe, column, vocab, use_idf=True, tf_bias=.5):
    """Converts text corpus into its BOW representation using predefined vocab.
    Args:
        raw dataframe, text column, and vocabulary.
    Returns:
        A sparse matrix of the bag of words representation of the column.
    """
    vectorizer = TfidfVectorizer(stop_words='english', vocabulary=vocab, use_idf=use_idf)
    X = vectorizer.fit_transform(dataframe[column].values.astype('U'))
    if tf_bias == -999:
        print('[INFO] Not using tf-bias')
        return X
    return (X.multiply(1/X.count_nonzero())).power(-tf_bias)

In [38]:
def logistic_regression(X, Y, use_hidden_layer=False, hidden_layer_size=200, num_epochs=5):
    """Perform multinomial logistic regression from BOW vector space (Y) onto course vector space (X). 
    Args: 
        Matrix of course vectors and corresponding BOW description encodings and number of epochs. 
        Hidden_layer_size must be greater than the max_description_len trying to predict (181)
    Returns:
        Tuple of weights and bias dataframes to use in prediction.
    """
    print('[INFO] Performing logistic regression...')

    inputs = Input(shape=(X.shape[1],)) # course vec
    if use_hidden_layer:
        hidden_layer = Dense(hidden_layer_size, activation='sigmoid')(inputs)
        predictions = Dense(vocabsize, activation='softmax')(hidden_layer)
    else:
        predictions = Dense(vocabsize, activation='softmax')(inputs)
    model = Model(inputs=inputs, outputs=predictions)
    model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
    model.fit(X, Y, epochs=num_epochs)
    weights = model.layers[1].get_weights()[0]
    biases = model.layers[1].get_weights()[1]
    weights_frame = pd.DataFrame(weights)
    biases_frame = pd.DataFrame(biases)
    return(weights_frame, biases)

In [39]:
def predict(course_vecs, course_descripts, trained_weights, trained_biases, vocab_frame, num_words_per_course=10):
    """Predict inferred keywords for each course using train the vectorspace coeffs to predict the BOW of a point.
    Args:
        Course vectors, course description, weights and biases
        num_words_per_course (int): Number of words to predict per course
    Returns:
        Course description dataframe with a new column for every predicted word 
    """
    df_with_keywords = course_descripts.copy()
    # Obtain the softmax predictions for all instances
    softmax_frame = course_vecs.iloc[:,1:].dot(trained_weights.values) + trained_biases 

    # From the softmax predictions, save the top 10 predicted words for each data point
    print('[INFO] Sorting classification results...')
    sorted_frame = np.argsort(softmax_frame,axis=1).iloc[:,-num_words_per_course:]

    print('[INFO] Predicting top k inferred keywords for each course...')
    for i in range(num_words_per_course):
        new_col = vocab_frame.iloc[sorted_frame.iloc[:,i],0] # get the ith top vocab word for each entry
        df_with_keywords['predicted_word_' + str(num_words_per_course-i)] = new_col.values
        
    return df_with_keywords

In [46]:
vec_frame = pd.read_csv(vectorfile, sep = '\t') # Vector space representation of each user, all numeric
info_frame = pd.read_csv(infofile, sep = '\t') # Course information

nonempty_indices = np.where(info_frame[textcolumn].notnull())[0]
filtered_vec_df = vec_frame.iloc[nonempty_indices,:].reset_index(drop = True)
filtered_descript_df = info_frame.iloc[nonempty_indices,:].reset_index(drop = True)
max_descript_len = max(filtered_descript_df.course_description.str.split().str.len())
num_top_words = 10

hyperparams_cols = ['use_idf', 'max_df','tf-bias', 'use_hidden_layer', 'num_epochs', 'recall@max_len', 'precision@10', 'distribution_diff', 
                    'document_frequency']

param_grid = {'use_idf': [True, False],
              'max_df': np.arange(.02, .06, .01),
              'tf_bias': np.append(np.arange(0, 2.5, .5), -999),
              'num_epochs': [5, 10], 
              'use_hidden_layer': [True, False]} 

grid = ParameterGrid(param_grid)
print(len(grid))
# for params in grid:
#     print("[HYPERPARAMS] use_idf: %r, max_df: %f, tf_bias: %f, num_epochs: %d" % 
#           (params['use_idf'], params['max_df'], params['tf_bias'], params['num_epochs']))

192


In [41]:
# simple parameter grid search
# simple_param_grid = {'use_idf': [True],
#               'max_df': [1], # np.arange(0, .0055, .0005),
#               'use_hidden_layer': [True, False],
#               'tf_bias': np.arange(.5, 1.5, .5), 
#               'num_epochs': [5]} 

# grid = ParameterGrid(simple_param_grid)

recall_validation_scores = []
precision_validation_scores = []
distribution_validation_scores = []
document_frequency_validation_scores = []
grid_search_data = []

for params in grid:
    print("***[INFO] Evaluating cross-validated model with hyperparams use_idf: %r, max_df: %f, tf_bias: %f, use_hidden_layer: %r, num_epochs: %d***" % 
          (params['use_idf'], params['max_df'], params['tf_bias'], params['use_hidden_layer'], params['num_epochs']))

    fold_num = 1
    kf = KFold(n_splits=4, random_state=42) # DO NOT FIX RANDOM STATE WHEN RUNNING THE ACTUAL EXPERIMENT - NVM, should be fixed for reproducibility
    for train_idx, valid_idx in kf.split(filtered_vec_df):
        print('======== [INFO] Fold %d' % (fold_num))
        # X = vectors, Y = descriptions
        split_X_train, split_X_valid = filtered_vec_df.iloc[train_idx], filtered_vec_df.iloc[valid_idx]
        split_Y_train, split_Y_valid = filtered_descript_df.iloc[train_idx], filtered_descript_df.iloc[valid_idx]

        vocab = get_vocab(split_Y_train, textcolumn, max_df=params['max_df'], use_idf=params['use_idf']) 
        vocab_frame = pd.DataFrame(vocab)
        vocabsize = len(vocab)

        # Convert the textcolumn of the raw dataframe into bag of words representation
        split_Y_train_BOW = to_bag_of_words(split_Y_train, textcolumn, vocab, tf_bias=params['tf_bias'], use_idf=params['use_idf'])
        split_Y_train_BOW = split_Y_train_BOW.toarray()

        (weights_frame, biases) = logistic_regression(split_X_train.iloc[:,1:], split_Y_train_BOW, 
                                                      use_hidden_layer=params['use_hidden_layer'], num_epochs=params['num_epochs'])

        print('[INFO] Predicting on validation set for recall...')
        df_with_keywords = predict(split_X_valid, split_Y_valid, weights_frame, biases, vocab_frame, max_descript_len)
        fold_i_average_recall = calculate_metric(df_with_keywords, 'r')
        recall_validation_scores.append(fold_i_average_recall)
        print('[INFO] Fold %d recall: %f.' % (fold_num, fold_i_average_recall))
        
        print('[INFO] Predicting on validation set for precision...')
        df_with_keywords = predict(split_X_valid, split_Y_valid, weights_frame, biases, vocab_frame, num_top_words)
        fold_i_average_precision = calculate_metric(df_with_keywords, 'p')
        precision_validation_scores.append(fold_i_average_precision)
        print('[INFO] Fold %d precision: %f.' % (fold_num, fold_i_average_precision))
        

        fold_i_distribution_diff = calculate_metric(df_with_keywords, 'c')
        distribution_validation_scores.append(fold_i_distribution_diff)
        print('[INFO] Fold %d cosine similarity: %f.' % (fold_num, fold_i_distribution_diff))
        
        fold_i_document_frequency = calculate_metric(df_with_keywords, 'df')
        document_frequency_validation_scores.append(fold_i_document_frequency)
        print('[INFO] Fold %d document frequency: %f.' % (fold_num, fold_i_document_frequency))

        fold_num += 1

    recall_i = np.mean(recall_validation_scores)
    precision_i = np.mean(precision_validation_scores)
    distribution_diff_i = np.mean(distribution_validation_scores)
    document_frequency_i = np.mean(document_frequency_validation_scores)

    model_i_params = [params['use_idf'], params['max_df'], params['tf_bias'], params['use_hidden_layer'],
                      params['num_epochs'], recall_i, precision_i, distribution_diff_i, document_frequency_i]

#     model_i_params = pd.DataFrame([model_i_params], columns=hyperparams_cols)
#     grid_search_df.append(model_i_params, sort = False)
    grid_search_data.append(dict(zip(hyperparams_cols, model_i_params)))
    grid_search_df = pd.DataFrame(grid_search_data, columns=hyperparams_cols) 
    print(grid_search_df)
    # print('recall scores:', recall_validation_scores)
    # print('precision scores:', precision_validation_scores)
    # print('distribution scores:', distribution_validation_scores)
    
grid_search_df.to_csv(scorefile_path, index=False, sep='\t')

***[INFO] Evaluating cross-validated model with hyperparams use_idf: True, max_df: 1.000000, tf_bias: 0.500000, use_hidden_layer: True, num_epochs: 5***
[INFO] Getting vocab...
[INFO] number unigrams: 5333
[INFO] Number of bigrams: 2666
[INFO] Performing logistic regression...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
[INFO] Predicting on validation set for recall...
[INFO] Sorting classification results...
[INFO] Predicting top k inferred keywords for each course...
[INFO] Calculating Recall...
[INFO] Fold 1 recall: 0.000153.
[INFO] Predicting on validation set for precision...
[INFO] Sorting classification results...
[INFO] Predicting top k inferred keywords for each course...
[INFO] Calculating Precision...
[INFO] Fold 1 precision: 0.000163.
[INFO] Calculating Cosine Similarity Between Keyword Distributions...
[INFO] Most common keywords by count:  [('ancestors', 818), ('acclimation', 765), ('agroforestry', 712), ('allotted', 686), ('ample', 658), ('amerindian', 584), ('ais'

[INFO] number unigrams: 5245
[INFO] Number of bigrams: 2622
[INFO] Performing logistic regression...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
[INFO] Predicting on validation set for recall...
[INFO] Sorting classification results...
[INFO] Predicting top k inferred keywords for each course...
[INFO] Calculating Recall...
[INFO] Fold 2 recall: 0.000937.
[INFO] Predicting on validation set for precision...
[INFO] Sorting classification results...
[INFO] Predicting top k inferred keywords for each course...
[INFO] Calculating Precision...
[INFO] Fold 2 precision: 0.005490.
[INFO] Calculating Cosine Similarity Between Keyword Distributions...
[INFO] Most common keywords by count:  [('legality', 343), ('collaborator', 341), ('hyperrealism', 334), ('send', 321), ('reforming', 303), ('foundered', 133), ('underwrite', 130), ('helped', 125), ('launched', 121), ('gave', 119)]
[INFO] Fold 2 cosine similarity: 0.543258.
[INFO] Calculating Document Frequency of Predicted Keywords across Co

[INFO] number unigrams: 5111
[INFO] Number of bigrams: 2555
[INFO] Performing logistic regression...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
[INFO] Predicting on validation set for recall...
[INFO] Sorting classification results...
[INFO] Predicting top k inferred keywords for each course...
[INFO] Calculating Recall...
[INFO] Fold 3 recall: 0.000128.
[INFO] Predicting on validation set for precision...
[INFO] Sorting classification results...
[INFO] Predicting top k inferred keywords for each course...
[INFO] Calculating Precision...
[INFO] Fold 3 precision: 0.000122.
[INFO] Calculating Cosine Similarity Between Keyword Distributions...
[INFO] Most common keywords by count:  [('aligned', 1006), ('animations', 890), ('aggregates', 877), ('annealing', 869), ('anthrax', 800), ('alien', 645), ('ancestors', 562), ('anchored', 517), ('abut', 509), ('absorptive', 458)]
[INFO] Fold 3 cosine similarity: 0.397376.
[INFO] Calculating Document Frequency of Predicted Keywords across Cour

In [42]:
grid_search_df

Unnamed: 0,use_idf,max_df,tf-bias,use_hidden_layer,num_epochs,recall@max_len,precision@10,distribution_diff,document_frequency
0,True,1,0.5,True,5,0.000147,0.000149,0.373041,0.14448
1,True,1,0.5,False,5,0.000523,0.00267,0.442515,0.083587
2,True,1,1.0,True,5,0.000395,0.001816,0.432875,0.103138
3,True,1,1.0,False,5,0.000509,0.002491,0.471151,0.083418


In [44]:
pd.read_csv(scorefile_path, sep='\t')

Unnamed: 0,use_idf,max_df,tf-bias,use_hidden_layer,num_epochs,recall@max_len,precision@10,distribution_diff,document_frequency
0,True,1,0.5,True,5,0.000147,0.000149,0.373041,0.14448
1,True,1,0.5,False,5,0.000523,0.00267,0.442515,0.083587
2,True,1,1.0,True,5,0.000395,0.001816,0.432875,0.103138
3,True,1,1.0,False,5,0.000509,0.002491,0.471151,0.083418
