# README
- This notebook follows from LogReg_training and includes code to test the training sets for prediction
- The goal is to see which training sets and which LogReg function settings might perform best
- The prediction function and training sets would then be used in the first stage of the pipeline to predict relevant results from a search (e.g. Twitter) and then relevant results from text taken from scraped URLs to pass the end user a list of websites to evaluate for inclusion 

In [3]:
#imports + path
import pandas as pd
pd.set_option('display.max_rows', 500)
import pickle
import numpy as np
import category_encoders as ce
import sklearn
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
path = '../'

Step 1:
- Based on the evaluations of the different training sets (see LogReg_training notebook) we focus on three variations for a training model against which to predict: the extended set (unbalanced towards musoW 2/1), the even set, and  the extended even set using description only as features. These are the variations that gave us the best precision on negatives during model testing, which was decided to be the most useful measure to ensure a minimal amount of false positives in our results. 
- For the encoding we focus on Tf-IDF as this is the most common type of encoding for text-based features. Backward Difference encoding performed better however it is unclear if this type of encoding is being correctly applied in this case. 
- The prediction sets used for testing are: the additions to the base set, a set of 20 results manually scraped from twitter results, and a set of results from github searches (see below for details of each).
- The outputs include confidence scores, probability estimates, and the length of the input (to see how length might affect scoring).
- After some early testing we decided to switch the sklearn function to LogisticRegressionCV as it seems to perform better, esp w/ 2 for number of CV folds and average_precision or f1 as scoring parameter. If scoring is set to precision it behaves more drastically but this can be compensated with higher CV values. Re-running model testing w/ LogRegCV shows higher precision on negatives under all set variations using tf-idf. 
- Lastly after some further testing of a small-scale version of the pipeline (see LogReg_Twitter notebook) we decided to create a new training set that would combine both musoW and MJI as positives and use scraped data from twitter searches for 'digital humanities', 'digital library', 'music business', and 'music companies' for the negatives. These results were automatically scraped and manually selected (roughly 10% of all scrapes were usable). This was because early results with the first version of the training sets seemed to indicate that the baseline difference was likely too subtle for the model to pick up on especially because of its small size. Changing the training set showed marked improvements in confidence scores. 

In [6]:
#NEW TRAINING SET V1
new_neg_set = pd.read_excel(path+'LOGREG_RELEVANCE/TRAINING_SETS/non_archive_negative_set_v1.xlsx')
new_neg_set = new_neg_set.drop_duplicates(subset=['Title'])
new_neg_set = new_neg_set.drop_duplicates(subset=['Description'])
new_neg_set['Target'] = '0'
positive_set = pd.read_pickle(path+'LOGREG_RELEVANCE/TRAINING_SETS/trainingset_even_extended.pkl')
positive_set['Target'] = '1'
archive_desc_training_v1 = pd.concat([positive_set, new_neg_set])
archive_desc_training_v1['Target'] = archive_desc_training_v1['Target'].astype('int')
archive_desc_training_v1 = archive_desc_training_v1.reset_index(drop=True)
archive_desc_training_v1.to_pickle(path+'LOGREG_RELEVANCE/TRAINING_SETS/archive_desc_training_v1.pkl')

In [19]:
#NEW TRAINING SET V2
new_neg_set_2 = pd.read_excel(path+'LOGREG_RELEVANCE/TRAINING_SETS/non_archive_negative_set_v2.xlsx')
new_neg_set_2 = new_neg_set_2.drop_duplicates(subset=['Title'])
new_neg_set_2 = new_neg_set_2.drop_duplicates(subset=['Description'])
new_neg_set_2['Target'] = '0'
positive_set_2 = pd.read_pickle(path+'LOGREG_RELEVANCE/TRAINING_SETS/trainingset_extended.pkl')
positive_set_2['Target'] = '1'
archive_desc_training_v2 = pd.concat([positive_set_2, new_neg_set_2])
archive_desc_training_v2['Target'] = archive_desc_training_v2['Target'].astype('int')
archive_desc_training_v2 = archive_desc_training_v2.reset_index(drop=True)
archive_desc_training_v2.to_pickle(path+'LOGREG_RELEVANCE/TRAINING_SETS/archive_desc_training_v2.pkl')

In [9]:
#LOAD TRAINING DFs 

#base DFs 
training_set_adds = pd.read_pickle(path+'LOGREG_RELEVANCE/TRAINING_SETS/trainingset_extended.pkl')
training_set_even = pd.read_pickle(path+'LOGREG_RELEVANCE/TRAINING_SETS/trainingset_even.pkl')
training_set_even_adds = pd.read_pickle(path+'LOGREG_RELEVANCE/TRAINING_SETS/trainingset_even_extended.pkl')
new_training_set = pd.read_pickle(path+'LOGREG_RELEVANCE/TRAINING_SETS/new_training_set.pkl')

In [3]:
#LOAD PREDICTION DFs

#additions to base set (164)
predictions_1 = pd.read_pickle(path+'LOGREG_RELEVANCE/PREDICTION_SETS/base_prediction_set.pkl') 

#test results from twitter (20)
predictions_2 = pd.read_csv(path+'TWITTER_SEARCHES/twitter_test.csv', encoding='iso-8859-1')

#test results from github (104)
prediction_github_1 = pd.read_pickle(path+'GH_PICKLES/music_archive.pkl')
prediction_github_2 = pd.read_pickle(path+'GH_PICKLES/digital_score.pkl')
prediction_github_3 = pd.read_pickle(path+'GH_PICKLES/library_music.pkl')
prediction_github_4 = pd.read_pickle(path+'GH_PICKLES/oral_history.pkl')
predictions_3 = pd.concat([prediction_github_1, prediction_github_2, prediction_github_3, prediction_github_4]).reset_index(drop=True)
predictions_3 = predictions_3.dropna(how='any').reset_index(drop=True)

In [4]:
#PREDICTION FUNCTION W/ and W/O LOGREGCV 

def lr_model_predict_cv(t_input, t_feature, target, cv_int, score_type, p_input, p_feature, filename, path):
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()
    x_count = count_vect.fit_transform(t_input[t_feature])
    x_train = tfidf_transformer.fit_transform(x_count)
    y_train = t_input[target].values
    model = LogisticRegressionCV(solver='liblinear', random_state=44, cv=cv_int, scoring=score_type)
    model.fit(x_train, y_train)
    export = f'LOGREG_RELEVANCE/MODELS/{filename}.sav'
    pickle.dump(model, open(path+export, 'wb'))
    x_new_count = count_vect.transform(p_input[p_feature])
    x_new_train = tfidf_transformer.transform(x_new_count)
    y_predict = model.predict(x_new_train)
    scores = model.decision_function(x_new_train)
    probability = model.predict_log_proba(x_new_train)
    results = [r for r in y_predict]
    result = p_input.copy()
    result['Prediction'] = results
    result['Score'] = [s for s in scores]
    result['Probability'] = [p for p in probability]
    result['Input Length'] = result[p_feature].str.len()
    return result

def lr_model_predict(t_input, t_feature, target, p_input, p_feature, filename, path):
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()
    x_count = count_vect.fit_transform(t_input[t_feature])
    x_train = tfidf_transformer.fit_transform(x_count)
    y_train = t_input[target].values
    model = LogisticRegression(solver='liblinear', C=10.0,random_state=44)
    model.fit(x_train, y_train)
    export = f'LOGREG_RELEVANCE/MODELS/{filename}.sav'
    pickle.dump(model, open(path+export, 'wb'))
    x_new_count = count_vect.transform(p_input[p_feature])
    x_new_train = tfidf_transformer.transform(x_new_count)
    y_predict = model.predict(x_new_train)
    scores = model.decision_function(x_new_train)
    probability = model.predict_log_proba(x_new_train)
    results = [r for r in y_predict]
    result = p_input.copy()
    result['Prediction'] = results
    result['Score'] = [s for s in scores]
    result['Probability'] = [p for p in probability]
    result['Input Length'] = result[p_feature].str.len()
    return result


Test 1:
- predict the even set against extensions 
- adjust CV and scoring values for logregcv function using 2, 5, 10 for CV and average_precision, precision_weighted, precision, f1 and f1_weighted (values are changed manually in the function call)
- run both logreg versions 

In [None]:
#Predict using LogRegCV and return positive results by descending confidence scores
even_pred_1 = lr_model_predict_cv(training_set_even, 'Description', 'Target', 10, 'average_precision', predictions_1, 'Description', 'even_model_logregcv', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

This prediction set size is 164 and technically includes 20 positives (all from github, taken from the ismir list and currently not included in the musoW dataset) and some negatives that could be positives (mji list). 

scoring and cv values:
- precision returns no positives.
- average_precision returns 16 positives w/ cv = 10, including 6 from ismir 
- precision_weighted w/ cv 2 returns 16 positives, same as above 
- f1 w/ cv 2 returns 44 positives, including all ismir -> drops when you increase the cv value 
- f1_weighted same as above 

In [None]:
#Predict using LogReg
even_pred_1 = lr_model_predict(training_set_even, 'Description', 'Target', predictions_1, 'Description', 'even_model_logreg', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

LogReg function returns 42 positives including 18 from ismir. The rest include some ambigious MJI entries however there is still a lot of false positives. 

Test 2:
- predict the even set against twitter samples 

In [None]:
#Predict using LogRegCV and return positive results by descending confidence scores
even_pred_2 = lr_model_predict_cv(training_set_even, 'Description', 'Target', 10, 'precision', predictions_2, 'Description', 'even_model_logregcv', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_2 = even_pred_2.loc[even_pred_2['Prediction'] == 1]
even_pred_2.sort_values(by='Score', ascending=False).reset_index(drop=True)

This prediction set size is 20 and has one entry that's a definite positive a few that are ambiguous. 

scoring and cv values (2, 5, 10):
- precision (all cv values) returns 1 ambiguous 
- average_precision returns 1 ambigous at cv 2 + 5, 2 ambiguous, one definite and one wrong at cv 10
- precision_weighted same as above for cv 10 but at all cv values 
- f1/f1_weighted (all cv values) return 5 positives, including the definite, one ambiguous and one wrong 

In [None]:
#Predict using LogReg
even_pred_1 = lr_model_predict(training_set_even, 'Description', 'Target', predictions_2, 'Description', 'even_model_logreg', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

LogReg function returns 5 results including the definite, one wrong, and 3 ambiguous. 

In all cases (logreg and logregcv) the definite result never ranks top in confidence score. 

Test 3:
- predict the even set against github results 

In [None]:
#Predict using LogRegCV and return positive results by descending confidence scores
even_pred_3 = lr_model_predict_cv(training_set_even, 'Description', 'Target', 2, 'precision', predictions_3, 'Description', 'even_model_logreg', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_3 = even_pred_3.loc[even_pred_3['Prediction'] == 1]
even_pred_3 = even_pred_3.sort_values(by='Score', ascending=False).reset_index(drop=True)
even_pred_3

This prediction set size is 104 entries and hasn't been assesed for how much relevant results it might contain as the first sets of test revealed that the lenght of the inputs (taken from repo descriptions via github API) are quite short and are likely having a negative impact on the results. Results are noted below for reference but for now the assumption is that the github searches need to be fixed to also scrape readme files when available. 

scoring and cv values (2, 5, 10):
- precision returns 42 (2, 5) and 62 (10)
- average_precision returns 62 (2, 5) and 87 (10)
- precision_weighted returns 87 (all values)
- f1/f1_weighted returns 93-94 (all values)

In [None]:
#Predict using LogReg
even_pred_1 = lr_model_predict(training_set_even, 'Description', 'Target', predictions_3, 'Description', 'even_model_logreg', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

LogReg returns 95 positives out of a 104. 

Test 4:
- predict the extended set against twitter samples 

In [None]:
#Predict using LogRegCV and return positive results by descending confidence scores
even_pred_4 = lr_model_predict_cv(training_set_adds, 'Description', 'Target', 10, 'precision_weighted', predictions_2, 'Description', 'extended_model_logregcv', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_4 = even_pred_4.loc[even_pred_4['Prediction'] == 1]
even_pred_4 = even_pred_4.sort_values(by='Score', ascending=False).reset_index(drop=True)
even_pred_4

scoring and cv values (2, 5, 10):
- precision returns 6 positives including definite, ambiguous and 2 wrong 
- average_precision returns 6 positives at cv 2 and 20 at cv5/10
- precision_weighted returns 6 positives including definite, ambiguous and 2 wrong 
- f1/f1_weighted returns 6 positives including definite, ambiguous and 2 wrong 

In [None]:
#Predict using LogReg
even_pred_1 = lr_model_predict(training_set_adds, 'Description', 'Target', predictions_2, 'Description', 'extended_model_logreg', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

LogReg returns 6 resultd including definite, two wrong, and 3 ambiguous (same as even set but different order based on confidence scores). 

Test 5:
- predict extended even set against twitter samples

In [None]:
#Predict using LogRegCV and return positive results by descending confidence scores
even_pred_5 = lr_model_predict_cv(training_set_even_adds, 'Description', 'Target', 10, 'f1_weighted', predictions_2, 'Description', 'extended_even_model_logregcv', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_5 = even_pred_5.loc[even_pred_5['Prediction'] == 1]
even_pred_5 = even_pred_5.sort_values(by='Score', ascending=False).reset_index(drop=True)
even_pred_5

scoring and cv values (2, 5, 10):
- precision (all cv values) returns nothing
- average_precision returns one definite, two ambiguous, one wrong at all cv values w/ the wrong having highest confidence score
- precision_weighted same as above but only three results at cv 2, 5 (dropping to one ambiguous)
- f1/f1_weighted same as average_precision w/ three results for cv 2 on weighted 

In [None]:
#Predict using LogReg
even_pred_1 = lr_model_predict(training_set_even_adds, 'Description', 'Target', predictions_2, 'Description', 'extended_even_model_logreg', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

LogReg returns four results including definite (lowest confidence score), two ambiguous, and one wrong (highest confidence score). This highest scoring wrong result also has the shortest input length in the prediction set, which seems to confirm doubts around input length from github results. 

Test 6:
- predict extended even set against github

In [None]:
#Predict using LogRegCV and return positive results by descending confidence scores
even_pred_6 = lr_model_predict_cv(training_set_even_adds, 'Description', 'Target', 2, 'f1', predictions_3, 'Description', 'extended_even_model_logregcv', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_6 = even_pred_6.loc[even_pred_6['Prediction'] == 1]
even_pred_6 = even_pred_6.sort_values(by='Score', ascending=False).reset_index(drop=True)
even_pred_6

Conducted this test just to see how it might differ from Test 3. All scoring and cv values returned high positives counts as with test 3 apart from precision which returns only 18 positives (all cv values).

In [None]:
#Predict using LogReg
even_pred_1 = lr_model_predict(training_set_even_adds, 'Description', 'Target', predictions_3, 'Description', 'extended_even_model_logreg', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

LogReg returns 101 positives. 

Test 7:
- predict new training set against twitter samples, and github. 

In [None]:
#Predict using LogRegCV and return positive results by descending confidence scores
even_pred_1 = lr_model_predict_cv(new_training_set, 'Description', 'Target', 2, 'precision', predictions_2, 'Description', 'new_model_logregcv', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

Twitter samples:
- returns 10 positives at all values, including the definite, four wrong, and the rest ambigious. Definite ranks around 6-7. First two are always ambiguous. 
Github searches:
- returns over 100 positives at all values. 

In [None]:
#Predict using LogReg
even_pred_1 = lr_model_predict(new_training_set, 'Description', 'Target', predictions_2, 'Description', 'extended_even_model_logreg', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False).reset_index(drop=True)

Results for all three prediction sets are similar with LogReg. 