TO DO:
- how to split the function into two: modelling function, prediction function. It's difficult because you also need to carry over the counts and transforms. Think the answer might be in the log reg classifier tutorial? 

In [1]:
#imports + path
import pandas as pd
pd.set_option('display.max_rows', 500)
import pickle
import numpy as np
import category_encoders as ce
import sklearn
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
path = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/'

- Based on LogReg evaluations of the different training sets we focus on three variations for potential final model/predictions: extended set, even set, and extended even set using description only as features. These are the variations that gave us the best precision on negatives during model testing, which should help us to have a model somewhat adept at removing what we don't want from our results. 
- For the encoding we focus on Tf-IDF as I'm not yet 100% sure if BD encoding is being used correctly, however if it is then there is a case for using it with combined features as it performs better with them. 
- After some early tests switched the sklearn function to LogisticRegressionCV as it seems to perform better, esp w/ 2 for number of CV folds and average_precision or f1 as scoring parameter. If scoring is set to precision it behaves more drastically but this can be compensated with higher CV values. Re-running model testing w/ LogRegCV shows higher precision on negatives under all set variations using tf-idf. 

In [3]:
#LOAD TRAINING DFs 

#base DFs 
training_set_adds = pd.read_pickle(path+'LOGREG_RELEVANCE/trainingset_extended.pkl')
training_set_even = pd.read_pickle(path+'LOGREG_RELEVANCE/trainingset_even.pkl')
training_set_even_adds = pd.read_pickle(path+'LOGREG_RELEVANCE/trainingset_even_extended.pkl')

#LOAD BD DFs 

#BD sets 
training_set_adds_ce_desc = pd.read_pickle(path+'LOGREG_RELEVANCE/trainingset_extended_BD_desc.pkl')
training_set_ce_even_desc = pd.read_pickle(path+'LOGREG_RELEVANCE/trainingset_even_BD_desc.pkl')
training_set_ce_even_extended_desc = pd.read_pickle(path+'LOGREG_RELEVANCE/trainingset_even_extended_BD_desc.pkl')

In [5]:
#LOAD PREDICTION DFs

#additions to base set (164)
predictions_1 = pd.read_pickle(path+'LOGREG_RELEVANCE/base_prediction_set.pkl') 

#test results from twitter (20)
predictions_2 = pd.read_csv('/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/TWITTER_SEARCHES/twitter_test.csv', encoding='iso-8859-1')

#test results from github (104)
prediction_github_1 = pd.read_pickle(path+'GH_PICKLES/music_archive.pkl')
prediction_github_2 = pd.read_pickle(path+'GH_PICKLES/digital_score.pkl')
prediction_github_3 = pd.read_pickle(path+'GH_PICKLES/library_music.pkl')
prediction_github_4 = pd.read_pickle(path+'GH_PICKLES/oral_history.pkl')
predictions_3 = pd.concat([prediction_github_1, prediction_github_2, prediction_github_3, prediction_github_4]).reset_index(drop=True)
predictions_3 = predictions_3.dropna(how='any').reset_index(drop=True)

In [52]:
#PREDICTION FUNCTION W/ LOGREGCV 

def lr_model_predict(t_input, t_feature, target, cv_int, score_type, p_input, p_feature, filename, path):
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()
    x_count = count_vect.fit_transform(t_input[t_feature])
    x_train = tfidf_transformer.fit_transform(x_count)
    y_train = t_input[target].values
    model = LogisticRegressionCV(solver='liblinear', random_state=44, cv=cv_int, scoring=score_type)
    model.fit(x_train, y_train)
    export = f'LOGREG_RELEVANCE/{filename}.sav'
    pickle.dump(model, open(path+export, 'wb'))
    x_new_count = count_vect.transform(p_input[p_feature])
    x_new_train = tfidf_transformer.transform(x_new_count)
    y_predict = model.predict(x_new_train)
    scores = model.decision_function(x_new_train)
    probability = model.predict_log_proba(x_new_train)
    results = [r for r in y_predict]
    result = p_input.copy()
    result['Prediction'] = results
    result['Score'] = [s for s in scores]
    result['Probability'] = [p for p in probability]
    result['Input Length'] = result['Description'].str.len()
    return result


In [19]:
""" def lr_model(t_input, t_feature, target, score_type, filename, path):
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()
    x_count = count_vect.fit_transform(t_input[t_feature])
    x_train = tfidf_transformer.fit_transform(x_count)
    y_train = t_input[target].values
    model = LogisticRegressionCV(solver='liblinear', random_state=44, cv=2, scoring=score_type)
    model.fit(x_train, y_train)
    saved_model = f'LOGREG_RELEVANCE/{filename}_model.pkl'
    vectorizer = f'LOGREG_RELEVANCE/{filename}_vectorizer.pkl'
    pickle.dump(model, open(path+saved_model, 'wb'))
    pickle.dump(vectorizer, open(path+vectorizer, 'wb'))

def lr_predict(p_input, p_feature, filename, path):
    model = pickle.load(open(path+f'LOGREG_RELEVANCE/{filename}_model.pkl', 'rb'))
    vectorizer = pickle.load(open(path+f'LOGREG_RELEVANCE/{filename}_vectorizer.pkl', 'rb'))
    x_new_count = vectorizer.transform(p_input[p_feature])
    x_new_train = tfidf_transformer.transform(x_new_count)
    y_predict = model.predict(x_new_train)
    scores = model.decision_function(x_new_train, y_predict)
    probability = model.predict_log_proba(x_new_train)
    results = [r for r in y_predict]
    result = p_input.copy()
    result['Prediction'] = results
    result['Score'] = [s for s in scores]
    result['Probability'] = [p for p in probability]
    return result """

In [None]:
#Predict even set against extensions
even_pred_1 = lr_model_predict(training_set_even, 'Description', 'Target', 2, 'f1_weighted', predictions_1, 'Description', 'extended_even_model', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_1 = even_pred_1.loc[even_pred_1['Prediction'] == 1]
even_pred_1.sort_values(by='Score', ascending=False)

This prediction set size is 164 and technically includes 20 positives (ismir list) and some negatives that could be positives (mji list). 

scoring and cv values:
- precision returns no positives.
- average_precision returns 16 positives w/ cv = 10, including 6 from ismir 
- precision_weighted w/ cv 2 returns 16 positives, same as above 
- f1 w/ cv 2 returns 44 positives, including all ismir -> drops when you increase the cv value 
- f1_weighted same as above 

In [None]:
#Predict even set against twitter tests 
even_pred_2 = lr_model_predict(training_set_even, 'Description', 'Target', 2, 'precision_weighted', predictions_2, 'Description', 'extended_even_model', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_2 = even_pred_2.loc[even_pred_2['Prediction'] == 1]
even_pred_2.sort_values(by='Score', ascending=False)

This prediction set size is 20 and has one entry that's a definite positive a few that are ambiguous. 

scoring and cv values (2, 5, 10):
- precision (all cv values) returns 1 ambiguous 
- average_precision returns 1 ambigous at cv 2 + 5, 2 ambiguous and definite and wrong at cv 10
- precision_weighted same as above for cv 10 but at all cv values 
- f1/f1_weighted (all cv values) return 5 positives, including the definite, the ambiguous and one wrong 

In [None]:
#Predict even set against github tests 
even_pred_3 = lr_model_predict(training_set_even, 'Description', 'Target', 2, 'f1_weighted', predictions_3, 'Description', 'extended_even_model', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_3 = even_pred_3.loc[even_pred_3['Prediction'] == 1]
even_pred_3 = even_pred_3.sort_values(by='Score', ascending=False)
even_pred_3

This prediction set size is 104 entries and TK TK. IT DOES HAVE V SMALL DESC COUNTS THO WHICH I THINK THROWS IT

scoring and cv values (2, 5, 10):
- precision returns 42 (2, 5) and 62 (10)
- average_precision returns 62 (2, 5) and 87 (10)
- precision_weighted returns 87 (all values)
- f1/f1_weighted returns 93-94 (all values)

In [None]:
#Predict extended set against twitter tests 
even_pred_4 = lr_model_predict(training_set_adds, 'Description', 'Target', 10, 'precision_weighted', predictions_2, 'Description', 'extended_even_model', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_4 = even_pred_4.loc[even_pred_4['Prediction'] == 1]
even_pred_4 = even_pred_4.sort_values(by='Score', ascending=False)
even_pred_4

scoring and cv values (2, 5, 10):
- precision returns 6 positives including definite, ambiguous and 2 wrong 
- average_precision returns 6 at cv 2 but 20 at cv 5/10 (no good)
- precision_weighted returns 6 positives including definite, ambiguous and 2 wrong 
- f1/f1_weighted returns 6 positives including definite, ambiguous and 2 wrong 

In [None]:
#Predict extended even set against twitter tests 
even_pred_5 = lr_model_predict(training_set_even_adds, 'Description', 'Target', 2, 'f1_weighted', predictions_2, 'Description', 'extended_even_model', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_5 = even_pred_5.loc[even_pred_5['Prediction'] == 1]
even_pred_5 = even_pred_5.sort_values(by='Score', ascending=False)
even_pred_5

scoring and cv values (2, 5, 10):
- precision (all cv values) returns nothing
- average_precision returns 1 def, 2 ambiguous, 1 wrong at all cv values w/ the wrong scoring highest 
- precision_weighted same as above but only 3 results at cv 2, 5 (dropping one ambiguous)
- f1/f1_weighted same as av precision w/ 3 results for cv 2 on weighted 

In [None]:
#Predict extended even set against github tests 
even_pred_6 = lr_model_predict(training_set_even_adds, 'Description', 'Target', 2, 'precision', predictions_3, 'Description', 'extended_even_model', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')
even_pred_6 = even_pred_6.loc[even_pred_6['Prediction'] == 1]
even_pred_6 = even_pred_6.sort_values(by='Score', ascending=False)
even_pred_6

- precision (all cv values) returns 18 which is a lot better than using the even set only as model. this model w/ these values might work best for github searches? 