# Final Results (more like results part 2)

After I finished my secondary results and viewed the sample that you sent I wanted to give another method a shot. I build off a lot of things that I learned from that part of the project in this section.

I ended up only importing 1,000 lines so that I could run these models more quickly. It turns out that mapping the partitions is extremely computationally expensive, so I did it once with 1000 and was able to store it in joblib. The steps below are all pre-processing steps that are from the sample.

*Much of the structure from this sample was borrowed from Dr. Paul Anderson.*

In [408]:
import joblib
import numpy as np
import dask.dataframe as dd
import dask
import pandas as pd
import pandas.io.json
import json
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.neighbors import KNeighborsRegressor

Here we are loading in the data with dask.

In [161]:
dask.config.set({'temporary_directory': '/disk/tmp'})
pd.io.json._json.loads = lambda s, *a, **kw: json.loads(s)
train_df = dd.read_json("simplified-nq-train.jsonl.sample_small",lines=True, blocksize=3e6)
train_df

Unnamed: 0_level_0,document_text,long_answer_candidates,question_text,annotations,document_url,example_id
npartitions=19,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,object,object,object,object,object,int64
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [162]:
train_df.shape[0].compute()

1000

In [163]:
train_df.head()

Unnamed: 0,document_text,long_answer_candidates,question_text,annotations,document_url,example_id
0,Email marketing - Wikipedia <H1> Email marketi...,"[{'start_token': 14, 'top_level': True, 'end_t...",which is the most common use of opt-in e-mail ...,"[{'yes_no_answer': 'NONE', 'long_answer': {'st...",https://en.wikipedia.org//w/index.php?title=Em...,5655493461695504401
1,The Mother ( How I Met Your Mother ) - wikiped...,"[{'start_token': 28, 'top_level': True, 'end_t...",how i.met your mother who is the mother,"[{'yes_no_answer': 'NONE', 'long_answer': {'st...",https://en.wikipedia.org//w/index.php?title=Th...,5328212470870865242
2,Human fertilization - wikipedia <H1> Human fer...,"[{'start_token': 14, 'top_level': True, 'end_t...",what type of fertilisation takes place in humans,"[{'yes_no_answer': 'NONE', 'long_answer': {'st...",https://en.wikipedia.org//w/index.php?title=Hu...,4435104480114867852
3,List of National Football League career quarte...,"[{'start_token': 28, 'top_level': True, 'end_t...",who had the most wins in the nfl,"[{'yes_no_answer': 'NONE', 'long_answer': {'st...",https://en.wikipedia.org//w/index.php?title=Li...,5289242154789678439
4,Roanoke Colony - wikipedia <H1> Roanoke Colony...,"[{'start_token': 32, 'top_level': True, 'end_t...",what happened to the lost settlement of roanoke,"[{'yes_no_answer': 'NONE', 'long_answer': {'st...",https://en.wikipedia.org//w/index.php?title=Ro...,5489863933082811018


In [164]:
orig_train_df = train_df

In [165]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Question Vectorizer

Fitting a tfidf vectorizer on the question text.

In [166]:
LOAD_VEC=True
if LOAD_VEC:
    question_vectorizer = joblib.load("question_vectorizer.joblib.z")
else:
    question_vectorizer = TfidfVectorizer()
    question_vectorizer.fit(train_df["question_text"])
    joblib.dump(question_vectorizer,"question_vectorizer.joblib.z")

# Document Vectorizer


Fitting a tfidf vectorizer on the document text.

In [167]:
import re

if LOAD_VEC:
    document_vectorizer = joblib.load("document_vectorizer.joblib.z")
else:
    document_vectorizer = TfidfVectorizer()
    train_df["document_text_no_html"] = train_df["document_text"].apply(lambda value: re.sub("<.*?>", "",value))
    document_vectorizer.fit(train_df["document_text_no_html"])
    joblib.dump(document_vectorizer,"document_vectorizer.joblib.z")

In [304]:
"We have " + str(len(document_vectorizer.get_feature_names())) + " words in our vocabulary."

'We have 165376 words in our vocabulary.'

Now we are creating our features to load into our soon to be made feature vectors. Note that this way of doing things is in stark contrast to the way that I was doing things in my *Barshay Secondary Results*. My feature vector had 61,222 columns as opposed to this one which was only 17 columns. You can imagine that the machine learning models are able to perform much faster on this model. The biggest difference is that on this model we are really doing the processing before it is time to turn it into a vector and then using those results to make a vector as opposed to just putting all the raw information in a vector all at once. There are also other significant structural differences, namely the fact that this model is in the form of a pandas dataframe which allows us to do maniupulations that we would not be able to do with a large sparse matrix.

In [306]:
from scipy import spatial
import sklearn
import Levenshtein 
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Function based on one found at https://www.kaggle.com/opanichev/tf2-0-qa-binary-classification-baseline

def extract_features(document_tfidf, question_tfidf, answer_tfidf, document, question, answer, y):
    # needed to convert to vectors
    document_tfidf = np.squeeze(np.asarray(document_tfidf))
    question_tfidf = np.squeeze(np.asarray(question_tfidf))
    answer_tfidf = np.squeeze(np.asarray(answer_tfidf))
    qa_cos_d = spatial.distance.cosine(question_tfidf, answer_tfidf)
    qd_cos_d = spatial.distance.cosine(question_tfidf, document_tfidf)
    ad_cos_d = spatial.distance.cosine(answer_tfidf, document_tfidf)

    qa_euc_d = np.linalg.norm(question_tfidf - answer_tfidf)
    qd_euc_d = np.linalg.norm(question_tfidf - document_tfidf)
    ad_euc_d = np.linalg.norm(answer_tfidf - document_tfidf)
    
    qa_lev_d = Levenshtein.distance(question, answer)
    qa_lev_r = Levenshtein.ratio(question, answer)
    qa_jar_s = Levenshtein.jaro(question, answer) 
    qa_jaw_s = Levenshtein.jaro_winkler(question, answer)
    
    qa_tfidf_score = np.sum(question_tfidf*answer_tfidf.T)
    qd_tfidf_score = np.sum(question_tfidf*document_tfidf.T)
    ad_tfidf_score = np.sum(answer_tfidf*document_tfidf.T)
    
    document_tfidf_sum = np.sum(document_tfidf)
    question_tfidf_sum = np.sum(question_tfidf)
    answer_tfidf_sum = np.sum(answer_tfidf)
    
    f = pd.Series([
        qa_cos_d, qd_cos_d, ad_cos_d, 
        qa_euc_d, qd_euc_d, ad_euc_d,
        qa_lev_d, qa_lev_r, qa_jar_s, qa_jaw_s,
        qa_tfidf_score, qd_tfidf_score, ad_tfidf_score, 
        document_tfidf_sum, question_tfidf_sum, answer_tfidf_sum,y
    ],index=['qa_cos_d', 'qd_cos_d', 'ad_cos_d', 
        'qa_euc_d', 'qd_euc_d', 'ad_euc_d',
        'qa_lev_d', 'qa_lev_r', 'qa_jar_s', 'qa_jaw_s',
        'qa_tfidf_score', 'qd_tfidf_score', 'ad_tfidf_score', 
        'document_tfidf_sum', 'question_tfidf_sum', 'answer_tfidf_sum','candidate'])
    return f

These two functions are more processing steps on the data to get it ready to be put through models. The main takeaways of these functions are processing and applying the function extract features the was defined above. Like I said before these choice of features are much different from the previous choice of features in the sense that most if not all of the calculations are done *prior* to creating the vector.

In [310]:
from pandas.io.json import json_normalize
import re

v = document_vectorizer # shorter code below

def process(row,column):
    columns = list(row.index)
    columns.remove(column)
    tdf = json_normalize(row.to_dict(),column,meta=columns)
    tdf = tdf.reset_index()
    tdf.rename(columns={('index'): (tdf.columns[0]+":"+column)}, inplace=True)
    return tdf

def on_partition(df):
    document_text = df.apply(lambda row: re.sub("<.*?>", "HTML_TAG",row.loc["document_text"]),axis=1) #replacing
    # html tages with HTML_TAG so that the model can still understand the context that the HTML tag created.
    # There are so many little things that you need to decide on that can make significant differences
    df["document_text"] = document_text
    annotation = json_normalize(df["annotations"].apply(lambda x: x[0]).values)
    annotation['example_id'] = df['example_id']
    annotation.set_index('example_id',inplace=True)
    df2_as_series = df.apply(lambda row: process(row,"long_answer_candidates"),axis=1)
    df2 = pd.concat(df2_as_series.values)
    df2 = df2.loc[df2.top_level == True]
    df3 = df2.set_index('example_id').join(annotation)
    df3["candidate"] = df3["long_answer.candidate_index"] == df3["index:long_answer_candidates"]
    answer_text = df3.apply(lambda row: " ".join(row.loc["document_text"].split()[row.loc["start_token"]:row.loc["end_token"]]),axis=1)
    df3["answer_text"] = answer_text
    features = df3.apply(lambda row: extract_features(v.transform([row.loc["document_text"]]).todense(),
                                                      v.transform([row.loc["question_text"]]).todense(),
                                                      v.transform([row.loc["answer_text"]]).todense(), 
                                                      row.loc["document_text"], 
                                                      row.loc["question_text"], 
                                                      row.loc["answer_text"],row.loc["candidate"]),axis=1)
    return features

for_meta = on_partition(train_df.head())


if LOAD_VEC: # I do not want to run the dask's map_partitions more than I have to
    pass
else:
    train_df2 = train_df.map_partitions(on_partition,meta=for_meta).compute(scheduler="processes")


In [311]:
train_df2.shape

(39192, 17)

In [422]:
train_df2.head()

Unnamed: 0_level_0,qa_cos_d,qd_cos_d,ad_cos_d,qa_euc_d,qd_euc_d,ad_euc_d,qa_lev_d,qa_lev_r,qa_jar_s,qa_jaw_s,qa_tfidf_score,qd_tfidf_score,ad_tfidf_score,document_tfidf_sum,question_tfidf_sum,answer_tfidf_sum,candidate
example_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
-8799945603687418006,0.766622,0.815032,0.261347,1.238242,1.276739,0.722976,442,0.0818,0.398669,0.398669,0.233378,0.184968,0.738653,6.322635,2.069796,4.247175,True
-8799945603687418006,0.777648,0.815032,0.14297,1.247115,1.276739,0.534734,122,0.22619,0.444032,0.444032,0.222352,0.184968,0.85703,6.322635,2.069796,2.957377,False
-8799945603687418006,0.996504,0.815032,0.271353,1.41174,1.276739,0.736686,1693,0.025258,0.408574,0.408574,0.003496,0.184968,0.728647,6.322635,2.069796,6.097632,False
-8627347779381584683,0.983627,0.416655,0.941501,1.402589,0.912858,1.372225,301,0.153846,0.396555,0.396555,0.016373,0.583345,0.058499,13.708678,2.396368,5.156949,False
-8627347779381584683,0.792941,0.416655,0.682364,1.259318,0.912858,1.168215,1086,0.055507,0.329138,0.329138,0.207059,0.583345,0.317636,13.708678,2.396368,6.611733,False


In [186]:
joblib.dump(train_df2,"train_df2.joblib.z")

['train_df2.joblib.z']

In [187]:
if LOAD_VEC:
    train_df2 = joblib.load("train_df2.joblib.z")

# Model Training 
## Throwing the kitchen sink at this data

Now that we got the hard work out of the way it is time to do our last data pre-processing steps. It is time to split the data into test and training sets. First we make sure that there are no missing values. Since we want all observations corresponding to certain ID's (questions) to be classified together we first use np.unique to get a np array of the unique indexes. We then use train_test_split to split the data where 66% of the unique ID's are going to be training data while the other 33% are going to be testing data. This way all questions will be grouped into the same groups.

In [334]:
from sklearn.model_selection import train_test_split

X = train_df2.fillna(0)
example_ids = np.unique(train_df2.index)

example_ids_train, example_ids_test = train_test_split(example_ids, test_size=0.33, shuffle=True, random_state=42)
X_train = X.loc[example_ids_train]
X_test = X.loc[example_ids_test]

I am making distinct data frames to represent both the test and the training data so that it is easier to train models (and also less confusing). 

In [335]:
y_train = X_train["candidate"] # only run this cell once
X_train.drop("candidate", axis = 1, inplace = True)
y_test = X_test["candidate"]
X_test.drop("candidate", axis = 1, inplace = True)

Now I am running cat boost and manually changing my class weights setting the not correct answer group to 1 and the the correct answer group to the ratio between correct and incorrect groups.

## CatBoost

In [336]:
from catboost import CatBoostClassifier

y_train *= 1 # this simply converts the trues and falses into ones and zeroes using pythons built in ability to store
# true values as one and false values as zero
v = y_train.value_counts().values
class_weights = [1,v[0]/v[1]]

# Initialize CatBoostRegressor
model = CatBoostClassifier(iterations=100,
                          learning_rate=1,
                          depth=8,
                          class_weights=class_weights,
                          verbose = False) # do not want to spam output
# Fit model
model.fit(X_train.drop("candidate", axis = 1), y_train)

<catboost.core.CatBoostClassifier at 0x7ffe8902ead0>

My cross_val_score results tell me that I am getting roughly 9% precision and roughly 11% recall. This is decent, but it is quite similar to my more naive approach, although the 17 feature vectors are much more compact.

In [337]:
res1 = cross_val_score(model, X_train, y_train, cv = 3, scoring = "precision")
res2 = cross_val_score(model, X_train, y_train, cv = 3, scoring = "recall")

In [338]:
display(res1.mean())
display(res2.mean())

1.0

1.0

In [339]:
predictions = model.predict(X_test)

The results on my held-out (classification) data set prove to be very similar giving evidence that you can use either cross_val_score or validation and achieve similar results. (WARNING: results may vary)

In [340]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

       False       0.99      0.98      0.99     11999
        True       0.08      0.11      0.09       154

    accuracy                           0.97     12153
   macro avg       0.53      0.55      0.54     12153
weighted avg       0.98      0.97      0.97     12153



We only get an f1-score of .09 on the true values. This seems to be not the best results, however this is a difficult problems. Lets now look at the distribution of the predictions.

In [341]:
pd.Series(predictions).value_counts()

0.0    11929
1.0      224
dtype: int64

We predicted many (224) to be the correct response and we only got 8% of those predictions correct, meaning that out of 224 that we predicted to be the correct answer only about 17 of them were actually the correct answer. In terms of recall we had 11% recall which means out of the 154 correct answers we correctly predicted only about 17 of them to be the correct answer. 

Note that I am only really paying attention to the values that were classified as being the correct answer since there were so many observations that were classified as being the incorrect answer we had 99% and 98% precision and recall respectively, so those results are not really of interest very much to me. Also, the goal of this challenge is to find **correct** answers not incorrect answers so from here on out when I refer to recall or precision I am talking about for the correct answer group not the incorrect answer group.

The following code prints a list of the indexes of the correct values, if those were ever to be of need. I have learned the hard way that often times observations that the machine picks out as being outliers end up just being ordinary observations for us to see and vice versa. 

In [342]:
y_test_r = np.array(y_test * 1)

In [343]:
correct = []
for i in range(len(predictions)):
    if predictions[i] == 1.0:
        if predictions[i] == y_test_r[i] :
            correct.append(i)

In [344]:
correct

[1962,
 1980,
 2587,
 2953,
 3548,
 6008,
 6239,
 6331,
 6338,
 7009,
 7365,
 7954,
 8332,
 8849,
 9640,
 9857,
 11520]

It turns out that indexing into the large data frame actually becomes significantly more difficult when using dask since dask automatically splits the data into many groups. I might come back to this however, the exact questions may not be the most interesting thing, especially given how normal the extreme outliers seemed to be in my second objective analysis.

For some strange reason it seems like catboost is appending the candidate variable back onto X_train so I will have to remove it manually.

In [345]:
X_train.drop("candidate", axis = 1, inplace = True)

In [346]:
from sklearn.linear_model import LogisticRegression
clf2 = LogisticRegression(random_state=0, solver='lbfgs',class_weight="balanced")
clf2.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [347]:
predictions = clf2.predict(X_test)

In [348]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

       False       1.00      0.69      0.81     11999
        True       0.03      0.75      0.06       154

    accuracy                           0.69     12153
   macro avg       0.51      0.72      0.44     12153
weighted avg       0.98      0.69      0.81     12153



Using logistic regression we get a lower precision for the true group however we actually improve in terms of our recall. It appears as if we are predicting more observations to be true than the cat boost was. However, our f1 went down which suggests that we are doing worse over all with logistic regression as compared to catboost.

I am going to see how many we classified as successes to contrast with catboost.

In [349]:
pd.Series(predictions).value_counts()

0    8313
1    3840
dtype: int64

This is pretty bad, we predicted 3,840 to be true observations when only 154 were correct observations in reality. This seems to be the only model that predicts such a high number of observations to be in the correct question class I think that it would be worthwhile to do some hyperparameter tuning with logistic regression.

I know that random forest classifiers are another good algorithm for classifying so I am going to give it a shot here with 100 parameters. We did not cover these in this class so I am not going to bother worrying about the hyper-parameters too much and instead I will just set the n_estimators to 100.

In [350]:
from sklearn.ensemble import RandomForestClassifier

In [351]:
rf = RandomForestClassifier(n_estimators = 100)
# we want many estimators, this is a difficult problem

In [352]:
rf.fit(X_train, y_train) 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [353]:
pred = rf.predict(X_test)

In [354]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

       False       0.99      1.00      0.99     11999
        True       0.60      0.02      0.04       154

    accuracy                           0.99     12153
   macro avg       0.79      0.51      0.52     12153
weighted avg       0.98      0.99      0.98     12153



Using random forest we actually ended up with a lower f1 score on our true predictions. Let me check the ratio of predictions.

In [355]:
pd.Series(pred).value_counts()

0    12148
1        5
dtype: int64

We only predicted 5 observations to be true and out of those only three of them were actually the correct answer.

In [356]:
rf = RandomForestClassifier(n_estimators = 100, class_weight = "balanced")
rf.fit(X_train, y_train) 
pred = rf.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

       False       0.99      1.00      0.99     11999
        True       0.00      0.00      0.00       154

    accuracy                           0.99     12153
   macro avg       0.49      0.50      0.50     12153
weighted avg       0.97      0.99      0.98     12153



  'precision', 'predicted', average, warn_for)


Quite surprisingly we get worse results for when we set the class weights to balanced.

In [357]:
from sklearn.svm import SVC

In [358]:
svc = SVC(kernel = "rbf", gamma = "auto", class_weight = "balanced")

In [359]:
svc.fit(X_train, y_train)
pred = svc.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

       False       0.99      0.92      0.95     11999
        True       0.02      0.11      0.03       154

    accuracy                           0.91     12153
   macro avg       0.50      0.52      0.49     12153
weighted avg       0.98      0.91      0.94     12153



Nothing seems to do as good as catboost did.

In [360]:
from sklearn.ensemble import GradientBoostingClassifier

In [361]:
gb = GradientBoostingClassifier()

In [362]:
gb.fit(X_train, y_train)
pred = svc.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

       False       0.99      0.92      0.95     11999
        True       0.02      0.11      0.03       154

    accuracy                           0.91     12153
   macro avg       0.50      0.52      0.49     12153
weighted avg       0.98      0.91      0.94     12153



Nothing has seemed to work thus far so it is time to break out the big guns. Time to use neural networks.

In [363]:
from sklearn.neural_network import MLPClassifier

In [364]:
nn = MLPClassifier()

In [365]:
nn.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

In [366]:
pred = nn.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

       False       0.99      0.78      0.87     11999
        True       0.02      0.43      0.05       154

    accuracy                           0.77     12153
   macro avg       0.51      0.60      0.46     12153
weighted avg       0.98      0.77      0.86     12153



That was disapointing, also neural networks do not have settings for class weights so I do not really know where to go from here.

Linear and quadratic discriminant analysis are two more machines learning methods that I am somewhat familiar with that I know are mathematically similar to logistic regression, which we get good results with so I will give them a shot below.

In [371]:
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
pred = lda.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

       False       0.99      0.99      0.99     11999
        True       0.10      0.07      0.08       154

    accuracy                           0.98     12153
   macro avg       0.54      0.53      0.54     12153
weighted avg       0.98      0.98      0.98     12153





These are actually quite surprising results. We get a f1-score of .08 which is just one shy of catboost. Sometimes the more simple models are actually the best. I am getting a warning saying that my variables are collinear.

In [372]:
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
pred = qda.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

       False       0.99      0.97      0.98     11999
        True       0.02      0.06      0.03       154

    accuracy                           0.95     12153
   macro avg       0.50      0.51      0.50     12153
weighted avg       0.98      0.95      0.96     12153





Our performance decreased when we switched to quadratic discriminant analysis. Let me investigate the covariant variables and see if that could possibly be bringing the performance of our model down.

In [376]:
lda = LinearDiscriminantAnalysis(store_covariance = True)
lda.fit(X_train, y_train)
pred = lda.predict(X_test)
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

       False       0.99      0.99      0.99     11999
        True       0.10      0.07      0.08       154

    accuracy                           0.98     12153
   macro avg       0.54      0.53      0.54     12153
weighted avg       0.98      0.98      0.98     12153





In [378]:
cov_frame = pd.DataFrame(lda.covariance_)

In [379]:
cov_frame

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.022423,0.012725,0.014519,0.018213,0.012027,0.013058,0.7113458,0.000303,-0.000897,-0.000891,-0.021599,-0.012725,-0.013695,0.071852,0.001656,-0.039789
1,0.012725,0.0328,0.003942,0.010402,0.030564,0.003609,16.17064,0.000118,-0.000565,-0.000559,-0.012777,-0.0328,-0.003993,0.073661,0.003427,-0.035919
2,0.014519,0.003942,0.029483,0.011584,0.003842,0.026975,-80.60546,0.00521,0.000547,0.000552,-0.013905,-0.003942,-0.028869,0.082332,0.001093,-0.15175
3,0.018213,0.010402,0.011584,0.015047,0.009865,0.010648,0.4102463,0.00017,-0.000723,-0.000719,-0.017921,-0.010402,-0.011292,0.061891,0.001569,-0.02998
4,0.012027,0.030564,0.003842,0.009865,0.028714,0.003558,14.23494,0.000151,-0.000475,-0.000469,-0.012074,-0.030564,-0.003889,0.081718,0.003887,-0.03176
5,0.013058,0.003609,0.026975,0.010648,0.003558,0.025213,-89.95647,0.004778,0.000558,0.000563,-0.012944,-0.003609,-0.02686,0.086105,0.001061,-0.136925
6,0.711346,16.170644,-80.605457,0.410246,14.234941,-89.956475,4510856.0,-56.904108,-23.751331,-23.78143,-0.120145,-16.170644,81.196659,73.081185,1.760234,1523.723746
7,0.000303,0.000118,0.00521,0.00017,0.000151,0.004778,-56.90411,0.00738,0.003153,0.003156,-0.000264,-0.000118,-0.00517,-0.022425,0.002935,-0.120295
8,-0.000897,-0.000565,0.000547,-0.000723,-0.000475,0.000558,-23.75133,0.003153,0.002863,0.002869,0.000917,0.000565,-0.000527,-0.00071,0.00279,-0.028294
9,-0.000891,-0.000559,0.000552,-0.000719,-0.000469,0.000563,-23.78143,0.003156,0.002869,0.002891,0.000912,0.000559,-0.000532,-0.000818,0.002793,-0.028319


The unfortunate part about covariance is that it depends on the scales of the variables and so it is somewhat hard to see which two variables are "collinear" as the warning states.

I am now going to use grid search to try to tune the hyperparameters.

In [403]:
pipe = Pipeline([
    ('lda', LinearDiscriminantAnalysis()) # it was not letting me not use a pipe for some reason
])


## GRID SEARCH
solvers = ["svd", "lsqr"]
priors = [None, [1,80]]
param_grid = {
        'lda__solver': solvers,
        'lda__priors': priors
}

grid = GridSearchCV(pipe, cv=5, n_jobs=1, param_grid=param_grid, iid=False,scoring='f1',return_train_score=True)
grid.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('lda',
                                        LinearDiscriminantAnalysis(n_components=None,
                                                                   priors=None,
                                                                   shrinkage=None,
                                                                   solver='svd',
                                                                   store_covariance=False,
                                                                   tol=0.0001))],
                                verbose=False),
             iid=False, n_jobs=1,
             param_grid={'lda__priors': [None, [1, 80]],
                         'lda__solver': ['svd', 'lsqr']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='f1', verbose=0)

In [404]:
mean_scores = np.array(grid.cv_results_['mean_test_score'])

In [407]:
mean_scores

array([0.06593272, 0.06593272, 0.02476439, 0.02705605])

We seem to get better results without using class weights.

Since this strategy of vector storage is more efficient I am going to use many features and see which ones optimize logistic regression.

In [421]:
pipe = Pipeline([
    ('log', LogisticRegression()) # it was not letting me not use a pipe for some reason
])


## GRID SEARCH
penalties = ["l1", "l2"]
solvers = ["liblinear"]
param_grid = {
        'log__penalty': penalties,
        'log__solver': solvers
}

grid = GridSearchCV(pipe, cv=5, n_jobs=1, param_grid=param_grid, iid=False,scoring='f1',return_train_score=True)
grid.fit(X_train, y_train)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('log',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='warn',
                                                           n_jobs=None,
                                                           penalty='l2',
                                                           random_state=None,
                                 

In [423]:
mean_scores = np.array(grid.cv_results_['mean_test_score'])
mean_scores

array([0.0057971, 0.0057971])

As it turns out so many of these parameters are dependent on each other (you can only use one parameter with another specific parameter in many cases, check out the logistic regression documentation!). It doesn't really make much sense to use a grid search since grid search tries out every possibility. In addition I am starting to discover that the default settings are the defaults for a reason, and it seems like they get best results a good propoertion of the time. It turns out that failure can be a learning experience sometimes.

# Conclusion/Reflection

The main takeaways that I can make from this project is that there is a lot to learn. I discovered through this analysis as well as my secondary results that you can have either 17 features or 17,000 features and you may get similar results in terms of f1 score (as I did). Another thing that I learned was that small changes that you make can have large impact down the line and often times it can be difficult or even impossible to discover which of those little changes that you make is going to yield the best results. Sometimes there is no explanation for the results that you have you just have to see what the error says and go with it. As I said in the cell above the defaults are good and grid search is not always the go to tool for hyper-parameter tuning. Based on my results in this project I have found that the choice of the *model* seems to be more important than the choice of the hyper-parameters as some of my models predicted no observations to be successes. I also find it interesting that I have such an objective opinion on whether something turns out to be a success or not if there are no metrics available to use. For example when a clustering algorithm produces results that I was not expecting or was not hoping for my opinion is to say that it is not doing a good job, however there is a chance that the computer is picking up on something that I am unable to.