# **Open Avenues - Week 4**

This week we're working on supervised learning. This notebook is the process that I will take to perform logistic regression.

First, I'll reuse the code from the previous Week 3 notebook and recreate the same combined TD-IDF vector.

In [27]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
df = pd.read_csv("open_ave_data.csv")

# modifying the dataset and replacing any NaN values with a NAN keyword
df.fillna("NAN", inplace=True)

# reducing each column to its own corpus
findings = df["findings"].values.tolist()
clinical = df["clinicaldata"].values.tolist()
exam = df["ExamName"].values.tolist()
impression = df["impression"].values.tolist()

# combine the categories
all_documents = findings + clinical + exam + impression

# transforming the documents with the vectorizer
tfidf_documents = vectorizer.fit_transform(all_documents)

# just checking if the shape of the combined matrix makes sense
#tfidf_documents.shape
#tfidf_documents.toarray()

all_documents represents all of our documents, every cell from every row (4 * 987). This is what is put into the TF-IDF matrix. But, for the supervised learning model to work. It needs to be trained against labeled data. So, we'll need to create a similar matrix, but only containing the labels for each of our documents.

This way, the model will have a list of labels to compare each prediction to so that we can test the accuracy of the model.

In [22]:
y = []
# repeating the label value for the length of the documents in the column
f_y = [0] * len(findings)
c_y = [1] * len(clinical)
e_y = [2] * len(exam)
i_y = [3] * len(impression)

# combine all the labels
y = f_y + c_y + e_y + i_y

# confirming the x and y match
print("tfidf: ", tfidf_documents.shape)
print("y label: ", len(y))

tfidf:  (3816, 1084)
y label:  3816


Now, we should be ready to split the data into a training and testing set since we have all the necessary labels to train the set and to check our model's predictions. 

We'll split the data into 80/20 for training and testing, which is fairly standard practice

In [24]:
import numpy as np
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(tfidf_documents, y, test_size=0.2, shuffle=False)

**Logistic Regression**

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from time import time
from sklearn.model_selection import train_test_split, GridSearchCV

lr_model = LogisticRegression(random_state=1234)
param_dict = {'C': [0.001, 0.01, 0.1, 1, 10],
             'solver': ['sag', 'lbfgs', 'saga']}

start = time()
grid_search = GridSearchCV(lr_model, param_dict)
grid_search.fit(x_train, y_train)
print("GridSearch took %.2f seconds to complete." % (time()-start))
display(grid_search.best_params_)
print("Cross-Validated Score of the Best Estimator: %.3f" % grid_search.best_score_)




GridSearch took 5.49 seconds to complete.




{'C': 10, 'solver': 'sag'}

Cross-Validated Score of the Best Estimator: 0.997


Like any model that makes predictions, we have to analyze the metrics of our model to understand how accurate it is. Here are the breakdowns of our model's metrics 

**Metrics**

In [26]:
lr=LogisticRegression(C=1, solver ='saga')
lr.fit(x_train, y_train)
lr_preds=lr.predict(x_test)

print(confusion_matrix(y_test, lr_preds))
print(classification_report(y_test, lr_preds))
print("Accuracy Score: %.3f" % accuracy_score(y_test, lr_preds))

[[  0   0   0   0]
 [  0   0   0   0]
 [  0   0   0   0]
 [ 21   6   1 736]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       0.00      0.00      0.00         0
           2       0.00      0.00      0.00         0
           3       1.00      0.96      0.98       764

    accuracy                           0.96       764
   macro avg       0.25      0.24      0.25       764
weighted avg       1.00      0.96      0.98       764

Accuracy Score: 0.963


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
