# The top k approach

The top-k method is a useful method to inspect a model's results. It simply consists of looking at the **most and least successful examples** to identify patterns within them. These patterns can then be used to engineer new features, or iterate on existing ones.

First, we load the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.externals import joblib

import sys
sys.path.append("..")
import warnings
warnings.filterwarnings('ignore')

from ml_editor.data_processing import (
    format_raw_df, get_split_by_author, 
    add_text_features_to_df, 
    get_vectorized_series, 
    get_feature_vector_and_label
)
from ml_editor.model_evaluation import get_top_k

data_path = Path('../data/writers.csv')
df = pd.read_csv(data_path)
df = format_raw_df(df.copy())

Then, we add features and split the dataset.

In [2]:
df = add_text_features_to_df(df.loc[df["is_question"]].copy())
train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)

We load the trained model, and vectorize the features.

In [3]:
model_path = Path("../models/model_1.pkl")
clf = joblib.load(model_path) 
vectorizer_path = Path("../models/vectorizer_1.pkl")
vectorizer = joblib.load(vectorizer_path) 

In [4]:
train_df["vectors"] = get_vectorized_series(train_df["full_text"].copy(), vectorizer)
test_df["vectors"] = get_vectorized_series(test_df["full_text"].copy(), vectorizer)

features = [
                "action_verb_full",
                "question_mark_full",
                "text_len",
                "language_question",
            ]
X_train, y_train = get_feature_vector_and_label(train_df, features)
X_test, y_test = get_feature_vector_and_label(test_df, features)

Now, we'll use the top k method to look at:

- The k best performing examples for each class (high and low scores)
- The k worst performing examples for each class
- The k most unsure examples, where our models prediction probability is close to .5

To read more about how plotting these particular examples can help with model iteration, please refer to Chapter 5 of the book.

In [5]:
test_analysis_df = test_df.copy()
y_predicted_proba = clf.predict_proba(X_test)
test_analysis_df["predicted_proba"] = y_predicted_proba[:, 1]
test_analysis_df["true_label"] = y_test

to_display = [
    "predicted_proba",
    "true_label",
    "Title",
    "body_text",
    "text_len",
    "action_verb_full",
    "question_mark_full",
    "language_question",
]
threshold = 0.5


top_pos, top_neg, worst_pos, worst_neg, unsure = get_top_k(test_analysis_df, "predicted_proba", "true_label", k=2)
pd.options.display.max_colwidth = 500

Most confident correct positive predictions

In [6]:
top_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
30514,0.72,True,Classic fantasy races lazy or boring?,"In my fantasy world there are other other sentient races besides humans. Some of them are ""original"", or at least not as common in mainstream stories. On the other hand, I have also included more traditional races like elves. I'm going to continue using elves for my example. Obviously, I have tried to put a personal spin on the elves and not just copy-paste them from Tolkien or anywhere else.\nSomewhere along the way, I realized that I could change them to be any other race I like while mai...",1785,True,True,False
32835,0.7,True,Traits of Bad Writers - Analysing Popular Authors,"I realise that this question can fall in the scope of personal opinion but I am looking for something concrete.\nVery often, not only on this site but other writing related sites, I find people constantly say that certain popular authors are, in reality, bad writers. These include the likes of J.K. Rowling, Dan Brown, Christopher Paolini, etc. Reasons range from minor plot holes to being too verbose. However, that should not exactly be true or should rather be minor insignificant things as o...",1097,True,True,False


Most confident correct negative predictions

In [7]:
top_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7488,0.08,False,Releases needed for picture books?,Do you need location releases for national parks and model releases for Pets to use in picture books?\n,137,False,True,False
16799,0.09,False,Capitalization of Open form Compound Words in Titles,"What would be considered proper capitalization of open form compound words in titles? Should the second part of the compound word be capitalized? Why?\nFor example, the capitalization for which title would be correct?\n\nCash flow Analysis Report\n\n--OR--\n\nCash Flow Analysis Report\n\nThanks!\n",343,True,True,True


It seems most of the correct negative predictions have **short length**. This result reinforces the feature importance analysis which showed question length as one of the most important features.

Let's look at the most confident incorrect negative predictions

In [8]:
worst_pos[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17543,0.15,True,Using Pronoun 'It' repetitvely for emphasis?,"I'd like to know if using ""It"" repetitively (for emphasis) in this context is okay grammatically.\n\nTV has become the modern day baby sitter. It is raising our children. It is dictating the cultural narrative and shaping future society. It is raising the bored inattentive child. It is raising the consumer child. It is raising the aggressive child. It is raising the obese child. It is raising the misinformed and complacent child. It is raising the disenchanted child. And what’s more, it...",591,False,True,False
14157,0.17,True,Do you bold punctuation directly after bold text?,"Do you bold punctuation directly after bold, linked or italic text? \n",119,False,True,False


On the flipside, we find an overrepresentation of short questions with high scores in the examples our model got wrong.

Next, let's look at the most confident incorrect positive predictions

In [9]:
worst_neg[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7878,0.88,False,"When quoting a person's informal speech, how much liberty do you have to make changes to what they say?","Even during a formal interview for a news article, people speak informally. They say ""uhm"", they cut off sentences half-way through, they interject phrases like ""you know?"", and they make innocent grammatical mistakes.\nAs somebody who wants to fairly and accurately report the discussion that takes place in an interview, what guidelines should I use in making changes to what a person says?\nWhile the simplest solution is to write exactly what they say and [sic] any errors they make, that can...",694,True,True,False
37550,0.75,False,How to make sure that my reader has not forgotten an incident or character which was described earlier and referenced much later in the writing?,"In the context of writing a fiction novel, how to make sure that readers remember several of the incidents or events or even characters that were described in earlier chapters, when the same gets referenced in the much later part? \nI have also experienced this myself as a reader, especially in mystery novels where a minute detail or character described earlier ends up as a dramatic, concluding climax later. Reader gets a sense of being lost as they happen to forget such things either becaus...",1008,True,True,False


And finally, the most "unsure" questions, the ones where a model's probability is closest to equal for all classes (`.5` in our case since we have two classes).

In [10]:
unsure[to_display]

Unnamed: 0_level_0,predicted_proba,true_label,Title,body_text,text_len,action_verb_full,question_mark_full,language_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
26444,0.5,False,How to make a choice more sadistic?,"How do I create an impossible choice for my protagonist? I want to place him in a painful dilemma, and I'm having trouble making the choice feel truly impossible.\nFor example, I'm writing a scene where store owner Ed learns of an upcoming sting operation at his store. Ed is friends with both the cop running the operation, and the customer who's the intended target, so I thought that could be developed into a good flashpoint, torn between two friends and dangerous circumstances.\nThe problem...",1065,True,True,False
10263,0.5,False,"How to define a research question when the output is a product, used in research?","My school requests a research question for a project. I came up with one like this: (different topic but very similar, due to NDA)\n\nWhat is the behavior of a bike, depending on speed, time in use and temperature?\n\nThis question is about the research, and defines what is wanted. However this is way too much for the time I will work at the project. My task is to design, build and verify the bike.\nIf I leave the research question as it is, I will be having trouble at the end, because I can...",1345,True,True,False


To find new candidate features, I recommend combining the top-k method with feature importance and vectorization.