### Weekly Dataset

Each week, I'm dedicating myself to exploring, modeling, and doing stuff with a dataset. Trying to master different modalities and such while reviewing different statistical tests I don't use as often as I should.

This week, it's basics: text classification with transformers.

# Who Wrote It: Text Classification on Old Books

Luckily, there's a flood of text data available these days. One great place to easily ge you some data is on <a href="https://www.gutenberg.org/">Project Gutenberg</a>. It has a large repo of books that have fallen into the public domain. You can access them free of charge, and downloading them from the web is super easy. No need to request or pay for an API. You can simply download a .txt file from a book's page.

Let's import our old friends and see how we can easily get a book's data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import requests
import os



In [2]:
# simply get all text from a webpage
resp = requests.get("https://www.gutenberg.org/cache/epub/2554/pg2554.txt")

In [3]:
# turn response object into a string object
text = resp.text

In [4]:
# get word count. Seems about right for this books which Happens to be Crime and Punishment
len(text.split())

206537

In [5]:
# take a peak at the text
text[20_000:20_500]

's I get it\r\nback from a friend...” he broke off in confusion.\r\n\r\n“Well, we will talk about it then, sir.”\r\n\r\n“Good-bye--are you always at home alone, your sister is not here with\r\nyou?” He asked her as casually as possible as he went out into the\r\npassage.\r\n\r\n“What business is she of yours, my good sir?”\r\n\r\n“Oh, nothing particular, I simply asked. You are too quick.... Good-day,\r\nAlyona Ivanovna.”\r\n\r\nRaskolnikov went out in complete confusion. This confusion became more\r\nand more intense. As he '

This text seems to have a bunch of \r and \n in it. We'll have to remove these to clean it up. On top of this, Project Gutenberg's raw text has a lot of info at the very beginning before the book starts in ernest. For Crime and Punishment, you also see a preface.

In [6]:
print(text[:10_000])

﻿The Project Gutenberg eBook of Crime and Punishment
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Crime and Punishment


Author: Fyodor Dostoyevsky

Translator: Constance Garnett

Release date: March 28, 2006 [eBook #2554]
                Most recently updated: August 5, 2021

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***



CRIME AND PUNISHMENT

By Fyodor Dostoevsky



Translated By Constance Garnett




TRANSLATOR’S PREFACE

A few words about Dostoevsky himself may help the English reader to
understand his work.

Dostoevsky was the son of a

### The Task

So we have some text. Cool. What's the task though? Well, since this is easy to get and I wanted to reup my game on using the Trainer API from Hugging Face, text classification seems good enough for me. I wanted to work on something though that starts from the basics, like term counts, but since I mainly want to focus on the Trainer API, I'll jump right to transformers. I know you're always supposed to start with basic stuff, but I didn't start doing this just redo <a href="https://www.ggwp.com/">my day job</a> of do basic where we can, complex where we must.

In light of that, let's get our data first. I'm thinking of an interesting way to see what authors might line up well. Dostoyevsky will have to one because he writes a ton and is famous and his books are easy to get on Project Gutenberg. Who might be in the same distribution as Dostoyevsky you ask...well I think our man <a href="https://en.wikipedia.org/wiki/Leo_Tolstoy">Leo Tolstoy</a> fits the bill. He's Russian and kinda lived at the same time as <a href="https://en.wikipedia.org/wiki/Fyodor_Dostoevsky">Dostoevsky</a>. That seems to fit. Plus, these dudes love to write hugh novels about stuff. That'll be good.

First thing's first then, download our data. I'm gonna pick two big ones by each of our dudes here:

- Tolstoy:
    - War and Peace
    - Anna Karenina
- Dostoevsky:
    - The Possessed (Demons)
    - The Brothers Karamazov
    
These are big enough to give us some good data. We'll break them into chunks larger, but first, the download:

### Get Data

In [7]:
links = dict(
    anna_karenina="https://www.gutenberg.org/cache/epub/1399/pg1399.txt",
    war_and_peace="https://www.gutenberg.org/cache/epub/2600/pg2600.txt",
    demons="https://www.gutenberg.org/cache/epub/8117/pg8117.txt",
    brothers_karamazov="https://www.gutenberg.org/cache/epub/28054/pg28054.txt"
)

In [8]:
texts = dict()

for book in links:
    texts[book] = requests.get(links[book]).text

In [9]:
# check we did it right. Seems like it
texts["demons"][:1000]

'\ufeffThe Project Gutenberg eBook of The Possessed (The Devils)\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country where you are located\r\nbefore using this eBook.\r\n\r\nTitle: The Possessed (The Devils)\r\n\r\n\r\nAuthor: Fyodor Dostoyevsky\r\n\r\nTranslator: Constance Garnett\r\n\r\nRelease date: May 1, 2005 [eBook #8117]\r\n                Most recently updated: July 5, 2017\r\n\r\nLanguage: English\r\n\r\nCredits: Produced by David Moynihan, David Widger and Michelle Knight\r\n\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE POSSESSED (THE DEVILS) ***\r\n\r\n\r\n\r\nProduced by David Moynihan, David Widger and 

### Preprocessing

We want to do a bit of preprocessing on the text to ensure we make it the most model friendly we can. First, we want to chop of the first few characters and last few characters since they contain information on project Gutenberg and not the book itself. To be conservative, since we know these books are really big, we can chop the first 5,000 character and probably be fine. Let's do that an ensure our start and end of the texts look like they only contain narrative parts.

In [10]:
preprocessed_text = dict()

for text in texts:
    preprocessed_text[text] = texts[text][5_000:-5_000]

In [11]:
for text in preprocessed_text:
    start = preprocessed_text[text][:500]
    end = preprocessed_text[text][-500:]
    
    print(text)
    print("Start\n", start)
    print()
    print("End\n", end)
    print()

anna_karenina
Start
 r fussing and worrying over household details,
and limited in her ideas, as he considered, was sitting perfectly still
with the letter in her hand, looking at him with an expression of
horror, despair, and indignation.

“What’s this? this?” she asked, pointing to the letter.

And at this recollection, Stepan Arkadyevitch, as is so often the case,
was not so much annoyed at the fact itself as at the way in which he
had met his wife’s words.

There happened to him at that instant what d

End
 the state applicable to this agreement, the
agreement shall be interpreted to make the maximum disclaimer or
limitation permitted by the applicable state law. The invalidity or
unenforceability of any provision of this agreement shall not void the
remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation, the
trademark owner, any agent or employee of the Foundation, anyone
providing copies of Project Gutenberg™ electronic works in
accordance with t

Seems like we still need to cut some off. Let's double it to 10,000 chars from front and end

In [12]:
preprocessed_text = dict()

for text in texts:
    preprocessed_text[text] = texts[text][10_000:-10_000]

In [13]:
for text in preprocessed_text:
    start = preprocessed_text[text][:500]
    end = preprocessed_text[text][-500:]
    
    print(text)
    print("Start\n", start)
    print()
    print("End\n", end)
    print()

anna_karenina
Start
 ey understood one another. Stepan
Arkadyevitch’s eyes asked: “Why do you tell me that? don’t you know?”

Matvey put his hands in his jacket pockets, thrust out one leg, and
gazed silently, good-humoredly, with a faint smile, at his master.

“I told them to come on Sunday, and till then not to trouble you or
themselves for nothing,” he said. He had obviously prepared the
sentence beforehand.

Stepan Arkadyevitch saw Matvey wanted to make a joke and attract
attention to himself. Tearing

End
 y charge a reasonable fee for copies of or providing
access to or distributing Project Gutenberg™ electronic works
provided that:

    • You pay a royalty fee of 20% of the gross profits you derive from
        the use of Project Gutenberg™ works calculated using the method
        you already use to calculate your applicable taxes. The fee is owed
        to the owner of the Project Gutenberg™ trademark, but he has
        agreed to donate royalties under this paragraph to the 

Seems like we need to cut off more! Let's try smarter. Looking at the raw text, it seems like each books ends with:

*** END OF THE PROJECT GUTENBERG EBOOK [Book Name] ***

So we can use that to end our text. And ah, each starts with:

*** START OF THE PROJECT GUTENBERG EBOOK [Book Name] ***

Let's use those as our break points, and then trim off the table of contents with a 5,000 chars limit again.

In [14]:
preprocessed_text = dict()

for text in texts:
    preprocessed_text[text] = texts[text] \
    .split("*** START OF THE PROJECT GUTENBERG EBOOK")[1]\
    .split("*** END OF THE PROJECT GUTENBERG EBOOK")[0][5_000:-5_000]

In [15]:
for text in preprocessed_text:
    start = preprocessed_text[text][:500]
    end = preprocessed_text[text][-500:]
    
    print(text)
    print("Start\n", start)
    print()
    print("End\n", end)
    print()

anna_karenina
Start
 ing himself, begging forgiveness, instead of remaining
indifferent even—anything would have been better than what he did
do—his face utterly involuntarily (reflex spinal action, reflected
Stepan Arkadyevitch, who was fond of physiology)—utterly involuntarily
assumed its habitual, good-humored, and therefore idiotic smile.

This idiotic smile he could not forgive himself. Catching sight of that
smile, Dolly shuddered as though at physical pain, broke out with her
characteristic heat into 

End
 where he heard voices, he
stopped on the terrace, and leaning his elbows on the parapet, he gazed
up at the sky.

It was quite dark now, and in the south, where he was looking, there
were no clouds. The storm had drifted on to the opposite side of the
sky, and there were flashes of lightning and distant thunder from that
quarter. Levin listened to the monotonous drip from the lime trees in
the garden, and looked at the triangle of stars he knew so well, and
the Milky Way with

Let's just do one more time with more chars to get rid of that table of contents:

In [16]:
preprocessed_text = dict()

for text in texts:
    preprocessed_text[text] = texts[text] \
    .split("*** START OF THE PROJECT GUTENBERG EBOOK")[1]\
    .split("*** END OF THE PROJECT GUTENBERG EBOOK")[0][10_000:-10_000]

In [17]:
for text in preprocessed_text:
    start = preprocessed_text[text][:500]
    end = preprocessed_text[text][-500:]
    
    print(text)
    print("Start\n", start)
    print()
    print("End\n", end)
    print()

anna_karenina
Start
 barber, cutting a
pink path through his long, curly whiskers.

“Thank God!” said Matvey, showing by this response that he, like his
master, realized the significance of this arrival—that is, that Anna
Arkadyevna, the sister he was so fond of, might bring about a
reconciliation between husband and wife.

“Alone, or with her husband?” inquired Matvey.

Stepan Arkadyevitch could not answer, as the barber was at work on his
upper lip, and he raised one finger. Matvey nodded at the
lookin

End
 w minutes after Kitty had left the room she sent for Levin to come
to the nursery.

Leaving his tea, and regretfully interrupting the interesting
conversation, and at the same time uneasily wondering why he had been
sent for, as this only happened on important occasions, Levin went to
the nursery.

Although he had been much interested by Sergey Ivanovitch’s views of
the new epoch in history that would be created by the emancipation of
forty millions of men of Slavonic race acting

Now we're looking good. Ok. Let's take a look at raw text though, not the printed version:

In [18]:
preprocessed_text["demons"][:100]

'tortoise\r\ncomes on the scene with certain sacramental Latin words, and even, if\r\nI remember aright, '

Those \r and \n's are still there. Let's use our regex powers to get rid of them:

In [19]:
import re

In [20]:
for text in preprocessed_text:
    preprocessed_text[text] = re.sub(r"\r", " ", preprocessed_text[text])
    preprocessed_text[text] = re.sub(r"\n", " ", preprocessed_text[text])
    
    # change whitespaces of one or more to be just one whitespace
    preprocessed_text[text] = re.sub(r"\s+", " ", preprocessed_text[text])

In [21]:
# looking good.
preprocessed_text["demons"][:100]

'tortoise comes on the scene with certain sacramental Latin words, and even, if I remember aright, a '

We could do a bunch more stuff, however, I want the Hugging Face tokenizer to decide on things like periods, comma, colons, etc. If we train a model and see that it's bunk, we can go back. For now though, let's just keep it as is.

Our next step it to break these texts up into smaller and more digestable pieces for training. Normally, a lot of transformers take a context window of 512 tokens. While we can't (i.e. I don't want to) specifically see how our text will be completely tokenized, what I'll do is split up our text into 150 word chunks. I'll split on whitespace, which will constitute a word, and create a row in dataframe that contains the text, who wrote it, and which book for record keeping. I want a roboust dataset, so I'll create 10,000 examples from each piece of text to balance it out. That gives us 20,000 examples from each author:

In [22]:
# to see how long we have to wait for our for loops
from tqdm.notebook import tqdm

In [23]:
from collections import defaultdict

In [24]:
book_author_map = dict(anna_karenina="Tolstoy",
                       war_and_peace="Tolstoy",
                       demons="Dostoevsky",
                       brothers_karamazov="Dostoevsky")

In [25]:
text_chunks = defaultdict(list)

for text in tqdm(preprocessed_text):
    split_text = preprocessed_text[text].split()
    for _ in tqdm(range(10_000)):
        start_idx = np.random.randint(0, len(split_text)-150)
        chunk = " ".join(split_text[start_idx:start_idx+150])
        
        # our data to predict on
        text_chunks["text"].append(chunk)
        
        # just some metadata to take into account
        text_chunks["book"].append(text)
        
        # our label
        text_chunks["author"].append(book_author_map[text])

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

In [26]:
text_df = pd.DataFrame(text_chunks)

In [27]:
text_df.head()

Unnamed: 0,text,book,author
0,was one of them. They peeped into the “inferna...,anna_karenina,Tolstoy
1,"you, to be sure. Do you suppose they keep vodk...",anna_karenina,Tolstoy
2,had obviously considered that he came off vict...,anna_karenina,Tolstoy
3,was not in the theater that evening. “How litt...,anna_karenina,Tolstoy
4,"too,” said Stepan Arkadyevitch, smiling, as he...",anna_karenina,Tolstoy


In [28]:
# sanity checks
text_df["book"].value_counts()

book
anna_karenina         10000
war_and_peace         10000
demons                10000
brothers_karamazov    10000
Name: count, dtype: int64

In [29]:
text_df["author"].value_counts()

author
Tolstoy       20000
Dostoevsky    20000
Name: count, dtype: int64

Right on. We have a dataset. However, we still need to consider things like lowercasing, puncutation remove (which I noted earlier), etc. Luckily, Hugging Face tokenizers can take care of some of that. Enough talking about transformers. Let's import them.

### Model Training

We've made it. Now we need to train a model...and make a dataset to fit Hugging Face's format...but we can train can get a baseline. Awesome. Let's get what we need.

In [30]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.1


In [31]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import datasets
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

If you don't know <a href="https://huggingface.co/">Hugging Face</a> is the place to do language modeling and shortly becoming the default for all deep learning tasks. You can read up on it. They have a bunch of tutorials. I'm not going to replicate them. I'm just here to train.

Ok...so we'll train a small model first called <a href="https://huggingface.co/albert-base-v2">Albert</a>. It's an acronym for "A lite Bert", meaning, it's like BERT, but does some interesting stuff like reducing the embedding dimensions and sharing weights across layers. In short, it's light in memory. Cool. Let's first though transform our pandas dataframe into a datasets object. This is just a special format for Hugging Face trainers to use. You can read more <a href="https://huggingface.co/learn/nlp-course/chapter5/1?fw=pt">here</a> about them.

In [32]:
from sklearn.model_selection import train_test_split

In [91]:
X_train, X_test, y_train, t_test = train_test_split(text_df,
                                                    text_df["author"],
                                                    test_size=0.1,
                                                    random_state=1,
                                                    stratify=text_df["author"])

In [34]:
text_ds = datasets.Dataset.from_pandas(X_train.reset_index(drop=True).rename(columns={"author": "label"}))

In [35]:
test_ds_split = text_ds.train_test_split(train_size=0.8,
                                         test_size=0.2,
                                         seed=1)

In [36]:
test_ds_split

DatasetDict({
    train: Dataset({
        features: ['text', 'book', 'label'],
        num_rows: 28800
    })
    test: Dataset({
        features: ['text', 'book', 'label'],
        num_rows: 7200
    })
})

In [37]:
# now we can get our models
cp = "albert-base-v2"

In [38]:
tokenizer = AutoTokenizer.from_pretrained(cp)
model = AutoModelForSequenceClassification.from_pretrained(cp, num_labels=2)

Downloading (…)lve/main/config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [39]:
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

In [40]:
def tokenize_function(example):
    return tokenizer(example["text"], 
                     truncation=True, 
                     padding=True)


tokenized_datasets_train = test_ds_split["train"].map(tokenize_function, batched=True)
tokenized_datasets_test = test_ds_split["test"].map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  0%|          | 0/29 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

In [41]:
tokenized_datasets_train = tokenized_datasets_train.remove_columns(["text", "book"])

In [42]:
tokenized_datasets_train = tokenized_datasets_train.class_encode_column("label")

Casting to class labels:   0%|          | 0/29 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

In [43]:
tokenized_datasets_test = tokenized_datasets_test.remove_columns(["text", "book"])

In [44]:
tokenized_datasets_test = tokenized_datasets_test.class_encode_column("label")

Casting to class labels:   0%|          | 0/8 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

In [51]:
training_args = TrainingArguments(
    report_to="none",
    output_dir="/",
    evaluation_strategy="steps",
    weight_decay=0.01,
    logging_steps=1_000,
    auto_find_batch_size=True
)

In [52]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [53]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_test,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [54]:
trainer.train()

You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
1000,0.47,0.385703,0.886806
2000,0.3542,0.220299,0.956806
3000,0.1497,0.084365,0.980556
4000,0.0834,0.069893,0.986667
5000,0.0444,0.067642,0.989167
6000,0.0343,0.062676,0.990417
7000,0.0244,0.04529,0.993472
8000,0.0081,0.035005,0.995556
9000,0.0116,0.024314,0.996111
10000,0.0064,0.020542,0.997083


TrainOutput(global_step=10800, training_loss=0.11002324165017517, metrics={'train_runtime': 4047.5607, 'train_samples_per_second': 21.346, 'train_steps_per_second': 2.668, 'total_flos': 1279172035252800.0, 'train_loss': 0.11002324165017517, 'epoch': 3.0})

In [63]:
text_ds_test = datasets \
.Dataset \
.from_pandas(X_test.reset_index(drop=True).rename(columns={"author": "label"})) \
.remove_columns(["book"]) \
.class_encode_column("label") \
.map(tokenize_function, batched=True)

Casting to class labels:   0%|          | 0/4 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

### Prediction

Damn. We got a nice looking model. Makes me think it might be overfitted...let's ignore that though and check accuracy. 

In [65]:
# use the trainer for our predictions
predictions = trainer.predict(text_ds_test)

In [83]:
predictions

PredictionOutput(predictions=array([[ 6.8189187, -5.962311 ],
       [ 6.8215036, -5.960884 ],
       [ 6.8242455, -5.959299 ],
       ...,
       [-5.4024425,  4.954162 ],
       [-5.4015336,  4.9545865],
       [ 6.823697 , -5.9633555]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 0]), metrics={'test_loss': 0.025159670040011406, 'test_accuracy': 0.9965, 'test_runtime': 48.4414, 'test_samples_per_second': 82.574, 'test_steps_per_second': 10.322})

In [70]:
# do argmax across rows
prediction_labels = np.argmax(predictions.predictions, axis=1)

In [71]:
prediction_labels

array([0, 0, 0, ..., 1, 1, 0])

In [75]:
t_test

24508    Dostoevsky
37321    Dostoevsky
33577    Dostoevsky
22163    Dostoevsky
35387    Dostoevsky
            ...    
26108    Dostoevsky
8039        Tolstoy
2638        Tolstoy
14978       Tolstoy
22760    Dostoevsky
Name: author, Length: 4000, dtype: object

In [77]:
# label encode authors
t_test_encoded = t_test.map({"Dostoevsky": 0, "Tolstoy": 1})

In [80]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [81]:
accuracy_score(prediction_labels, t_test_encoded.values)

0.9965

In [82]:
confusion_matrix(prediction_labels, t_test_encoded.values)

array([[1997,   11],
       [   3, 1989]])

We're sitting pretty. Let's take a look at examples that we're off.

In [92]:
X_test["predictions"] = prediction_labels

In [93]:
X_test["predictions"] = X_test["predictions"].map({0: "Dostoevsky", 1: "Tolstoy"})

In [97]:
bad_preds = X_test.loc[X_test["author"] != X_test["predictions"]]

In [98]:
good_preds = X_test.loc[X_test["author"] == X_test["predictions"]]

In [101]:
for passage in bad_preds["text"].tolist():
    print(passage)
    print()

what to do with oneself.” “How can you be bored, prince? There’s so much that’s interesting now in Germany,” said Marya Yevgenyevna. “But I know everything that’s interesting: the plum soup I know, and the pea sausages I know. I know everything.” “No, you may say what you like, prince, there’s the interest of their institutions,” said the colonel. “But what is there interesting about it? They’re all as pleased as brass halfpence. They’ve conquered everybody, and why am I to be pleased at that? I haven’t conquered anyone; and I’m obliged to take off my own boots, yes, and put them away too; in the morning, get up and dress at once, and go to the dining-room to drink bad tea! How different it is at home! You get up in no haste, you get cross, grumble a little, and come round again. You’ve time to think things over,

him at all. The cause of this is my egotism. I set myself above him and so become much worse than he, for he is lenient to my rudeness while I on the contrary nourish contemp

In [105]:
good_samples = good_preds.sample(20, random_state=1)

In [108]:
for passage in good_samples["text"].tolist():
    print(passage)
    print()

this?" "Allegory, indeed! You are laughing, I see.... Stepan Trofimovitch said truly that I lie under a stone, crushed but not killed, and do nothing but wriggle. It was a good comparison of his." "Stepan Trofimovitch declares that you are mad over the Germans," I laughed. "We've borrowed something from them anyway." "We took twenty kopecks, but we gave up a hundred roubles of our own." We were silent a minute. "He got that sore lying in America." "Who? What sore?" "I mean Kirillov. I spent four months with him lying on the floor of a hut." "Why, have you been in America?" I asked, surprised. "You never told me about it." "What is there to tell? The year before last we spent our last farthing, three of us, going to America in an emigrant steamer, to test the life of the American workman on ourselves, and to verify by

said, chuckling, "if what is advocated in your manifestoes ever comes to pass, will be the first to be hanged." "Perhaps before," Pyotr Stepanovitch said suddenly. "Quite

### Error Finding

Hmmmm...seems like some of these good examples have character names in them. Our model may just have learned specific characters in each novel and used that to predict who write it. Let's try one of these passages, remove the name with a blank NAME place holder and see what we get.

In [121]:
from transformers import TextClassificationPipeline

In [130]:
pipe = TextClassificationPipeline(model=trainer.model.cpu(), tokenizer=tokenizer, return_all_scores=True)



In [136]:
# with the name no issue
pipe("""
above all—don’t lie.” “You mean about Diderot?” 
“No, not about Diderot. Above all, don’t lie to yourself. 
The man who lies to himself and listens to his own lie comes 
to such a pass that he cannot distinguish the truth within him, 
or around him, and so loses all respect for himself and for others. 
And having no respect he ceases to love, and in order to 
occupy and distract himself without love he gives way to 
passions and coarse pleasures, and sinks to bestiality 
in his vices, all from continual lying to other men and to himself. 
The man who lies to himself can be more easily offended than any one. 
You know it is sometimes very pleasant to take offense, isn’t it? 
A man may know that nobody has insulted him, but that he has 
invented the insult for himself, has lied and exaggerated to make it picturesque,
""")

[[{'label': 'LABEL_0', 'score': 0.999996542930603},
  {'label': 'LABEL_1', 'score': 3.476069423413719e-06}]]

In [137]:
# this one seems to hold
pipe("""
above all—don’t lie.” “You mean about Diderot?” 
“No, not about Diderot. Above all, don’t lie to yourself. 
The man who lies to himself and listens to his own lie comes 
to such a pass that he cannot distinguish the truth within him, 
or around him, and so loses all respect for himself and for others. 
And having no respect he ceases to love, and in order to 
occupy and distract himself without love he gives way to 
passions and coarse pleasures, and sinks to bestiality 
in his vices, all from continual lying to other men and to himself. 
The man who lies to himself can be more easily offended than any one. 
You know it is sometimes very pleasant to take offense, isn’t it? 
A man may know that nobody has insulted him, but that he has 
invented the insult for himself, has lied and exaggerated to make it picturesque,
""".replace("Diderot", "NAME"))

[[{'label': 'LABEL_0', 'score': 0.999982476234436},
  {'label': 'LABEL_1', 'score': 1.748324211803265e-05}]]

In [140]:
pipe("""
Shtcherbatskaya. “On Thursdays we are home, as always.” “Today, then?” 
“We shall be pleased to see you,” the princess said stiffly. 
This stiffness hurt Kitty, and she could not resist the desire 
to smooth over her mother’s coldness. She turned her head, and 
with a smile said: “Good-bye till this evening.” At that moment 
Stepan Arkadyevitch, his hat cocked on one side, 
with beaming face and eyes, strode into the garden 
like a conquering hero. But as he approached his mother-in-law, 
he responded in a mournful and crestfallen tone to her 
inquiries about Dolly’s health. After a little subdued and 
dejected conversation with his mother-in-law, he threw 
out his chest again, and put his arm in Levin’s. “Well, 
shall we set off?” he asked. “I’ve been thinking about 
you all this time, and I’m very, very glad you’ve come,” 
he said, looking him in the face with a significant air.
""")

[[{'label': 'LABEL_0', 'score': 3.176436075591482e-05},
  {'label': 'LABEL_1', 'score': 0.9999682903289795}]]

In [139]:
pipe("""
Shtcherbatskaya. “On Thursdays we are home, as always.” “Today, then?” 
“We shall be pleased to see you,” the princess said stiffly. 
This stiffness hurt Kitty, and she could not resist the desire 
to smooth over her mother’s coldness. She turned her head, and 
with a smile said: “Good-bye till this evening.” At that moment 
Stepan Arkadyevitch, his hat cocked on one side, 
with beaming face and eyes, strode into the garden 
like a conquering hero. But as he approached his mother-in-law, 
he responded in a mournful and crestfallen tone to her 
inquiries about Dolly’s health. After a little subdued and 
dejected conversation with his mother-in-law, he threw 
out his chest again, and put his arm in Levin’s. “Well, 
shall we set off?” he asked. “I’ve been thinking about 
you all this time, and I’m very, very glad you’ve come,” 
he said, looking him in the face with a significant air.
""".replace("Shtcherbatskaya", "NAME") \
.replace("Kitty", "NAME") \
.replace("Stepan Arkadyevitch", "NAME"))

[[{'label': 'LABEL_0', 'score': 3.176523750880733e-05},
  {'label': 'LABEL_1', 'score': 0.9999682903289795}]]

So it seems to hold. Fairly cool. One reason we should probably look into is content as well. Even though we included two books, the model could have just learned the content of them as well and trained from that. A lot of sticky issues we have with transformers. But that's pretty good that names aren't the sole reason...but one last test. We didn't train on Crime and Punishmen, so what happens if we try a passage from that...

In [141]:
pipe("""
With a sinking heart and a nervous tremor, he went up to a huge house
which on one side looked on to the canal, and on the other into the
street. This house was let out in tiny tenements and was inhabited by
working people of all kinds--tailors, locksmiths, cooks, Germans of
sorts, girls picking up a living as best they could, petty clerks, etc.
There was a continual coming and going through the two gates and in the
two courtyards of the house. Three or four door-keepers were employed on
the building. The young man was very glad to meet none of them, and
at once slipped unnoticed through the door on the right, and up the
staircase. It was a back staircase, dark and narrow, but he was familiar
with it already, and knew his way, and he liked all these surroundings:
in such darkness even the most inquisitive eyes were not to be dreaded.
""")

[[{'label': 'LABEL_0', 'score': 0.0006668599089607596},
  {'label': 'LABEL_1', 'score': 0.9993330836296082}]]

In [142]:
pipe("""
“Hopelessly in the fullest sense, when you know beforehand that you
will get nothing by it. You know, for instance, beforehand with positive
certainty that this man, this most reputable and exemplary citizen, will
on no consideration give you money; and indeed I ask you why should he?
For he knows of course that I shan’t pay it back. From compassion? But
Mr. Lebeziatnikov who keeps up with modern ideas explained the other day
that compassion is forbidden nowadays by science itself, and that that’s
what is done now in England, where there is political economy. Why, I
ask you, should he give it to me? And yet though I know beforehand that
he won’t, I set off to him and...”
""")

[[{'label': 'LABEL_0', 'score': 0.9999967813491821},
  {'label': 'LABEL_1', 'score': 3.2692801141820382e-06}]]

In [143]:
pipe("""
ell known to the police, had two or three times tried to get at her
through the landlady. ‘And why not?’ said Katerina Ivanovna with a jeer,
‘you are something mighty precious to be so careful of!’ But don’t blame
her, don’t blame her, honoured sir, don’t blame her! She was not herself
when she spoke, but driven to distraction by her illness and the crying
of the hungry children; and it was said more to wound her than anything
else.... For that’s Katerina Ivanovna’s character, and when children
cry, even from hunger, she falls to beating them at once. At six o’clock
I saw Sonia get up, put on her kerchief and her cape, and go out of the
room and about nine o’clock she came back. She walked straight up to
Katerina Ivanovna and she laid thirty roubles on the table before her
in silence. She did not utter a word, she did not even look at her, she
simply picked up our big green _drap de dames_ shawl (we have a shawl,
made of _drap de dames_), put it over her head and face and lay down
on the bed with her face to the wall; only her little shoulders and her
body kept shuddering.... And I went on lying there, just as before....
""")

[[{'label': 'LABEL_0', 'score': 0.000790547754149884},
  {'label': 'LABEL_1', 'score': 0.9992094039916992}]]

Uh oh. It labeled 2 of our 3 samples as Tolstoy from Crime and Punishment. This means we really should have thought of a better testing set i.e. something out of distribution more than just the same book.

Maybe we can retrain with C&P as a test set...well since I don't have inifite GPU, I'll hold off on that until next week. I'm gonna call it here and reread <a href="https://www.fast.ai/posts/2017-11-13-validation-sets.html">this article</a> from Rachel Thomas at fast.ai. So many articles read, so much forgotten...