In this lab we're going to do a style-based classification of authors, based on printed books. This is the same case-study we saw in the lecture. 

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd

ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


Let's load the books that we need for each city. We're looking at authors born between 1850 and 1900, each of whom has a number of books here. That makes sure that we learn to *generalize* so that we're not just predicting individual books. Each sample is about 5k from a book.

In [2]:
file = "stylistics.authorship_1850.gz"
file = os.path.join(ai.data_dir, file)
df = pd.read_csv(file)
print(df)

         Author                Title  \
0      abbott_j  alexander_the_great   
1      abbott_j  alexander_the_great   
2      abbott_j  alexander_the_great   
3      abbott_j  alexander_the_great   
4      abbott_j  alexander_the_great   
...         ...                  ...   
15994    wood_h       victor_serenus   
15995    wood_h       victor_serenus   
15996    wood_h       victor_serenus   
15997    wood_h       victor_serenus   
15998    wood_h       victor_serenus   

                                                    Text  
0      note project gutenberg also has an html versio...  
1      it will be recollected to epirus where her fri...  
2      it would be best to endeavor to effect a landi...  
3      transport his army across the straits the army...  
4      that the true greatness of the soul of alexand...  
...                                                  ...  
15994  since i have been with amabel it hath waxed st...  
15995  his face uttered a loud cry and shrank b

We're going to use function word n-grams. These are pre-defined, so we don't need to fit a model before we use them. Let's take a look at how many authors we have here.

In [3]:
freq = ai.print_labels(df, "Author")

for author in freq:
    print(author, freq[author])

abbott_j 927
altsheler_j 610
bennett_a 573
bindloss_h 815
chambers_r 728
collingwood_h 659
collins_w 858
crawford_f 912
doyle_a 694
galsworthy_j 337
gissing_g 528
haggard_h 956
hume_f 975
london_j 590
meade_l 701
oppenheim_e 848
parker_g 429
quiller-couch_a 514
stratemeyer_e 792
ward_h 671
warner_c 300
wells_c 515
weyman_s 511
wood_h 556


This is a lot of samples per author. So it is going to be difficult to make generalizations that tell us who wrote what using just function word n-grams like "there was" or "in the". So now we have (1) our data from Project Gutenberg and (2) our style vectorizer (function word n-grams). We're going to classify these by author. The basic code is below; this just called our *text_analytics* package. That package splits the data into training and testing data, learns a classifier, and then evaluates the classifier. We're telling the package to use "Author" as the ground-truth class, with stylistic features.

In [4]:
report = ai.shallow_classification(df, labels = "Author", features = "style", classifier = "lm")
print(report)

                 precision    recall  f1-score   support

       abbott_j       0.99      0.98      0.98        96
    altsheler_j       0.98      1.00      0.99        50
      bennett_a       1.00      1.00      1.00        61
     bindloss_h       0.99      0.99      0.99        96
     chambers_r       0.97      1.00      0.98        65
  collingwood_h       1.00      0.98      0.99        55
      collins_w       0.98      0.98      0.98        95
     crawford_f       0.99      0.99      0.99        94
        doyle_a       1.00      1.00      1.00        62
   galsworthy_j       1.00      1.00      1.00        27
      gissing_g       0.98      0.97      0.97        59
      haggard_h       0.99      0.98      0.98        99
         hume_f       0.99      1.00      1.00       107
       london_j       1.00      1.00      1.00        64
        meade_l       1.00      0.99      0.99        70
    oppenheim_e       1.00      1.00      1.00        75
       parker_g       0.98    

**Be patient**

And there we go! We're looking at the classifier accuracy. 

This will change a bit from the lecture, because we're using random train/test splits. That means the classifier is looking at different data each time. If you want more advanced examples for how to solve this authorship classification problem, take a look at the *text_analytics.shallow_classification()* function.