<a href="https://colab.research.google.com/github/organisciak/Text-Mining-Course/blob/independentstudy/labs/Lab%2007%20Companion%20-%20Preparing%20Wide%20DF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 07 Companion

This notebook describes how the HTRC Extracted Features files were converted to 'wide' dataframes of book x word.

In [0]:
#@title Imports and Installs
!pip install git+https://github.com/massivetexts/htrc-feature-reader.git
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from htrc_features import Volume, FeatureReader

## Pandas: combining multiple EF files into one token list

I've added a set of English and French books to our course content: https://github.com/organisciak/Text-Mining-Course/tree/master/data/classification. Here are the paths (as they look on my system):

In [2]:
train_ids = ['hvd.32044014292023', 'hvd.32044102860673', 'mdp.39015038910694', 'pst.000029579440', 'uiug.30112037882914', 'wu.89104415476']
test_ids = ['mdp.39015004295880', 'mdp.39015005725919', 'mdp.39015008815865', 'mdp.39015066049530', 'mdp.39076002736721', 'pst.000062491532' ]
(train_ids + test_ids)

['hvd.32044014292023',
 'hvd.32044102860673',
 'mdp.39015038910694',
 'pst.000029579440',
 'uiug.30112037882914',
 'wu.89104415476',
 'mdp.39015004295880',
 'mdp.39015005725919',
 'mdp.39015008815865',
 'mdp.39015066049530',
 'mdp.39076002736721',
 'pst.000062491532']

All of the files can be loaded into the FeatureReader:

In [0]:
fr = FeatureReader(train_ids + test_ids)

Before we work with *all of them*, consider the type of information we want for each book. We want a DataFrame for each book with with word counts, put together into a list.

1) Get a tokenlist DataFrame for the volume, ignoring case, parts of speech, and pages. For simplicity, convert the index to columns, and drop the column called 'section'.

In [7]:
vol = fr.first()
tl = vol.tokenlist(pages=False, pos=False, case=False, drop_section=True)
tl.head(3)

Unnamed: 0_level_0,count
lowercase,Unnamed: 1_level_1
!,868
!',1
!33,1


Note that the `drop_section` argument removed the part that says 'body'.

2) We want to stick the tokenlists together, so add information that we don't want to lose - the book identifier.

In [8]:
tl['book'] = vol.id
tl.head(3)

Unnamed: 0_level_0,count,book
lowercase,Unnamed: 1_level_1,Unnamed: 2_level_1
!,868,hvd.32044014292023
!',1,hvd.32044014292023
!33,1,hvd.32044014292023


Putting it together: here is a function that takes a volume and returns the desired dataframe as the output:

In [0]:
def prepare_dataframe(input_volume, pos=False, pages=False):
    tl = input_volume.tokenlist(pages=pages, pos=pos, case=False, drop_section=True)
    tl['book'] = input_volume.id
    return tl.reset_index()

For example,

In [34]:
prepare_dataframe(vol).head(3)

Unnamed: 0,lowercase,count,book
0,!,544,mdp.39015027242315
1,!—and,2,mdp.39015027242315
2,!—before,1,mdp.39015027242315


Great! So, lets use a loop to collect this for every single volume in fr.volumes(), then use `pd.concat` to join everything.

At the same time, save a list with additional book information: the title and the language.

In [35]:
book_dataframes = []
book_information = []

for vol in fr.volumes():
    df = prepare_dataframe(vol)
    book_dataframes.append(df)
    book_information.append((vol.id, vol.title, vol.language))
    
books = pd.concat(book_dataframes)
language_assignments = pd.DataFrame(book_information, columns=['book', 'title', 'language'])
books.sample(5)

Unnamed: 0,lowercase,count,book
879,bat,1,mdp.39015046835560
8530,smoky,1,mdp.39015055831070
7682,public,9,mdp.39015046788223
2732,fours,1,mdp.39015070756609
1849,bodybag,2,mdp.39015054263903


There's a lot of junk words or uninteresting words, so filter to words that show up at least $n$ times across the entire collection.

Don't stoplist, because that we're looking across languages.

In [0]:
books_filtered = books.groupby('lowercase').filter(lambda x: x['count'].sum() > 5)

`books` is 'long', meaning each word is in it's own row. To make it wide we need to pivot the DataFrame. The hope is for a DataFrame where each row is a book, each column is a word, and the cells are the frequency counts. Consider how that request becomes the arguments for `books.pivot()`:

In [38]:
book_order = language_assignments['book']
wide_books = (books_filtered.pivot(index='book', columns='lowercase', values='count')
                            .fillna(0)
                            .loc[book_order]
              )
wide_books

lowercase,!,!—and,"""","""'the","""because","""big","""do","""ecce","""eddie","""edgar","""fbi","""from","""get","""go","""good","""have","""he","""home","""how","""if","""in","""is","""it","""jack-you-boys","""kyra","""let","""like","""little","""my","""new","""on","""one","""or","""real","""was","""we","""what","""why","""you","""you're",...,},£,©,«,»,—,"—""",—$,—a,—and,—are,—as,—at,—but,—for,—had,—he,—her,—his,—how,—if,—in,—is,—it,—just,—not,—on,—one,—or,—she,—so,—some,—that,—the,„,•,■,□,★,♦
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
mdp.39015046835560,34.0,0.0,5785.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,21.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,39.0,0.0,0.0
mdp.39015062842383,148.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,273.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,9.0,0.0
mdp.39015073669312,26.0,0.0,3395.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,20.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
mdp.39015055831070,238.0,0.0,4772.0,0.0,2.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,6.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,5.0,0.0,...,1.0,0.0,1.0,0.0,0.0,126.0,3.0,0.0,1.0,10.0,0.0,0.0,0.0,2.0,1.0,1.0,5.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
mdp.39015043780249,104.0,0.0,957.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,2.0,0.0,0.0,33.0,1.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0
uc1.32106012198112,196.0,0.0,2084.0,0.0,2.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,1.0,0.0,0.0,3.0,1.0,2.0,0.0,6.0,1.0,2.0,0.0,5.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,1.0,1.0,65.0,8.0,0.0,4.0,8.0,2.0,4.0,0.0,2.0,2.0,0.0,2.0,0.0,2.0,1.0,1.0,3.0,2.0,0.0,2.0,1.0,0.0,0.0,2.0,3.0,2.0,0.0,0.0,3.0,0.0,1.0,1.0,0.0,0.0,0.0
mdp.39015070756609,40.0,0.0,2960.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,19.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0
mdp.39015010763418,73.0,0.0,3562.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,1.0,4.0,3.0,9.0,0.0,...,2.0,1.0,1.0,1.0,1.0,16.0,5.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0
uc1.32106017944551,222.0,0.0,3656.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,1.0,0.0,2.0,0.0,0.0,170.0,1.0,0.0,1.0,12.0,0.0,1.0,0.0,0.0,0.0,5.0,4.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,1.0,1.0,1.0,4.0,7.0,0.0,0.0,0.0
mdp.39015046788223,119.0,0.0,7333.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,25.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,48.0,0.0,0.0


In addition to the pivot:
 - I added 'fillna(0)': this put in a `0` for every missing (`n/a`) value.
 - I took the book names from the information dataframe, and ordered the `wide_books` rows in the same order.

In [0]:
import os
# Make a 'classification' directory. This is in your temporary Colab space - 
# it will disappear after a few hours
os.makedirs('classification', exist_ok=True) 
wide_books.to_csv('classification/english_french_class.csv', encoding='utf-8')
language_assignments.to_csv('classification/english_french_class_labels.csv', encoding='utf-8', index=False)

## Molding a new document to have the same column order of words

In [42]:
vol = FeatureReader('hvd.hn6ltf').first()
tl = prepare_dataframe(vol).reset_index()
tl_wide = tl.pivot(index='book', columns='lowercase', values='count').fillna(0)
tl_wide



lowercase,!,!—nay,!—that,!—you,"""","""and","""are","""because","""dear","""do","""dry","""every","""exclaimed","""fills","""find","""he","""how","""howkind","""if","""in","""is","""it","""listen","""little","""my","""old","""on","""or","""part","""promised","""really","""requires","""very","""was","""what","""why","""write","""you","""your","""—punch",...,yet,yield,yielded,yon,york,you,young,younger,youngest,youngster,your,yours,yourself,yourselves,youth,youthful,youths,zeal,zzth,|,—,"—""",—',—'come,—'if,—a,—athenaum,—led,—let,—more,—my,—one,—save,—she,—the,—their,—to,•,•93,•dons
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
hvd.hn6ltf,166,1,1,1,723,1,1,1,1,1,1,1,1,1,1,3,6,1,1,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,2,3,1,1,2,1,...,154,6,3,2,1,568,31,10,1,1,245,4,28,1,21,3,1,4,1,1,37,7,1,1,1,1,1,1,1,1,2,1,1,1,1,1,2,2,1,1


As you can see, the new document has different words than the training data. To get the appropriate column order, first save the wide_books columns.

In [0]:
c = wide_books.columns

The next step is a bit ugly. If you concat the new book with a zero row version of the training data (`wide_books.head(0)`), it will add missing values for all the words that the new book doesn't have. Then, you can select just the relevant columns (the `[c]` part), and fill the missing values with zero again (`fillna(0)`):

In [44]:
new_book = pd.concat([tl_wide, wide_books.head(0)])[c].fillna(0)
new_book

Unnamed: 0_level_0,!,!—and,"""","""'the","""because","""big","""do","""ecce","""eddie","""edgar","""fbi","""from","""get","""go","""good","""have","""he","""home","""how","""if","""in","""is","""it","""jack-you-boys","""kyra","""let","""like","""little","""my","""new","""on","""one","""or","""real","""was","""we","""what","""why","""you","""you're",...,},£,©,«,»,—,"—""",—$,—a,—and,—are,—as,—at,—but,—for,—had,—he,—her,—his,—how,—if,—in,—is,—it,—just,—not,—on,—one,—or,—she,—so,—some,—that,—the,„,•,■,□,★,♦
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
hvd.hn6ltf,166.0,0.0,723.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,6.0,1.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,3.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,37.0,7.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0


## Preparing contemporary author example

Since this example has a large number of files, I'll download them from the Github repository.

In [0]:
!git clone https://github.com/organisciak/Text-Mining-Course.git

In [0]:
import glob # Glob just lets us read all the files in a directory
paths = glob.glob('Text-Mining-Course/data/contemporary_books/dataset_files/*bz2')
fr = FeatureReader(paths)

book_dataframes = []
book_information = []

for vol in fr.volumes():
    df = prepare_dataframe(vol, pos=True)
    book_dataframes.append(df)
    # Author is a list, like "[King, Stephen 1947- ]", so we'll grab just the first item,
    # and truncate at the first comma
    author = vol.author[0].split(',')[0]
    # Title includes the author name, as in "Carrie / Stephen King.", so truncate 
    title = vol.title.split(' / ')[0]
    book_information.append((vol.id, author, title))

In [0]:
books = pd.concat(book_dataframes)

# Include only nouns (non-proper), verbs, adverbs, adjectives, and interjections
include_pos = ['NN', 'NNS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ',
               'RB', 'RBR', 'RBS','JJ', 'JJS', 'JJR','UH']
good_pos = books['pos'].isin(include_pos)
stopword = books['lowercase'].isin(stopwords.words('english'))
alpha = books['lowercase'].str.isalpha()

books_filtered = (books[~stopword & alpha & good_pos]
                    .groupby('lowercase')
                    .filter(lambda x: x['count'].sum() > 5)
                 )

info = pd.DataFrame(book_information, columns=['book', 'author', 'title'])
book_order = info['book']
wide_books = (books_filtered.groupby(['book', 'lowercase'], as_index=False)[['count']].sum()
                            .pivot(index='book', columns='lowercase', values='count')
                            .fillna(0)
                            .loc[book_order]
              )

In [51]:
info.sample(5)

Unnamed: 0,book,author,title
21,pst.000050069378,Grisham,The king of torts (large print)
15,mdp.39015046381565,Grisham,A time to kill
24,mdp.39015029244657,Grisham,The pelican brief
4,mdp.39015043780249,King,The girl who loved Tom Gordon
19,mdp.39015040702071,Atwood,Alias Grace


In [0]:
os.makedirs('contemporary_books', exist_ok=True)
wide_books.to_csv('contemporary_books/contemporary.csv', encoding='utf-8')
info.to_csv('contemporary_books/contemporary_labels.csv', encoding='utf-8', index=False)

## Page-level

Just to make the dataset a little bit smaller, this example actually uses 10 pages at a time.

In [0]:
paths = glob.glob('Text-Mining-Course/data/contemporary_books/dataset_files/*bz2')
fr = FeatureReader(paths)

book_dataframes = []
book_information = []

for vol in fr.volumes():
    df = prepare_dataframe(vol, pages=True, pos=True)
    book_dataframes.append(df)
    author = vol.author[0].split(',')[0]
    title = vol.title.split(' / ')[0]
    book_information.append((vol.id, author, title))

In [0]:
books = pd.concat(book_dataframes)
books['pageblock'] = books['page'].apply(lambda x: 0 + x - x % 10)
books['id'] = books['book'] + '-' + books['pageblock'].astype(str)

# Filter by POS: keep only nouns (non-proper)
include_pos = ['NN', 'NNS']
good_pos = books['pos'].isin(include_pos)
stopword = books['lowercase'].isin(stopwords.words('english'))
alpha = books['lowercase'].str.isalpha()

In [0]:
books_filtered = (books[~stopword & alpha & good_pos]
                    .groupby('lowercase')
                    .filter(lambda x: x['count'].sum() > 5)
                 )

In [0]:
info = pd.DataFrame(book_information, columns=['book', 'author', 'title'])
info_with_pages = pd.merge(info, books_filtered[['book', 'pageblock']].drop_duplicates())

In [56]:
info_with_pages.head()

Unnamed: 0,book,author,title,pageblock
0,mdp.39015046835560,Grisham,The partner,0
1,mdp.39015046835560,Grisham,The partner,10
2,mdp.39015046835560,Grisham,The partner,20
3,mdp.39015046835560,Grisham,The partner,30
4,mdp.39015046835560,Grisham,The partner,40


In [0]:
page_order = (info_with_pages['book'] + '-' + info_with_pages['pageblock'].astype(str))
wide_books = (books_filtered.groupby(['id', 'lowercase'], as_index=False)[['count']].sum()
                            .pivot(index='id', columns='lowercase', values='count')
                            .fillna(0)
                            .loc[page_order]
             )

In [0]:
# Note the compression
wide_books.to_csv('contemporary_books/contemporary-pages.csv.gz', encoding='utf-8', compression='gzip')
info_with_pages.to_csv('contemporary_books/contemporary-pages_labels.csv', encoding='utf-8', index=False)