# Lab 07 Companion

This notebook describes how the HTRC Extracted Features files were converted to 'wide' dataframes of book x word.

In [None]:
import pandas as pd
from nltk.corpus import stopwords
from htrc_features import FeatureReader

## Pandas: combining multiple EF files into one token list

I've added a set of English and French books to our course content: https://github.com/organisciak/Text-Mining-Course/tree/master/data/classification. Here are the paths (as they look on my system):

In [None]:
# Glob lets us select a number of files using a 'wildcard'
import glob
train_paths = glob.glob('../data/classification/train/*bz2')
test_paths = glob.glob('../data/classification/test/*bz2')
(train_paths + test_paths)

['../data/classification/train\\hvd.32044014292023.json.bz2',
 '../data/classification/train\\hvd.32044102860673.json.bz2',
 '../data/classification/train\\mdp.39015038910694.json.bz2',
 '../data/classification/train\\pst.000029579440.json.bz2',
 '../data/classification/train\\uiug.30112037882914.json.bz2',
 '../data/classification/train\\wu.89104415476.json.bz2',
 '../data/classification/test\\mdp.39015004295880.json.bz2',
 '../data/classification/test\\mdp.39015005725919.json.bz2',
 '../data/classification/test\\mdp.39015008815865.json.bz2',
 '../data/classification/test\\mdp.39015066049530.json.bz2',
 '../data/classification/test\\mdp.39076002736721.json.bz2',
 '../data/classification/test\\pst.000062491532.json.bz2']

All of the files can be loaded into the FeatureReader:

In [None]:
fr = FeatureReader(train_paths + test_paths)

Before we work with *all of them*, consider the type of information we want for each book. We want a DataFrame for each book with with word counts, put together into a list.

1) Get a tokenlist DataFrame for the volume, ignoring case, parts of speech, and pages. For simplicity, convert the index to columns, and drop the column called 'section'.

In [None]:
vol = fr.first()
tl = (vol.tokenlist(pages=False, pos=False, case=False)
         .reset_index()
         .drop('section', 1)
      )
tl.head(3)

Unnamed: 0,lowercase,count
0,!,868
1,!',1
2,!33,1


When dropping the section column with `drop('section', 1)`, the `1` refers to the axis, so Pandas knows that you're refering to a column and not a row.

2) We want to stick the tokenlists together, so add information that we don't want to lose - the book identifier.

In [None]:
tl['book'] = vol.id
tl.head(3)

Unnamed: 0,lowercase,count,book
0,!,868,hvd.32044014292023
1,!',1,hvd.32044014292023
2,!33,1,hvd.32044014292023


Putting it together: here is a function that takes a volume and returns the desired dataframe as the output:

In [None]:
def prepare_dataframe(input_volume, pos=False, pages=False):
    tl = (input_volume.tokenlist(pages=pages, pos=pos, case=False)
                      .reset_index()
                      .drop('section', 1)
      )
    tl['book'] = input_volume.id

    return tl

For example,

In [None]:
prepare_dataframe(vol).head(3)

Unnamed: 0,lowercase,count,book
0,!,868,hvd.32044014292023
1,!',1,hvd.32044014292023
2,!33,1,hvd.32044014292023


Great! So, lets use a loop to collect this for every single volume in fr.volumes(), then use `pd.concat` to join everything.

At the same time, save a list with additional book information: the title and the language.

In [None]:
book_dataframes = []
book_information = []

for vol in fr.volumes():
    df = prepare_dataframe(vol)
    book_dataframes.append(df)
    book_information.append((vol.id, vol.title, vol.language))
    
books = pd.concat(book_dataframes)
language_assignments = pd.DataFrame(book_information, columns=['book', 'title', 'language'])

books.sample(5)

Unnamed: 0,lowercase,count,book
6917,jaunes,3,mdp.39015008815865
475,a-drinking,1,pst.000029579440
4362,earshot,1,mdp.39076002736721
9704,piquant,1,mdp.39015008815865
10360,recoiled,5,mdp.39076002736721


There's a lot of junk words or uninteresting words, so filter to words that show up at least $n$ times across the entire collection.

Don't stoplist, because that we're looking across languages.

In [None]:
books_filtered = books.groupby('lowercase').filter(lambda x: x['count'].sum() > 5)

`books` is 'long', meaning each word is in it's own row. To make it wide we need to pivot the DataFrame. The hope is for a DataFrame where each row is a book, each column is a word, and the cells are the frequency counts. Consider how that request becomes the arguments for `books.pivot()`:

In [None]:
book_order = language_assignments['book']
wide_books = (books_filtered.pivot(index='book', columns='lowercase', values='count')
                            .fillna(0)
                            .loc[book_order]
              )
wide_books

lowercase,!,!—,!—the,"""","""""","""because","""if","""it","""only","""or",...,ﬂight,ﬂights,ﬂoor,ﬂown,ﬂuid,ﬂung,ﬂush,ﬂushed,ﬂy,ﬂying
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
hvd.32044014292023,868.0,0.0,0.0,4582.0,2.0,6.0,10.0,22.0,7.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hvd.32044102860673,1354.0,0.0,0.0,139.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mdp.39015038910694,910.0,5.0,9.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
pst.000029579440,452.0,3.0,1.0,2835.0,71.0,2.0,4.0,5.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
uiug.30112037882914,159.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wu.89104415476,573.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mdp.39015004295880,565.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mdp.39015005725919,314.0,0.0,0.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mdp.39015008815865,963.0,0.0,0.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mdp.39015066049530,119.0,0.0,0.0,51.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In addition to the pivot:
 - I added 'fillna(0)': this put in a `0` for every missing (`n/a`) value.
 - I took the book names from the information dataframe, and ordered the `wide_books` rows in the same order.

In [None]:
wide_books.to_csv('../data/classification/english_french_class.csv', encoding='utf-8')
language_assignments.to_csv('../data/classification/english_french_class_labels.csv', encoding='utf-8', index=False)

## Molding a new document to have the same column order of words

In [None]:
vol = FeatureReader('../data/hvd.hn6ltf.json.bz2').first()
tl = prepare_dataframe(vol)
tl_wide = tl.pivot(index='book', columns='lowercase', values='count').fillna(0)
tl_wide

lowercase,!,!—nay,!—that,!—you,"""","""and","""are","""because","""dear","""do",...,—my,—one,—save,—she,—the,—their,—to,•,•93,•dons
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
hvd.hn6ltf,166,1,1,1,723,1,1,1,1,1,...,2,1,1,1,1,1,2,2,1,1


As you can see, the new document has different words than the training data. To get the appropriate column order, first save the wide_books columns.

In [None]:
c = wide_books.columns

The next step is a bit ugly. If you concat the new book with a zero row version of the training data (`wide_books.head(0)`), it will add missing values for all the words that the new book doesn't have. Then, you can select just the relevant columns (the `[c]` part), and fill the missing values with zero again (`fillna(0)`):

In [None]:
new_book = pd.concat([tl_wide, wide_books.head(0)])[c].fillna(0)
new_book

Unnamed: 0_level_0,!,!—,!—the,"""","""""","""because","""if","""it","""only","""or",...,ﬂight,ﬂights,ﬂoor,ﬂown,ﬂuid,ﬂung,ﬂush,ﬂushed,ﬂy,ﬂying
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
hvd.hn6ltf,166.0,0.0,0.0,723.0,0.0,1.0,1.0,2.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Preparing contemporary author example

In [None]:
paths = glob.glob('../data/contemporary_books/dataset_files/*bz2')
fr = FeatureReader(paths)

book_dataframes = []
book_information = []

for vol in fr.volumes():
    df = prepare_dataframe(vol, pos=True)
    book_dataframes.append(df)
    # Author is a list, like "[King, Stephen 1947- ]", so we'll grab just the first item,
    # and truncate at the first comma
    author = vol.author[0].split(',')[0]
    # Title includes the author name, as in "Carrie / Stephen King.", so truncate 
    title = vol.title.split(' / ')[0]
    book_information.append((vol.id, author, title))

In [None]:
books = pd.concat(book_dataframes)

# Include only nouns (non-proper), verbs, adverbs, adjectives, and interjections
include_pos = ['NN', 'NNS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ',
               'RB', 'RBR', 'RBS','JJ', 'JJS', 'JJR','UH']
good_pos = books['pos'].isin(include_pos)
stopword = books['lowercase'].isin(stopwords.words('english'))
alpha = books['lowercase'].str.isalpha()

books_filtered = (books[~stopword & alpha & good_pos]
                    .groupby('lowercase')
                    .filter(lambda x: x['count'].sum() > 5)
                 )

info = pd.DataFrame(book_information, columns=['book', 'author', 'title'])
book_order = info['book']
wide_books = (books_filtered.groupby(['book', 'lowercase'], as_index=False)[['count']].sum()
                            .pivot(index='book', columns='lowercase', values='count')
                            .fillna(0)
                            .loc[book_order]
              )

In [None]:
info.sample(5)

Unnamed: 0,book,author,title
9,mdp.39015046381565,Grisham,A time to kill
0,mdp.39015005028686,King,The stand
29,uc1.32106012198112,King,Stephen King's Danse macabre
4,mdp.39015031703609,Grisham,The rainmaker
5,mdp.39015038148048,King,Desperation


In [None]:
wide_books.to_csv('../data/contemporary_books/contemporary.csv', encoding='utf-8')
info.to_csv('../data/contemporary_books/contemporary_labels.csv', encoding='utf-8', index=False)

## Page-level

Just to make the dataset a little bit smaller, this example actually uses 10 pages at a time.

In [None]:
paths = glob.glob('../data/contemporary_books/dataset_files/*bz2')
fr = FeatureReader(paths)

book_dataframes = []
book_information = []

for vol in fr.volumes():
    df = prepare_dataframe(vol, pages=True, pos=True)
    book_dataframes.append(df)
    author = vol.author[0].split(',')[0]
    title = vol.title.split(' / ')[0]
    book_information.append((vol.id, author, title))

In [None]:
books = pd.concat(book_dataframes)
books['pageblock'] = books['page'].apply(lambda x: 0 + x - x % 10)
books['id'] = books['book'] + '-' + books['pageblock'].astype(str)

# Filter by POS: keep only nouns (non-proper)
include_pos = ['NN', 'NNS']
good_pos = books['pos'].isin(include_pos)
stopword = books['lowercase'].isin(stopwords.words('english'))
alpha = books['lowercase'].str.isalpha()

In [None]:
books_filtered = (books[~stopword & alpha & good_pos]
                    .groupby('lowercase')
                    .filter(lambda x: x['count'].sum() > 5)
                 )

In [None]:
info = pd.DataFrame(book_information, columns=['book', 'author', 'title'])
info_with_pages = pd.merge(info, books_filtered[['book', 'pageblock']].drop_duplicates())

In [None]:
info_with_pages.head()

Unnamed: 0,book,author,title,pageblock
0,mdp.39015005028686,King,The stand,0
1,mdp.39015005028686,King,The stand,10
2,mdp.39015005028686,King,The stand,20
3,mdp.39015005028686,King,The stand,30
4,mdp.39015005028686,King,The stand,40


In [None]:
page_order = (info_with_pages['book'] + '-' + info_with_pages['pageblock'].astype(str))
wide_books = (books_filtered.groupby(['id', 'lowercase'], as_index=False)[['count']].sum()
                            .pivot(index='id', columns='lowercase', values='count')
                            .fillna(0)
                            .loc[page_order]
             )

In [None]:
# Note the compression
wide_books.to_csv('../data/contemporary_books/contemporary-pages.csv.gz', encoding='utf-8', compression='gzip')
info_with_pages.to_csv('../data/contemporary_books/contemporary-pages_labels.csv', encoding='utf-8', index=False)