# TF-IDF With HathiTrust Volumes

## Installation

In [None]:
!pip install htrc-feature-reader

## Download Extracted Features From The Command Line

Lost in the City

https://babel.hathitrust.org/cgi/pt?id=mdp.39015029970129

Download extracted features into my current directory

In [None]:
!htid2rsync mdp.39015029970129 | rsync --files-from - data.analytics.hathitrust.org::features-2020.03/ .

Download extracted features into new directory called "hathi-files"

In [1]:
!htid2rsync mdp.39015029970129 | rsync --files-from - data.analytics.hathitrust.org::features-2020.03/ hathi-files/

[queenpalm] Welcome to the HathiTrust Research Center rsync server.



## Download Extracted Features With Python

In [2]:
from htrc_features import utils

In [3]:
utils.id_to_rsync('mdp.39015029970129')

'mdp/31272/mdp.39015029970129.json.bz2'

## Download extracted features from the web

In [311]:
from htrc_features import Volume

In [312]:
volume = Volume("mdp.39015029970129")

In [313]:
volume

### Read in extracted features from file

In [315]:
from htrc_features import Volume

In [316]:
volume = Volume("hathi-files/mdp/31272/mdp.39015029970129.json.bz2")

In [317]:
volume

## Download Volume IDs From HathiTrust Collection

https://solr2.htrc.illinois.edu/solr-ef/?solr-key-q=17083A600227127535CCB993F515A8220&start=31&group-by-vol=0

In [1]:
!pip install htrc

Collecting htrc
  Using cached htrc-0.1.53.tar.gz (81 kB)
Collecting PyLD
  Using cached PyLD-2.0.3.tar.gz (70 kB)
Collecting prov
  Using cached prov-1.5.3-py2.py3-none-any.whl (423 kB)
Collecting unicodecsv
  Using cached unicodecsv-0.14.1.tar.gz (10 kB)
Collecting progressbar2
  Using cached progressbar2-3.53.1-py2.py3-none-any.whl (25 kB)
Collecting argparse==1.1
  Using cached argparse-1.1.zip (151 kB)
Collecting topicexplorer==1.0b226
  Using cached topicexplorer-1.0b226-py2.py3-none-any.whl (2.2 MB)
Collecting frozendict
  Using cached frozendict-1.2.tar.gz (2.6 kB)
Collecting rdflib>=4.2.1
  Using cached rdflib-5.0.0-py3-none-any.whl (231 kB)
Collecting python-utils>=2.3.0
  Using cached python_utils-2.4.0-py2.py3-none-any.whl (12 kB)
Collecting webtest>=2.0.29
  Using cached WebTest-2.0.35-py2.py3-none-any.whl (32 kB)
Collecting brewer2mpl<1.5.0,>=1.4.0
  Using cached brewer2mpl-1.4.1-py2.py3-none-any.whl (24 kB)
Collecting numpy<1.16.4,>=1.11.3
  Downloading numpy-1.16.3-cp37

James Baldwin HathiTrust collection that I created: https://babel.hathitrust.org/cgi/mb?a=listis;c=2098723708

In [318]:
from htrc import workset
volume_ids = workset.load_hathitrust_collection('https://babel.hathitrust.org/cgi/mb?a=listis;c=2098723708')

In [319]:
volume_ids

['uva.x000691804',
 'mdp.39015041612683',
 'mdp.39015084118606',
 'mdp.39015031598124',
 'uva.x002228607',
 'mdp.39015063278306',
 'mdp.39015054289775',
 'mdp.39015046412022',
 'mdp.39015027359721',
 'mdp.39015019555161']

In [320]:
for hathi_id in volume_ids:
    vol = Volume(hathi_id)
    print(hathi_id, vol.title)

uva.x000691804 Another country
mdp.39015041612683 Collected essays /
mdp.39015084118606 The devil finds work : an essay /
mdp.39015031598124 The fire next time /
uva.x002228607 Giovanni's room /
mdp.39015063278306 Go tell it on the mountain /
mdp.39015054289775 If Beale Street could talk /
mdp.39015046412022 Just above my head /
mdp.39015027359721 Little man, little man : a story of childhood /
mdp.39015019555161 Notes of a native son /


## Download Volume IDs From HathiTrust Workset

Make a Workset. Then Download volumes IDs.

In [349]:
volume_ids = pd.read_csv('James-Baldwin-Workset.csv')['id']

In [350]:
for hathi_id in volume_ids:
    vol = Volume(hathi_id)
    print(hathi_id, vol.title)

mdp.39015041612683 Collected essays /
pst.000043352395 Evidence of things not seen /
mdp.39015011372029 No name in the street.
uc1.32106002160320 Giovanni's room; a novel.
mdp.39015001385916 The devil finds work : an essay /
uc1.32106005373219 Go tell it on the mountain.
mdp.39015013970754 The amen corner; a play. -
mdp.39015027359721 Little man, little man : a story of childhood /
uc1.32106007393728 Another country.
mdp.39076006860543 Just above my head /
mdp.39015000361173 Notes of a native son.
pst.000027232392 Going to meet the man.
uc1.32106018392966 The fire next time /
mdp.39015015148243 Blues for Mister Charlie.
uc1.32106013535023 Tell me how long the train's been gone; a novel.


## Make Word Count DataFrame From All Volumes

In [351]:
import pandas as pd

In [352]:
all_tokens = []

for hathi_id in volume_ids:
    #Read in volume
    volume = Volume(hathi_id)
    
    #Make dataframe from token list -- do not include part of speech, sections, or case sensitivity
    token_df = volume.tokenlist(case=False, pos=False, drop_section=True)
    
    #Add book column
    token_df['book'] = volume.title
    
    #Add publication year column
    token_df['year'] = volume.year
    
    all_tokens.append(token_df)

In [353]:
baldwin_df = pd.concat(all_tokens)

Change from multi-level index to regular index with `reset_index()`

In [354]:
baldwin_df_flattened = baldwin_df.reset_index()

In [355]:
baldwin_df_flattened 

Unnamed: 0,page,lowercase,count,book,year
0,7,baldwin,1,Collected essays /,1998
1,7,james,1,Collected essays /,1998
2,9,america,1,Collected essays /,1998
3,9,devil,1,Collected essays /,1998
4,9,essays,1,Collected essays /,1998
...,...,...,...,...,...
730950,513,fo,1,Tell me how long the train's been gone; a novel.,1968
730951,513,ifiid,1,Tell me how long the train's been gone; a novel.,1968
730952,513,of,1,Tell me how long the train's been gone; a novel.,1968
730953,513,university,1,Tell me how long the train's been gone; a novel.,1968


Summarize token counts for each book

In [356]:
baldwin_df_flattened.groupby(['book', 'year', 'lowercase'])[['count']].sum().reset_index()

Unnamed: 0,book,year,lowercase,count
0,Another country.,1962,!,217
1,Another country.,1962,"""-""--•.",1
2,Another country.,1962,""".-",1
3,Another country.,1962,',128
4,Another country.,1962,'',4419
...,...,...,...,...
97029,The fire next time /,1995,•,2
97030,The fire next time /,1995,•>_.,1
97031,The fire next time /,1995,•■>,1
97032,The fire next time /,1995,■,2


In [357]:
word_frequency_df = baldwin_df_flattened.groupby(['book', 'year', 'lowercase'])[['count']].sum().reset_index()

## Remove Infrequent Words, Stopwords, & Punctuation

Remove words that appear less than 5 times in a book

In [358]:
word_frequency_df = word_frequency_df[word_frequency_df['count'] > 5]

Remove stopwords

In [359]:
STOPS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
         'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
         'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
         'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
         'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
         'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
         'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
         'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
         'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
         'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
         'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
         'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp', "!"]

In [360]:
word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].isin(STOPS)].index)

Remove punctuation

In [361]:
word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].str.contains('[^A-Za-z\s]', regex=True)].index)

In [362]:
word_frequency_df

Unnamed: 0,book,year,lowercase,count
97,Another country.,1962,able,38
105,Another country.,1962,abruptly,23
110,Another country.,1962,absolutely,16
120,Another country.,1962,accent,6
122,Another country.,1962,accept,17
...,...,...,...,...
96995,The fire next time /,1995,x,8
97000,The fire next time /,1995,years,30
97003,The fire next time /,1995,yes,12
97005,The fire next time /,1995,yet,25


## TFIDF

In [363]:
word_frequency_df = word_frequency_df.rename(columns={'lowercase': 'term','count': 'term_frequency'})

In [364]:
word_frequency_df

Unnamed: 0,book,year,term,term_frequency
97,Another country.,1962,able,38
105,Another country.,1962,abruptly,23
110,Another country.,1962,absolutely,16
120,Another country.,1962,accent,6
122,Another country.,1962,accept,17
...,...,...,...,...
96995,The fire next time /,1995,x,8
97000,The fire next time /,1995,years,30
97003,The fire next time /,1995,yes,12
97005,The fire next time /,1995,yet,25


Calculate document frequency (in how many books in the collection does this term appear?)

In [365]:
document_frequency_df = (word_frequency_df.groupby(['book','term']).size().unstack(fill_value=0) > 0).sum().reset_index()

In [366]:
document_frequency_df = document_frequency_df.rename(columns={0:'document_frequency'})

In [367]:
word_frequency_df = word_frequency_df.merge(document_frequency_df)

In [368]:
word_frequency_df

Unnamed: 0,book,year,term,term_frequency,document_frequency
0,Another country.,1962,able,38,14
1,Blues for Mister Charlie.,1976,able,10,14
2,Collected essays /,1998,able,204,14
3,Evidence of things not seen /,1985,able,14,14
4,Giovanni's room; a novel.,1956,able,28,14
...,...,...,...,...,...
17384,The amen corner; a play. -,1968,maggie,42,1
17385,The amen corner; a play. -,1968,margaret,359,1
17386,The amen corner; a play. -,1968,moore,125,1
17387,The amen corner; a play. -,1968,phillips,8,1


Calculate total number of documents 

In [369]:
total_number_of_documents = baldwin_df_flattened['book'].nunique()

Calculate tfidf

In [370]:
from math import log

In [371]:
word_frequency_df['tfidf'] = word_frequency_df['term_frequency'] * (log(1 + total_number_of_documents)) /  (1+ word_frequency_df['document_frequency']) + 1

In [374]:
word_frequency_df['tfidf_normalized']=(word_frequency_df['tfidf'] - word_frequency_df['tfidf'].mean())/word_frequency_df['tfidf'].std()

In [455]:
word_frequency_df['tfidf_sklearnnormalized'] = sklearn.preprocessing.normalize(word_frequency_df[['tfidf']], axis=0, norm='l2')

Find top 15 words with highest tfidf scores for each book

In [456]:
word_frequency_df.sort_values(by=['book','tfidf_sklearnnormalized'], ascending=False).groupby(['book']).head(15)

Unnamed: 0,book,year,term,term_frequency,document_frequency,tfidf,tfidf_normalized,tfidf_sklearnnormalized
13184,The fire next time /,1995,elijah,39,2,37.043653,1.016645,0.010166
10964,The fire next time /,1995,white,167,14,31.868154,0.810701,0.008745
6884,The fire next time /,1995,one,157,15,28.206027,0.664978,0.00774
6604,The fire next time /,1995,negro,67,8,21.640383,0.403717,0.005939
7173,The fire next time /,1995,people,105,15,19.195113,0.306415,0.005268
11596,The fire next time /,1995,negroes,38,5,18.559729,0.281131,0.005093
11474,The fire next time /,1995,christian,21,3,15.556091,0.16161,0.004269
5999,The fire next time /,1995,man,74,15,13.823223,0.092656,0.003793
11295,The fire next time /,1995,would,68,14,13.569069,0.082543,0.003724
1060,The fire next time /,1995,black,65,14,13.014551,0.060477,0.003571


In [373]:
word_frequency_df.sort_values(by='tfidf', ascending=False)[:100]

Unnamed: 0,book,year,term,term_frequency,document_frequency,tfidf
16880,Just above my head /,1979,arthur,1639,2,1515.757639
10573,Another country.,1962,vivaldo,858,1,1190.440562
16963,Just above my head /,1979,crunch,632,1,877.138036
2978,Another country.,1962,eric,684,2,633.150229
8090,Another country.,1962,rufus,455,1,631.763934
17270,Tell me how long the train's been gone; a novel.,1968,caleb,443,1,615.128402
16822,Just above my head /,1979,julia,659,2,610.045323
1585,Another country.,1962,cass,426,1,591.561398
17385,The amen corner; a play. -,1968,margaret,359,1,498.679676
12138,Tell me how long the train's been gone; a novel.,1968,barbara,530,2,490.824008


In [157]:
word_frequency_df.groupby(['book', 'lowercase'])[[ 'lowercase','tfidf']].sort_values(by='tfidf', ascending=False).nlargest(10)

AttributeError: 'DataFrameGroupBy' object has no attribute 'sort_values'

In [166]:
pd.set_option("max_rows", 500)

In [172]:
word_frequency_df.groupby('book')[['tfidf', 'lowercase']].nlargest(10).reset_index()

AttributeError: 'DataFrameGroupBy' object has no attribute 'nlargest'

In [175]:
word_frequency_df.groupby('book')[['tfidf', 'lowercase']].nlargest(10)

AttributeError: 'DataFrameGroupBy' object has no attribute 'nlargest'

In [178]:
def pretty_plot_top_n(series, top_n=5, index_level=0):
    
    r = series.groupby(level=index_level).nlargest(top_n).reset_index(level=index_level, drop=True)
    #r.plot.bar()
    return r.to_frame()


pretty_plot_top_n(word_frequency_df['tfidf'])

Unnamed: 0,tfidf
0,10.466183
1,53.191929
2,7.651912
3,7.396070
4,6.372699
...,...
13129,268.099871
13130,21.723266
13131,14.815511
13132,44.749117


In [198]:
word_frequency_df[word_frequency_df['tfidf'] = word_frequency_df.groupby('book')['tfidf'].nlargest(5).reset_index(drop=True)]

SyntaxError: invalid syntax (<ipython-input-198-2047fc9298cb>, line 1)

In [208]:
word_frequency_df.groupby('book')[['book','tfidf', 'lowercase']].reset_index().sort_values(by='tfidf')

AttributeError: 'DataFrameGroupBy' object has no attribute 'reset_index'

In [168]:
word_frequency_df.sort_values(by=['book', 'tfidf'], ascending=False)[:100]

Unnamed: 0,book,year,lowercase,count,document_frequency,tfidf
9524,The fire next time /,1963,elijah,40,2,47.051702
7919,The fire next time /,1963,white,174,9,45.516645
13132,The fire next time /,1963,e,19,1,44.749117
4758,The fire next time /,1963,negro,70,4,41.295239
4948,The fire next time /,1963,one,157,10,37.150586
10803,The fire next time /,1963,negroes,38,3,30.166078
5159,The fire next time /,1963,people,110,10,26.328436
8858,The fire next time /,1963,christian,22,2,26.328436
1374,The fire next time /,1963,colour,20,2,24.025851
13133,The fire next time /,1963,o,9,1,21.723266


In [162]:
word_frequency_df.groupby(['book', 'lowercase']).nlargest()

AttributeError: 'DataFrameGroupBy' object has no attribute 'nlargest'

In [18]:
inverse_document_frequency = (log(total_number_of_documents) / number_of_documents_with_term) + 1

In [None]:
term_frequency * inverse_document_frequency 

In [113]:
(word_frequency_df.groupby(['book','lowercase']).size().unstack(fill_value=0) > 0).groupby('book')

ValueError: 'book' is both an index level and a column label, which is ambiguous.

In [114]:
word_frequency_df

Unnamed: 0,book,year,lowercase,count
142,Another country,1963,able,37
151,Another country,1963,abruptly,22
157,Another country,1963,absolutely,15
167,Another country,1963,accent,6
169,Another country,1963,accept,17
...,...,...,...,...
69431,The fire next time /,1963,would,71
69446,The fire next time /,1963,years,29
69448,The fire next time /,1963,yes,12
69450,The fire next time /,1963,yet,25


### Get Top n Words For Each Book

In [None]:
def calculate_document_frequency(token):
    if token

In [1]:
baldwin_df_flattened

NameError: name 'baldwin_df_flattened' is not defined

In [None]:
def calculate_document_frequency(current_token):
    

In [209]:
for book, token in baldwin_df_flattened.groupby('book')['token']:
    #if token == 'black':
    print(book, token.values)

Another country [',0' 'Bal' 'James' ... 'b"U' 'r' "■To'1W"]
Collected essays / ['BALDWIN' 'JAMES' 'AMERICA' ... 'SEP' 'i»a*"6' '£009']
Giovanni's room / ['.' '1988' '3552' ... 'PENN' 'STATE' 'UNIVERSITY']
Go tell it on the mountain / ['.' '>*H' 'Mfr' ... 'DEMCO' 'S' '.<***.']
If Beale Street could talk / ['ALE' 'BALDWIN' 'BE' ... 'vividly' 'work' 'yet']
Just above my head / ['*....' '.' 'J' ... 'هي' 'يم' 'يهم']
Little man, little man : a story of childhood / ['!' '*' '-' ... 'a—x' 'ft' 'r']
Notes of a native son / ['A' 'NATIVE' 'NOTES' ... '00' 'I' 'SEP']
The devil finds work : an essay / ['#' '&' "'" ... 'will' 'with' 'writer']
The fire next time / ['"' '",' '".' ... 'The' 'f' 't']


In [235]:
[x for x in baldwin_df_flattened.token.values if token in x]

TypeError: 'in <string>' requires string as left operand, not Series

array([',0', 'Bal', 'James', ..., '00', 'I', 'SEP'], dtype=object)

In [None]:
len([x for x in df.tokens.values if token in x])) for token in tf.columns]

In [242]:
baldwin_df_flattened.groupby(['book','token']).size().unstack(fill_value=0)

token,!,!!,!!!,"!""--","""","""""7§::v",""""":","""'3'",""",","""-",...,ﬂying,ﬂypaper,！,！！,！！！,（,（~~,）,，,：
book,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Another country,137,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Collected essays /,161,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Giovanni's room /,78,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Go tell it on the mountain /,93,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
If Beale Street could talk /,50,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Just above my head /,756,1,2,1,2,1,1,1,0,2,...,6,1,5,1,1,1,1,4,4,4
"Little man, little man : a story of childhood /",11,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Notes of a native son /,20,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The devil finds work : an essay /,33,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The fire next time /,28,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [238]:
(baldwin_df_flattened.groupby(['book','token']).size().unstack(fill_value=0) > 0).sum()

token
!       10
!!       1
!!!      1
!"--     1
"        4
        ..
（        1
（~~      1
）        1
，        1
：        1
Length: 29873, dtype: int64

In [239]:
baldwin_df_flattened.groupby(['book', 'token']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,page,section,pos,count,year,doc_frequency
book,token,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Another country,!,137,137,137,137,137,137
Another country,&,2,2,2,2,2,2
Another country,',435,435,435,435,435,435
Another country,'',26,26,26,26,26,26
Another country,'Cause,1,1,1,1,1,1
...,...,...,...,...,...,...,...
The fire next time /,…--,1,1,1,1,1,1
The fire next time /,■,1,1,1,1,1,1
The fire next time /,"■""'",1,1,1,1,1,1
The fire next time /,■j,1,1,1,1,1,1


In [221]:
baldwin_df_flattened['doc_frequency'] = (baldwin_df_flattened.groupby(['book', 'token'])['token'].count() > 0).sum()
baldwin_df_flattened

Unnamed: 0,page,section,token,pos,count,book,year,doc_frequency
0,1,body,",0",CD,1,Another country,1963,77227
1,1,body,Bal,NN,1,Another country,1963,77227
2,1,body,James,NNP,1,Another country,1963,77227
3,1,body,Library,NNP,1,Another country,1963,77227
4,1,body,Virginia,NNP,1,Another country,1963,77227
...,...,...,...,...,...,...,...,...
885965,222,body,THE,UNK,1,Notes of a native son /,1990,77227
885966,222,body,UNIVERSITY,UNK,1,Notes of a native son /,1990,77227
885967,223,body,00,UNK,1,Notes of a native son /,1990,77227
885968,223,body,I,UNK,1,Notes of a native son /,1990,77227


In [198]:
(baldwin_df_flattened.groupby(['book', 'token'])['count'].sum() > 0).sum()

77227

In [157]:
total_number_of_documents

10

In [148]:
baldwin_df_flattened['book'].nunique()

10

In [154]:
len(baldwin_df_flattened.groupby(['book', 'token'])[['book', 'token']].count() > 0)

77227

In [11]:
from math import log

In [13]:
total_number_of_documents

10

In [None]:
baldwin_df_flattened.groupby(['book', 'token'])[['count']].sum()

In [15]:
term_frequency 

Unnamed: 0_level_0,Unnamed: 1_level_0,count
book,token,Unnamed: 2_level_1
Another country,!,218
Another country,&,2
Another country,',4527
Another country,'',32
Another country,'Cause,1
...,...,...
The fire next time /,…--,1
The fire next time /,■,1
The fire next time /,"■""'",1
The fire next time /,■j,1


In [14]:
term_frequency = baldwin_df_flattened.groupby(['book', 'token'])[['count']].sum()

In [16]:
number_of_documents_with_term = (baldwin_df_flattened.groupby(['book','token']).size().unstack(fill_value=0) > 0).sum()

In [17]:
number_of_documents_with_term

token
!       10
!!       1
!!!      1
!"--     1
"        4
        ..
（        1
（~~      1
）        1
，        1
：        1
Length: 29873, dtype: int64

In [19]:
inverse_document_frequency

token
!       1.230259
!!      3.302585
!!!     3.302585
!"--    3.302585
"       1.575646
          ...   
（       3.302585
（~~     3.302585
）       3.302585
，       3.302585
：       3.302585
Length: 29873, dtype: float64

In [18]:
inverse_document_frequency = (log(total_number_of_documents) / number_of_documents_with_term) + 1

In [None]:
term_frequency * inverse_document_frequency 

In [142]:
baldwin_df_flattened.groupby(['token'])[['count']].sum()

Unnamed: 0_level_0,count
token,Unnamed: 1_level_1
!,2436
!!,1
!!!,2
"!""--",1
"""",11
...,...
（,1
（~~,1
）,6
，,15


In [143]:
baldwin_df_flattened.groupby(['book', 'token'])[['count']].sum() / baldwin_df_flattened.groupby(['token'])[['count']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
book,token,Unnamed: 2_level_1
Another country,!,0.089491
Another country,&,0.044444
Another country,',0.571158
Another country,'',0.001802
Another country,'Cause,0.250000
...,...,...
The fire next time /,…--,1.000000
The fire next time /,■,0.041667
The fire next time /,"■""'",1.000000
The fire next time /,■j,1.000000


In [None]:
term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = log(total_number_of_documents / number_of_documents_with_term) + 1

In [139]:
baldwin_df_flattened.groupby(['book', 'token'])[['count']].sum().groupby('book').to_numpy().shape

AttributeError: 'DataFrameGroupBy' object has no attribute 'to_numpy'

In [126]:
baldwin_df_flattened.groupby(['book', 'token'])[['count']].sum().groupby('book')['count'].nlargest(100).droplevel(0).reset_index()

Unnamed: 0,book,token,count
0,Another country,",",13376
1,Another country,.,10789
2,Another country,the,5745
3,Another country,and,4849
4,Another country,',4527
...,...,...,...
995,The fire next time /,knew,84
996,The fire next time /,how,83
997,The fire next time /,other,83
998,The fire next time /,Elijah,80


In [56]:
baldwin_df.groupby(level=['token'], 'book')[['count']].sum().sort_values(by='count', ascending=False)[:100]

SyntaxError: positional argument follows keyword argument (<ipython-input-56-b77946377ddb>, line 1)

In [53]:
baldwin_df.groupby(level='token')[['count']].sum().sort_values(by='count', ascending=False)[:100]

Unnamed: 0_level_0,count
token,Unnamed: 1_level_1
",",147523
.,83846
the,66948
and,49906
to,43659
I,37319
of,30582
a,27787
in,22632
that,21247


In [38]:
baldwin_df.sort_values(by='count', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count,book,year
page,section,token,pos,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
120,body,-,UNK,1035,The fire next time /,1963
1,body,-,UNK,835,The fire next time /,1963
1,body,º,UNK,515,The fire next time /,1963
120,body,.,UNK,298,The fire next time /,1963
414,body,",",",",85,Just above my head /,1979
...,...,...,...,...,...,...
63,body,for,IN,1,Giovanni's room /,1988
63,body,forever,RB,1,Giovanni's room /,1988
63,body,friendship,NN,1,Giovanni's room /,1988
63,body,gave,VBD,1,Giovanni's room /,1988


In [80]:
baldwin_df_flattened

Unnamed: 0,page,section,token,pos,count,book,year
0,1,body,",0",CD,1,Another country,1963
1,1,body,Bal,NN,1,Another country,1963
2,1,body,James,NNP,1,Another country,1963
3,1,body,Library,NNP,1,Another country,1963
4,1,body,Virginia,NNP,1,Another country,1963
...,...,...,...,...,...,...,...
885965,222,body,THE,UNK,1,Notes of a native son /,1990
885966,222,body,UNIVERSITY,UNK,1,Notes of a native son /,1990
885967,223,body,00,UNK,1,Notes of a native son /,1990
885968,223,body,I,UNK,1,Notes of a native son /,1990


In [109]:
baldwin_df_flattened.book.value_counts()

Just above my head /                               358778
Collected essays /                                 191672
Another country                                     86902
Giovanni's room /                                   70659
Go tell it on the mountain /                        49141
Notes of a native son /                             35212
If Beale Street could talk /                        32503
The fire next time /                                32000
The devil finds work : an essay /                   23595
Little man, little man : a story of childhood /      5508
Name: book, dtype: int64

In [101]:
baldwin_df_flattened.groupby(['book', 'token'])[['count', 'book']].sum().sort_values(by=[ 'book', 'count'], ascending=False).nlargest(10, columns=[ 'count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
book,token,Unnamed: 2_level_1
Just above my head /,",",68594
Just above my head /,.,38798
Collected essays /,",",29725
Just above my head /,the,24500
Just above my head /,and,21454
Just above my head /,I,18758
Collected essays /,the,18155
Just above my head /,to,17375
Another country,",",13376
Collected essays /,.,12924


In [None]:
https://babel.hathitrust.org/cgi/mb?a=listis;c=447365647

## Get Metadata

In [5]:
volume.title

'Lost in the city : stories /'

In [6]:
volume.year

1993

In [7]:
volume.page_count

260

In [8]:
volume.publisher

'HarperPerennial'

In [9]:
volume.handle_url

'http://hdl.handle.net/2027/mdp.39015029970129'

In [16]:
volume.parser.meta

{'id': 'mdp.39015029970129',
 'metadata_schema_version': 'https://schemas.hathitrust.org/EF_Schema_MetadataSubSchema_v_3.0',
 'enumeration_chronology': None,
 'type_of_resource': 'http://id.loc.gov/ontologies/bibframe/Text',
 'title': 'Lost in the city : stories /',
 'date_created': 20200209,
 'pub_date': 1993,
 'language': 'eng',
 'access_profile': 'google',
 'isbn': '0060975571',
 'issn': None,
 'lccn': '92054781',
 'oclc': '27685090',
 'page_count': 260,
 'feature_schema_version': 'https://schemas.hathitrust.org/EF_Schema_FeaturesSubSchema_v_3.0',
 'access_rights': 'ic',
 'alternate_title': None,
 'category': 'American literature',
 'genre_ld': ['http://id.loc.gov/vocabulary/marcgt/doc',
  'http://id.loc.gov/vocabulary/marcgt/fic'],
 'genre': ['document (computer)', 'fiction'],
 'contributor_ld': {'id': 'http://www.viaf.org/viaf/114552168',
  'type': 'http://id.loc.gov/ontologies/bibframe/Person',
  'name': 'Jones, Edward P.'},
 'contributor': 'Jones, Edward P.',
 'handle_url': 'htt

All possible metadata categories

In [18]:
volume.parser.meta.keys()

dict_keys(['id', 'metadata_schema_version', 'enumeration_chronology', 'type_of_resource', 'title', 'date_created', 'pub_date', 'language', 'access_profile', 'isbn', 'issn', 'lccn', 'oclc', 'page_count', 'feature_schema_version', 'access_rights', 'alternate_title', 'category', 'genre_ld', 'genre', 'contributor_ld', 'contributor', 'handle_url', 'source_institution_ld', 'source_institution', 'lcc', 'type', 'is_part_of', 'last_rights_update_date', 'pub_place_ld', 'pub_place', 'main_entity_of_page', 'publisher_ld', 'publisher'])

## Work With Tokens

In [10]:
volume.tokenlist()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
1,body,",",",",1
1,body,.046,CD,1
1,body,1993,CD,1
1,body,3560,CD,1
1,body,AWARD,NN,1
...,...,...,...,...
260,body,world,NN,2
260,body,would,MD,1
260,body,writers,NNS,1
260,body,written,VBN,1


In [14]:
volume.tokenlist(page=False, section=None, pos=False)

ERROR:root:Invalid section argument: None


In [62]:
import pandas as pd

In [52]:
pd.set_option("max_rows", 600)

In [10]:
lost_df = volume.tokenlist()

In [98]:
lost_df.columns

Index(['count'], dtype='object')

In [99]:
lost_df

Unnamed: 0_level_0,Unnamed: 1_level_0,count
section,lowercase,Unnamed: 2_level_1
body,!,116
body,"""",1
body,#22>,1
body,$,4
body,%,2
body,...,...
body,•*♦:,1
body,€,1
body,™,5
body,■,18


In [126]:
lost_df.query('page == 41')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
41,body,!,.,1
41,body,&,CC,1
41,body,'d,MD,2
41,body,'ll,MD,1
41,body,'s,POS,1
41,body,",",",",11
41,body,-LSB-,-LRB-,1
41,body,.,.,4
41,body,13th,JJ,1
41,body,At,IN,1


In [90]:
lost_df.query('token == "pigeon"')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
11,body,pigeon,NN,1
13,body,pigeon,NN,1
15,body,pigeon,NN,1
20,body,pigeon,NN,3
23,body,pigeon,NN,2
24,body,pigeon,NN,1
28,body,pigeon,NN,2
31,body,pigeon,NN,1
32,body,pigeon,NN,1
34,body,pigeon,NN,1


In [64]:
volume.tokenlist()[:100]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
1,body,",",",",1
1,body,.046,CD,1
1,body,1993,CD,1
1,body,3560,CD,1
1,body,AWARD,NN,1
1,body,BY,IN,1
1,body,Edward,NNP,1
1,body,Exhale,VB,1
1,body,His,PRP$,1
1,body,Jones,NNP,1


In [11]:
lost_df_flattened = lost_df.reset_index()

In [None]:
lost_df_flattened[lost_df_flattened['token'] == "LIONS"]

In [222]:
lost_df_flattened[lost_df_flattened['token'] == "TRAIN"]

Unnamed: 0,page,section,token,pos,count,chapter
22506,113,body,TRAIN,NN,1,


In [232]:
lost_df_flattened[lost_df_flattened['token'] == "DAY"]

Unnamed: 0,page,section,token,pos,count,chapter
5796,35,body,DAY,NNP,1,Ch2: The First Day


In [227]:
lost_df_flattened[lost_df_flattened['page'] == 117]

Unnamed: 0,page,section,token,pos,count,chapter
23058,117,body,'','',7,
23059,117,body,'re,VBP,1,
23060,117,body,'s,POS,4,
23061,117,body,'s,VBZ,1,
23062,117,body,",",",",21,
23063,117,body,.,.,25,
23064,117,body,12th,JJ,4,
23065,117,body,13th,JJ,1,
23066,117,body,13th,NNP,1,
23067,117,body,14th,JJ,1,


In [132]:
lost_df_flattened.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51297 entries, 0 to 51296
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   page     51297 non-null  uint64
 1   section  51297 non-null  object
 2   token    51297 non-null  object
 3   pos      51297 non-null  object
 4   count    51297 non-null  uint32
dtypes: object(3), uint32(1), uint64(1)
memory usage: 1.8+ MB


In [12]:
def add_chapter_titles(page):
    if page >= 11 and page < 35:
        return "Ch1: The Girl Who Raised Pigeons"
    elif page >= 35 and page < 41:
        return "Ch2: The First Day"
    elif page >= 41 and page < 63:
        return "Ch3: The Night Rhonda Ferguson Was Killed"
    elif page >= 63 and page < 85:
        return "Ch4: Young Lions"
    elif page >= 85 and page < 113:
        return "Ch5: The Store"
    elif page >= 113 and page < 125:
        return "Ch6: An Orange Line Train to Ballston"
    elif page >= 125 and page < 149:
        return "Ch7: The Sunday Following Mother's Day"
    elif page >= 149 and page < 159:
        return "Ch8: Lost in the City"
    elif page >= 159 and page < 184:
        return "Ch9: His Mother's House"
    elif page >= 184 and page < 191:
        return "Ch10: A Butterfly on F Street"
    elif page >= 191 and page < 209:
        return "Ch11: Gospel"
    elif page >= 209 and page < 225:
        return "Ch12: A New Man"
    elif page >= 225 and page < 237:
        return "Ch13: A Dark Night"
    elif page >= 237:
        return "Ch14: Marie"

In [13]:
lost_df_flattened['chapter'] = lost_df_flattened['page'].apply(add_chapter_titles)

In [9]:
lost_df_flattened['chapter'].value_counts()

Ch5: The Store                               6467
Ch9: His Mother's House                      5406
Ch1: The Girl Who Raised Pigeons             5382
Ch7: The Sunday Following Mother's Day       5228
Ch4: Young Lions                             4660
Ch3: The Night Rhonda Ferguson Was Killed    4469
Ch11: Gospel                                 3980
Ch14: Marie                                  3470
Ch12: A New Man                              3082
Ch13: A Dark Night                           2547
Ch8: Lost in the City                        2162
Ch6: An Orange Line Train to Ballston        1985
Ch2: The First Day                           1101
Ch10: A Butterfly on F Street                 956
Name: chapter, dtype: int64

In [249]:
lost_df_flattened

Unnamed: 0,page,section,token,pos,count,chapter
0,1,body,",",",",1,
1,1,body,.046,CD,1,
2,1,body,1993,CD,1,
3,1,body,3560,CD,1,
4,1,body,AWARD,NN,1,
...,...,...,...,...,...,...
51292,260,body,world,NN,2,Ch14: Marie
51293,260,body,would,MD,1,Ch14: Marie
51294,260,body,writers,NNS,1,Ch14: Marie
51295,260,body,written,VBN,1,Ch14: Marie


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])

In [251]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [253]:
from sklearn.feature_extraction.text import TfidfTransformer


In [255]:
tfidf_transformer = TfidfTransformer()

In [None]:
from numpy import log
def tfidf(x):
    return x * log(1+vol.page_count / x.count())

In [None]:
# Will take a few seconds to run, depending on your system
idf_scores = lost_df_flattened.groupby(level=["token"]).transform(tfidf)

In [262]:
lost_df.groupby(level=["token"]).count()

Unnamed: 0_level_0,count
token,Unnamed: 1_level_1
!,55
"""",1
#22>,1
$,3
%,2
...,...
•*♦:,1
€,1
™,4
■,10


In [265]:
len(lost_df_flattened['chapter'].unique())

15

In [None]:
document_frequency = lost_df_flattened['chapter'].

In [14]:
def calculate_tfidf(term_frequency):
    return term_frequency * log(1+14 / term_frequency.count())

In [None]:
log(total_number_of_documents / number_of_documents_with_term) + 1

In [None]:
lost_df_flattened.groupby(['chapter', 'token'])['count'].count().transform(calculate_tfidf)['count']

In [269]:
lost_df.groupby(level=["token"]).count()

Unnamed: 0_level_0,count
token,Unnamed: 1_level_1
!,55
"""",1
#22>,1
$,3
%,2
...,...
•*♦:,1
€,1
™,4
■,10


In [None]:
lost_df_flattened.groupby(['chapter', 'token']).count()

In [16]:
from numpy import log
def tfidf(x):
    return x * log(1+volume.page_count / x.count())



# Will take a few seconds to run, depending on your system
idf_scores = lost_df.groupby(level=["token"]).transform(tfidf)
idf_scores

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
1,body,",",",",0.747214
1,body,.046,CD,5.564520
1,body,1993,CD,4.875197
1,body,3560,CD,5.564520
1,body,AWARD,NN,4.875197
...,...,...,...,...
260,body,world,NN,2.699853
260,body,would,MD,0.874572
260,body,writers,NNS,5.564520
260,body,written,VBN,3.397487


In [256]:
tfidf_transformer.fit_transform(lost_df_flattened[['token', 'count']])

ValueError: could not convert string to float: ','

In [252]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [None]:
tfidf_vectorizer.fit_transform(lost_df_flattened[])

In [57]:
lost_df_flattened['section'].value_counts()

body    51297
Name: section, dtype: int64

In [58]:
lost_df_flattened[lost_df_flattened['section'] == 'header']

Unnamed: 0,page,section,token,pos,count


In [34]:
lost_df.sort_values(by='count', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
250,body,.,.,45
6,body,",",",",43
100,body,and,CC,43
163,body,.,.,40
76,body,the,DT,38
...,...,...,...,...
101,body,s,VBZ,1
101,body,sat,VBD,1
101,body,second,JJ,1
101,body,sense,NN,1


In [100]:
lost_df = volume.tokenlist()

In [91]:
lost_df = volume.tokenlist(pages=False, pos=False, case=False)
lost_df.sort_values(by='count', ascending=False)[:100]

Unnamed: 0_level_0,Unnamed: 1_level_0,count
section,lowercase,Unnamed: 2_level_1
body,the,5506
body,",",5498
body,.,4954
body,and,2939
body,to,2316
body,she,1783
body,a,1681
body,'',1619
body,``,1616
body,her,1590


https://babel.hathitrust.org/cgi/pt?id=bc.ark:/13960/t5q84bb6g&view=1up&seq=9

In [80]:
dubliners_volume = Volume("bc.ark:/13960/t5q84bb6g")

In [81]:
dubliners_volume.title

'Dubliners /'

In [85]:
dubliners_volume.tokenlist().groupby(level='section').size()

section
body    43854
dtype: int64

In [None]:
from htrc_features import Volume
vol = Volume('data/ef2-stubby/hvd/34926/hvd.32044093320364.json.bz2')
vol

In [88]:
Volume("mdp.39015062571693").tokenlist().groupby(level='section').size()

section
body    36810
dtype: int64