# TF-IDF with HathiTrust Data

In this lesson, we're going to learn about a text analysis method called *term frequency–inverse document frequency*, often abbreviated *tf-idf*.

While calculating the most frequent words in a text can be useful, the most frequent words in a text usually aren't the most interesting words in a text, even if we get rid of stop words ("the, "and," "to," etc.). Tf-idf is a method that builds off word frequency but it more specifically tries to identify the most distinctively frequent or significant words in a document. 

In this lesson, we will cover how to:
- Calculate and normalize tf-idf scores for each short story in Edward P. Jones's *Lost in the City*
- Download and process HathiTrust extracted features — that is, word frequencies for books in the HathiTrust Digital Library (including in-copyright books like *Lost in the City*)
- Prepare HathiTrust extracted features for tf-idf analysis

## Dataset

### *Lost in the City* by Edward P. Jones

<blockquote class="epigraph" style=" padding: 10px">

 [T]he pigeon had taken a step and dropped from the ledge. He caught an upwind that took him nearly as high as the tops of the empty K Street houses. He flew farther into Northeast, into the color and sounds of the city's morning. She did nothing, aside from following him, with her eyes, with her heart, as far as she could.
    
<p class ="attribution">—Edward P. Jones, "The Girl Who Raised Pigeons," <i>Lost in the City</i> (1993) </p>
    
</blockquote>

Edward P. Jones's *Lost in the City* (1993) is a collection of 14 short stories set in Washington D.C. The first short story, "The Girl Who Raised Pigeons," begins with a young girl raising homing pigeons on her roof.

How distinctive is a "pigeon" in the world of *Lost in the City*? What does this uniqueness (or lackthereof) tell us about the meaning of pigeons in first short story "The Girl Who Raised Pigeons" and the collection as a whole? These are just a few of the questions that we're going to try to answer with tf-idf.

If you already have a collection of plain text (.txt) files that you'd like to analyze, one of the easiest ways to calculate tf-idf scores is to use the Python library scikit-learn. It has a quick and nifty module called [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), which does all the math for you behind the scenes. We will cover how to use the TfidfVectorizer in the next lesson.

In this lesson, however, we're going to calculate tf-idf scores manually because *Lost in the City* is still in-copyright, which means that, for legal reasons, we can't easily share or access plain text files of the book.

Luckily, the [HathiTrust Digital Library](https://www.hathitrust.org/)—which contains digitized books from Google Books as well as many university libraries—has released word frequencies per page for all 17 million books in its catalog. These word frequencies (plus part of speech tags) are otherwise known as "extracted features." There's a lot of text analysis that we can do with extracted features alone, including tf-idf.

So to calculate tf-idf scores for *Lost in the City*, we're going to use HathiTrust extracted features. That's why we're not using sci-kit learn's TfidfVectorizer. It works great with plain text files but not so great with extracted features.

## Breaking Down the TF-IDF Formula

But first, let's quickly discuss the tf-idf formula. The idea is pretty simple.

**tf-idf = term_frequency * inverse_document_frequency**

**term_frequency** = number of times a given term appears in document

**inverse_document_frequency** = log(total number of documents / number of documents with term) + 1**\***

You take the number of times a term occurs in a document (term frequency). Then you take the number of documents in which the same term occurs at least once divided by the total number of documents (document frequency), and you flip that fraction on its head (inverse document frequency). Then you multiply the two numbers together (term_frequency * inverse_document_frequency).

The reason we take the *inverse*, or flipped fraction, of document frequency is to boost the rarer words that occur in relatively few documents. Think about the inverse document frequency for the word "said" vs the word "pigeon." The term "said" appears in 13 (document frequency) of 14 (total documents) *Lost in the City* stories (14 / 13 --> a smaller inverse document frequency) while the term "pigeons" only occurs in 2 (document frequency) of the 14 stories (total documents) (14 / 2 --> a bigger inverse document frequency, a bigger tf-idf boost). 

*There are a bunch of slightly different ways that you can calculate inverse document frequency. The version of idf that we're going to use is the [scikit-learn default](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer), which uses "smoothing" aka it adds a "1" to the numerator and denominator: 

**inverse_document_frequency**  = log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

```{margin}
> If smooth_idf=True (the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.  
> -[scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)
```

### Let's test it out

We need the `log()` function for our calculation, otherwise known as [logarithm](https://en.wikipedia.org/wiki/Logarithm), so we're going to import the `numpy` package.

In [221]:
import numpy as np

**"said"**

In [230]:
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 13 ##number of short stories the contain the word "said"

In [231]:
term_frequency = 47 ##number of times "said" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

In [232]:
term_frequency * inverse_document_frequency

50.24266495988672

**"pigeons"**

In [233]:
total_number_of_documents = 14 ##total number of short stories in *Lost in the City*
number_of_documents_with_term = 2 ##number of short stories the contain the word "pigeons"

In [234]:
term_frequency = 30 ##number of times "pigeons" appears in "The Girl Who Raised Pigeons"
inverse_document_frequency = np.log((1 + total_number_of_documents) / (number_of_documents_with_term +1)) + 1

In [235]:
term_frequency * inverse_document_frequency

78.28313737302301

**tf–idf scores for "The Girl Who Raised Pigeons"**

"said" = 50.48<br>
"pigeons" = 78.28

Though the word "said" appears 47 times in "The Girl Who Raised Pigeons" and the word "pigeons" only appears 30 times, "pigeons" has a higher tf–idf score than "said" because it's a rarer word. The word "pigeons" appears in 2 of 14 stories, while "said" appears in 13 of 14 stories, almost all of them.

## Get HathiTrust Extracted Features

Now let's try to calculate tf-idf scores for all the words in all the short stories in *Lost in the City*. To do so, we need word counts, or HathiTrust extracted features, for each story in the collection.

To work with HathiTrust's extracted features, we first need to install and import the [HathiTrust Feature Reader](https://github.com/htrc/htrc-feature-reader).

Install HathiTrust Feature Reader

In [None]:
!pip install htrc-feature-reader

Import necessary libraries

In [202]:
from htrc_features import Volume
import pandas as pd

<div class="admonition pandasreview" name="html-admonition" style="background: black; color: white; padding: 10px">
<p class="title">Pandas</p>
 Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part1.html"> Pandas Basics (1-3) </a> in this textbook!
    
</div>

Then we need to locate the the HathiTrust volume ID for *Lost in the City*. If we search the HathiTrust catalog for this book and then click on "Limited (search only)," it will take us to the following web page: https://babel.hathitrust.org/cgi/pt?id=mdp.39015029970129.

The HathiTrust Volume ID for *Lost in the City* is located after `id=` this URL: `mdp.39015029970129`. 

### Make DataFrame of Word Frequencies From Volume(s)

#### Single Volume

To get HathiTrust extracted features for a single volume, we can create a `[Volume` object](https://github.com/htrc/htrc-feature-reader#volume) and use the `.tokenlist()` method. 

In [203]:
Volume('mdp.39015029970129').tokenlist()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
1,body,",",",",1
1,body,.046,CD,1
1,body,1993,CD,1
1,body,3560,CD,1
1,body,AWARD,NN,1
...,...,...,...,...
260,body,world,NN,2
260,body,would,MD,1
260,body,writers,NNS,1
260,body,written,VBN,1


For each page in *Lost in the City*, this DataFrame displays the page number and section type as well as every word/token that appears on the page, its part-of-speech, and the number of times that word/token occurs on the page. As you can see, there are 51,297 rows in this DataFrame — one for each token that appears on each page.

Let's look at a sample of just 20 words from page 11.

In [204]:
Volume('mdp.39015029970129').tokenlist()[500:520]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
11,body,out,RP,1
11,body,over,IN,1
11,body,part,NN,1
11,body,past,IN,1
11,body,pee,VB,1
11,body,pigeon,NN,1
11,body,pigeons,NNS,1
11,body,reach,VB,1
11,body,remained,VBD,1
11,body,roof,NN,1


We can also get metadata for a HathiTrust volume by asking for [certain attributes](https://github.com/htrc/htrc-feature-reader#volume).

In [205]:
Volume('mdp.39015029970129').year

1993

In [206]:
Volume('mdp.39015029970129').page_count

260

In [207]:
Volume('mdp.39015029970129').publisher

'HarperPerennial'

#### Multiple Volumes

We might want to get extracted features for multiple volumes at the same time, so we're also going to practice a workflow that will allow us to read in multiple HathiTrust books, even though we're only reading in one book at this moment.

Insert list of desired HathiTrust volume(s)

In [208]:
volume_ids = ['mdp.39015029970129']

Loop through this list of volume IDs and make a DataFrame that includes extracted features, book title, and publication year, then make a list of all DataFrames.

In [209]:
all_tokens = []

for hathi_id in volume_ids:
    
    #Read in HathiTrust volume
    volume = Volume(hathi_id)
    
    #Make dataframe from token list -- do not include part of speech, sections, or case sensitivity
    token_df = volume.tokenlist(case=False, pos=False, drop_section=True)
    
    #Add book column
    token_df['book'] = volume.title
    
    #Add publication year column
    token_df['year'] = volume.year
    
    all_tokens.append(token_df)

Concatenate the list of DataFrames 

In [210]:
lost_df = pd.concat(all_tokens)

Preview the DataFrame

In [211]:
lost_df

Unnamed: 0_level_0,Unnamed: 1_level_0,count,book,year
page,lowercase,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,",",1,Lost in the city : stories /,1993
1,.046,1,Lost in the city : stories /,1993
1,1993,1,Lost in the city : stories /,1993
1,3560,1,Lost in the city : stories /,1993
1,a,1,Lost in the city : stories /,1993
...,...,...,...,...
260,would,1,Lost in the city : stories /,1993
260,writers,1,Lost in the city : stories /,1993
260,written,1,Lost in the city : stories /,1993
260,york,1,Lost in the city : stories /,1993


Change from multi-level index to regular index with `reset_index()`

In [212]:
lost_df_flattened = lost_df.reset_index()

In [213]:
lost_df_flattened 

Unnamed: 0,page,lowercase,count,book,year
0,1,",",1,Lost in the city : stories /,1993
1,1,.046,1,Lost in the city : stories /,1993
2,1,1993,1,Lost in the city : stories /,1993
3,1,3560,1,Lost in the city : stories /,1993
4,1,a,1,Lost in the city : stories /,1993
...,...,...,...,...,...
47302,260,would,1,Lost in the city : stories /,1993
47303,260,writers,1,Lost in the city : stories /,1993
47304,260,written,1,Lost in the city : stories /,1993
47305,260,york,1,Lost in the city : stories /,1993


Nice! We now have a DataFrame of word counts per page for *Lost in the City*.

But what we need to move forward with tf-idf is a way of splitting this collection into its individual stories. Remember: to use tf-idf, we need a *collection* of texts because we need to compare word frequency for one document with all the other documents in the collection.

## Add story titles

How can we split up *Lost in the City* into individual stories?

Sometimes HathiTrust Extracted Features helpfully include "section" information for a book, such as chapter titles. Unfortunately, the extracted features for *Lost in the City* do not include chapter or story titles.

They do, however, include page numbers and, if you specify `volume.tokenlist(case=True)`, words with case sensitivity. When I manually combed through the HTRC token list with case sensitivity turned on, I noticed that the title page for each short story seemed to format the title in all-caps. So I searched for all-caps words from each story title and noted down the corresponding page number. This should give us a marker of where every story begins and ends.

The function below will add in *Lost in the City*'s story titles for the correct page numbers and corresponding words.

In [214]:
def add_story_titles(page):
    if page >= 0 and page < 11:
        return "Front Matter"
    if page >= 11 and page < 35:
        return "01: The Girl Who Raised Pigeons"
    elif page >= 35 and page < 41:
        return "02: The First Day"
    elif page >= 41 and page < 63:
        return "03: The Night Rhonda Ferguson Was Killed"
    elif page >= 63 and page < 85:
        return "04: Young Lions"
    elif page >= 85 and page < 113:
        return "05: The Store"
    elif page >= 113 and page < 125:
        return "06: An Orange Line Train to Ballston"
    elif page >= 125 and page < 149:
        return "07: The Sunday Following Mother's Day"
    elif page >= 149 and page < 159:
        return "08: Lost in the City"
    elif page >= 159 and page < 184:
        return "09: His Mother's House"
    elif page >= 184 and page < 191:
        return "10: A Butterfly on F Street"
    elif page >= 191 and page < 209:
        return "11: Gospel"
    elif page >= 209 and page < 225:
        return "12: A New Man"
    elif page >= 225 and page < 237:
        return "13: A Dark Night"
    elif page >= 237 and page <= 252:
        return "14: Marie"
    elif page > 252:
        return "Back Matter"

Below we add a new column of story titles to the DataFrame by `apply()`ing our function to the "page" column and dumping the results to `lost_df_flattened['story']`. You can read more about applying functions in ["Pandas Basics - Part 3"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part3.html#applying-functions).

In [215]:
lost_df_flattened['story'] = lost_df_flattened['page'].apply(add_story_titles)

We're also going to drop the "Front Matter" and "Back Matter" from the DataFrame.

In [216]:
lost_df_flattened = lost_df_flattened.drop(lost_df_flattened[lost_df_flattened['story'] == 'Front Matter'].index)

In [217]:
lost_df_flattened = lost_df_flattened.drop(lost_df_flattened[lost_df_flattened['story'] == 'Back Matter'].index)

## Sum Word Counts For Each Story

Page-level information is great. But for tf-idf purposes, we really only care about the frequency of words for every story. Below we group by story and calculate the sum of word frequencies for all the pages in that story.

In [218]:
lost_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()

Unnamed: 0,story,lowercase,count
0,01: The Girl Who Raised Pigeons,!,8
1,01: The Girl Who Raised Pigeons,',4
2,01: The Girl Who Raised Pigeons,'',111
3,01: The Girl Who Raised Pigeons,'d,1
4,01: The Girl Who Raised Pigeons,'ll,5
...,...,...,...
18082,14: Marie,yet,1
18083,14: Marie,you,39
18084,14: Marie,you-know-who,1
18085,14: Marie,young,8


Notice how the "page" column no longer exists in the DataFrame and our rows have slimmed down from more than 40,000 to 18,000.

In [219]:
word_frequency_df = lost_df_flattened.groupby(['story', 'lowercase'])[['count']].sum().reset_index()

## Remove Infrequent Words, Stopwords, & Punctuation

We will conclude with some final pre-processing steps. We will remove the list of stopwords defined below.

Make list of stopwords

In [220]:
STOPS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
         'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
         'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
         'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
         'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
         'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
         'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
         'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
         'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
         'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
         'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
         'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp', "!"]

Remove stopwords

In [221]:
word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].isin(STOPS)].index)

We will also remove punctuation by using a regular expression `[^A-Za-z\s]`, which matches anything that's not a letter and drops it from the DataFrame.

In [222]:
word_frequency_df = word_frequency_df.drop(word_frequency_df[word_frequency_df['lowercase'].str.contains('[^A-Za-z\s]', regex=True)].index)

In [223]:
#Remove words that appear less than 5 times in a book
#word_frequency_df_test = word_frequency_df[word_frequency_df['count'] > 5]

In [224]:
word_frequency_df

Unnamed: 0,story,lowercase,count
36,01: The Girl Who Raised Pigeons,abandoned,2
37,01: The Girl Who Raised Pigeons,able,2
40,01: The Girl Who Raised Pigeons,absently,1
41,01: The Girl Who Raised Pigeons,absolute,1
42,01: The Girl Who Raised Pigeons,accepted,1
...,...,...,...
18079,14: Marie,years,10
18080,14: Marie,yes,2
18081,14: Marie,yesterday,2
18082,14: Marie,yet,1


## TF-IDF

### Term Frequency

We already have term frequencies for each document. Let's rename the columns so that they're consistent with the tf-idf vocabulary that we've been using.

In [225]:
word_frequency_df = word_frequency_df.rename(columns={'lowercase': 'term','count': 'term_frequency'})

In [226]:
word_frequency_df

Unnamed: 0,story,term,term_frequency
36,01: The Girl Who Raised Pigeons,abandoned,2
37,01: The Girl Who Raised Pigeons,able,2
40,01: The Girl Who Raised Pigeons,absently,1
41,01: The Girl Who Raised Pigeons,absolute,1
42,01: The Girl Who Raised Pigeons,accepted,1
...,...,...,...
18079,14: Marie,years,10
18080,14: Marie,yes,2
18081,14: Marie,yesterday,2
18082,14: Marie,yet,1


### Document Frequency

To calculate the number of documents or stories in which each term appears, we're going to create a separate DataFrame and do some Pandas manipulation and calculation.

In [227]:
document_frequency_df = (word_frequency_df.groupby(['story','term']).size().unstack()).sum().reset_index()

If you inspect parts of the complex chain of Pandas methods above (which is always a great way to learn!), you will see that we're momentarily reshaping the DataFrame to see if each term appears in each story...

In [228]:
word_frequency_df.groupby(['story','term']).size().unstack()

term,abandoned,abhored,abide,ability,able,abomination,aboum,aboutfcfteen,abqu,absently,...,ypu,yr,ysirs,ythe,yuddini,zigzagging,zion,zipped,zippers,zoo
story,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01: The Girl Who Raised Pigeons,1.0,,,,1.0,,,,,1.0,...,,1.0,,1.0,,,,,,
02: The First Day,,,,,,,,,,1.0,...,,,,,,,,,,
03: The Night Rhonda Ferguson Was Killed,,,,,1.0,,,,,,...,,,,,,,,,,
04: Young Lions,,,,,1.0,,1.0,,,,...,1.0,,,,,,,,1.0,
05: The Store,,,,1.0,1.0,1.0,,1.0,,,...,,,,,1.0,,,,,
06: An Orange Line Train to Ballston,,,,,1.0,,,,,,...,,,,,,,,,,1.0
07: The Sunday Following Mother's Day,1.0,1.0,1.0,,1.0,,,,1.0,,...,,,1.0,,,,,,,
08: Lost in the City,,,,,,,,,,,...,,,,,,,,,,
09: His Mother's House,1.0,,,,1.0,,,,,,...,,,,,,,,1.0,,
10: A Butterfly on F Street,,,,,1.0,,,,,,...,,,,,,1.0,,,,


Then we're adding up how many stories each term appears in (`.sum()`) and resetting the index (`.reset_index()`) to make a DataFrame.

Finally, we will rename the column in this DataFrame and merge it into our word frequency DataFrame.

In [229]:
document_frequency_df = document_frequency_df.rename(columns={0:'document_frequency'})

In [230]:
word_frequency_df = word_frequency_df.merge(document_frequency_df)

Now we have term frequency and document frequency.

In [231]:
word_frequency_df

Unnamed: 0,story,term,term_frequency,document_frequency
0,01: The Girl Who Raised Pigeons,abandoned,2,3.0
1,07: The Sunday Following Mother's Day,abandoned,1,3.0
2,09: His Mother's House,abandoned,1,3.0
3,01: The Girl Who Raised Pigeons,able,2,12.0
4,03: The Night Rhonda Ferguson Was Killed,able,3,12.0
...,...,...,...,...
15721,14: Marie,whim,1,1.0
15722,14: Marie,wilamena,20,1.0
15723,14: Marie,wise,8,1.0
15724,14: Marie,womanish,1,1.0


As you can see in the DataFrame above, the term "abandoned" appears 2 times in the story "The Girl Who Raised Pigeons" (term frequency), and it appears in 3 different stories in the collection overall (document frequency).

### Total Number of Documents 

To calculate the total number of documents are in the collection, we count how many unique values are in the "story" column (we know the answer should be 14 short stories).

In [232]:
total_number_of_documents = lost_df_flattened['story'].nunique()

In [233]:
total_number_of_documents

14

### Inverse Document Frequency

As we previously established, there are a lot of slightly different versions of the tf-idf formula, but we're going to use the default version from the [scikit-learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) that adds "smoothing" to inverse document frequency.

```
inverse_document_frequency = log [ (1 + total number of docs) / (1 + document frequency) ] + 1
```

In [234]:
import numpy as np

In [235]:
word_frequency_df['idf'] = np.log((1 + total_number_of_documents) / (1 + word_frequency_df['document_frequency'])) + 1

### TF- IDF

Finally, we will calculate tf-idf by multiplying term frequency and inverse document frequency together.

In [236]:
word_frequency_df['tfidf'] = word_frequency_df['term_frequency'] * word_frequency_df['idf']

Then we will normalize these values with the scikit-learn library.

In [237]:
from sklearn import preprocessing

In [238]:
word_frequency_df['tfidf_normalized'] = preprocessing.normalize(word_frequency_df[['tfidf']], axis=0, norm='l2')

We did it! Now let's inspect the top 15 words with the highest tfidf scores for each story in the collection

In [239]:
word_frequency_df.sort_values(by=['story','tfidf_normalized'], ascending=[True,False]).groupby(['story']).head(15)

Unnamed: 0,story,term,term_frequency,document_frequency,idf,tfidf,tfidf_normalized
655,01: The Girl Who Raised Pigeons,betsy,44,1.0,3.014903,132.655733,0.106417
3317,01: The Girl Who Raised Pigeons,jenny,42,1.0,3.014903,126.625927,0.10158
212,01: The Girl Who Raised Pigeons,ann,45,2.0,2.609438,117.424706,0.094199
5566,01: The Girl Who Raised Pigeons,robert,36,1.0,3.014903,108.536509,0.087069
1384,01: The Girl Who Raised Pigeons,coop,28,1.0,3.014903,84.417285,0.06772
7887,01: The Girl Who Raised Pigeons,would,84,14.0,1.0,84.0,0.067385
5053,01: The Girl Who Raised Pigeons,pigeons,30,2.0,2.609438,78.283137,0.062799
4238,01: The Girl Who Raised Pigeons,miss,46,10.0,1.310155,60.267127,0.048347
688,01: The Girl Who Raised Pigeons,birds,29,5.0,1.916291,55.572431,0.044581
1191,01: The Girl Who Raised Pigeons,clara,17,1.0,3.014903,51.253351,0.041116


It turns out that "pigeons" are pretty unique to the first short story in *Lost in the City* and have a normalized tf-idf score of .062, making it one of the most distinctive words in that story along with "coop" and "birds."

What are some other distinctive words in *Lost in the City*?

## Further Resources

- Peter Organisciak and Boris Capitanu, ["Text Mining in Python through the HTRC Feature Reader,"](https://programminghistorian.org/en/lessons/text-mining-with-extracted-features) *The Programming Historian*
