<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created on August 19, 2021 
By Ted Lawless and [Nathan Kelber](http://nkelber.com) for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email tdm@ithaka.org.<br />
___

# Find all sentences that match a word or phrase

The Research Goal:

>Identify any sentences in a dataset that match a word/phrase and output them to a file.

## Before getting started

This research question will require a Constellate dataset that contains the full-text of the document. You can create these in the [Constellate dataset builder](https://constellate.org/builder/) by selecting "Full text only" from the "Download Availability" filter. 

## Import Libraries

First, we will import Python libraries to help us with our analysis. We will use:

* The Pandas library for plotting and outputting our data to CSV
* The [Natural Language Toolkit](https://www.nltk.org/) to parse the raw text into sentences
* The [Constellate client](https://constellate.org/docs/constellate-client) to retrieve our dataset

In [None]:
import pandas as pd
import csv
from nltk.tokenize import sent_tokenize
import constellate


You may also need to download the NLTK punkt tokenizer, which is required for parsing sentences. 

In [None]:
import nltk
nltk.download('punkt')

## Download the dataset and specify the matching text

Download a dataset created in the Constellate application into the notebook environment.

Add the dataset id that you are interested in retreiving as the `dataset_id` variable. This can be found on your [dashboard](https://constellate.org/dataset/dashboard) in the Constellate web application.

If you do not have a dataset, here are two examples you could use:

* `f477f1df-6cd5-c12e-844e-a04128e9b6e5`: All documents from JSTOR published in Proceedings of the American Philosophical Society from 1900 - 1930

* `88a2bfb7-7196-0ca4-d545-d066ae8cc52c`: All documents from JSTOR published in The American Economic Review from 1910 - 1930 and limited to full text availability

In [None]:
dataset_id = "88a2bfb7-7196-0ca4-d545-d066ae8cc52c"

dataset_file = constellate.get_dataset(dataset_id)

Next, define the `matching_phrase` variable that you want to find in the text of the documents. This can be any string and a case insensitive match will be used. 

In [None]:
matching_phrase = "inflation"

## Tokenize the dataset into sentences

1. Loop through all the documents in the dataset
2. Read the `fullText` field (which is an array of page text) and parse sentences using nltk's sentence parser
3. Check each sentence to see if it contains the matching phrase (case insensitive)
4. Save matches to a Python list
5. Record the following:
    * document identifier
    * publication year
    * the page sequence number where the sentence was found
    * the sentence sequence number within that page
    * the text of the sentence.

In [None]:
# Lower case our matching phrase and strip any whitespace
matching_phrase = matching_phrase.lower().strip()

# Define an empty list to store our matched sentences
matched_sentences = []

# Count our matches
matched = 0

# Count our loop iterations
n = 0

for document in constellate.dataset_reader(dataset_file):
    publication_year = document["publicationYear"]
    for page_sequence, raw_page_text in enumerate(document.get("fullText")):
        # Replace all line breaks with spaces.
        page = " ".join(raw_page_text.split())
        for sentence_sequence, sentence in enumerate(sent_tokenize(page)):
            if matching_phrase in sentence.lower():
                matched_sentences.append((document["id"], publication_year, page_sequence, sentence_sequence, sentence))
                matched += 1
    n += 1
    if (n % 100) == 0:
       print(f"{n} documents scanned", document["id"])

In [None]:
# Report number of matches
print(f'{len(matched_sentences)} matching sentences found in your dataset.')

# Preview the matched sentences
# ID, year, page sequence, sentence sequence for page, actual matching sentence
matched_sentences[:3]

## Create sentence dataframe

Create a pandas DataFrame from the matched sentences. This makes it convenient to output as a CSV or analyze further.

In [None]:
sentence_df = pd.DataFrame(matched_sentences, columns=["id", "publication_year", "page_seq", "sentence_seq", "text"])

In [None]:
sentence_df.head()

## Plot matching sentences over time
Visualize the number of matches grouped by the year they occurred in a bar chart.

In [None]:
sentence_df.groupby("publication_year").size()\
  .plot(kind="bar", title="Matching sentences over time", xlabel="publication year", ylabel="matching sentences");

## Output a csv file with the matching sentences


In [None]:
sent_file = f"../data/{dataset_id}-sentences.csv"

sentence_df.to_csv(sent_file, index=False)