# Finding sentences that match a word or phrase

8/19/21

Research question:

>We would like to identify any sentences in dataset that match the word/pharse X and ouput those to a file.

This research question will require a Constellate dataset that contains the full text of the document. There are over You can create these in the Constellate application by selecting "Full text only" from the "Download Availability" filter. Here are example datasets to work with: 

* `f477f1df-6cd5-c12e-844e-a04128e9b6e5`: All documents from JSTOR published in Proceedings of the American Philosophical Society from 1900 - 1930

* `88a2bfb7-7196-0ca4-d545-d066ae8cc52c`: All documents from JSTOR published in The American Economic Review from 1910 - 1930 and limited to full text availability

First, import Python libraries to help us with our analysis. We will use the [Natural Language Toolkit](https://www.nltk.org/) to parse the raw text into sentences and the Pandas library for plotting and outputting the matched sentences to a CSV file.

In [None]:
from collections import defaultdict, Counter
import csv
import pandas as pd

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
import constellate

## Download the dataset and specify the matching text

Download a dataset created in the Constellate application into the notebook environment.

Add the dataset id that you are interested in retreiving as the `dataset_id` variable. This can be found on your [dashboard](https://constellate.org/dataset/dashboard) in the Constellate web application.

Next, define the `matching_phrase` variable that you want to find in the text of the documents. This can be any string and a case insensitive match will be used. 

In [None]:
dataset_id = "88a2bfb7-7196-0ca4-d545-d066ae8cc52c"
matching_phrase = "inflation"

In [None]:
dataset_file = constellate.get_dataset(dataset_id)

## Tokenize the dataset into sentences

Loop through all documents in the dataset, read the `fullText` field, which is an array of page text, and parse sentences using nltk's sentence parser. Check each sentence to see if it contains the matching phrase (case insensitive) and save matches to a Python list. We will record the document identifier, publication year, the page sequence number where the sentence was found, the sentence sequence number within that page, and the text of the sentence.

In [None]:
matching_phrase = matching_phrase.lower().strip()
matched_sentences = []
matched = 0
n = 0

for document in constellate.dataset_reader(dataset_file):
    publication_year = document["publicationYear"]
    for page_sequence, raw_page_text in enumerate(document.get("fullText")):
        # Replace all line breaks with spaces.
        page = " ".join(raw_page_text.split())
        for sentence_sequence, sentence in enumerate(sent_tokenize(page)):
            if matching_phrase in sentence.lower():
                matched_sentences.append((document["id"], publication_year, page_sequence, sentence_sequence, sentence))
                matched += 1
    n += 1
    if (n % 100) == 0:
       print(f"{n} documents scanned", document["id"])

Prview the matched sentences.

In [None]:
matched_sentences[:4]

## Create sentence dataframe

Create a pandas DataFrame from the matched sentences. This makes it convenient to output as a CSV or analyze further.

In [None]:
sentence_df = pd.DataFrame(matched_sentences, columns=["id", "publication_year", "page_seq", "sentence_seq", "text"])

In [None]:
sentence_df.head()

## Plot matching sentences over time

In [None]:
sentence_df.groupby("publication_year").size()\
  .plot(kind="bar", title="Matching sentences over time", xlabel="publication year", ylabel="matching sentences");

## Output a csv file with the matching senctences


In [None]:
sent_file = f"{dataset_id}-sentences.csv"

sentence_df.to_csv(sent_file, index=False)