# Search for a Word in Context with Python

In this notebook, we'll look at code for reading in a text file and searching for a word in context within that file.

First we import the modules/packages we need for the rest of our process. In this case, we just need the regular expression module. 

Click `Shift + Enter/Return` or `⏵ Run` to run the cell below.

In [None]:
import re

## Read in a file and print text

Next, we'll read in a text file. Specifically, let's look at a collection of poetry by UW students, which was published just over 100 years ago: *University of Washington Poems* (1924), edited by Glenn Hughes. 

Before you read in the file, take a look at a scan of the original edition: https://babel.hathitrust.org/cgi/pt?id=uc1.$b247983&seq=7  

Then, run the next cell. 

In [None]:
file_path = 'UW-1920s-poetry.txt'

txt_start=2180
txt_end=3900

with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

print(text[txt_start:txt_end])

**Try it:** shift the printing window above by changing the `txt_start` and `txt_end` variables.

## Search a text file for a specific word

In the code below we create a function that will accept:
  - filename: name of the file to read
- search_term: the word to search for (case-insensitive)
- num_results: maximum number of context examples to print (default: 5)
- context_window: number of characters to show before/after the word (default: 30)

Run the cell below - do not edit it if you are trying to follow along :)

In [None]:
def search_word_in_file(filename, word, num_print=5, context_window=30):
    
    """
    Search for a word in a text file and print the number of occurrences.
    Optionally print the context for each occurrence.
    
    Parameters:
    - filename: name of the file to read
    - word: the word to search for (case-insensitive)
    - num_print: maximum number of context examples to print (default: 5)
    - context: number of characters to show before/after the word (default: 30)
    """

    # Open the file and read the contents into a single string
    with open(filename, 'r', encoding='utf-8') as file:
        text = file.read()

    # Find all matches of the word (case-insensitive, as a whole word)
    # r'\b' means "word boundary" so we only match the full word
    matches = [m for m in re.finditer(r'\b{}\b'.format(re.escape(word)), text, re.IGNORECASE)]

    # Print the number of times the word appears
    print(f"The word '{word}' appears {len(matches)} times in the text.\n")

    # Print up to num_print occurrences with context
    for i, match in enumerate(matches[:num_print]):
        # Get start and end position of the match
        start, end = match.start(), match.end()
        # Get context before and after the word
        before = text[max(0, start-context_window):start]
        after = text[end:end+context_window]
        # Print with ... to show it's a snippet
        print(f"✨{before}[{text[start:end]}]{after}✨\n\n")

    if len(matches) > num_print:
        print(f"\n[Showing {num_results} out of {len(matches)} occurrences. Increase num_print to see more.]")

The code in the next cell reads in the file we've selected (`filename`), searches for a specified term (`search_term`), and returns a specified number of results (`num_results`) for appearances of that term, printed within context windows of length (`context_window`) on either side.

In [None]:
filename = 'UW-1920s-poetry.txt'
search_term = 'rain'
num_results = 15
context_window=40

search_word_in_file(filename, search_term, num_results, context_window)

**Try it:** shift the number of results returned and/or the number of characters returned in each result by adjusting the `num_results` and `context_window` variables above. Look at how the word "rain" (or other terms you're interested in) appear within poems.

## Search for a term in a different text file

Searching within a text file drawn from a printed text with a more complex layout can be more complicated. Let's go back to the issue of the *Seattle Star* we were looking at before. A scan of the issue is available here: https://chroniclingamerica.loc.gov/lccn/sn87093407/1925-06-23/ed-1/seq-1/

Once you've looked at the issue, try reading in a chunk of the text.

In [None]:
file_path = 'Seattle-Star_06231925.txt'

txt_start=0
txt_end=2000

with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

print(text[txt_start:txt_end])

Next, we can call the same function we wrote above, `search_word_in_file`, to search this new file. 

Edit the variables below to try searching for a new search term within the newspaper pages.

In [None]:
filename = 'Seattle-Star_06231925.txt'
search_term = 'rain'
num_results = 15
context_window=80

search_word_in_file(filename, search_term, num_results, context_window)