# Introduction

In this lesson, we will explore how to use Python for text analysis by working with a trial transcript from the *Proceedings of the Old Bailey*. Text analysis is an essential tool for historians, allowing us to uncover patterns and insights from historical documents that would be difficult to identify manually.

We will begin by scraping the trial transcript from the *Old Bailey Online*, a valuable digital archive containing the records of over 197,000 criminal trials held at the Central Criminal Court in London between 1674 and 1913. These records provide a rich source of information about crime, justice, and society in early modern London.

This hands-on lesson will teach you how to:

1.   Scrape text data from a webpage.
2.   Clean the text to make it suitable for analysis.
3.   Generate and visualize word frequencies to reveal trends in the language used during the trial.

By the end of the lesson, you will be able to apply these basic text analysis techniques to other historical documents, helping you to engage with historical sources in new and exciting ways.

This lesson was inspired by [this sequence of lessons](https://programminghistorian.org/en/lessons/from-html-to-list-of-words-1) on the [*Programming Historian*](https://programminghistorian.org/)!

## About the *Proceedings of the Old Bailey*

The *Proceedings of the Old Bailey* is a major online resource for historians, providing access to detailed accounts of trials held at London's central criminal court from 1674 to 1913. It includes witness testimony, defense statements, verdicts, and sentencing information. The website also offers powerful search tools, allowing users to investigate topics ranging from crime and punishment to class, gender, and race in early modern London.

The transcript we will analyze today is from the trial of Benjamin Bowsey, a participant in the 1780 [Gordon Riots](https://en.wikipedia.org/wiki/Gordon_Riots), accused of rioting and the destruction of property. This document, like many others in the *Old Bailey Online*, offers valuable insights into the social and political tensions of the time.

# Step 1: Scraping text from the trial transcript

In [None]:
# Importing necessary libraries
%pip install bs4
from bs4 import BeautifulSoup

# Open the HTML file and parse it
# REMEMBER TO CHANGE THIS PATH SO THAT IT POINTS AT A FILE YOU HAVE ACCESS TO!
with open('/Users/kevinpasquette/Documents/GitHub/hist1354repos/riot_trial.html', 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file, 'html.parser')

# Extract all the paragraph tags which contain the main text of the trial
paragraphs = soup.find_all('p')

# Concatenate the text content of all paragraphs into one string
raw_text = ' '.join([para.get_text() for para in paragraphs])

# Print the first 500 characters to see what the text looks like
print(raw_text[:500])


*   `BeautifulSoup` is used to parse the HTML content.

*   We locate all the `<p>` tags since visual inspection of the HTML indicated that the trial transcript is contained within those tags.

*   The content of each paragraph is extracted, combined into a single string (`raw_text`), and we print the first 500 characters to verify the extraction.

# Step 2: Cleaning the text

The next step is to clean the scraped text. We will remove unwanted characters, such as special symbols, punctuation, and extra spaces, and make all text lowercase for uniformity.

In [11]:

%pip install re
%pip install nltk
%pip install nltk.corpus
%pip install stopwords

import re
import nltk
from nltk.corpus import stopwords

# Download the stopwords from NLTK if not already downloaded
nltk.download('stopwords')

# Define a more advanced cleaning function with stop word removal
def clean_text(text):
    # Remove any non-alphabetic characters (numbers, punctuation, etc.)
    cleaned = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    cleaned = cleaned.lower()
    # Remove extra whitespace
    cleaned = re.sub(r'\s+', ' ', cleaned).strip()

    # Split the text into individual words
    words = cleaned.split()

    # Remove stop words using NLTK's stop word list for English
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]

    # Return the cleaned and filtered text as a single string
    return ' '.join(filtered_words)

# Clean the extracted raw text with stop word removal
cleaned_text = clean_text(raw_text)

# Print the first 500 characters of the cleaned text to check
print(cleaned_text[:500])

[31mERROR: Could not find a version that satisfies the requirement re (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for re[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement nltk.corpus (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for nltk.corpus[0m[31m
[0m
[1m[[

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>


LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - '/Users/kevinpasquette/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.12/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************



*   The re.sub() function is used to remove any character that is not a letter or whitespace.

*   The clean_text function also converts the text to lowercase and removes excess spaces.

*   After cleaning the text by removing non-alphabetic characters and converting it to lowercase, we split the text into individual words.

*   We use the `nltk.corpus.stopwords` list to remove common English stop words.

*   The cleaned text is returned as a single string of meaningful words, ready for further analysis.

# Step 3: Word frequency analysis

Now that we have cleaned the text, we can generate a list of word frequencies. This will help us understand which words are most common in the trial transcript.

In [None]:
from collections import Counter

# Split the cleaned text into individual words
words = cleaned_text.split()

# Use Counter to count the occurrences of each word
word_freq = Counter(words)

# Display the 10 most common words
word_freq.most_common(10)

*   We split the cleaned text into words using `.split()`, which creates a list of words.

*   The `Counter` class from the `collections` module counts the frequency of each word.

*   We print the 10 most common words using `most_common(10)`.

# Visualizing word frequencies

If you want to visualize the word frequencies, you can use the `matplotlib` library to create a bar chart.

In [None]:
import matplotlib.pyplot as plt

# Get the 10 most common words and their frequencies
common_words = word_freq.most_common(10)
words, counts = zip(*common_words)

# Create a bar chart
plt.figure(figsize=(10, 6))
plt.bar(words, counts)
plt.title('Top 10 Most Common Words in Trial Transcript')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

*   This code takes the top 10 most frequent words and their counts.

*   It uses `matplotlib` to generate a simple bar chart visualizing the word frequencies.

*   The `xticks(rotation=45)` rotates the word labels to make them easier to read.

# Conclusion

In this lesson, you learned how to scrape, clean, and analyze a historical trial transcript using Python. The techniques covered here provide a foundation for text analysis in digital history, enabling you to extract meaningful insights from primary source documents. As you continue working with historical texts, consider how these computational methods can be applied to different types of documents and how they might complement more traditional historical approaches.

The combination of digital tools and historical research opens new possibilities for exploring the past, allowing us to engage with sources in ways that were previously unimaginable.