Here is an EDA for this project. I have documented my thought processes and decisions in the notebook. Some contributions include:
- Scalable data collection and wrangling methods
- Some exploratory data visualization and analysis

In [None]:
import plotly.express as px
from pprint import pprint
import pandas as pd
import requests
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

The first step to EDA is data collection. We will use copyright free books from The Project Gutenberg to do our analysis. We will query the books using gutendex.com.

In [None]:
# Get book content from Project Gutenberg
def get_book_content(book_number: int) -> str:
    gutendex_url = f"https://gutendex.com/books/{book_number}/"
    try:
        response = requests.get(gutendex_url)
        response.raise_for_status()
        data = response.json()
        book_url = data["formats"]["text/plain"]

        try:
            response = requests.get(book_url)
            response.raise_for_status()
            text = response.text
            return text.replace("\r\n", " ")

        except requests.exceptions.RequestException as e:
            print("Error retrieving book content:", str(e))

    except requests.exceptions.RequestException as e:
        print("Error retrieving book information:", str(e))

Let's try parsing sentences from The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson. We'll use the SpaCy's parser to parse the sentences from the book.

In [None]:
content = get_book_content(43)

doc = nlp(content)
sentences = [sent.text for sent in doc.sents]
pprint(sentences[:10])

Just from eyeballing the list of sentences, it looks like the parser has done a good job. Although, it looks like there are some unnecessary sentences in the list, such as The Project Gutenberg header and footer. However, before we get into cleaning up the data, let's try to understand the data a little better.

In [None]:
print(f"This book has {len(sentences)} sentences.")

What is the longest sentence in the book? What about the shortest sentence?

In [None]:
print("The longest sentence is:")
pprint(max(sentences, key=len))

In [None]:
print(f"The shortest sentence is: {min(sentences, key=len)}")

Upon verifying the output from the actual book on The Project Gutenberg (https://www.gutenberg.org/cache/epub/43/pg43.txt), it looks like there may be some outliers in the data (unwanted text). For example, the shortest sentence is a single character, which is probably from a bullet point list from The Project Gutenberg footer. Fortunately, The Project Gutenberg makes this easy to do, as they provide markers such as:

> *** START OF THE PROJECT GUTENBERG EBOOK

> *** END OF THE PROJECT GUTENBERG EBOOK

Unfortunately, we will have to manually remove the outliers before we feed the data into the parser. Let's try to remove the outliers and parse the sentences again.

In [None]:
def remove_marker(text: str) -> str:
    start_marker = "***"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

    # Remove everything before the second occurrence of the start marker
    start_index = text.find(start_marker, text.find(start_marker) + 1)
    # Remove everything after the first occurrence of the end marker
    end_index = text.find(end_marker)

    if start_index != -1 and end_index != -1:
        text = text[start_index + len(start_marker) : end_index].strip()

    return text

In [None]:
content = remove_marker(content)

doc = nlp(content)
sentences = [sent.text for sent in doc.sents]
pprint(sentences[:10])

This looks much better! To verify, let's check the number of sentences again and look at the shortest sentence in the book.

In [None]:
print(f"This book has {len(sentences)} sentences.")

In [None]:
print(f"The shortest sentence is: {min(sentences, key=len)}")

Hmm, what went wrong here? Surely the last sentence can't just be "DR."? Let's take a look at the sentence before and after.

In [None]:
shortest_sentence = min(sentences, key=len)
shortest_sentence_index = sentences.index(shortest_sentence)

sentence_before = sentences[shortest_sentence_index - 1]
sentence_after = sentences[shortest_sentence_index + 1]

print(f"Before : {sentence_before}")
print(f"Current: {shortest_sentence}")
print(f"After  : {sentence_after}")


This looks like SpaCy parser has interpreted the period after "DR" as the end of the sentence. This may have to do with the fact that [SpaCy uses a non-monotonic arc-eager transition-system](https://spacy.io/api/dependencyparser/), a form of rule-based method to parse sentences. It may also have to do with the fact that we are using a less accurate model (en_core_web_sm) to parse the sentences. Nonetheless, I think we sanitized the data enough to move on to the next step!

Our next task is to separate out the sentences with subordinating conjunctions. We will continue to use SpaCy's dependency parser to do this. Specifically, SpaCy's part-of-speech (POS) tagging allows us to identify subordinating conjunctions as the tag "SCONJ". Let's try to identify the sentences with subordinating conjunctions.

In [None]:
sentences_with_sconj = [sent for sent in sentences if any(token.pos_ == "SCONJ" for token in nlp(sent))]
pprint(sentences_with_sconj[:10])


How many sentences did we identify with subordinating conjunctions?

In [None]:
print(f"There are {len(sentences_with_sconj)} sentences with a subordinating conjunction, out of {len(sentences)} sentences in total.")

Let's try visualizing one of the sentences with subordinating conjunctions.

In [None]:
from spacy import displacy

doc = nlp(sentences_with_sconj[455])
displacy.render(doc, style="dep", jupyter=True, options={"distance": 120})

Let's do some simple visualization. What are the most common subordinating conjunctions in the book?

In [None]:
subordinating_conjunctions = []

for sentence in sentences_with_sconj:
    doc = nlp(sentence)
    for token in doc:
        if token.pos_ == "SCONJ":
            subordinating_conjunctions.append(token.text)

In [None]:
temp_df = pd.DataFrame(subordinating_conjunctions, columns=["sconj"])
subordinating_conjunctions_df = (
    temp_df.groupby("sconj", as_index=False)
    .size()
    .sort_values("size", ascending=False)
    .reset_index(drop=True)
)
subordinating_conjunctions_df.head()

In [None]:
fig = px.bar(
    subordinating_conjunctions_df,
    x="sconj",
    y="size",
    title="Most common subordinating conjunctions",
)

fig.show()

This result is somewhat surprising to me. I did not expect that "that" would be the most common subordinating conjunction in the book. I had expected "because" to be more common in comparison to other subordinating conjunctions used in the book. However, subordinating conjunction alone might not be as interesting as the words that follow the subordinating conjunctions. Let's try to look at the rest of the sentence after the subordinating conjunctions.

In [None]:
phrases = []

for sentence in sentences_with_sconj:
    doc = nlp(sentence)
    sconj_index = next((i for i, token in enumerate(doc) if token.pos_ == "SCONJ"), None)
    if sconj_index is not None:
        phrase = " ".join([token.text for token in doc[sconj_index:]])
        phrases.append(phrase)

pprint(phrases[:10])

This is interesting! It looks like the words that follow the subordinating conjunctions explain the reason for some action in some way. For example, if my sentence with subordinating conjunction is:

> Will Hyde die upon the scaffold?

The words up to the subordinating conjunction "upon" are incomplete thoughts. Imagine if you were to read the sentence up to "upon":

> Will Hyde die

You would be left with many possibilities to complete the sentence. However, when you add the word "upon" the sentence, it sorts of begs the question "upon what?":

> Will Hyde die upon _what?_

You could imagine that the writer could potentially complete the sentence (in this case, Robert Louis Stevenson) with something like:

> Will Hyde die upon _the scaffold?_

This leaves me with several questions:
- What are some good examples of sentences that uses subordinating conjunctions to explain the reason for some action? 
- What about sentences that uses subordinating conjunctions in the beginning of the sentence, like "Upon the scaffold, will Hyde die?"