# **NOTE:** Use File > Save a copy in Drive to make a copy before doing anything else


In [None]:
%%shell
jupyter nbconvert --to html /content/Project6_sample_sol.ipynb

[NbConvertApp] Converting notebook /content/Project6_sample_sol.ipynb to html
[NbConvertApp] Writing 1090638 bytes to /content/Project6_sample_sol.html




# Project 6: Advanced Text Analysis with SpaCy

### Project Objectives

*   Apply proper text pre-processing techniques to extract
meaningful words

*   Use SpaCy to clean and normalize text data
*   Identify the 15 most frequently used meaningful words in a text

*   Analyze how pre-processing affects text analysis results
*   Apply the same techniques to a text of your choice

#### Loading Data

We will fetch the Great Gatsby from [Project Gutenberg](https://www.gutenberg.org/cache/epub/64317/pg64317.txt)


#### Function to fetch data

In [1]:
def fetch_text(raw_url):
  import requests
  from pathlib import Path
  import hashlib

  CACHE_DIR = Path("cs_110_content/text_cache")
  CACHE_DIR.mkdir(parents=True, exist_ok=True)

  def _url_to_filename(url):
    url_hash = hashlib.sha1(url.encode("utf-8")).hexdigest()[:12]
    return CACHE_DIR / f"{url_hash}.txt"

  cache_path = _url_to_filename(raw_url)

  SUCCESS_MSG = "✅ Text fetched."
  FAILURE_MSG = "❌ Failed to fetch text."
  try:
    if not cache_path.exists():
      response = requests.get(raw_url, timeout=10)
      response.raise_for_status()
      text_data = response.text
      cache_path.write_text(text_data, encoding="utf-8")
    print(SUCCESS_MSG)
    return cache_path.read_text(encoding="utf-8")

  except Exception as e:
    print(FAILURE_MSG)
    print(f"Error: {e}")
    return ""

#### Save the text in a variable

In [3]:
GREAT_GATSBY_URL = "https://www.gutenberg.org/cache/epub/64317/pg64317.txt"

great_gatsby_text = fetch_text(GREAT_GATSBY_URL)

✅ Text fetched.


#### Statistics about the data

In [4]:
def print_text_stats(text):
  num_chars = len(text)

  lines = text.splitlines()
  num_lines = len(lines)

  num_words = 0
  for line in lines:
    words_in_line = line.split()
    num_words_in_line = len(words_in_line)
    num_words += num_words_in_line

  print(f"Number of characters: {num_chars}")
  print(f"Number of lines: {num_lines}")
  print(f"Number of words: {num_words}")

print_text_stats(great_gatsby_text)

Number of characters: 290077
Number of lines: 6781
Number of words: 51257


In [5]:
def get_word_counts(text):
  word_counts = {}
  lines = text.splitlines()
  for line in lines:
    words = line.split()
    for word in words:
      word = word.lower()
      if word in word_counts:
        word_counts[word] += 1
      else:
        word_counts[word] = 1
  return word_counts

word_counts = get_word_counts(great_gatsby_text)
print(word_counts)



In [7]:
# prompt: Make a new Code Cell below and use Gemini to create a new function called print_top_10_frequent_words that will call the above get_word_counts() and print only the top 10 frequent words.

import operator

def print_top_10_frequent_words(text):
    word_counts = get_word_counts(text)
    sorted_word_counts = dict(sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True))
    top_10_words = list(sorted_word_counts.items())[:10]  # Get the top 10 words and counts
    for word, count in top_10_words:
        print(f"{word}: {count}")

print_top_10_frequent_words(great_gatsby_text)

the: 2543
and: 1540
a: 1439
of: 1224
to: 1186
i: 996
in: 841
he: 769
was: 751
that: 537


In [8]:
# prompt: create a new function called print_top_10_frequent_words that will call the above get_word_counts() and print only the top 10 frequent words.

def print_top_10_frequent_words(text):
  word_counts = get_word_counts(text)
  sorted_word_counts = dict(sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True))
  top_10_words = list(sorted_word_counts.items())[:10]
  for word, count in top_10_words:
    print(f"{word}: {count}")

print_top_10_frequent_words(great_gatsby_text)

the: 2543
and: 1540
a: 1439
of: 1224
to: 1186
i: 996
in: 841
he: 769
was: 751
that: 537


### Part 1: Analyzing "Great Gatsby" with Pre-processing

In our previous lab, we simply counted word frequencies without any sophisticated pre-processing, resulting in common but uninformative words (like "the", "and", "to") dominating our results.

You will use SpaCy, a powerful Natural Language Processing library, to perform these operations. SpaCy provides pre-trained models that can handle tokenization, stop word removal, and lemmatization automatically.





In [9]:
# install spacy library
!pip install spacy



In [10]:
# bring the spacy library into scope
import spacy

In [11]:
# Load a SpaCy model
nlp = spacy.load('en_core_web_sm')

Explanation:

spacy: This is the spaCy library, a popular and efficient NLP library in Python.

load(): This function loads a pre-trained NLP model.

'en_core_web_sm': This is the name of the small English-language model. It's a lightweight model that includes:

*   Tokenization (splitting text into words, punctuation, etc.)

*  Part-of-speech (POS) tagging

*  Named entity recognition (NER) etc

nlp: This variable now holds the loaded model.

In [12]:
def word_tokenization_normalization(text):

    text = text.lower() # lowercase
    doc = nlp(text)     # loading text into model

    words_normalized = []
    for word in doc:
        if word.text != '\n' \
        and not word.is_stop \
        and not word.is_punct \
        and not word.like_num \
        and len(word.text.strip()) > 2:
            word_lemmatized = str(word.lemma_)
            words_normalized.append(word_lemmatized)

    return words_normalized

This function takes input text and prepares it for our goal of extracting meaningfulwords by 'cleaning' it. The text is converted to all lwoercase. Then it uses SpaCy to break down the text into individual words. Then it goes through each word and takes out common words, punctuation marks, number tokens, words with less than 3 letters, etc. Then it takes the reamining words and lemmatizes them meaning changing the word to its base dictionary form. Then it returns the products in the words_normalized list.

In [13]:
def word_tokenization_normalization(text):

    text = text.lower() # lowercase
    doc = nlp(great_gatsby_text)     # loading text into model

    words_normalized = []
    for word in doc:
        if word.text != '\n' \
        and not word.is_stop \
        and not word.is_punct \
        and not word.like_num \
        and len(word.text.strip()) > 2:
            word_lemmatized = str(word.lemma_)
            words_normalized.append(word_lemmatized)

    return words_normalized

In [14]:
# prompt: create a new function called create_word_frequency_dict from the result you got from the previous task. This function should return a word frequency dictionary,

def create_word_frequency_dict(words_normalized_list):
  """
  Creates a word frequency dictionary from a list of normalized words.

  Args:
    words_normalized_list: A list of cleaned and normalized words.

  Returns:
    A dictionary where keys are words and values are their frequencies.
  """
  word_frequency = {}
  for word in words_normalized_list:
    word_frequency[word] = word_frequency.get(word, 0) + 1
  return word_frequency

In [15]:
# prompt: create a new function called print_top_words to print out the result

def print_top_words(word_frequency_dict, num_words=15):
  """
  Prints the top most frequent words from a word frequency dictionary.

  Args:
    word_frequency_dict: A dictionary where keys are words and values are their frequencies.
    num_words: The number of top words to print (default is 15).
  """
  sorted_word_frequency = sorted(word_frequency_dict.items(), key=operator.itemgetter(1), reverse=True)
  top_words = sorted_word_frequency[:num_words]

  print(f"Top {num_words} most frequent words:")
  for word, count in top_words:
    print(f"{word}: {count}")

# Assuming you have the word_frequency_dict from the previous task
# Example usage:
words_normalized = word_tokenization_normalization(great_gatsby_text)
word_frequency = create_word_frequency_dict(words_normalized)
print_top_words(word_frequency)

Top 15 most frequent words:
Gatsby: 266
say: 253
come: 207
Tom: 191
go: 190
Daisy: 184
look: 178
know: 173
man: 157
like: 128
think: 120
get: 104
hand: 104
house: 103
little: 101


### Part 2: Text Analysis of your choice with pre-processing

The top words used in Great Gatsby by F. Scott Fitzgerald are Gatsby, say, come, Tom, go, Daisy, look, know, man, like, think, get, hand, house, and little. I feel like these words reveal that this book is about human actions and feelings and personalities. I think this because the most consistent type of words that show up are verbs which make me think there's a lot of action that the characters do in this book. That leads me to believe it's a book about interacting with each other (as the characters are mentioned a lot too). I'm a fan of this book and have read it before and I do believe this is a book analyzing human behavior and the American Dream. I think we need text pre-processing because it eliminates the filler and 'meaningless' words. It allows for more precise and accurate data for people to analyze text from. Without it, it would be harder to identify the common words and themes in a piece of text. For example in both text blocks used, words like 'the' and 'and' came up a lot which don't give readers an idea of what the message in the text is. But after processing more substantial words are produced. The words that show up in Pride and Prejudice match more to that book ('girl', 'daughter') and the words here match Gatsby as well ('Daisy', 'hand').

### Deliverable

Download both notebooks by clicking on the File Menu (below the name of the file), Download > Download .ipynb and submit them.

1. CS110_Project6_Part1.ipynb with all TODO tasks done
2. CS110_Project6_Part2.ipynb with the write-up after producing 15 most frequently used meaningful words from a text of your choice