<a href="https://colab.research.google.com/github/malehzja/lis4693/blob/main/lab-2/Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2: Text Pre-Processing using spaCy

In this lab assignment, we will learn to perform some basic text pre-processing using spaCy.

*Note: Before starting this lab assignment, please complete the Introduction to spaCy notebook*

## Leaning Objectives

In this exercise, you will:

- Load and process your own text file (transcript.txt)
- Split text into sentences
- Count words and sentences
- Find frequently used words
- Use spaCy’s PhraseMatcher to find specific phrases
- Understand the difference between blank pipelines and full pipelines

This exercise builds directly on concepts from discussed in the precursor notebook on "Introduction to spaCy".

## Install spaCy

In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import spacy

Importing and installing spaCy

## Load and read your file from GitHub


In [3]:
import requests

url = "https://raw.githubusercontent.com/malehzja/lis4693/refs/heads/main/lab-1/transcript-2.txt"
response = requests.get(url)
response.raise_for_status() # Raise an exception for HTTP errors
text = response.text

In [6]:
print("Number of characters:", len(text)) # print number of characters

print(text[:100])   # print first 100 characters

Number of characters: 6029
Should
I get the laundry going first?
Hello everyone and welcome back to my
channel. I'm going to do


**TASK 1: Load and read your transcript.txt file from lab-1 from GitHub repo directly**

*NOTE: My initial transcript was in Korean so I made an English one and named it transcript-2.*

**TASK 2: How many characters are in your transcript? Print the first 100 characters.**  
6029 characters in total.

## Sentence Segmentation

Create a blank spaCy model and add sentencizer.

In [9]:
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

doc = nlp(text)

sentences = list(doc.sents)

print("Number of sentences:", len(sentences))

print("\nFirst 3 sentences:")
for sent in sentences[:3]:
    print(sent)

Number of sentences: 124

First 3 sentences:
Should
I get the laundry going first?

Hello everyone and welcome back to my
channel.
I'm going to do a Sunday reset
because I've been holding off a lot of
cleaning.


**TASK 3: Perform sentence segmentation using the blank pipeline**

How many sentences are in your transcript?  
124.

Print the first 3 sentences.  
*NOTE: I changed from printing the first 5 to 3.*

## Word Count and Token Analysis

Let's count total words and unique words in your text file.

In [10]:
words = [token.text.lower() for token in doc if token.is_alpha]

print("Total words:", len(words))
print("Unique words:", len(set(words)))

Total words: 1178
Unique words: 399


**TASK 4: Perform Word Count and Token Analysis**

What is the total number of words?  
1178.

What is the number of unique words?   
399.

## Most Frequent Words

Let's find top 10 most frequent words in your file.

In [11]:
from collections import Counter

word_freq = Counter(words)

print("Top 10 most frequent words:")
for word, count in word_freq.most_common(10):
    print(word, count)

Top 10 most frequent words:
i 54
the 43
to 34
you 34
a 33
it 32
this 30
and 26
so 26
is 21


**TASK 5: Find Most Frequent Words**

What is the most frequent word?   
The most frquent word is "i".

Why do you think this word appears frequently?    
I think the word "i" appears most frequently because Michelle Choi is speaking over the video in first-person narrative. It is about HER and a day in HER life.

## Using Full spaCy Pipeline

Now use the full model for better linguistic analysis.

In [13]:
nlp2 = spacy.load("en_core_web_sm")

doc2 = nlp2(text)

print("Named Entities:")
for ent in doc2.ents:
    print(ent.text, ent.label_)

print("Total Named Entities:", len(doc2.ents))

Named Entities:
first ORDINAL
Sunday DATE
the day DATE
Dawn PERSON
Korea GPE
Amazon ORG
one CARDINAL
POV ORG
the Insta 360 Go
Ultra LAW
4K QUANTITY
POV ORG
POV ORG
the Insta 360 Go Ultra LAW
POV ORG
Valentine PERSON
Michelle
Choy PERSON
15% PERCENT
Today DATE
Can This Love Be Translated WORK_OF_ART
Today DATE
New York GPE
Korea GPE
nitpicky GPE
Blair Osman PERSON
one CARDINAL
Japan GPE
Korea GPE
Toyota ORG
the Kia Carnival FAC
a third row DATE
Kia ORG
Talib ORG
POV ORG
Ago Ultra PERSON
Michelle Choy PERSON
15% PERCENT
Total Named Entities: 36


**TASK 6: Run full spaCy pipeline**

How many named entities were found?   
36.  
*NOTE: To reduce user error I added a line of code to count the number of entities.*

What types of entities appear? (PERSON, ORG, DATE, etc.)  
The types of entities that appear are: ORDINAL, DATE, PERSON, GPE, ORG, CARDINAL, LAW, QUANTITY, PERCENT, WORK_OF_ART, and FAC.

## PhraseMatcher

In [16]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp2.vocab, attr="LOWER")

phrases = ["new york", "life", "michelle choi"]

patterns = [nlp2(p) for p in phrases]

matcher.add("TECH_TERMS", patterns)

matches = matcher(doc2)

print("Matches found:")
for match_id, start, end in matches:
    print(doc2[start:end])
    print("Sentence:", doc2[start].sent)

Matches found:
life
Sentence: If you
ever wanted to start vlogging but you
held back because you thought your life
wasn't interesting enough, think again.

life
Sentence: This camera lets you
capture your life through unique POV
angles and tell stories in a way that
feels creative, personal, and
effortless.
New York
Sentence: My parents will be coming
back and forth from New York and Korea
and they'll be staying for like more
extended amount of time.
life
Sentence: I think right now at
this point in my life, I don't really
care to have a fancy car.


**TASK 7: Use PhraseMatcher**

When you selected your YouTube video in Lab-1, what topic or subject were you interested in? Based on that topic, identify three specific phrases that are directly relevant to your search. For example, if your video was about information retrieval, relevant phrases might include "information retrieval," "text mining," and "data science."

When I selected my YouTube video in Lab-1, I was interested in videos by Michelle Choi about everyday life in New York. Three specific phrases that were directly relevant to my search were: "new york", "life", and "michelle choi".

**TASK 8: At the end of your Colab notebook, create a new text cell and write a brief reflection for this assignment in a few sentences addressing the following:**

What went well?   
Two things that went well were the overall **run** of code and the **change** of code. I never encountered any errors from the install to the phrase matcher. Also, when making changes, it was easy to know what to edit to fit the given prompt.

What did not go well or what challenges you encountered?  
Two challenges that I encountered were, when accessing the notebook in colab and describing outputs. When I tried to open the notebook from the link pasted on the Lab Assignment, it told me that the notebook could not be found. I realized that the address was false. The initial link said that the notebook was in Professor Lamba's main repository, but it was in a folder. Then, in terms of describing outputs, I struggled with this when asked about the entities. I had to re-read the question multiple times before I understood.

# EXCERCISE

Open a new Google Colab notebook and complete the tasks below. As you work, add brief explanations using the **Text (Markdown) cells** throughout your notebook to describe what you are doing.

1. Make a new folder named `lab-2` in your `lis4693` or `lis5693` repo on GitHub. **[1 Point]**

2. Complete the following tasks:


**TASK 1**: Load and read your `transcript.txt` file from lab-1 from GitHub repo directly **[1 Point]**

**TASK 2**: How many characters are in your transcript? Print the first 100 characters. **[1 Point]**

**TASK 3**: Perform sentence segmentation using the blank pipeline
  - How many sentences are in your transcript? **[0.5 Point]**
  - Print the first 3 sentences **[0.5 Point]**

**TASK 4**: Perform Word Count and Token Analysis
  - What is the total number of words? **[0.5 Point]**
  - What is the number of unique words? **[0.5 Point]**

**TASK 5**: Find Most Frequent Words
  - What is the most frequent word? **[0.5 Point]**
  - Why do you think this word appears frequently? **[1 Point]**

**TASK 6**: Run full spaCy pipeline
  - How many named entities were found? **[0.5 Point]**
  - What types of entities appear? (PERSON, ORG, DATE, etc.) **[0.5 Point]**

**TASK 7**: Use PhraseMatcher **[1 Point]**

When you selected your YouTube video in Lab-1, what topic or subject were you interested in? Based on that topic, identify three specific phrases that are directly relevant to your search. For example, if your video was about information retrieval, relevant phrases might include "information retrieval," "text mining," and "data science."

Make sure the video you selected in Lab-1 was clearly related to your chosen topic and was the same video for which you downloaded the transcript in Lab-1.

**TASK 8**: At the end of your Colab notebook, create a new text cell and write a brief reflection for this assignment in a few sentences addressing the following **[2 Points]**:
 - What went well?
 - What did not go well or what challenges you encountered?

3. Push your Google Colab file to your `lab-2` GitHub repo from Colab. *No points will be given if you upload it to GitHub directly!* **[1 Point]**
4. Share the link to your `lab-2` GitHub repo for this lab assignment on CANVAS for credit. **[0.5 Point]**
