<a href="https://colab.research.google.com/github/manika-lamba/SP26-LIS4_5693/blob/main/Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2: Text Pre-Processing using spaCy

In this lab assignment, we will learn to perform some basic text pre-processing using spaCy.

*Note: Before starting this lab assignment, please complete the Introduction to spaCy notebook*

## Leaning Objectives

In this exercise, you will:

- Load and process your own text file (transcript.txt)
- Split text into sentences
- Count words and sentences
- Find frequently used words
- Use spaCy’s PhraseMatcher to find specific phrases
- Understand the difference between blank pipelines and full pipelines

This exercise builds directly on concepts from discussed in the precursor notebook on "Introduction to spaCy".

## Install spaCy

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m89.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy

## Load and read your file from GitHub


In [None]:
import requests

url = "https://raw.githubusercontent.com/manika-lamba/SP26-LIS4_5693/refs/heads/main/lab-2/wiki_us.txt"
response = requests.get(url)
response.raise_for_status() # Raise an exception for HTTP errors
text = response.text

In [None]:
print("Number of characters:", len(text)) # print number of characters

print(text[:300])   # print first 300 characters

Number of characters: 716
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine Minor Outlying Islands, and 326 Indian reservations. It is the t


## Sentence Segmentation

Create a blank spaCy model and add sentencizer.

In [None]:
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

doc = nlp(text)

sentences = list(doc.sents)

print("Number of sentences:", len(sentences))

print("\nFirst 5 sentences:")
for sent in sentences[:5]:
    print(sent)

Number of sentences: 7

First 5 sentences:
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, nine Minor Outlying Islands, and 326 Indian reservations.
It is the third-largest country by both land and total area.
The United States shares land borders with Canada to its north and with Mexico to its south.
It has maritime borders with the Bahamas, Cuba, Russia, and other nations.


## Word Count and Token Analysis

Let's count total words and unique words in your text file.

In [None]:
words = [token.text.lower() for token in doc if token.is_alpha]

print("Total words:", len(words))
print("Unique words:", len(set(words)))

Total words: 116
Unique words: 68


## Most Frequent Words

Let's find top 10 most frequent words in your file.

In [None]:
from collections import Counter

word_freq = Counter(words)

print("Top 10 most frequent words:")
for word, count in word_freq.most_common(10):
    print(word, count)

Top 10 most frequent words:
the 9
and 6
is 5
states 4
it 4
with 4
united 3
of 3
america 3
or 3


## Using Full spaCy Pipeline

Now use the full model for better linguistic analysis.

In [None]:
nlp2 = spacy.load("en_core_web_sm")

doc2 = nlp2(text)

print("Named Entities:")
for ent in doc2.ents:
    print(ent.text, ent.label_)

Named Entities:
The United States of America GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
nine CARDINAL
Minor Outlying Islands ORG
326 CARDINAL
Indian NORP
third ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
Russia GPE
over 331 million MONEY
third ORDINAL
Washington GPE
D.C. GPE
New York City GPE


## PhraseMatcher

In [None]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp2.vocab, attr="LOWER")

phrases = ["united states", "america", "country"]

patterns = [nlp2(p) for p in phrases]

matcher.add("TECH_TERMS", patterns)

matches = matcher(doc2)

print("Matches found:")
for match_id, start, end in matches:
    print(doc2[start:end])
    print("Sentence:", doc2[start].sent)

Matches found:
United States
Sentence: The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America.
America
Sentence: The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America.
United States
Sentence: The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America.
America
Sentence: The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America.
country
Sentence: The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America.
America
Sentence: The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in Nort

# EXCERCISE

Open a new Google Colab notebook and complete the tasks below. As you work, add brief explanations using the **Text (Markdown) cells** throughout your notebook to describe what you are doing.

1. Make a new folder named `lab-2` in your `lis4693` or `lis5693` repo on GitHub. **[1 Point]**

2. Complete the following tasks:


**TASK 1**: Load and read your `transcript.txt` file from lab-1 from GitHub repo directly **[1 Point]**

**TASK 2**: How many characters are in your transcript? Print the first 100 characters. **[1 Point]**

**TASK 3**: Perform sentence segmentation using the blank pipeline
  - How many sentences are in your transcript? **[0.5 Point]**
  - Print the first 3 sentences **[0.5 Point]**

**TASK 4**: Perform Word Count and Token Analysis
  - What is the total number of words? **[0.5 Point]**
  - What is the number of unique words? **[0.5 Point]**

**TASK 5**: Find Most Frequent Words
  - What is the most frequent word? **[0.5 Point]**
  - Why do you think this word appears frequently? **[1 Point]**

**TASK 6**: Run full spaCy pipeline
  - How many named entities were found? **[0.5 Point]**
  - What types of entities appear? (PERSON, ORG, DATE, etc.) **[0.5 Point]**

**TASK 7**: Use PhraseMatcher **[1 Point]**

When you selected your YouTube video in Lab-1, what topic or subject were you interested in? Based on that topic, identify three specific phrases that are directly relevant to your search. For example, if your video was about information retrieval, relevant phrases might include "information retrieval," "text mining," and "data science."

Make sure the video you selected in Lab-1 was clearly related to your chosen topic and was the same video for which you downloaded the transcript in Lab-1.

**TASK 8**: At the end of your Colab notebook, create a new text cell and write a brief reflection for this assignment in a few sentences addressing the following **[2 Points]**:
 - What went well?
 - What did not go well or what challenges you encountered?

3. Push your Google Colab file to your `lab-2` GitHub repo from Colab. *No points will be given if you upload it to GitHub directly!* **[1 Point]**
4. Share the link to your `lab-2` GitHub repo for this lab assignment on CANVAS for credit. **[0.5 Point]**
