# Basics of Natural Language Processing (NLP)Take Home Exercise #



Use the following link to find open source data sets to complete take-home exercises.

[Data Sets](https://opendatascience.com/20-open-datasets-for-natural-language-processing/)

# Run this code in the beginning to limit the output size of the cells

In [4]:
 from IPython.display import display, Javascript

def resize_colab_cell():
  # Change the maxHeight variable to change the max height of the output
   display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 400})'))
  #Change output size for the entire notebook (set to call function on cell run)
   get_ipython().events.register('pre_run_cell', resize_colab_cell)

In [5]:
import pandas as pd
questions = pd.read_csv("Datasets/JEOPARDY_CSV.csv")

In [6]:
questions.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [7]:
# sample by default gets 1 random row
questions.sample(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
149244,779,1988-01-14,Jeopardy!,REPUBLICANS,$300,This Tennessean replaced Donald Regan as Reaga...,Howard Baker
198173,3041,1997-11-17,Double Jeopardy!,DEMOCRATS,"$1,500",This Ohioan is flying high as the ranking Demo...,John Glenn
192559,4313,2003-05-07,Double Jeopardy!,DOGGY BAG,$800,These keg-carriers were once known simply as h...,Saint Bernards
12838,3780,2001-01-26,Double Jeopardy!,HOW DOES YOUR GARDEN GROW?,$1000,This fertilizer element was leached out of woo...,Potassium
132560,2972,1997-07-01,Jeopardy!,THE BIBLE,$100,"The Lord's Prayer says, ""And lead us not into ...",Deliver us from evil
144198,5811,2009-12-14,Double Jeopardy!,PORTRAIT OF A LADY,$800,"Aptly, her name translates as ""a beautiful wom...",Nefertiti
55908,5673,2009-04-15,Jeopardy!,MOVIES ANY TIME,$600,Attila the Hun & Sacajawea are characters in t...,Night at the Museum
27580,3738,2000-11-29,Jeopardy!,"""TRIPLE"" JEOPARDY!",$200,Tantalizing variety of lunch seen here: (Dagwo...,Triple-decker sandwich
161950,5395,2008-02-08,Jeopardy!,AFRICAN HISTORY,$800,In 1498 this Portuguese explorer rounded South...,(Vasco) da Gama
168616,2932,1997-05-06,Double Jeopardy!,POTPOURRI,$400,The dromedary type of this mammal has 1 hump; ...,Camel


In [8]:
s = questions['Question'][149342]
print(s)

len(s.split())

He was born in Norwich, Connecticut in 1741 & died in London in 1801


14

### 2. Basic Analysis

Perform basic text analysis on the collected text using Spacy ([spacy.io](http://spacy.io)) library. Try different string manipulations.

In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [11]:
#Process text
doc = nlp(s)

In [12]:
#Extract entities

for entity in doc.ents:
    print(f"Entity: {entity.text}, Label: {entity.label_}")

Entity: Norwich, Label: GPE
Entity: Connecticut, Label: GPE
Entity: 1741, Label: DATE
Entity: London, Label: GPE
Entity: 1801, Label: DATE


### 3. Tokenizer
Create a custom tokenizer in Python that handles:
*   Contractions (e.g., "don't" → "do n't")
*   Keeps punctuation as separate tokens
*   Splits hyphenated words (e.g., "state-of-the-art" → "state of the art")

Compare its results with NLTK's word_tokenize on any sample paragraph and the following examples:
"New York-based company", "It's a beautiful day!", "https://www.example.com"

What differences do you see? What are the advantages, and limitations of each approach?

In [14]:
# https://stackoverflow.com/questions/4998629/split-string-with-multiple-delimiters-in-python
# https://www.geeksforgeeks.org/python-split-string-on-all-punctuations/
# https://stackoverflow.com/questions/14576158/how-to-tokenize-contractions-in-python
import re

def custom_tokenize(string_to_split):
    # Regex pattern splits on substrings " " and "-"
    #split_string = re.split(' |-', string_to_split)
    
    # Replace '-' with white space
    processed_string = string_to_split.replace('-', ' ')

    #processed_string = re.findall(r"\w+(?=n't)|n't|\w+(?=')|'\w+|\w+", processed_string)
    
    # using findall() to get all regex matches. 
    processed_string = re.findall( r'\w+|[^\s\w]+', processed_string)
    return processed_string

In [15]:
print(custom_tokenize(s))
print(custom_tokenize("New York-based company"))
print(custom_tokenize("It's a beautiful day!"))
print(custom_tokenize("https://www.example.com"))

['He', 'was', 'born', 'in', 'Norwich', ',', 'Connecticut', 'in', '1741', '&', 'died', 'in', 'London', 'in', '1801']
['New', 'York', 'based', 'company']
['It', "'", 's', 'a', 'beautiful', 'day', '!']
['https', '://', 'www', '.', 'example', '.', 'com']


In [16]:
from nltk.tokenize import word_tokenize
print(word_tokenize(s))
print(word_tokenize("New York-based company"))
print(word_tokenize("It's a beautiful day!"))
print(word_tokenize("https://www.example.com"))

['He', 'was', 'born', 'in', 'Norwich', ',', 'Connecticut', 'in', '1741', '&', 'died', 'in', 'London', 'in', '1801']
['New', 'York-based', 'company']
['It', "'s", 'a', 'beautiful', 'day', '!']
['https', ':', '//www.example.com']


### 4. Regex

Try writing your own RegEx that can capture citations in text E.g. (Horning, 2022)

In [18]:
sample_text = 'This is a sample reference (Horning, 2022). There is another reference(Shibani, 2024) located here.'

In [19]:
# Current best
re.findall('\(.*\)', sample_text)

['(Horning, 2022). There is another reference(Shibani, 2024)']

In [20]:
# Got all the dates
re.findall('\d\d\d\d', sample_text)

['2022', '2024']

In [21]:
# Use \s for space(renembember your C++ carriage returns, tabs, etc)
# \d{4} means 4 digits
# use .*? to get shortest, not .* alone!

re.findall('\(.*?,\s\d{4}\)', sample_text)

['(Horning, 2022)', '(Shibani, 2024)']

Extract URLS following a certain format (www. or http or https:// ..)

In [23]:
sample_url_text = 'This is a sample url https://www.google.com located here. Another link - http://www.google.com. A third(www.uts.edu.au)'

In [24]:
# Can definitely improve on this
re.findall('(https?|www)(.*?)(.com|.au)', sample_url_text)

[('https', '://www.google', '.com'),
 ('http', '://www.google', '.com'),
 ('www', '.uts.edu', '.au')]

### 5. Word Frequency

Find the list of words that occur more than 10 times in a selected corpus.

Try using different forms of setup: no stopwords, custom stopwords, not removing punctuation, etc. and see what difference in results they produce.


In [26]:
# converting series to string
text = questions['Question'].to_string()

In [27]:
from collections import Counter

topic_words = [ z.lower() for y in
                [ x.split() for x in questions['Question'] if isinstance(x, str)]
                for z in y]
word_count_dict = dict(Counter(topic_words))
popular_words = sorted(word_count_dict, key = word_count_dict.get, reverse = True)

In [28]:
# Changed to 1000 because the set is huge
# some stopwords required - 'the' for example... 'there's also quotation marks, need to remove those
# https://stackoverflow.com/questions/15861739/removing-objects-whose-counts-are-less-than-threshold-in-counter
Counter({k: c for k, c in word_count_dict.items() if c >= 1000})

Counter({'for': 35855,
         'the': 170363,
         'last': 2654,
         'years': 2638,
         'of': 112756,
         'his': 16911,
         'was': 29149,
         'under': 1416,
         'house': 1476,
         'this': 123422,
         'no.': 1436,
         'star': 1472,
         'at': 12405,
         'with': 17502,
         '&': 45040,
         'city': 5471,
         'in': 102022,
         'state': 4140,
         'has': 6005,
         'a': 86696,
         'year': 1543,
         'on': 25643,
         '"the': 9327,
         'company': 1579,
         'its': 8656,
         'president': 2707,
         'title': 3932,
         'an': 12884,
         'to': 50657,
         'south': 1722,
         "it's": 12348,
         'use': 1277,
         'named': 5650,
         'first': 10010,
         'we': 1714,
         'from': 19420,
         'it': 14278,
         'i': 4688,
         'take': 1069,
         'island': 2083,
         'now': 2123,
         'known': 3841,
         'if': 3352,
      

In [29]:
# let's remove stopwords first
from nltk.corpus import stopwords
#stopwords
stop_words=stopwords.words("english")

In [30]:
#create word tokens
tokenized_words=word_tokenize(text)

In [31]:
#Create a new variable to store filtered tokens
filtered_tokens=[]
for w in tokenized_words:
    if w not in stop_words:
         #add all filtered tokens excluding stopwords in this list below
         filtered_tokens.append(w)

import string
# punctuations
punctuations=list(string.punctuation)
#Add custom punctuations to the list
punctuations.append("...")

#Create another variable to store all clean tokens
filtered_tokens_clean=[]
for i in filtered_tokens:
    if i not in punctuations:
        filtered_tokens_clean.append(i)

In [32]:
word_count_dict_clean = dict(Counter(filtered_tokens_clean))

In [33]:
Counter({k: c for k, c in word_count_dict_clean.items() if c >= 1000})
# Looks like some major issues have happened, "The", "For" are still there. Need to lowercase before using it

Counter({'For': 1171,
         'last': 1341,
         'years': 1056,
         '1': 2043,
         'No': 1232,
         '2': 3847,
         'The': 26990,
         'city': 2668,
         'state': 1962,
         '3': 1866,
         'In': 22865,
         '``': 48990,
         "''": 40384,
         'th': 3521,
         '4': 1259,
         '5': 1015,
         'title': 2136,
         'This': 23402,
         'named': 2560,
         'first': 5501,
         'I': 4914,
         'On': 3338,
         'company': 1000,
         "'s": 33663,
         'A': 11239,
         'man': 2205,
         'one': 4897,
         'His': 1577,
         'c': 1295,
         'href=': 8355,
         'http': 7975,
         'It': 8553,
         'He': 4446,
         'author': 1071,
         'wrote': 2105,
         'book': 1246,
         'When': 1768,
         'seen': 2739,
         'got': 1013,
         'thi': 1272,
         'hit': 1145,
         "'ll": 1187,
         'French': 1405,
         'largest': 1046,
         'If': 

In [34]:
text_lower = text.lower()
tokenized_words=word_tokenize(text_lower)

#Create a new variable to store filtered tokens
filtered_tokens=[]
for w in tokenized_words:
    if w not in stop_words:
         #add all filtered tokens excluding stopwords in this list below
         filtered_tokens.append(w)

import string
# punctuations
punctuations=list(string.punctuation)
#Add custom punctuations to the list
punctuations.append("...")

#Create another variable to store all clean tokens
filtered_tokens_clean=[]
for i in filtered_tokens:
    if i not in punctuations:
        filtered_tokens_clean.append(i)

word_count_dict_clean = dict(Counter(filtered_tokens_clean))

In [35]:
Counter({k: c for k, c in word_count_dict_clean.items() if c >= 1000})

# Much better, I can see that some common words are things like city, state, company, first, years...
# These git a gameshow more
# punctuation should probably be removed as well
# Notice that there's a http in there, not sure what's going on there

Counter({'last': 1739,
         'years': 1103,
         '1': 2038,
         '2': 3843,
         'star': 1106,
         'city': 3187,
         'state': 2476,
         '3': 1861,
         '``': 48990,
         'show': 1057,
         "''": 40384,
         'th': 3684,
         '4': 1256,
         '5': 1014,
         'title': 2312,
         'named': 2831,
         'first': 5949,
         'company': 1105,
         "'s": 33666,
         'man': 2682,
         'time': 1029,
         'one': 6479,
         'war': 1072,
         'c': 1782,
         'href=': 8355,
         'http': 7975,
         'author': 1131,
         'wrote': 2114,
         'book': 1462,
         'born': 1519,
         'seen': 3153,
         'got': 1082,
         'thi': 1287,
         'hit': 1175,
         'new': 2395,
         "'ll": 1188,
         'french': 1411,
         'largest': 1056,
         'part': 1354,
         'name': 6535,
         'island': 1092,
         'word': 1985,
         'type': 2477,
         'made': 2164,
