# Basics of Natural Language Processing (NLP)Take Home Exercise #



Use the following link to find open source data sets to complete take-home exercises.

[Data Sets](https://opendatascience.com/20-open-datasets-for-natural-language-processing/)

# Run this code in the beginning to limit the output size of the cells

In [1]:
 from IPython.display import display, Javascript

def resize_colab_cell():
  # Change the maxHeight variable to change the max height of the output
   display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 400})'))
  #Change output size for the entire notebook (set to call function on cell run)
   get_ipython().events.register('pre_run_cell', resize_colab_cell)

### 1. Input Text

Write a code to collect text for the analysis as a user input



In [21]:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")

In [22]:
#Importing same dataset as dataset used in tutorial
url = 'https://github.com/AntonetteShibani/NLPAnalysis/blob/main/CNN_Articles_2021.csv?raw=true'
df = pd.read_csv(url)

### 2. Basic Analysis

Perform basic text analysis on the collected text using Spacy ([spacy.io](http://spacy.io)) library. Try different string manipulations.

In [23]:
doc = nlp(df.head(20).to_string())

for ent in doc.ents:
    print(ent.text, ent.label_)


0 CARDINAL
part_of PERSON
0 CARDINAL
157 CARDINAL
Daniel Dale PERSON
CNN ORG
2021-03-03 DATE
Antifa PERSON
Capitol FAC
Republicans NORP
Antifa PERSON
Trump ORG
Republicans NORP
Antifa PERSON
Capitol FAC
Republicans NORP
Antifa PERSON
Capitol FAC
Republicans NORP
Washington GPE
CNN)FBI PERSON
Christopher Wray PERSON
CNN ORG
MSNBC ORG
Senate ORG
Tuesday DATE
FBI ORG
Antifa NORP
January DATE
US GPE
Capitol FAC
Fox News ORG
Seuss PERSON
America GPE
Americans NORP
Fox ORG
Republicans NORP
Antifa PERSON
Donald Trump PERSON
Antifa PERSON
January DATE
American Enterprise Institute ORG
50% PERCENT
Republicans NORP
Antifa PERSON
Capitol FAC
January DATE
NBC ORG
48% PERCENT
Republican NORP
Antifa PERSON
Capitol ORG
FoxFox News ORG
Fox ORG
Trump ORG
Antifa PERSON
one CARDINAL
Antifa PERSON
Capitol ORG
Newsmax ORG
One CARDINAL
America News ORG
Trump ORG
Rudy Giuliani PERSON
Trump ORG
Michael van der PERSON
Veen PERSON
Republican NORP
Congress ORG
Mo Brooks PERSON
Matt Gaetz PERSON
Ron Johnson PERSO

### 3. Tokenizer
Create a custom tokenizer in Python that handles:
*   Contractions (e.g., "don't" → "do n't")
*   Keeps punctuation as separate tokens
*   Splits hyphenated words (e.g., "state-of-the-art" → "state of the art")

Compare its results with NLTK's word_tokenize on any sample paragraph and the following examples:
"New York-based company", "It's a beautiful day!", "https://www.example.com"

What differences do you see? What are the advantages, and limitations of each approach?

In [5]:
#Import and load the NLTK library
import nltk
from  nltk.tokenize  import  sent_tokenize ,  word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
strings = ["New York-based company", "It's a beautiful day!", "https://www.example.com"]


In [7]:
tokens = word_tokenize(strings[0])
print(tokens)

['New', 'York-based', 'company']


In [8]:
tokens = word_tokenize(strings[1])
print(tokens)

['It', "'s", 'a', 'beautiful', 'day', '!']


In [9]:
tokens = word_tokenize(strings[2])
print(tokens)

['https', ':', '//www.example.com']


In [10]:
#Custom Tokeniser
def custom_tokeniser(text):

    # Handle example contraction
    contractions = {
        "don't": "do n't",
        "It's": "it is"
    }

    #Split hyphenated words
    tokens = []
    for word in text.split():
        #print(tokens)
        #print(word)

        if '-' in word:
            tokens.extend(word.split('-'))
        elif word in contractions:
            tokens.extend(contractions[word].split())
        else:
            tokens.append(word)



    return tokens

In [11]:
custom_tokeniser(strings[0])

['New', 'York', 'based', 'company']

In [12]:
custom_tokeniser(strings[1])

['it', 'is', 'a', 'beautiful', 'day!']

In [13]:
custom_tokeniser(strings[2])

#with this implementation of a custom tokensiser you cant really account for punctuation

['https://www.example.com']

In [14]:
#What differences do you see? What are the advantages, and limitations of each approach?
'''
Advantages of custom:
more customisable
runs faster (less performance overhead)

Limitations of custom:
Lots of hardcoding required unless you use libraries to help (regex)
contractions are really hard to handle without hardcoding
you cant really handle punctuation with the implementation done above, will have to use regex

Advantages of NLTK:
extrememly simple to setup
no hassle
no hardcoding required
covers many use cases

Disadvantages:
Likely runs slower with more performance overhead
the output will likely have to be parsed and sorted (which is still easier than writing the contractions hardcoded yourself)
for example hyphens can be sorted with a for loop and splitting words with hyphens, same for punctuation

Generally ntlk is a lot simpler, robust and easier to use rather than writing a custom tokeniser in python without libraries.
'''

'\nAdvantages of custom:\nmore customisable\nruns faster (less performance overhead)\n\nLimitations of custom:\nLots of hardcoding required unless you use libraries to help (regex)\ncontractions are really hard to handle without hardcoding\nyou cant really handle punctuation with the implementation done above, will have to use regex\n\nAdvantages of NLTK:\nextrememly simple to setup\nno hassle\nno hardcoding required\ncovers many use cases\n\nDisadvantages:\nLikely runs slower with more performance overhead\nthe output will likely have to be parsed and sorted (which is still easier than writing the contractions hardcoded yourself)\nfor example hyphens can be sorted with a for loop and splitting words with hyphens, same for punctuation\n\nGenerally ntlk is a lot simpler, robust and easier to use rather than writing a custom tokeniser in python without libraries.\n'

### 4. Regex

Try writing your own RegEx that can capture citations in text E.g. (Horning, 2022)

In [15]:
import re

In [16]:
example = "random text (citation1, 2024) more text (citation2, 2025) and some more text (citation3, 2026)"
pattern = r'\(\w+, \d+\)' #r'\((.*?)\)' also works but its capture all stuff in brackets
re.findall(pattern, example)

['(citation1, 2024)', '(citation2, 2025)', '(citation3, 2026)']

Extract URLS following a certain format (www. or http or https:// ..)

In [17]:
example2 = "test www.google.com.au test http://www.google.com.au test https://www.google.com.au test"
pattern2 = r'\b((www|http:|https:)[^\s]+[\w]+)\b'
re.findall(pattern2, example2)

[('www.google.com.au', 'www'),
 ('http://www.google.com.au', 'http:'),
 ('https://www.google.com.au', 'https:')]

### 5. Word Frequency

Find the list of words that occur more than 10 times in a selected corpus.

Try using different forms of setup: no stopwords, custom stopwords, not removing punctuation, etc. and see what difference in results they produce.


In [42]:
from collections import Counter
import string

#Importing same dataset as dataset used in tutorial
url = 'https://github.com/AntonetteShibani/NLPAnalysis/blob/main/CNN_Articles_2021.csv?raw=true'
df = pd.read_csv(url)
doc = nlp(df.head(20).to_string())

# Extract words and count their frequency
array = []
for token in doc:

  array.append(token.text.lower())
word_freq = Counter(array)

#Print the word frequencies - before removing punctuation
for word, freq in word_freq.items():
  if freq > 10:
    print(f"{word}: {freq}")




:: 64
  : 15

: 20
         : 12
analysis: 14
by: 157
,: 1445
cnn: 143
 : 129
2021: 22
-: 343
03: 11
+: 20
00:00: 20
politics: 18
of: 733
a: 651
lie: 13
how: 50
the: 1482
that: 364
antifa: 19
capitol: 46
among: 16
is: 241
for: 261
not: 105
trump: 27
has: 139
become: 14
--: 113
after: 60
being: 42
long: 21
right: 46
wing: 15
people: 83
and: 670
.: 1114
washington: 12
(: 59
was: 172
on: 247
when: 58
he: 162
told: 42
had: 69
found: 18
any: 34
to: 853
were: 63
in: 599
january: 35
us: 41
at: 114
news: 21
from: 153
its: 35
it: 184
just: 55
about: 83
": 598
so: 38
's: 287
most: 27
or: 74
with: 194
americans: 25
part: 14
because: 34
conspiracy: 17
think: 12
former: 22
some: 34
have: 118
been: 71
their: 79
own: 17
american: 12
%: 27
said: 156
republican: 12
voters: 19
read: 27
than: 53
this: 103
state: 41
other: 52
far: 17
anti: 29
much: 20
as: 178
threat: 12
but: 89
here: 16
one: 80
media: 19
they: 84
attorney: 15
members: 15
like: 38
candidate: 11
social: 17
help: 39
more: 96
who: 106
them: 3

In [43]:
cleaned_dict = {key: value for key, value in word_freq.items() if key not in string.punctuation}
word_freq = cleaned_dict

#Print the word frequencies - after removing punctuation
for word, freq in word_freq.items():
  if freq > 10:
    print(f"{word}: {freq}")

  : 15

: 20
         : 12
analysis: 14
by: 157
cnn: 143
 : 129
2021: 22
03: 11
00:00: 20
politics: 18
of: 733
a: 651
lie: 13
how: 50
the: 1482
that: 364
antifa: 19
capitol: 46
among: 16
is: 241
for: 261
not: 105
trump: 27
has: 139
become: 14
--: 113
after: 60
being: 42
long: 21
right: 46
wing: 15
people: 83
and: 670
washington: 12
was: 172
on: 247
when: 58
he: 162
told: 42
had: 69
found: 18
any: 34
to: 853
were: 63
in: 599
january: 35
us: 41
at: 114
news: 21
from: 153
its: 35
it: 184
just: 55
about: 83
so: 38
's: 287
most: 27
or: 74
with: 194
americans: 25
part: 14
because: 34
conspiracy: 17
think: 12
former: 22
some: 34
have: 118
been: 71
their: 79
own: 17
american: 12
said: 156
republican: 12
voters: 19
read: 27
than: 53
this: 103
state: 41
other: 52
far: 17
anti: 29
much: 20
as: 178
threat: 12
but: 89
here: 16
one: 80
media: 19
they: 84
attorney: 15
members: 15
like: 38
candidate: 11
social: 17
help: 39
more: 96
who: 106
them: 34
political: 16
used: 20
previously: 13
him: 48
never: