# Welcome to the First Exercise Session of the NLP Tutoring Course

The main topics of this Colab notebook are **tokenization**, **normalization**, **representation**, and **pandas dataframes**. 😸

## **Contents**

1. **Tokenization**
    - **Exercise 1a**: No library/module tokenization.
    - **Exercise 1b**: Library/module tokenization.
    - **Exercise 1c**: Line/sentence tokenization.

2. **Normalization**
    - **Exercise 2a**: Casefolding, whitespaces, accent and punctuation removal.
    - **Exercise 2b**: Stemming.
    - **Exercise 2c**: Lemmatization.

3. **Mini Introduction to Python**
    - Make a dataframe
    - Write and load TSV files
    - Write and load JSON files

4. **Representations**
    - **Exercise 3a**: Bag of Words.
    - **Exercise 3b**: One Hot Encoding.

5. **Further Practice**




Your installs and imports should be here ⬇

In [None]:
# Installing the SpaCy library for advanced NLP tasks (tokenization, normalization, parsing, etc.)
!pip install spacy

# Installing the Italian language model for SpaCy (replace 'it_core_news_sm' if another language/model is needed)
!python -m spacy download it_core_news_sm

# Installing the NLTK library for basic NLP tasks (tokenization, normalization, parsing, etc.)
!pip install nltk

# Installing the 'unidecode' module for accent removal and handling special characters
!pip install unidecode


Collecting it-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-3.7.0/it_core_news_sm-3.7.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: it-core-news-sm
Successfully installed it-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('it_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m6.9 MB/s[0m eta [3

In [None]:
# Importing SpaCy
import spacy

# Importing the regular expressions (re) module to perform pattern matching, such as searching and replacing text.
import re # check this: https://regex101.com/

# Importing the Counter class from the collections module to count hashable objects.
from collections import Counter

# Importing the NLTK
import nltk

# Importing the PorterStemmer from NLTK for stemming words.
from nltk.stem import PorterStemmer

# Importing the word_tokenize function from NLTK to split text into tokens.
from nltk.tokenize import word_tokenize

# Importing the WordNetLemmatizer from NLTK to perform lemmatization.
from nltk.stem import WordNetLemmatizer

# Importing pandas, a library for data manipulation and analysis, especially for handling data in DataFrame format.
import pandas as pd

# Importing the unidecode function from the unidecode module to remove accents and special characters from text.
from unidecode import unidecode

# Importing NumPy, a library for numerical computations and handling arrays/matrices efficiently.
import numpy as np


## Tokenization


### Exercise 1a: Tokenize the following poem into individual words. Do not use any library. How many tokens does the poem have?


O hushed October morning mild,  
Thy leaves have ripened to the fall;  
Tomorrow’s wind, if it be wild,  
Should waste them all.  
The crows above the forest call;  
Tomorrow they may form and go.  
O hushed October morning mild,  
Begin the hours of this day slow.  
Make the day seem to us less brief.  
Hearts not averse to being beguiled,  
Beguile us in the way you know.  
Release one leaf at break of day;  
At noon release another leaf;  
One from our trees, one far away.  
Retard the sun with gentle mist;  
Enchant the land with amethyst.  
Slow, slow!  
For the grapes’ sake, if they were all,  
Whose leaves already are burnt with frost,  
Whose clustered fruit must else be lost—  
For the grapes’ sake along the wall.




In [None]:
# Here's the variable with the poem
poem = """O hushed October morning mild,
Thy leaves have ripened to the fall;
Tomorrow’s wind, if it be wild,
Should waste them all.
The crows above the forest call;
 Tomorrow they may form and go.
 O hushed October morning mild,
 Begin the hours of this day slow.
 Make the day seem to us less brief.
 Hearts not averse to being beguiled,
 Beguile us in the way you know.
 Release one leaf at break of day;
 At noon release another leaf;
 One from our trees, one far away.
 Retard the sun with gentle mist;
 Enchant the land with amethyst.
 Slow, slow!
 For the grapes’ sake, if they were all,
 Whose leaves already are burnt with frost,
 Whose clustered fruit must else be lost—
 For the grapes’ sake along the wall."""


In [None]:
split_poem = poem.split()
print (f 'There are{len(split_poem)}tokens in the poem. ')


### Exercise 1b: Now, tokenize the poem using one of the libraries that we talked about in class. How many tokens does the poem have now?  How many unique tokens are there?

In [None]:
nlp

### Exercise 1c: Can you tokenize the poem by line, and print each line vertically? What problem do you come across?

## Normalization



> Un’armonia mi suona    nelle vene,
allora simile a Dafne
mi trasmuto in un albero alto,
Apollo, perché tu non mi fermi.
Ma    sono una Dafne
accecata dal fumo della follia,
non ho    foglie né fiori;
eppure mentre mi trasmigro
nasce    profonda la luce
e nella solitudine arborea
volgo una triade di Dei.



### Exercise 2a: Casefolding, whitespaces, accent and punctuation removal.

1.   Turn all characters of the new poem into lower case.
2.   Remove rendundant whitespaces.
3.   Remove the accent mark from certain characters.
4.   Remove any punctuation marks.



In [None]:
poem = """Un’armonia mi suona    nelle vene,
allora simile a Dafne
mi trasmuto in un albero alto,
Apollo, perché tu non mi fermi.
Ma    sono una Dafne
accecata dal fumo della follia,
non ho    foglie né fiori;
eppure mentre mi trasmigro
nasce    profonda la luce
e nella solitudine arborea
volgo una triade di Dei."""


In [None]:
poem = poem.lower()
cleaned_poem = poem.strip()
cleaned_poem = re.sub (r'\s+,' ', cleaned_poem')
print(cleaned.poem)

cleaned_poem

TypeError: sub() missing 2 required positional arguments: 'repl' and 'string'

### Exercise 2b: Stemming
Use one of the libraries to produce a list with all the stems of the words in the poem.

In [None]:
nltk.download('punkt')
stemmer= PorterStemmer()
words = word_tokenize(poem)
stemmed_words = [stemmer.stem(word)for word in words]
for word, stemmed in zip(words, stemmed_words):
  print(f'{word}-> {stemmed}')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


o-> o
hushed-> hush
october-> octob
morning-> morn
mild-> mild
,-> ,
thy-> thi
leaves-> leav
have-> have
ripened-> ripen
to-> to
the-> the
fall-> fall
;-> ;
tomorrow-> tomorrow
’-> ’
s-> s
wind-> wind
,-> ,
if-> if
it-> it
be-> be
wild-> wild
,-> ,
should-> should
waste-> wast
them-> them
all-> all
.-> .
the-> the
crows-> crow
above-> abov
the-> the
forest-> forest
call-> call
;-> ;
tomorrow-> tomorrow
they-> they
may-> may
form-> form
and-> and
go-> go
.-> .
o-> o
hushed-> hush
october-> octob
morning-> morn
mild-> mild
,-> ,
begin-> begin
the-> the
hours-> hour
of-> of
this-> thi
day-> day
slow-> slow
.-> .
make-> make
the-> the
day-> day
seem-> seem
to-> to
us-> us
less-> less
brief-> brief
.-> .
hearts-> heart
not-> not
averse-> avers
to-> to
being-> be
beguiled-> beguil
,-> ,
beguile-> beguil
us-> us
in-> in
the-> the
way-> way
you-> you
know-> know
.-> .
release-> releas
one-> one
leaf-> leaf
at-> at
break-> break
of-> of
day-> day
;-> ;
at-> at
noon-> noon
release-> releas
anoth

### Exercise 2c: Lemmatization
Lemmatize the poem - try at least two libraries.

In [None]:
nlkt.download

# Mini Introduction to Pandas

### Let's make a dataframe
**Dataframe** A two-dimensional, size-mutable, tabular data structure with labeled axes (rows and columns). It can be thought of as similar to a spreadsheet or a SQL table, where data is organized in rows and columns.

In [None]:
data = {
    'Name': ['Frodo Baggins', 'Gandalf', 'Aragorn', 'Legolas', 'Gimli', 'Samwise Gamgee'],
    'Race': ['Hobbit', 'Maia', 'Human', 'Elf', 'Dwarf', 'Hobbit'],
    'Age': [50, 2019, 87, 2, 139, 38],
    'Notable Trait': ['Ring-bearer', 'Wise', 'King', 'Archer', 'Warrior', 'Loyalty']
}

df = pd.DataFrame(data)

# print("Original DataFrame:")
# print(df)
df

Unnamed: 0,Name,Race,Age,Notable Trait
0,Frodo Baggins,Hobbit,50,Ring-bearer
1,Gandalf,Maia,2019,Wise
2,Aragorn,Human,87,King
3,Legolas,Elf,2,Archer
4,Gimli,Dwarf,139,Warrior
5,Samwise Gamgee,Hobbit,38,Loyalty


Let's save this df to TSV file

**TSV** (Tab-Separated Values file): text file that uses tab characters to separate values or fields. Each line in the file represents a record or row of data, and each field within that record is separated by a tab.

In [None]:
df.to_csv('lotr_characters.tsv', sep='\t', index=False) # the sep parameter specifies the string used to separate values in the output file.
# The index parameter determines whether to write row indices (the row labels) to the output file.

How to load/read a tsv file

In [None]:
df_tsv = pd.read_csv('lotr_characters.tsv', sep='\t')
print(df_tsv)

             Name    Race   Age Notable Trait
0   Frodo Baggins  Hobbit    50   Ring-bearer
1         Gandalf    Maia  2019          Wise
2         Aragorn   Human    87          King
3         Legolas     Elf     2        Archer
4           Gimli   Dwarf   139       Warrior
5  Samwise Gamgee  Hobbit    38       Loyalty


Let's do the same with JSON files

**JSON** (JavaScript Object Notation):  plain text format, meaning it can be opened and edited with any text editor. The structure is based on key-value pairs.





In [None]:
df.to_json('lotr_characters.json', orient='records', lines=True) # This option formats the JSON as a list of records, which is easier to read and use in many applications.
# This parameter allows the JSON records to be written one per line, making it suitable for streaming or line-by-line processing.

In [None]:
df_json = pd.read_json('lotr_characters.json', orient='records', lines=True)

print(df_json)

             Name    Race   Age Notable Trait
0   Frodo Baggins  Hobbit    50   Ring-bearer
1         Gandalf    Maia  2019          Wise
2         Aragorn   Human    87          King
3         Legolas     Elf     2        Archer
4           Gimli   Dwarf   139       Warrior
5  Samwise Gamgee  Hobbit    38       Loyalty


# Representations

### Exercise 3a: Bag of Words
Use the first 10 lines of the first poem and
*  Create the BoW vectors for each sentence based on word frequency.
*  Try to use Pandas
*  Optional: Add new sentences to your dataset and expand your BoW model!



In [None]:
# let's use what we learned before about splitting the poem into sentences
# Let's try with regular expressions.

poem = """O hushed October morning mild,
Thy leaves have ripened to the fall;
Tomorrow’s wind, if it be wild,
Should waste them all.
The crows above the forest call;
Tomorrow they may form and go.
O hushed October morning mild,
Begin the hours of this day slow.
Make the day seem to us less brief. """


### Exercise 3b: One Hot Encoding
Use the first 10 lines of the first poem and

*  Create your own vocabulary/lexicon: Identify unique words from these sentences.
*  Create the the one hot vectors
*  Optional: Add new sentences to your dataset and expand your one hot embeddings.

In [None]:
# let's create the vocabulary
poem = """O hushed October morning mild,
Thy leaves have ripened to the fall;
Tomorrow’s wind, if it be wild,
Should waste them all.
The crows above the forest call;
Tomorrow they may form and go.
O hushed October morning mild,
Begin the hours of this day slow.
Make the day seem to us less brief."""



# If you want to practice more, use what you've learnt so far to preprocess a text that you like. For example, you can find a Reddit jokes dataset below.

In [None]:
! git clone https://github.com/taivop/joke-dataset.git

Cloning into 'joke-dataset'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 44 (delta 12), reused 10 (delta 10), pack-reused 30 (from 1)[K
Receiving objects: 100% (44/44), 32.38 MiB | 7.01 MiB/s, done.
Resolving deltas: 100% (21/21), done.
Updating files: 100% (9/9), done.


In [None]:
import pandas as pd
df = pd.read_json('/content/joke-dataset/reddit_jokes.json')

Unnamed: 0,body,id,score,title
0,"Now I have to say ""Leroy can you please paint ...",5tz52q,1,I hate how you cant even say black paint anymore
1,Pizza doesn't scream when you put it in the ov...,5tz4dd,0,What's the difference between a Jew in Nazi Ge...
2,...and being there really helped me learn abou...,5tz319,0,I recently went to America....
3,A Sunday school teacher is concerned that his ...,5tz2wj,1,"Brian raises his hand and says, “He’s in Heaven.”"
4,He got caught trying to sell the two books to ...,5tz1pc,0,You hear about the University book store worke...
...,...,...,...,...
194548,Gives me something to read while i'm in the sh...,1a89ts,5,I like a girl with words tattooed on her back.
194549,I mean dyslexia fcuk!!! >_<,1a87we,12,I have sexdaily...
194550,A hockey player showers after three periods.,1a7xnd,44,What's the difference between a hippie chick a...
194551,A father buys a lie detector robot that slaps ...,1a813f,63,new family robot


https://github.com/taivop/joke-dataset.git