# Welcome to the First Exercise Session of the NLP Tutoring Course

The main topics of this Colab notebook are **tokenization**, **normalization**, **representation**, and **pandas dataframes**. 😸

## **Contents**

1. **Tokenization**
    - **Exercise 1a**: No library/module tokenization.
    - **Exercise 1b**: Library/module tokenization.
    - **Exercise 1c**: Line/sentence tokenization.

2. **Normalization**
    - **Exercise 2a**: Casefolding, whitespaces, accent and punctuation removal.
    - **Exercise 2b**: Stemming.
    - **Exercise 2c**: Lemmatization.

3. **Mini Introduction to Python**
    - Make a dataframe
    - Write and load TSV files
    - Write and load JSON files

4. **Representations**
    - **Exercise 3a**: Bag of Words.
    - **Exercise 3b**: One Hot Encoding.

5. **Further Practice**



Your installs and imports should be here ⬇

In [1]:
# Installing the SpaCy library for advanced NLP tasks (tokenization, normalization, parsing, etc.)
!pip install spacy

# Installing the Italian language model for SpaCy (replace 'it_core_news_sm' if another language/model is needed)
!python -m spacy download it_core_news_sm

# Installing the NLTK library for basic NLP tasks (tokenization, normalization, parsing, etc.)
!pip install nltk

# Installing the 'unidecode' module for accent removal and handling special characters
!pip install unidecode


Collecting it-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-3.7.0/it_core_news_sm-3.7.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: it-core-news-sm
Successfully installed it-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('it_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m10.5 MB/s[0m eta [

In [2]:
# Importing SpaCy
import spacy

# Importing the regular expressions (re) module to perform pattern matching, such as searching and replacing text.
import re # check this: https://regex101.com/

# Importing the Counter class from the collections module to count hashable objects.
from collections import Counter

# Importing the NLTK
import nltk

# Importing the PorterStemmer from NLTK for stemming words.
from nltk.stem import PorterStemmer

# Importing the word_tokenize function from NLTK to split text into tokens.
from nltk.tokenize import word_tokenize

# Importing the WordNetLemmatizer from NLTK to perform lemmatization.
from nltk.stem import WordNetLemmatizer

# Importing pandas, a library for data manipulation and analysis, especially for handling data in DataFrame format.
import pandas as pd

# Importing the unidecode function from the unidecode module to remove accents and special characters from text.
from unidecode import unidecode

# Importing NumPy, a library for numerical computations and handling arrays/matrices efficiently.
import numpy as np

# Use this if you want to download a file automatically when the cell is run.
from google.colab import files


## Tokenization


### Exercise 1a: Tokenize the following poem into individual words. Do not use any library. How many tokens does the poem have?


O hushed October morning mild,  
Thy leaves have ripened to the fall;  
Tomorrow’s wind, if it be wild,  
Should waste them all.  
The crows above the forest call;  
Tomorrow they may form and go.  
O hushed October morning mild,  
Begin the hours of this day slow.  
Make the day seem to us less brief.  
Hearts not averse to being beguiled,  
Beguile us in the way you know.  
Release one leaf at break of day;  
At noon release another leaf;  
One from our trees, one far away.  
Retard the sun with gentle mist;  
Enchant the land with amethyst.  
Slow, slow!  
For the grapes’ sake, if they were all,  
Whose leaves already are burnt with frost,  
Whose clustered fruit must else be lost—  
For the grapes’ sake along the wall.




In [3]:
# Solution
# Here's the variable with the poem
poem = """O hushed October morning mild,
Thy leaves have ripened to the fall;
Tomorrow’s wind, if it be wild,
Should waste them all.
The crows above the forest call;
 Tomorrow they may form and go.
 O hushed October morning mild,
 Begin the hours of this day slow.
 Make the day seem to us less brief.
 Hearts not averse to being beguiled,
 Beguile us in the way you know.
 Release one leaf at break of day;
 At noon release another leaf;
 One from our trees, one far away.
 Retard the sun with gentle mist;
 Enchant the land with amethyst.
 Slow, slow!
 For the grapes’ sake, if they were all,
 Whose leaves already are burnt with frost,
 Whose clustered fruit must else be lost—
 For the grapes’ sake along the wall."""

# Tokenize using split()
split_poem = poem.split()

# Count how many unique types there are with len() and set().
# Also remind the use of f-string in formatting.
print(f'There are {len(split_poem)} tokens in the poem.')
print(f'There are {len(set(split_poem))} distinct types in the poem.')

There are 128 tokens in the poem.
There are 99 distinct types in the poem.


In [4]:
print(split_poem)

['O', 'hushed', 'October', 'morning', 'mild,', 'Thy', 'leaves', 'have', 'ripened', 'to', 'the', 'fall;', 'Tomorrow’s', 'wind,', 'if', 'it', 'be', 'wild,', 'Should', 'waste', 'them', 'all.', 'The', 'crows', 'above', 'the', 'forest', 'call;', 'Tomorrow', 'they', 'may', 'form', 'and', 'go.', 'O', 'hushed', 'October', 'morning', 'mild,', 'Begin', 'the', 'hours', 'of', 'this', 'day', 'slow.', 'Make', 'the', 'day', 'seem', 'to', 'us', 'less', 'brief.', 'Hearts', 'not', 'averse', 'to', 'being', 'beguiled,', 'Beguile', 'us', 'in', 'the', 'way', 'you', 'know.', 'Release', 'one', 'leaf', 'at', 'break', 'of', 'day;', 'At', 'noon', 'release', 'another', 'leaf;', 'One', 'from', 'our', 'trees,', 'one', 'far', 'away.', 'Retard', 'the', 'sun', 'with', 'gentle', 'mist;', 'Enchant', 'the', 'land', 'with', 'amethyst.', 'Slow,', 'slow!', 'For', 'the', 'grapes’', 'sake,', 'if', 'they', 'were', 'all,', 'Whose', 'leaves', 'already', 'are', 'burnt', 'with', 'frost,', 'Whose', 'clustered', 'fruit', 'must', 'el

### Exercise 1b: Now, tokenize the poem using one of the libraries that we talked about in class. How many tokens does the poem have now?  How many unique tokens are there?

In [5]:
nlp = spacy.load("en_core_web_sm") # loading the english spacy model

# process the poem.
doc = nlp(poem)

tokenized_poem = [token.text for token in doc]

print(tokenized_poem)

['O', 'hushed', 'October', 'morning', 'mild', ',', '\n', 'Thy', 'leaves', 'have', 'ripened', 'to', 'the', 'fall', ';', '\n', 'Tomorrow', '’s', 'wind', ',', 'if', 'it', 'be', 'wild', ',', '\n', 'Should', 'waste', 'them', 'all', '.', '\n', 'The', 'crows', 'above', 'the', 'forest', 'call', ';', '\n ', 'Tomorrow', 'they', 'may', 'form', 'and', 'go', '.', '\n ', 'O', 'hushed', 'October', 'morning', 'mild', ',', '\n ', 'Begin', 'the', 'hours', 'of', 'this', 'day', 'slow', '.', '\n ', 'Make', 'the', 'day', 'seem', 'to', 'us', 'less', 'brief', '.', '\n ', 'Hearts', 'not', 'averse', 'to', 'being', 'beguiled', ',', '\n ', 'Beguile', 'us', 'in', 'the', 'way', 'you', 'know', '.', '\n ', 'Release', 'one', 'leaf', 'at', 'break', 'of', 'day', ';', '\n ', 'At', 'noon', 'release', 'another', 'leaf', ';', '\n ', 'One', 'from', 'our', 'trees', ',', 'one', 'far', 'away', '.', '\n ', 'Retard', 'the', 'sun', 'with', 'gentle', 'mist', ';', '\n ', 'Enchant', 'the', 'land', 'with', 'amethyst', '.', '\n ', 'Slo

In [6]:
# Let's count the tokens now with Counter.
token_counts = Counter(tokenized_poem)

print(f'There are {len(tokenized_poem)} tokens in the poem.')
print(f'There are {len(token_counts)} distinct types in the poem.')

There are 176 tokens in the poem.
There are 102 distinct types in the poem.


### Exercise 1c: Can you tokenize the poem by line, and print each line vertically? What problem do you come across?

In [7]:
# Let's try with spacy
for i, sent in enumerate(doc.sents):
    print(sent.text)

O hushed October morning mild,
Thy leaves have ripened to the fall;
Tomorrow’s wind, if it be wild,
Should waste them all.

The crows above the forest call;
 Tomorrow they may form and go.
 
O hushed October morning mild,
 Begin the hours of this day slow.
 
Make the day seem to us less brief.
 
Hearts not averse to being beguiled,
 Beguile us in the way you know.
 
Release one leaf at break of day;
 At noon release another leaf;
 One from our trees, one far away.
 
Retard the sun with gentle mist;
 
Enchant the land with amethyst.
 
Slow, slow!
 
For the grapes’ sake, if they were all,
 Whose leaves already are burnt with frost,
 Whose clustered fruit must else be lost—
 For the grapes’ sake along the wall.


In [8]:
# Let's try with regular expressions.
lines = re.split(r'(?<=[.!?,;]) +', poem)

for line in lines:
    print(line.strip())

O hushed October morning mild,
Thy leaves have ripened to the fall;
Tomorrow’s wind,
if it be wild,
Should waste them all.
The crows above the forest call;
 Tomorrow they may form and go.
 O hushed October morning mild,
 Begin the hours of this day slow.
 Make the day seem to us less brief.
 Hearts not averse to being beguiled,
 Beguile us in the way you know.
 Release one leaf at break of day;
 At noon release another leaf;
 One from our trees,
one far away.
 Retard the sun with gentle mist;
 Enchant the land with amethyst.
 Slow,
slow!
 For the grapes’ sake,
if they were all,
 Whose leaves already are burnt with frost,
 Whose clustered fruit must else be lost—
 For the grapes’ sake along the wall.


## Normalization



> Un’armonia mi suona    nelle vene,
allora simile a Dafne
mi trasmuto in un albero alto,
Apollo, perché tu non mi fermi.
Ma    sono una Dafne
accecata dal fumo della follia,
non ho    foglie né fiori;
eppure mentre mi trasmigro
nasce    profonda la luce
e nella solitudine arborea
volgo una triade di Dei.



### Exercise 2a: Casefolding, whitespaces, accent and punctuation removal.

1.   Turn all characters of the new poem into lower case.
2.   Remove rendundant whitespaces.
3.   Remove the accent mark from certain characters.
4.   Remove any punctuation marks.



In [9]:
poem = """Un’armonia mi suona    nelle vene,
allora simile a Dafne
mi trasmuto in un albero alto,
Apollo, perché tu non mi fermi.
Ma    sono una Dafne
accecata dal fumo della follia,
non ho    foglie né fiori;
eppure mentre mi trasmigro
nasce    profonda la luce
e nella solitudine arborea
volgo una triade di Dei."""

# lower-casing
poem  = poem.lower()
# print(poem)

# remove white space
# cleaned_poem = poem.strip()

cleaned_poem = re.sub(r'\s+', ' ', poem)

# # remove accent marks
cleaned_poem = unidecode(cleaned_poem)

# # remove punctuation marks
cleaned_poem = re.sub(r'[^\w\s]', '', cleaned_poem)

# print(cleaned_poem)

In [10]:
print(cleaned_poem)

unarmonia mi suona nelle vene allora simile a dafne mi trasmuto in un albero alto apollo perche tu non mi fermi ma sono una dafne accecata dal fumo della follia non ho foglie ne fiori eppure mentre mi trasmigro nasce profonda la luce e nella solitudine arborea volgo una triade di dei


### Exercise 2b: Stemming
Use one of the libraries to produce a list with all the stems of the words in the poem.

In [11]:
# let's use NLTK this time

# sometimes we need to download the nltk data files
nltk.download('punkt')

# we create the Porter stemmer object
stemmer = PorterStemmer()

# we need to tokenize before stemming
words = word_tokenize(poem)

# let's stem
stemmed_words = [stemmer.stem(word) for word in words]

# print the original and stemmed words
for word, stemmed in zip(words, stemmed_words):
    print(f"{word} -> {stemmed}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


un -> un
’ -> ’
armonia -> armonia
mi -> mi
suona -> suona
nelle -> nell
vene -> vene
, -> ,
allora -> allora
simile -> simil
a -> a
dafne -> dafn
mi -> mi
trasmuto -> trasmuto
in -> in
un -> un
albero -> albero
alto -> alto
, -> ,
apollo -> apollo
, -> ,
perché -> perché
tu -> tu
non -> non
mi -> mi
fermi -> fermi
. -> .
ma -> ma
sono -> sono
una -> una
dafne -> dafn
accecata -> accecata
dal -> dal
fumo -> fumo
della -> della
follia -> follia
, -> ,
non -> non
ho -> ho
foglie -> fogli
né -> né
fiori -> fiori
; -> ;
eppure -> eppur
mentre -> mentr
mi -> mi
trasmigro -> trasmigro
nasce -> nasc
profonda -> profonda
la -> la
luce -> luce
e -> e
nella -> nella
solitudine -> solitudin
arborea -> arborea
volgo -> volgo
una -> una
triade -> triad
di -> di
dei -> dei
. -> .


In [12]:
word  = 'potato'

print(f'I do not want a {word}')

I do not want a potato


### Exercise 2c: Lemmatization
Lemmatize the poem - try at least two libraries.

In [13]:
# Download necessary NLTK data files (if not already downloaded)
# nltk.download('punkt')
nltk.download('wordnet')

# Create a WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()

# Tokenization
tokens = word_tokenize(poem, language='italian')

# Lemmatization
lemmatized_words = [(word, lemmatizer.lemmatize(word)) for word in tokens]

# Print the original tokens and their lemmatized forms
for word, lemmatized in lemmatized_words:
    print(f"{word} -> {lemmatized}")

[nltk_data] Downloading package wordnet to /root/nltk_data...


un -> un
’ -> ’
armonia -> armonia
mi -> mi
suona -> suona
nelle -> nelle
vene -> vene
, -> ,
allora -> allora
simile -> simile
a -> a
dafne -> dafne
mi -> mi
trasmuto -> trasmuto
in -> in
un -> un
albero -> albero
alto -> alto
, -> ,
apollo -> apollo
, -> ,
perché -> perché
tu -> tu
non -> non
mi -> mi
fermi -> fermi
. -> .
ma -> ma
sono -> sono
una -> una
dafne -> dafne
accecata -> accecata
dal -> dal
fumo -> fumo
della -> della
follia -> follia
, -> ,
non -> non
ho -> ho
foglie -> foglie
né -> né
fiori -> fiori
; -> ;
eppure -> eppure
mentre -> mentre
mi -> mi
trasmigro -> trasmigro
nasce -> nasce
profonda -> profonda
la -> la
luce -> luce
e -> e
nella -> nella
solitudine -> solitudine
arborea -> arborea
volgo -> volgo
una -> una
triade -> triade
di -> di
dei -> dei
. -> .


In [14]:
# Load the Italian NLP model (careful to use the right model every time)
nlp = spacy.load("it_core_news_sm")

# Process the text using spacy
doc = nlp(poem)

# Extracting the lemmatized forms of each token
lemmatized_words = [(token.text, token.lemma_) for token in doc]

# Print the original tokens and their lemmatized forms
for word, lemmatized in lemmatized_words:
    print(f"{word} -> {lemmatized}")

un’ -> uno
armonia -> armonia
mi -> mi
suona -> suonare
    ->    
nelle -> in il
vene -> vena
, -> ,

 -> 

allora -> allora
simile -> simile
a -> a
dafne -> dafne

 -> 

mi -> mi
trasmuto -> trasmere
in -> in
un -> uno
albero -> albero
alto -> alto
, -> ,

 -> 

apollo -> apollo
, -> ,
perché -> perché
tu -> tu
non -> non
mi -> mi
fermi -> fermare
. -> .

 -> 

ma -> ma
    ->    
sono -> essere
una -> uno
dafne -> dafne

 -> 

accecata -> accecare
dal -> da il
fumo -> fumo
della -> di il
follia -> follia
, -> ,

 -> 

non -> non
ho -> avere
    ->    
foglie -> foglia
né -> né
fiori -> fiore
; -> ;

 -> 

eppure -> eppure
mentre -> mentre
mi -> mi
trasmigro -> Trasmigro

 -> 

nasce -> nascere
    ->    
profonda -> profondare
la -> il
luce -> luce

 -> 

e -> e
nella -> in il
solitudine -> solitudine
arborea -> arboreo

 -> 

volgo -> volgo
una -> uno
triade -> triade
di -> di
dei -> Dei
. -> .


# Mini Introduction to Pandas

### Let's make a dataframe
**Dataframe** A two-dimensional, size-mutable, tabular data structure with labeled axes (rows and columns). It can be thought of as similar to a spreadsheet or a SQL table, where data is organized in rows and columns.

In [15]:
data = {
    'Name': ['Frodo Baggins', 'Gandalf', 'Aragorn', 'Legolas', 'Gimli', 'Samwise Gamgee'],
    'Race': ['Hobbit', 'Maia', 'Human', 'Elf', 'Dwarf', 'Hobbit'],
    'Age': [50, 2019, 87, 2, 139, 38],
    'Notable Trait': ['Ring-bearer', 'Wise', 'King', 'Archer', 'Warrior', 'Loyalty']
}

df = pd.DataFrame(data)

# print("Original DataFrame:")
# print(df)
df

Unnamed: 0,Name,Race,Age,Notable Trait
0,Frodo Baggins,Hobbit,50,Ring-bearer
1,Gandalf,Maia,2019,Wise
2,Aragorn,Human,87,King
3,Legolas,Elf,2,Archer
4,Gimli,Dwarf,139,Warrior
5,Samwise Gamgee,Hobbit,38,Loyalty


Let's save this df to TSV file

**TSV** (Tab-Separated Values file): text file that uses tab characters to separate values or fields. Each line in the file represents a record or row of data, and each field within that record is separated by a tab.

In [16]:
df.to_csv('lotr_characters.tsv', index=False) # the sep parameter specifies the string used to separate values in the output file.
# The index parameter determines whether to write row indices (the row labels) to the output file.

# download the file locally automatically
#files.download('lotr_characters.csv')

How to load/read a tsv file

In [17]:
df_tsv = pd.read_csv('lotr_characters.tsv', sep='\t')
print(df_tsv)

           Name,Race,Age,Notable Trait
0  Frodo Baggins,Hobbit,50,Ring-bearer
1               Gandalf,Maia,2019,Wise
2                Aragorn,Human,87,King
3                 Legolas,Elf,2,Archer
4              Gimli,Dwarf,139,Warrior
5     Samwise Gamgee,Hobbit,38,Loyalty


Let's do the same with JSON files

**JSON** (JavaScript Object Notation):  plain text format, meaning it can be opened and edited with any text editor. The structure is based on key-value pairs.





In [18]:
df.to_json('lotr_characters.json', orient='records', lines=True) # This option formats the JSON as a list of records, which is easier to read and use in many applications.
# This parameter allows the JSON records to be written one per line, making it suitable for streaming or line-by-line processing.

In [19]:
df_json = pd.read_json('lotr_characters.json', orient='records', lines=True)

print(df_json)

             Name    Race   Age Notable Trait
0   Frodo Baggins  Hobbit    50   Ring-bearer
1         Gandalf    Maia  2019          Wise
2         Aragorn   Human    87          King
3         Legolas     Elf     2        Archer
4           Gimli   Dwarf   139       Warrior
5  Samwise Gamgee  Hobbit    38       Loyalty


# Representations

### Exercise 3a: Bag of Words
Use the first 10 lines of the first poem and
*  Create the BoW vectors for each sentence based on word frequency.
*  Try to use Pandas
*  Optional: Add new sentences to your dataset and expand your BoW model!



In [20]:
# let's use what we learned before about splitting the poem into sentences
# Let's try with regular expressions.

poem = """O hushed October morning mild,
Thy leaves have ripened to the fall;
Tomorrow’s wind, if it be wild,
Should waste them all.
The crows above the forest call;
Tomorrow they may form and go.
O hushed October morning mild,
Begin the hours of this day slow.
Make the day seem to us less brief. """

sentences = re.split(r'(?<=[.!?,;]) +', poem)

 # load the tokens into a dictionary (assuming each line is a document)
corpus = {}
for i, sent in enumerate(sentences):
   corpus['sent{}'.format(i)] = dict((tok, 1) for tok in
        sent.split())
# print(corpus)

# # # load the dictionary contents into a pandas dataframe
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T

df

Unnamed: 0,O,hushed,October,morning,"mild,",Thy,leaves,have,ripened,to,...,hours,of,this,day,slow.,Make,seem,us,less,brief.
sent0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sent1,1,1,1,1,1,0,0,0,0,1,...,1,1,1,1,1,1,1,1,1,1
sent2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Exercise 3b: One Hot Encoding
Use the first 10 lines of the first poem and

*  Create your own vocabulary/lexicon: Identify unique words from these sentences.
*  Create the the one hot vectors
*  Optional: Add new sentences to your dataset and expand your one hot embeddings.

In [21]:
import spacy
import re
import numpy as np

# poem text
poem = """O hushed October morning mild,
Thy leaves have ripened to the fall;
Tomorrow’s wind, if it be wild"""

# load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# strip punctuation using regex
cleaned_poem = re.sub(r'[^\w\s]', '', poem)

# process the poem using spaCy
doc = nlp(cleaned_poem)

# tokenize the poem
tokenized_poem = [token.text for token in doc]

# create a sorted vocabulary (unique tokens)
voc = sorted(set(tokenized_poem))

# calculate vocabulary size and the number of tokens
voc_size = len(voc)
tokens_length = len(tokenized_poem)

# create a one-hot encoding matrix
onehot_vectors = np.zeros((tokens_length, voc_size), int)

# create a token-to-index mapping for one-hot encoding
token_to_index = {token: idx for idx, token in enumerate(voc)}

# fill the one-hot encoding matrix
for i, token in enumerate(tokenized_poem):
    onehot_vectors[i, token_to_index[token]] = 1

# print the one-hot encoded matrix
print(onehot_vectors)


[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]]


In [22]:
df=pd.DataFrame(onehot_vectors, columns=voc)
df

Unnamed: 0,\n,O,October,Thy,Tomorrows,be,fall,have,hushed,if,it,leaves,mild,morning,ripened,the,to,wild,wind
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


# If you want to practice more, use what you've learnt so far to preprocess a text that you like. For example, you can find a Reddit jokes dataset below.

In [23]:
! git clone https://github.com/taivop/joke-dataset.git

Cloning into 'joke-dataset'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 44 (delta 12), reused 10 (delta 10), pack-reused 30 (from 1)[K
Receiving objects: 100% (44/44), 32.38 MiB | 18.21 MiB/s, done.
Resolving deltas: 100% (21/21), done.


In [24]:
import pandas as pd
df = pd.read_json('/content/joke-dataset/reddit_jokes.json')