# Lab 1 - Sesion exercise

Develop a jupyter notebook that show the 25 non-stopwords with more number of occurrences in the file 'blake-poems.txt' of Gutenberg corpus.


### Import nltk
Natural Language Toolkit 

In [1]:
# nltk && stopwords
import nltk
from nltk.corpus import stopwords

#### Activate only to download
nltk.download('gutenberg')
nltk.download('stopwords')

### Loading the required file

In [2]:
# load file
bp = nltk.corpus.gutenberg.words('blake-poems.txt')
print('bp length: %d, subset: %s'%(len(bp), bp[:3]))

bp length: 8354, subset: ['[', 'Poems', 'by']


### Lowercase the file:
It's necessary to lowercase the file, for the extraction to work properly so that for instance: the stopwords 'A' and 'a' can be later processed as equal words

In [3]:
# lowercase the file
bp_lower = [word.lower() for word in bp]
print('bp_lower length: %d, subset: %s'%(len(bp), bp_lower[:3]))

bp_lower length: 8354, subset: ['[', 'poems', 'by']


In [4]:
# read stopwords in english
stop_words = stopwords.words('english')
print('stopwords length: %d, subset: %s'%(len(stop_words), stop_words[:3]))

stopwords length: 179, subset: ['i', 'me', 'my']


### 'Raw' extraction of the non-stopwords:

In [5]:
# extract non-stopwords from file
non_stopwords = [word for word in bp_lower if word not in stop_words]
# frequencies of words
freqs = {w:non_stopwords.count(w) for w in set(non_stopwords)}
# order 25 ocurrences of non-stopwords
ord= sorted(freqs.items(), key=lambda x:x[1], reverse=True)
print('non-stopwords length: %d, 25 ocurrences: %s'%(len(non_stopwords), ord[:25]))

non-stopwords length: 5225, 25 ocurrences: [(',', 680), ('.', 201), ("'", 104), (';', 98), (':', 75), ('?', 65), ('!', 59), ('"', 51), ('little', 45), ('thee', 42), ('thou', 35), ('like', 35), ('thy', 31), ('love', 29), ('night', 28), ('sweet', 28), ('joy', 25), ('away', 24), ('weep', 24), ('father', 22), ('sleep', 21), ('-', 20), ('."', 20), ('shall', 19), ('day', 19)]


### Clean the file:
To do a cleaner extraction: avoiding punctuation marks, signs and symbols, etc. We could make use of the 'Regular Expression' module from python3 **re.findall()** that returns all the matches as a list of strings

In [6]:
import re

#### Extract the words

The function **re.findall** can help us clean the file with the regex ('\w+')
to search for sequences of letters (not including punctuation)

In [7]:
# extract only the words
# rejoin the words with space
clean_text = ' '.join(non_stopwords)
# filtering to get only the words
finalnon_stopwords = re.findall('\w+', clean_text)
# length
print('new length of non-stopwords: %d'%(len(finalnon_stopwords)))

new length of non-stopwords: 3807


In [8]:
# frequencies of words
freqs = {w:finalnon_stopwords.count(w) for w in set(finalnon_stopwords)}
#order 25 ocurrences of nonstop words
ord= sorted(freqs.items(), key=lambda x:x[1], reverse=True)
ord[:25]

[('little', 45),
 ('thee', 42),
 ('thou', 35),
 ('like', 35),
 ('thy', 31),
 ('love', 29),
 ('night', 28),
 ('sweet', 28),
 ('joy', 25),
 ('away', 24),
 ('weep', 24),
 ('father', 22),
 ('sleep', 21),
 ('shall', 19),
 ('day', 19),
 ('mother', 19),
 ('happy', 19),
 ('child', 18),
 ('every', 17),
 ('never', 17),
 ('human', 16),
 ('voice', 16),
 ('er', 16),
 ('infant', 16),
 ('green', 16)]

### Adding new stopwords
The word **('er',16)** looks suspicious. With a quick check to the file 'blake-poems.txt' we see the recurrent preposition 'O'er'

In [9]:
# check if word exist
print('"er" exist?: %s'%("er" in bp))
# consulting context in corpus
from nltk.text import Text
crp = Text(bp)
crp.concordance('er')

"er" exist?: True
Displaying 16 of 16 matches:
d bid thee feed By the stream and o ' er the mead ; Gave thee clothing of deli
 SONG Sweet dreams , form a shade O ' er my lovely infant ' s head ! Sweet dre
 Sweet Sleep , angel mild , Hover o ' er my happy child ! Sweet smiles , in th
eep , sleep , happy sleep , While o ' er thee doth mother weep . Sweet babe , 
 shine like the gold , As I guard o ' er the fold ." SPRING Sound the flute ! 
AM Once a dream did weave a shade O ' er my angel - guarded bed , That an emme
is eternal winter there . For where ' er the sun does shine , And where ' er t
' er the sun does shine , And where ' er the rain does fall , Babes should nev
p . " Frowning , frowning night , O ' er this desert bright Let thy moon arise
 viewed : Then he gambolled round O ' er the hallowed ground . Leopards , tige
 an Angel mild : Witless woe was ne ' er beguiled ! And I wept both night and 
 ," And I passed the sweet flower o ' er . Then I went to my pretty rose tree 
 meet

We could add 'er' as a stopword to make a better extraction at the begining: **stop_word.append('er')**

In [16]:
# including a stopword
stop_words.append("er")
# extract nonstop words from file
non_stopwords = [word for word in bp_lower if word not in stop_words]
# extract only the words
clean_text = ' '.join(non_stopwords)
finalnon_stopwords = re.findall('\w+', clean_text)
# frequencies of words
freqs = {w:finalnon_stopwords.count(w) for w in set(finalnon_stopwords)}
# order 25 ocurrences of nonstop words
# it's necessary to reverse the sort to have them in descendent form
ord= sorted(freqs.items(), key=lambda x:x[1], reverse=True)
ord[:25]

[('little', 45),
 ('thee', 42),
 ('thou', 35),
 ('like', 35),
 ('thy', 31),
 ('love', 29),
 ('night', 28),
 ('sweet', 28),
 ('joy', 25),
 ('away', 24),
 ('weep', 24),
 ('father', 22),
 ('sleep', 21),
 ('shall', 19),
 ('day', 19),
 ('mother', 19),
 ('happy', 19),
 ('child', 18),
 ('every', 17),
 ('never', 17),
 ('human', 16),
 ('voice', 16),
 ('infant', 16),
 ('green', 16),
 ('thel', 16)]

**And there we have 25 non-stopwords with more number of occurrences in the file 'blake-poems.txt' of Gutenberg corpus.**

## Alternative solution
Aditionally there's an alternative aproach to this exercise that optimizes the process using python3 module: **Counter.**

Counter is a container that keeps track of how many times equivalent values are added.

### Import Counter

In [17]:
from collections import Counter

### Perform the extraction again with the cleansing

In [18]:
# extract non-stopwords from file
alt_non_stopwords = [word for word in bp_lower if word not in stop_words]
# extract only the words
clean_text = ' '.join(alt_non_stopwords)
f_words = re.findall('\w+', clean_text)

#### Using Counter().most_common function

In [19]:
# top 25 ocurrences of nonstop words
top_25 = Counter(f_words).most_common(25)
top_25

[('little', 45),
 ('thee', 42),
 ('like', 35),
 ('thou', 35),
 ('thy', 31),
 ('love', 29),
 ('sweet', 28),
 ('night', 28),
 ('joy', 25),
 ('away', 24),
 ('weep', 24),
 ('father', 22),
 ('sleep', 21),
 ('happy', 19),
 ('shall', 19),
 ('day', 19),
 ('mother', 19),
 ('child', 18),
 ('every', 17),
 ('never', 17),
 ('thel', 16),
 ('hear', 16),
 ('green', 16),
 ('voice', 16),
 ('infant', 16)]

### Conclusion

There seem to be multiple ways to approach a solution to a data-processing related problem. It'll be advisable to do a quick research to the problem and learn how other people have resolved it, so that the issue can be better understod and it can be provide with the simpler solution.

Therefore to avoid *reinventing the wheel*, we ought to use build-in-functions from other modules whenever posible.