<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Natural language processing
Â© ExploreAI Academy

In this exercise, we will perform text preprocessing tasks such as converting to lowercase, removing punctuation, creating a bag-of-words, and applying stemming and lemmatization techniques in order to analyse text data to gain some insights.

## Learning objectives

By the end of this exercise, you should be able to:
* Implement text preprocessing techniques such as converting to lowercase and removing punctuation.
* Apply stemming and lemmatization techniques to extract the root forms of words.
* Create a bag-of-words representation to quantify the occurrence of words in text.
* Calculate statistics such as the number of stop words, unique words, and word frequencies in text data.

## Import libraries and read in the data

In [3]:
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import urllib

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

The data used in this notebook is text from the book "Alice's Adventures in Wonderland" by Lewis Carroll. 

In [2]:
# read in the data
def print_some_url():
    with urllib.request.urlopen('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint//alice_in_wonderland.txt') as f:
        return f.read().decode('ISO-8859-1')

data = print_some_url()
print(data[:863])

Alice's Adventures in Wonderland

                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 3.0




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

 


## Data preprocessing
We will first start by providing you with the functions required to remove punctuation, create a bag-of-words, and define a stemmer, tokeniser, and lemmatizer. Once you apply the functions to preprocess the data, you will be asked to perform some calculations and analysis in the exercise questions below.


**Convert to lowercase and remove punctuation** 

In [4]:
#Function to remove punctuation

def remove_punctuation(words):
    words = words.lower()
    return ''.join([x for x in words if x not in string.punctuation])

In [5]:
#Apply the remove_punctuation function to the data
data = remove_punctuation(data)

**Create a bag-of-words and assign our stemmer and lemmatizer**

In [6]:
# Define stemmer function
stemmer = SnowballStemmer('english')

# Tokenise data
tokeniser = TreebankWordTokenizer()
tokens = tokeniser.tokenize(data)

# Define lemmatizer
lemmatizer = WordNetLemmatizer()

# Bag-of-words
def bag_of_words_count(words, word_dict={}):
    """ this function takes in a list of words and returns a dictionary 
        with each word as a key, and the value represents the number of 
        times that word appeared"""
    for word in words:
        if word in word_dict.keys():
            word_dict[word] += 1
        else:
            word_dict[word] = 1
    return word_dict

# Remove stop words
tokens_less_stopwords = [word for word in tokens if word not in stopwords.words('english')]

# Create bag-of-words
bag_of_words = bag_of_words_count(tokens_less_stopwords,{})

Pay special attention to what these functions return and how the subsequent texts and lists look.

## Exercises

### Exercise 1

Use the stemmer and lemmatizer functions (defined in the cells above) from the relevant library to write a function that finds the stem and lemma of the nth word in the token list.

_**Function specifications:**_
* Should take a `list` as input and return a  `dict` type as output.
* The dictionary should have the keys **'original',  'stem', and 'lemma'** with the corresponding values being the nth word transformed in that way.

**Example result:**

`{'original': 'daisies', 
'stem': 'daisi', 
'lemma': 'daisy'}`

Use your function to find the 120th word in `tokens`.

In [18]:
#Your code here 
def do_stemming_and_lematization(tokens: list, index: int) -> dict:
    """Takes in a list of tokens, and an index then returns a dictionary stemmed and lematized words"""
    index -= 1
    nth_token = tokens[index] 
    lemma = lemmatizer.lemmatize(nth_token)
    stem = stemmer.stem(nth_token)
    return {
        "original": nth_token,
        "stem": stem,
        "lemma": lemma
    }


In [19]:
do_stemming_and_lematization(tokens=tokens, index=120)

{'original': 'daisies', 'stem': 'daisi', 'lemma': 'daisy'}

### Exercise 2

Create a function that calculates the number of stop words that are in the text in total, including repetitions.   

_Hint:_ You can use the nltk stopwords dictionary. 

_**Function specifications:**_
* Function should take a `list` as input. 
* The number of stop words should be returned as an `int`. 

Use your function to calculate the total number of stop words in `tokens`.

In [20]:
#Your code here
def calc_total_stopwords(tokens: list) -> int:
    """Calculates the total number of stop words in the tokens""" 
    return len([stop_word for stop_word in tokens if stop_word in stopwords.words("english")])


In [22]:
calc_total_stopwords(tokens=tokens)

13774

### Exercise 3

Write a function that calculates the number of **unique** words in the text.

_**Function specifications:**_
* Function should take a `list` as input and return an `int`. 


Use your function to calculate the number of **unique** words in `tokens`.

In [24]:
#Your code here
def calc_total_unique_words(tokens: list) -> int:
    """Returns the total number of unique words in the tokens list"""
    return len(list(set(tokens)))

In [25]:
calc_total_unique_words(tokens=tokens)

2749

### Exercise 4

Write a function that calculates the kth most frequently occurring word in the bag-of-words.

_**Function specifications:**_
* Function should take a `dict` and an `int` k as input.
* Function should return the kth most common word as a `str`.

_Hint:_ bag_of_words already does not include stop words.

**Example input:**
```python
most_common_word(bag = {'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 2)

>>> 'apple'
```


Use the function to calculate the 3rd most frequently occurring word in the bag-of-words.

In [48]:
#Your code here
def find_most_common_kth_word(bag: dict, kth: int) -> str:
    """Finds the most common word from the provided dict"""
    sorted_ = sorted(bag.items(), key=lambda item: item[1], reverse=True) 
    print(sorted_)
    return sorted_[kth-1]

In [49]:
find_most_common_kth_word(bag_of_words, 3)



('little', 128)

### Exercise 5

Write a function that calculates the number of words that appear n times in the text.

_**Function specifications:**_
* Input is taken as a `dict` and an `int` n, where n is the number of times the word appears in the text.
* Count the number of words that appear n times in the text.
* Output should be the count as an `int`.

**Example input:** 
```python
word_frequency_count(bag = {'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 12)

>>> 2
```

Use the function to calculate the number of words that appear eight times in the bag-of-words.


In [35]:
#Your code here
def calc_num_words_appearing_nth_times(bag: dict, nth: int) -> int:
    counter = 0 
    for key, value in bag.items():
        if value == 8:
            counter += 1 
        else:
            continue
    return counter

In [36]:
calc_num_words_appearing_nth_times(bag_of_words, 8)

49

## Solutions

### Exercise 1

In [37]:
def find_roots(token_list, n):
    
    root_dict = {}
    word = token_list[n-1]
    root_dict['original'] = word
    root_dict["stem"] = stemmer.stem(word)
    root_dict["lemma"] = lemmatizer.lemmatize(word)
    
    return root_dict

In [38]:
find_roots(tokens, 120) 

{'original': 'daisies', 'stem': 'daisi', 'lemma': 'daisy'}

### Exercise 2

In [39]:
def count_stopwords(token_list):
    STOPwords = [word for word in token_list if word in stopwords.words("english")]
    return len(STOPwords)

In [40]:
count_stopwords(tokens)

13774

### Exercise 3

In [41]:
def unique_words(token_list):
    return len(set(token_list))

In [42]:
unique_words(tokens)

2749

Note: The same result can be achieved by using the `len()` function on the `bag_of_words_count()` function to calculate the number of unique words in `tokens`.

In [43]:
len(bag_of_words_count(tokens,{}))

2749

### Exercise 4

In [44]:
def most_common_word(bag, k):
    switch = [(value, key) for key, value in bag.items()]
    switch = sorted(switch)
    return switch[-k][1]

In [45]:
most_common_word(bag_of_words, 3)

'little'

### Exercise 5

In [50]:
def word_frequency_count(bag, n):
    total = sum(1 for value in bag.values() if value == n)
    return total

In [51]:
word_frequency_count(bag_of_words, 8)

49

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>