# NLP Practical Test

© Explore Data Science Academy

The NLP practical test will take place within this Jupyter notebook. Each question will require you to write a function which will return the answer. This notebook will be graded automatically, so it is important that the names of any existing variables and functions are left unchanged.

A shell function with the correct name for each question has already been defined for you. You will simply need to fill in the necessary code inside the function, as directed by the comments.

## Honour Code

I **YOUR NAME**, **YOUR SURNAME**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

#### Import Libraries and Read In the Data

Do not modify or remove any of the code in these cells.

In [1]:
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import urllib

nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
# read in the data
def print_some_url():
    with urllib.request.urlopen('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint//alice_in_wonderland.txt') as f:
        return f.read().decode('ISO-8859-1')

data = print_some_url()
print(data[:863])

Alice's Adventures in Wonderland

                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 3.0




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

 


#### Convert to lowercase and remove punctuation  

Do not change or remove any of the code in these cells

In [3]:
def remove_punctuation(words):
    words = words.lower()
    return ''.join([x for x in words if x not in string.punctuation])

In [4]:
data = remove_punctuation(data)

#### Creating a bag of words and assigning our stemmer and lemmatizer

Pay special attention to what these functions return and how the subsequent texts and lists look

In [5]:
# define stemmer function
stemmer = SnowballStemmer('english')

# tokenise data
tokeniser = TreebankWordTokenizer()
tokens = tokeniser.tokenize(data)

# define lemmatiser
lemmatizer = WordNetLemmatizer()

# bag of words
def bag_of_words_count(words, word_dict={}):
    """ this function takes in a list of words and returns a dictionary 
        with each word as a key, and the value represents the number of 
        times that word appeared"""
    for word in words:
        if word in word_dict.keys():
            word_dict[word] += 1
        else:
            word_dict[word] = 1
    return word_dict

# remove stopwords
tokens_less_stopwords = [word for word in tokens if word not in stopwords.words('english')]

# create bag of words
bag_of_words = bag_of_words_count(tokens_less_stopwords)



## Question 1

Use the stemmer and lemmatizer functions (defined in the cells above) from the relevant library to find the stem and lemma of the nth word in the token list.

_**Function Specifications:**_
* Should take a `list` as input and return a  `dict` type as output.
* The dictionary should have the keys **'original',  'stem' and 'lemma'** with the corresponding values being the nth word transformed in that way.

In [8]:
### START FUNCTION
def find_roots(token_list, n):
    #your code here
    word = token_list[n-1]
    original = word
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word)
    return {'original': original, 'stem': stem, 'lemma': lemma}
### END FUNCTION

In [9]:
find_roots(tokens, 120)


{'lemma': 'daisy', 'original': 'daisies', 'stem': 'daisi'}

_**Expected Outputs:**_
```python
find_roots(tokens, 120) == 
{'original': 'daisies', 
'stem': 'daisi', 
'lemma': 'daisy'}
```


## Question 2

How many stopwords are in the text in total?   

_Hint_ : you can use the nltk stopwords dictionary 

_**Function Specifications:**_
* Function should take a `list` as input 
* The number of stopwords should be returned as an `int` 

In [10]:
### START FUNCTION
def count_stopwords(token_list):
    #your code here
    stop_words = [word for word in token_list if word in stopwords.words('english')]
    return len(stop_words)
### END FUNCTION

In [11]:
count_stopwords(tokens)

13774

_**Expected output:**_

```python
count_stopwords(tokens) == 13774
```

## Question 3

How many **unique** words are in the text?

_**Function Specifications:**_
* Function should take a `list` as input and return an `int` 

In [12]:
### START FUNCTION
def unique_words(token_list):
  #your code here
  unique = set(token_list)

  return len(unique)
### END FUNCTION

In [14]:
unique_words(tokens)

2749

_**Expected output:**_

```python
unique_words(tokens) == 2749
```

## Question 4

What is the kth most frequently occuring word in the bag of words?

_**Function Specifications:**_
* Function should take a `dict` and an `int` k as input
* Function should return the kth most common word as a `str`

_Hint : bag_of_words already does not include stopwords_

Example: 
```python
most_common_word(bag = {'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 2)

>>> 'apple'
```

In [52]:
### START FUNCTION
def most_common_word(bag, k):
    # your code here
    from collections import Counter
    word_counts = Counter(bag)
    top_k = word_counts.most_common(k)
    return (top_k[-1][0])


### END FUNCTION

In [56]:
most_common_word(bag_of_words, 3) 

'little'

_**Expected output:**_

```python
most_common_word(bag_of_words, 3) == 'little'

```

## Question 5

How many words appear n times in the text?

_**Function Specifications:**_
* Input is taken as a `dict` and an `int` n, where n is the number of times the word appears in the text
* Count the number of words that appear n times in the text
* Output should be the count as an `int`

Example: 
```python
word_frequency_count(bag = {'apple': 30, 'orange': 12, 'pear': 50, 'banana': 12}, 12)

>>> 2
```

In [17]:
### START FUNCTION
def word_frequency_count(bag, n):
    # your code here
    from collections import Counter
    #Calculateing frequency using counter
    m = Counter(bag)
    
    count = 0
    #Traversing in freq dictionary
    for i in m:
        if m[i] == n:
            count += 1
    return count
    
### END FUNCTION

In [19]:
word_frequency_count(bag_of_words, 5)

97

In [20]:
word_frequency_count(bag_of_words, 8)

49

_**Expected output:**_

```python
most_common_word(bag_of_words, 5) == 97
most_common_word(bag_of_words, 8) == 49

```