# Hello again, welcome to notebook 1!

In this notebook we will look at preparing text data.

It is recommended that you complete all exercises that are not marked as optional.

Feel free to be creative and write your own code wherever you want!

## Imports

In [1]:
import string

from nltk import download
download('stopwords')
from nltk.corpus import stopwords

ModuleNotFoundError: No module named 'nltk'

## Lesson 1: Preparing text data
In order to get a computer to understand a piece of text - whether that's a news headline, a tweet or a Wikipedia article - we need to break it up into components that the computer will recognise. This is called tokenization. We will also want to remove common words like 'and', 'in', 'but', etc to allow the computer to focus on more important words. In NLP, words such as these are known as stopwords.

In this lesson we will focus on tokenization and stopword removal for preparing text data. In reality, however, there are many more techniques that we could use to prepare our data, such as lemmatization and stemming. You may want to research and implement these techniques for the final challenge!


### Exercise 1.1: Tokenization

As described above, the technique of splitting up sentences or documents into smaller components is called tokenization. When we perform tokenization, there are many different things we must consider. Should we split up the text into characters, or words? What should we do about punctuation? In this exercise we will think about some of these issues and build our own basic tokenizer.

In [4]:
# Q1.1.1 - Think of three different ways in which you could tokenize the below sentence
#        - Which of the tokenizations will be the best, do you think? Why?

animal_sentence = "Cats, in my opinion, aren't better than dogs."

possibility_1 = #cats   in my opinion   arent better than dogs
possibility_2 = # TODO
possibility_3 = # TODO

# Q1.1.2 - What problems could we encounter when trying to tokenize a dataset of tweets? 
#        - How might we overcome these?

SyntaxError: invalid syntax (<ipython-input-4-486b9b759c39>, line 6)

Hopefully these questions have emphasised that there are lots of things to think about when we perform tokenization, and there is no single 'best' method!

In [12]:
# Q1.1.3 (optional) - Try importing some tokenizers from different Python modules 
#                   - e.g. nltk, torchtext, spacy
#                   - Compare their outputs, do they all tokenize sentences in the same way?
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

sentence = "Cats, in my opinion, aren't better than dogs. Dogs are the best!"
tokenized_sentence = sent_tokenize(sentence)
print(tokenized_sentence)

###i couldnt get spacy to install
#import spacy
#spacy_en = spacy.load('en')
#def tokenizer(text):
#    return [tok.text for tok in spacy_en.tokenizer(text)]

#print(tokenizer(sentence))

["Cats, in my opinion, aren't better than dogs.", 'Dogs are the best!']
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ollie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


ModuleNotFoundError: No module named 'spacy'

For now, we'll concentrate on just a handful of points when we tokenize our sentences:
- Splitting sentences into a list of words.
- Removing punctuation.
- Changing any uppercase letters to lowercase.

For Q1.1.4 you must complete the skeleton code for a ```basic_tokenizer```.
There are a few hints below that you may want to use in your solution.

In [42]:
# Q1.1.4
import string as st
punctuation = st.punctuation
def basic_tokenizer(sentence):
    string = sentence.lower()
    newstring = ''
    for char in string:
        if char not in punctuation:
            newstring += char
    stingOfWord = newstring.split(' ')
    return(stingOfWord)

basic_tokenizer("Cats, in my opinion, aren't better than dogs.")


['cats', 'in', 'my', 'opinion', 'arent', 'better', 'than', 'dogs']

In [31]:
hint1 = 'This is hint 1'.split()
print('Hint 1:', hint1)

hint2 = 'This! Is! Hint! 2!'.replace('!', '')
print('Hint 2:', hint2)

hint3 = string.punctuation
print('Hint 3:', hint3)

hint4 = 'THIS IS YOUR FINAL HINT'.lower()
print('Hint 4:', hint4)

Hint 1: ['This', 'is', 'hint', '1']
Hint 2: This Is Hint 2
Hint 3: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Hint 4: this is your final hint


Run the cell below to test your ```basic_tokenizer``` function!

In [43]:
test1 = "Cats, in my opinion, aren't better than dogs"
if sorted(basic_tokenizer(test1)) == ['arent', 'better', 'cats', 'dogs', 'in', 'my', 'opinion', 'than']:
    print('Your function passed the test!')
else:
    print("Oops, your function isn't working quite yet. Try again!")

Your function passed the test!


### Exercise 1.2: Stopwords

In NLP, stopwords are common words that do not add much value to the meaning of a sentence or document. In English, these include: 'the', 'is', 'in', 'for', 'at', etc. We often want to remove these from our data for NLP problems. For example, consider the following two sentences:

1. There is a cat in this hat.
2. There is a flaw in this argument.

We can tell that these sentences are about totally different things, despite having 5 of 7 words in common, but computers aren't as good at infering meaning as we are!

In English, there are a lot of words that we could consider stopwords. Thankfully, there are Python libraries that contain stopword lists that we can use, rather than creating our own lists.

In [45]:
stop_words = stopwords.words('english')

In [47]:
# Q1.2.1 - How many stopwords are there in this list?
#        - Which stopword is the longest?

print(len(stop_words))
print(max(stop_words, key=len))

179
yourselves


Now that we have a list of stopwords, we need a function to remove them from our text. 

For Q2.2 you must complete the function ```remove_stopwords``` below.

The function should take as input a list ```sentence``` of words and a list ```stop_words``` of stopwords.

In [88]:
# Q1.2.2 

def remove_stopwords(list, stop_words):
    output = []
    for item in list:
        if item not in stop_words:
            output.append(item)
    return output

remove_stopwords(["this","is","a","test"], stop_words)

['test']

Run the cell below to test your ```remove_stopwords``` function!

In [89]:
test2 = ['there', 'is', 'a', 'cat', 'in', 'this', 'hat']
if sorted(remove_stopwords(test2, stop_words)) == ['cat', 'hat']:
    print('Your function passed the test!')
else:
    print("Oops, your function didn't remove all of the stopwords. Try again!")

Your function passed the test!


### Exercise 1.3: Final processing

Now that we've written functions to tokenize sentences and to remove stopwords, we want to combine them into a complete preprocessing function.

* Q1.3.1 - Write a function ```preprocess``` that preprocesses a sentence ready for NLP techniques to be applied. Your function should tokenize an input sentence and remove stopwords from a predefined list.

* Q1.3.2 (optional) - Research lemmatization and stemming in relation to NLP. Add these techniques to your ```preprocess``` function. (**HINT**: Are there any Python libraries that you could use to help you?)

In [90]:
# Q1.3.1 / Q1.3.2

def preprocess(sentence, stop_words):
    sentence = basic_tokenizer(sentence)
    clean_sentence = remove_stopwords(sentence, stop_words)

    return clean_sentence

Run the cell below to test your ```preprocess``` function!

In [91]:
test3 = 'This is the final test! What will the result be?'
if sorted(preprocess(test3, stop_words)) == ['final', 'result', 'test']:
    print('Congratulations! Your preprocessor is working and ready to go.')
else:
    print('Not quite! Try again.')

Congratulations! Your preprocessor is working and ready to go.
