## Part 1

In [40]:
import re
from nltk.tokenize import RegexpTokenizer
import spacy
import pandas as pd
from collections import Counter
import pickle

### 1.1. First, tokenize the sentence yourself (i.e. manually convert this string toa list of tokens) without looking at any of the methods in the manual.Then, use at least two different (computational) methods for tokenization. Compare the results: describe the differences and try to explain them given what you know about the way they work

In [28]:
sample = 'We’re where we were when W.W.W. Wonka (1940-2012) was when he was selected prime minister of the U.K. with 50.23% of the votes. For more information, see: www.we-want-wonka.co.uk/2012'
punctuation = re.compile(r'[!"#$%&\'\(\)\*\+\,\-\.\/:\;\<=\>\?@\[\\\]\^\_`\{|\}~]')

# No methods used
def tokenize(text):
    punctuation.sub(text, '')
    text = text.lower() 
    text = text.split()
    return text

result1 = tokenize(sample)

# NLTK method
tokenizer = RegexpTokenizer('\w+')
result2 = tokenizer.tokenize(sample.lower())

# Spacy method
nlp = spacy.load("en_core_web_sm")
processed_text = nlp(sample.lower())

result3 = [token.text for token in processed_text if not token.is_punct]

Put together, the different tokens lists look like this.

In [36]:
pd.DataFrame(data=[result1, result2, result3]).T

Unnamed: 0,0,1,2
0,we’re,we,we
1,where,re,’re
2,we,where,where
3,were,we,we
4,when,were,were
5,w.w.w.,when,when
6,wonka,w,w.w.w
7,(1940-2012),w,wonka
8,was,w,1940
9,when,wonka,2012


Differences:

- result1 splits on any of the characters included in the punctuation list.

- result2 splits the sample text in sequences that match the indicated pattern '\w+' (sequence of word characters). It is equivalent to the method 'findall' from the re library.

- result3 is the library spaCy particular method of tokenization. It adds some special cases to the standard tokenization. For example, it seems to retrieve the full value of numbers that have a decimal part (50.23), or abbreviations separated by periods (U.K). Also, it retrieves full links to websites.

### 1.2. In 2017, a medium article compared the word counts of four different tools (Word, Word Online, Google Docs, LibreOffice). All tools produced different word counts. The author concluded: ’I’m not counting manually to find out who is correct, so I am declaring them all rubbish’.

In [38]:
print(f"Total number of tokens obtained with custom tokenize function: {len(result1)}")
print(f"Total number of tokens obtained with RegexpTokenizer: {len(result2)}")
print(f"Total number of tokens obtained with spaCy tokenizer: {len(result3)}")

Total number of tokens obtained with custom tokenize function: 28
Total number of tokens obtained with RegexpTokenizer: 40
Total number of tokens obtained with spaCy tokenizer: 30


Since the tokens obtained with spaCy seem the most reliable to me, I would say that the most appropriate length is 30 words.

### 1.3. Compare the occurrences of the words ’we’, ’www’ between tokenization methods.

Custom tokenize function retrieves: {"we’re"}, {"w.w.w"}

RegexpTokenizer retrieves: {"we", "re"}, {"w", "w", "w"}

spaCy tokenizer retrieves: {"we", "’re"}, {"w.w.w"}

### 1.4. Compare the ten most frequent words based on the different (including your own) tokenization methods. Do you think these differences will matter when you have millions of texts, and not some nonsense text solely designed to make you reflect on the impact of preprocessing methods?

In [44]:
counter_1 = Counter(result1)
counter_2 = Counter(result2)
counter_3 = Counter(result3)

counter_1.most_common(5), counter_2.most_common(5), counter_3.most_common(5)

([('when', 2), ('was', 2), ('of', 2), ('the', 2), ('we’re', 1)],
 [('we', 3), ('w', 3), ('when', 2), ('wonka', 2), ('2012', 2)],
 [('we', 2), ('when', 2), ('was', 2), ('of', 2), ('the', 2)])

If the difference is so substantial in such a small text, it is very likely that they will matter in bigger samples.

## Part 2

In [55]:
# Load pickle file
df = pd.read_pickle(r'discussions.p')

In [58]:
# How many posts?
df.shape[0]

50000

In [None]:
# How many lemmatized nouns and lemmatized adjectives?

text = df["post"].to_string()
nlp = spacy.load("en_core_web_sm")
nlp.max_length = len(text)
processed_text = nlp(text.lower())

# Lemmatized nouns
lem_nouns = [token.lemma_ for token in processed_text if token.pos_ == "NOUN"]
Counter(lem_nouns).most_common(10)