# Preprocessing

In [1]:
from nltk.corpus import twitter_samples

## Exploring the Data

The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly.

In [2]:
postive_tweets = twitter_samples.strings("positive_tweets.json")
negative_tweets = twitter_samples.strings("negative_tweets.json")

num_postive_tweets = len(postive_tweets)
num_negative_tweets = len(negative_tweets)
num_total_tweets = num_postive_tweets + num_negative_tweets

print("Number of positive tweets: ", len(postive_tweets))
print("Number of negative tweets: ", len(negative_tweets))
print("Total number of tweets: ", num_total_tweets)

Number of positive tweets:  5000
Number of negative tweets:  5000
Total number of tweets:  10000


At first, we want to get an understanding of what the data looks like.

When you scroll through the samples, you will notice a couple of things that differentiate tweets from normal texts, for example:
- usernames, so-called `handles`, e.g `@Lambd2ja`
- hashtags, e.g. `#FollowFriday`
- emojis and smileys, e.g. 💞 or `:)`
- URLs, e.g. `https://t.co/smyYriipxI"`
- slang words
- etc.

Make yourself familiar with both the positive and negative tweets!

In [3]:
postive_tweets

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days',
 '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM',
 "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI",
 '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.',
 'Jgh , but we have to go to Bayan :D bye',
 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing 

In [4]:
negative_tweets

['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(',
 "oh god, my babies' faces :( https://t.co/9fcwGvaki0",
 '@RileyMcDonough make me smile :((',
 '@f0ggstar @stuartthull work neighbour on motors. Asked why and he said hates the updates on search :( http://t.co/XvmTUikWln',
 'why?:("@tahuodyy: sialan:( https://t.co/Hv1i0xcrL2"',
 'Athabasca glacier was there in #1948 :-( #athabasca #glacier #jasper #jaspernationalpark #alberta #explorealberta #… http://t.co/dZZdqmf7Cz',
 "I have a really good m&amp;g idea but I'm never going to meet them :(((",
 '@Rampageinthebox mare ivan :(',
 '@SophiaMascardo happy trip, keep safe. see you soon :* :(',
 "I'm so tired hahahah :(",
 '@GrumpyCockney With knee replacements they get you up &amp; about the same day. :-(   Ou

## Tweet Preprocessing

We will be using the `htwgnlp` Python package to preprocess the data.
It contains a `preprocessing` module with a `TweetProcessor` class.
The boilerplate code for the class is given, as well as some unit tests that describe the desired behavior.

Your job will be to implement the `TweetProcessor` class, which is located in `src/htwgnlp/preprocessing.py`
The task is completed successfully if all tests for the first assignment pass.
You can run the test using the following command:

```bash
make assignment_1
```

> As you can check in the `Makefile`, this is will execute `pytest tests/htwgnlp/test_preprocessing.py` under the hood.

Let's assume we have the following requirements for the preprocessing pipeline of our tweets:

- remove URLs as they are usually shortened and don't add much information to the tweet
- remove hashtag symbols `#` but preserve the word of the hashtag since it gives valuable information about the content of the tweet
- remove english stopwords
- remove standard punctuation, but keep emojis like `:)`
- Twitter handles like `@stuartthull` should be removed completely
- after preprocessing, it is expected to have the tweet in a tokenized and stemmed for, i.e. a list of words.
- for tokenization, you should use [NLTK's `TweetTokenizer`](https://www.nltk.org/api/nltk.tokenize.casual.html#nltk.tokenize.casual.TweetTokenizer)
- for stemming, you should use [`NLTK's PorterStemmer`](https://www.nltk.org/api/nltk.stem.porter.html)
- Also, tweets should be lowercased and repeated character sequences should not be more than 3, e.g. `looooove` should be transformed to `looove`

For more implementation details, please refer to the [docstrings](https://realpython.com/documenting-python-code/#documenting-your-python-code-base-using-docstrings) of the `htwgnlp.preprocessing.TweetProcessor` class.


In [5]:
from htwgnlp.preprocessing_private import TweetProcessor

The following code shows the intended usage of the `TweetProcessor` class.

In [6]:
# instatiate a TweetProcessor object
processor = TweetProcessor()

# we use a selected tweet as an example
i = 2277
postive_tweets[i]

'My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i'

Each processing step described above is encapsulated in a separate method of the `TweetProcessor` class, and can be called separately. 
For example, the `remove_urls(tweet: str)` method.

If your implementation works correctly, the URL `https://t.co/3tfYom0N1i` should be removed, when you execute the following line:

In [7]:
tweet = processor.remove_urls(postive_tweets[i])
tweet

'My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… '

The `remove_hashtag(tweet: str)` method should transform `#sunflowers #favourites #happy #Friday` to `sunflowers favourites happy Friday`

> Note that lowercasing comes later in the process.

In [8]:
tweet = processor.remove_hashtags(tweet)
tweet

'My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… '

After tokenization, the tweet should be lowercased, and repeated characters as well as twitter handles should be removed.

> For this step, make sure to read the docs of [NLTK's `TweetTokenizer`](https://www.nltk.org/api/nltk.tokenize.casual.html#nltk.tokenize.casual.TweetTokenizer)

The expected output is a list of tokens. Specifically for our example, at this point, it should be: `['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']`

In [9]:
tweet = processor.tokenize(tweet)
print(tweet)

['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']


After removing stopwords: `['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']`

In [10]:
tweet = processor.remove_stopwords(tweet)
print(tweet)

['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']


After removing punctuation, it makes no difference for our example: `['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']`

> Note that the requirement is to only remove common punctuation, and want to keep emojis like `:)`. However, one could argue if we should want to remove `...` but for this pipeline, let's keep it simple.

In [11]:
tweet = processor.remove_punctuation(tweet)
print(tweet)

['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']


Finally, the last step is stemming. 
After applying the Porter Stemmer, the tweet should look like this: `['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

In [12]:
tweet = processor.stem(tweet)
print(tweet)

['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']


And the `process_tweet(tweet: str)` method is a shortcut for all of the above.

So after a successful pipeline the input tweet should look like this:

```txt
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']
```

In [13]:
print(f"{'Tweet:':<20}{postive_tweets[i]}")
print(f"{'Processed tweet:':<20}{processor.process_tweet(postive_tweets[i])}")

Tweet:              My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
Processed tweet:    ['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']


When your tests run successfully, this notebook should as well deliver the expected output.

Congratulations! 🥳🚀 You just completed your first assignment!