Next week, you will start developing NLP models to classify positive and negative tweets. You will use the NLTK twitter samples dataset.

**Twitter Samples dataset documentation**

These samples of Tweets (or 'status updates') were collected from the
Twitter Streaming and REST APIs. Each file consists of
line-separated JSON-formatted tweets, i.e. one Tweet per line. For a
detailed description of the JSON fields in a Tweet, see
https://dev.twitter.com/overview/api/tweets.

Any use of this data is subject to the Twitter Developer Agreement and
Developer Policy:
https://dev.twitter.com/overview/terms/agreement-and-policy.

These were collected in July 2015 by searching against the following strings:

positive_tweets.json

    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    }

negative_tweets.json

    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('

Before developing sentiment analysis models on this dataset, you need to be able to process tweets. This is what you will learn in this datalab.

In [2]:
import nltk
from nltk.corpus import twitter_samples
import random
import re

In [3]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [4]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

**Task 1: Start with a simple EDA on the dataset.**

- What is the data type of `all_positive_tweets` and `all_negative_tweets`
- What is the data type of a tweet entry
- How many tweets are there in each class
- Print a single tweet from each class

In [5]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))

Number of positive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>
The type of a tweet entry is:  <class 'str'>


In [6]:
print('Positive class:')
print(all_positive_tweets[random.randint(0,5000)]+'\n')

print('Negative class:')
print(all_negative_tweets[random.randint(0,5000)])

Positive class:
I am not looking forward to a 12 hour shift today :))))))

Negative class:
@koolaidkiller18 no idea why :(


In [7]:
example_tweet = ('My beautiful sunflowers on a sunny Friday morning off :)'
                 ' #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i')
print(example_tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i


Next, you will create a function called `tweet_processor()`. Task by task, you will add more functionality to it.

In [8]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a string containing the processed tweet
    """
    
    processed_tweet = tweet # no processing so far
    
    return processed_tweet

**Task 2: Remove hyperlinks from a tweet**

Using `re.sub()`, write a regular expression to remove hyperlinks. Add this to the function `tweet_processor()`.

Example tweet:
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:
`'My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… '`

In [9]:
import re

In [10]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a string containing the processed tweet
        
    Processing steps:
    - Removes hyperlinks
        
    """
    # Remove URLs using regular expression
    processed_tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet)
    
    return processed_tweet

In [11]:
tweet_processor(example_tweet)

'My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… '

**Task 3: Remove hashtags from the tweets**

Add a `re.sub()` and a suitable pattern to the `tweet_processor()` function to remove the `#` sign from the word.

Example tweet:
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:
`'My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… '`

Notice that `tweet_processor()` now removes both hyperlinks and hashtags.

In [12]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a string containing the processed tweet
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
        
    """
    # Remove URLs using regular expression
    processed_tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet)
    
    # Remove hashtags
    processed_tweet = re.sub(r'#([^\s]+)', r'\1', processed_tweet)
    
    return processed_tweet

In [13]:
tweet_processor(example_tweet)

'My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… '

**Task 4: Tokenize a tweet**

Use `TweetTokenizer` from `nltk` to tokenize tweets. Add this to the `tweet_processor()` function.

Example tweet:
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:
`['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']`

Notice that the output is not a string anymore. It is a list of tokens.

In [14]:
from nltk.tokenize import TweetTokenizer

Decide which arguments to provide to the following parameters:

- **preserve_case** (bool) – Flag indicating whether to preserve the casing (capitalisation) of text used in the tokenize method. Defaults to True. 

- **reduce_len** (bool) – Flag indicating whether to replace repeated character sequences of length 3 or greater with sequences of length 3. Defaults to False.

- **strip_handles** (bool) – Flag indicating whether to remove Twitter handles of text used in the tokenize method. Defaults to False.

In [15]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of tokens
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
        
    """
    # Remove URLs using regular expression
    processed_tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet)
    
    # Remove hashtags
    processed_tweet = re.sub(r'#([^\s]+)', r'\1', processed_tweet)
    
    # Tokenize the tweet using TweetTokenizer
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=False)
    processed_tweet = tokenizer.tokenize(processed_tweet)
    
    return processed_tweet

In [16]:
print(tweet_processor(example_tweet))

['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']


 **Task 5: Remove stopwords and punctuation.**
 
 Go through every word in your tokens list, remove stopwords and punctuation. Add this to the function `tweet_processor()`.
 
 Example tweet:
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:
`['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']`

Take a look at which tokens are removed and try to understand why.

In [17]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_english = stopwords.words('english') 
print(stopwords_english)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [19]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
        
    """
    # Remove URLs using regular expression
    processed_tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet)
    
    # Remove hashtags
    processed_tweet = re.sub(r'#([^\s]+)', r'\1', processed_tweet)
    
    # Tokenize the tweet using TweetTokenizer
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=False)
    processed_tweet = tokenizer.tokenize(processed_tweet)
    
    # Remove stopwords
    stopwords_english = set(stopwords.words('english'))
    processed_tweet = [word for word in processed_tweet if word not in stopwords_english]
    
    # Remove punctuation
    processed_tweet = [word for word in processed_tweet if word not in string.punctuation]
    
    return processed_tweet

In [20]:
print(tweet_processor(example_tweet))

['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']


**Task 6: Stem the tokens**

Use PorterStterm from NLTK to convert tokens to stems.
https://en.wikipedia.org/wiki/Stemming

Add this functionality to tweet_processor() function.

Example tweet:
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:
`['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

In [21]:
from nltk.stem import PorterStemmer

In [22]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
    - Stem tokens
        
    """
    # Remove URLs using regular expression
    processed_tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet)
    
    # Remove hashtags
    processed_tweet = re.sub(r'#([^\s]+)', r'\1', processed_tweet)
    
    # Tokenize the tweet using TweetTokenizer
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=False)
    processed_tweet = tokenizer.tokenize(processed_tweet)
    
    # Remove stopwords
    stopwords_english = set(stopwords.words('english'))
    processed_tweet = [word for word in processed_tweet if word not in stopwords_english]
    
    # Remove punctuation
    processed_tweet = [word for word in processed_tweet if word not in string.punctuation]
    
    # Stem the tokens
    stemmer = PorterStemmer()
    processed_tweet = [stemmer.stem(word) for word in processed_tweet]

    return processed_tweet

In [23]:
print(tweet_processor(example_tweet))

['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']
