# Step 1 - Downloading Twitter Data

Running the command below from the Python interpreter downloads and stores the tweets locally. Once the samples are downloaded, they are available for your use.

You will use the negative and positive tweets to train your model on sentiment analysis later in the tutorial. The tweets with no sentiments will be used to test your model.

If you would like to use your own dataset, you can gather tweets from a specific time period, user, or hashtag by using the Twitter API.

Now that you’ve imported NLTK and downloaded the sample tweets, exit the interactive session by entering in exit(). You are ready to import the tweets and begin processing the data.

In [86]:
import nltk
nltk.download("twitter_samples")

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

# Step 2 - Tokenizing the Data

Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

In [87]:
from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
neutral_tweets = twitter_samples.strings('tweets.20150430-223406.json')

The strings() method of twitter_samples will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

In [88]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

You are ready to use NLTK’s tokenizers. NLTK provides a default tokenizer for tweets with the .tokenized() method. Add a line to create an object that tokenizes the positive_tweets.json dataset:

In [89]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(tweet_tokens[0])

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions

# Step 3 - Normalizing the Data

Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

In [90]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

 Before running a lemmatizer, you need to determine the context for each word in your text. This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence.

In [91]:
from nltk.tag import pos_tag
from pprint import pprint 

print(pprint(pos_tag(tweet_tokens[0])))

[('#FollowFriday', 'JJ'),
 ('@France_Inte', 'NNP'),
 ('@PKuchly57', 'NNP'),
 ('@Milipol_Paris', 'NNP'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN'),
 ('this', 'DT'),
 ('week', 'NN'),
 (':)', 'NN')]
None


https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    
From the list of tags, here is the list of the most common items and their meaning:
<ul>
NNP: Noun, proper, singular
</ul>
<ul>
NN: Noun, common, singular or mass
</ul>    
<ul>
IN: Preposition or conjunction, subordinating
</ul>
<ul>
VBG: Verb, gerund or present participle
</ul>
<ul>
VBN: Verb, past participle    
</ul>    

In general, if a tag starts with NN, the word is a noun and if it starts with VB, the word is a verb.

To incorporate this into a function that normalizes a sentence, you should first generate the tags for each token in the text, and then lemmatize each word using the tag.

In [92]:
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize(tokens):
    
    tagged_tokens = pos_tag(tokens)
    lemmatizer = WordNetLemmatizer()
    lemmatized_sent = []
    
    for word, tag in tagged_tokens:
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sent.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sent
        

print(pprint(lemmatize(nltk.word_tokenize('Hi my name is @Osama and I am typing very fast and being fast means making more mistakes')))) 

['Hi',
 'my',
 'name',
 'be',
 '@',
 'Osama',
 'and',
 'I',
 'be',
 'type',
 'very',
 'fast',
 'and',
 'be',
 'fast',
 'mean',
 'make',
 'more',
 'mistake']
None


You will notice that the verb 'being' changes to its root form, 'be' and 'running' change to 'run'

# Step 4 — Removing Noise from the Data

In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

In this tutorial, you will use regular expressions in Python to search for and remove these items:

**Hyperlinks** - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.

**Twitter handles in replies** - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.

**Punctuation and special characters** - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

In [93]:
import re, string
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

def remove_noise(tokens, stop_words):
    cleaned_tokens = []
    
    for token in tokens:
            token = re.sub(pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', repl = '', string = token)
            token = re.sub(pattern = "(@[A-Za-z0-9_]+)", repl = '', string = token)
            
            if len(token) > 0 and (token not in string.punctuation) and (token.lower() not in stop_words):
                cleaned_tokens.append(token)
    
    cleaned_lemmatized_tokens = lemmatize(cleaned_tokens)
    
    return cleaned_lemmatized_tokens

print(pprint(remove_noise(tweet_tokens[0], stopwords)))

['#FollowFriday', 'top', 'engaged', 'member', 'community', 'week', ':)']
None


Before proceeding to the modeling exercise in the next step, use the remove_noise() function to clean the positive and negative tweets.

In [99]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

cleaned_positive_tweets = [remove_noise(token, stopwords) for token in positive_tweet_tokens]
cleaned_negative_tweets = [remove_noise(token, stopwords) for token in negative_tweet_tokens]

print(positive_tweet_tokens[0]) 
print(cleaned_positive_tweets[0]) # print out a random cleaned negative tweet

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']
['#FollowFriday', 'top', 'engaged', 'member', 'community', 'week', ':)']


# Step 5 — Determining Word Density

The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

The following snippet defines a generator function, named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined.

Yield is a lazy iterator and it does not drain the memory unlike 'return', that is why it is used for iterating over such large
datasets.

In [102]:
def word_density(cleaned_tweet_list):
    
    for tokens in cleaned_tweet_list:
        for token in tokens:
                yield token
  
all_pos_words = word_density(cleaned_positive_tweets)
print(all_pos_words)

<generator object word_density at 0x000001F1833DE348>


Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the FreqDist class of NLTK.

The .most_common() method lists the words which occur most frequently in the data.

In [103]:
from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(pprint(freq_dist_pos.most_common(10)))

[(':)', 3691),
 (':-)', 701),
 (':D', 658),
 ('follow', 338),
 ('...', 290),
 ('love', 242),
 ('day', 235),
 ('get', 234),
 ('u', 228),
 ('like', 220)]
None


From this data, you can see that emoticon entities form some of the most common parts of positive tweets. Before proceeding to the next step, make sure you comment out the last line of the script that prints the top ten tokens.

To summarize, you extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

# Step 6 — Preparing Data for the Model


Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.

Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

A model is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. A sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

### Converting Tokens to a Dictionary

First, you will prepare the data to be fed into the model. You will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

In [161]:
def token_dict(cleaned_tweet_tokens):
    
    for tweet_tokens in cleaned_tweet_tokens:
        yield dict([token, True] for token in tweet_tokens)
        
positive_tokens_for_model = token_dict(positive_tweet_tokens)
negative_tokens_for_model = token_dict(negative_tweet_tokens)

Attach a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

In [162]:
import random as r

positive_dataset = [(token, 'Positive') for token in positive_tokens_for_model]
negative_dataset = [(token, 'Negative') for token in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

r.shuffle(dataset)

# splitting the train/test data

train_data = dataset[:7000]
test_data = dataset[7000:]

train_data

[({'So': True,
   '@gwatsky': True,
   'had': True,
   'a': True,
   'fantastic': True,
   'show': True,
   '!': True,
   'And': True,
   'I': True,
   'already': True,
   'want': True,
   'to': True,
   'buy': True,
   'tickets': True,
   'another': True,
   'concert': True,
   '.': True,
   ':D': True},
  'Positive'),
 ({'Someone': True,
   'more': True,
   'mature': True,
   'than': True,
   'me': True,
   'taught': True,
   'once': True,
   'how': True,
   'to': True,
   'hide': True,
   'your': True,
   'privacy': True,
   '.': True,
   'He': True,
   'was': True,
   'wise': True,
   ':)': True,
   'Now': True,
   "it's": True,
   'time': True,
   'grow': True,
   'up': True,
   'in': True,
   'some': True,
   'ways': True},
  'Positive'),
 ({'@OdellSchwarzeJG': True,
   'How': True,
   '?': True,
   'Easy.Get': True,
   'up': True,
   'at': True,
   '5:30': True,
   'am': True,
   ',': True,
   'go': True,
   'to': True,
   'work': True,
   'come': True,
   'home': True,
   'bout

# Step 7 — Building and Testing the Model

Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

In [169]:
from nltk import classify
from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_data)

print(f'Confidence is: {classify.accuracy(classifier, test_data)}\n')
print(classifier.show_most_informative_features(20))

Confidence is: 0.9946666666666667

Most Informative Features
                      :) = True           Positi : Negati =    974.7 : 1.0
                     sad = True           Negati : Positi =     30.9 : 1.0
                 arrived = True           Positi : Negati =     27.0 : 1.0
                  THANKS = True           Negati : Positi =     24.2 : 1.0
                  Thanks = True           Positi : Negati =     23.2 : 1.0
                    miss = True           Negati : Positi =     19.5 : 1.0
                    THAT = True           Negati : Positi =     18.1 : 1.0
                  FOLLOW = True           Negati : Positi =     17.4 : 1.0
                    MUCH = True           Negati : Positi =     17.4 : 1.0
                     TOO = True           Negati : Positi =     14.9 : 1.0
                    poor = True           Negati : Positi =     14.7 : 1.0
                   Thank = True           Positi : Negati =     14.1 : 1.0
               followers = True        

In [179]:
custom_good = "The crust of this pizza was very good, but the topping was very bad. Overall I was satisfied with the pizza."
custom_tokens = remove_noise(nltk.word_tokenize(custom_bad), stopwords)

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Negative
