# Preprocessing

In this Assignment, we will be exploring how to preprocess tweets for sentiment analysis.


In [None]:
import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt            # library for visualization
import random                              # pseudo-random number generator

## About the Twitter dataset

The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly. The exact match between these classes is not a coincidence. The intention is to have a balanced dataset. That does not reflect the real distributions of positive and negative classes in live Twitter streams.



In [None]:
# downloads sample twitter dataset.
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

We can load the text fields of the positive and negative tweets by using the module's `strings()` method like this:

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

Next, we'll print a report with the number of positive and negative tweets. It is also essential to know the data structure of the datasets

In [None]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))
print(all_negative_tweets[0:5])
print(all_positive_tweets[0:5])

Number of positive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>
The type of a tweet entry is:  <class 'str'>
['hopeless for tmr :(', "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(", '@Hegelbon That heart sliding into the waste basket. :(', '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too', 'Dang starting next week I have "work" :(']
['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)', '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!', '@97sides CONGRATS :)', 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days']


## Looking at raw texts



Below, you will print one random positive and one random negative tweet.

In [None]:
print(all_positive_tweets[random.randint(0,5000)])
print(all_negative_tweets[random.randint(0,5000)])

All my Bae :)) &lt;3 
By : @Merima_Beslagic http://t.co/gDy1trnfjV
@CameronNeil yeah true but I fucked up one of my signature dishes last night! Unfamiliar kitchen :(


## Preprocess raw text for Sentiment analysis

Data preprocessing is one of the critical steps in any machine learning project. It includes cleaning and formatting the data before feeding into a machine learning algorithm. For NLP, the preprocessing steps are comprised of the following tasks:

* Tokenizing the string
* Lowercasing
* Removing stop words and punctuation
* Stemming




In [None]:
# Our selected sample. Complex enough to exemplify each step
tweet = all_positive_tweets[2277]
print(tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i


Let's import a few more libraries for this purpose.

In [None]:
# download the stopwords from NLTK
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

### Remove hyperlinks,  Twitter marks and styles

Since we have a Twitter dataset, we'd like to remove some substrings commonly used on the platform like the hashtag, retweet marks, and hyperlinks. We'll use the [re](https://docs.python.org/3/library/re.html) library to perform regular expression operations on our tweet. We'll define our search pattern and use the `sub()` method to remove matches by substituting with an empty character (i.e. `''`)

In [None]:
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)

# remove hyperlinks
tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)

# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)

print(tweet)

My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 


### Tokenize the string

To tokenize means to split the strings into individual words without blanks or tabs. In this same step, we will also convert each word in the string to lower case. The [tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual) module from NLTK allows us to do these easily:

In [None]:
# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)

# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet)

print()
print('Tokenized string:')
print(tweet_tokens)


Tokenized string:
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']


### Remove stop words and punctuations

The next step is to remove stop words and punctuation. Stop words are words that don't add significant meaning to the text. You'll see the list provided by NLTK when you run the cells below.

In [None]:
#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english')

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

We can see that the stop words list above contains some words that could be important in some contexts.


Time to clean up our tokenized tweet!

In [None]:

print(tweet_tokens)

tweets_clean = []

for word in tweet_tokens: # Go through every word in your tokens list
    if (word not in stopwords_english and  # remove stopwords
        word not in string.punctuation):  # remove punctuation
        tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']
removed stop words and punctuation:
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']


Please note that the words **happy** and **sunny** in this list are correctly spelled.

### Stemming

Stemming is the process of converting a word to its most general form, or stem. This helps in reducing the size of our vocabulary.

Consider the words:
 * **learn**
 * **learn**ing
 * **learn**ed
 * **learn**t

All these words are stemmed from its common root **learn**. However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, **happi** and **sunni**. That's because it chooses the most common stem for related words. For example, we can look at the set of words that comprises the different forms of happy:

 * **happ**y
 * **happi**ness
 * **happi**er

We can see that the prefix **happi** is more commonly used. We cannot choose **happ** because it is the stem of unrelated words like **happen**.

NLTK has different modules for stemming and we will be using the [PorterStemmer](https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter) module which uses the [Porter Stemming Algorithm](https://tartarus.org/martin/PorterStemmer/). Let's see how we can use it in the cell below.

In [None]:

print(tweets_clean)

# Instantiate stemming class
stemmer = PorterStemmer()

# Create an empty list to store the stems
tweets_stem = []

for word in tweets_clean:
    stem_word = stemmer.stem(word)  # stemming word
    tweets_stem.append(stem_word)  # append to the list

print('stemmed words:')
print(tweets_stem)

['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']
stemmed words:
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']


In [None]:
processed_tweet=' '.join(tweets_stem)
processed_tweet

'beauti sunflow sunni friday morn :) sunflow favourit happi friday …'

That's it! Now we have a sentence which can be feed into to the next stage
of our  project.

.

PART 2: Sentimental Analysis

In [None]:
import numpy as np
import pandas as pd
nltk.download('twitter_samples')
# select the lists of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# concatenate the lists, 1st part is the positive tweets followed by the negative
tweets = all_positive_tweets + all_negative_tweets

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


In [None]:
#print tweets here
for tweet in tweets:
   print(tweet)

In [None]:
y=np.zeros(10000)
for i in range(5000):
  y[i]=1

In [None]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer

def preprocess_tweet(tweet):
    # Remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # Remove hyperlinks
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
    # Remove hashtags
    tweet = re.sub(r'#', '', tweet)
    return tweet



# Preprocess the tweets
processed_tweets = [preprocess_tweet(tweet) for tweet in tweets]
print(processed_tweets)





In [None]:
def tokenizing(tweet):
  # instantiate tokenizer class
  tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)

  # tokenize tweets
  tweet_tokens = tokenizer.tokenize(tweet)
  return tweet_tokens

tokenized_tweets=[tokenizing(tweet) for tweet in processed_tweets]

print(tokenized_tweets)



In [None]:
def cleaning(tweet):
  tweet_clean = []

  for word in tweet: # Go through every word in your tokens list
    if (word not in stopwords_english and  # remove stopwords
        word not in string.punctuation):  # remove punctuation
        tweet_clean.append(word)
  return tweet_clean

tweets_clean=[cleaning(tweet) for tweet in tokenized_tweets]
print(tweets_clean)



In [None]:
def stemming(tweet):
  stemmer = PorterStemmer()

  # Create an empty list to store the stems
  tweets_stem = []

  for word in tweet:
      stem_word = stemmer.stem(word)  # stemming word
      tweets_stem.append(stem_word)  # append to the list

  return tweets_stem

tweets_stemall=[stemming(tweet) for tweet in tweets_clean]
print(tweets_stemall)


[['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)'], ['hey', 'jame', 'odd', ':/', 'pleas', 'call', 'contact', 'centr', '02392441234', 'abl', 'assist', ':)', 'mani', 'thank'], ['listen', 'last', 'night', ':)', 'bleed', 'amaz', 'track', 'scotland'], ['congrat', ':)'], ['yeaaah', 'yipppi', 'accnt', 'verifi', 'rqst', 'succeed', 'got', 'blue', 'tick', 'mark', 'fb', 'profil', ':)', '15', 'day'], ['one', 'irresist', ':)', 'flipkartfashionfriday'], ['like', 'keep', 'love', 'custom', 'wait', 'long', 'hope', 'enjoy', 'happi', 'friday', 'lwwf', ':)'], ['second', 'thought', '’', 'enough', 'time', 'dd', ':)', 'new', 'short', 'enter', 'system', 'sheep', 'must', 'buy'], ['jgh', 'go', 'bayan', ':d', 'bye'], ['act', 'mischiev', 'call', 'etl', 'layer', 'in-hous', 'wareh', 'app', 'katamari', 'well', '…', 'name', 'impli', ':p'], ['followfriday', 'top', 'influenc', 'commun', 'week', ':)'], ['love', 'big', '...', 'juici', '...', 'selfi', ':)'], ['follow', 'follow', 'u', 'back', ':)'], ['perf

Now make a function and implement pre-processing into all tweets and then make an array that contains all processed tweets as strings.

Now use **TfidfVectorizer** to vectorize your tweets into a numbered matrix
 **X**.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
processed_tweets_strings = [' '.join(tweet) for tweet in tweets_stemall]
print(processed_tweets_strings)
import numpy as np

vectorizer = TfidfVectorizer()
# Vectorize the processed tweets
X = vectorizer.fit_transform(processed_tweets_strings)
print(X)

['followfriday top engag member commun week :)', 'hey jame odd :/ pleas call contact centr 02392441234 abl assist :) mani thank', 'listen last night :) bleed amaz track scotland', 'congrat :)', 'yeaaah yipppi accnt verifi rqst succeed got blue tick mark fb profil :) 15 day', 'one irresist :) flipkartfashionfriday', 'like keep love custom wait long hope enjoy happi friday lwwf :)', 'second thought ’ enough time dd :) new short enter system sheep must buy', 'jgh go bayan :d bye', 'act mischiev call etl layer in-hous wareh app katamari well … name impli :p', 'followfriday top influenc commun week :)', 'love big ... juici ... selfi :)', 'follow follow u back :)', "perfect alreadi know what' wait :)", 'great new opportun junior triathlet age 12 13 gatorad seri get entri :)', 'lay greet card rang print today love job :-)', "friend' lunch ... yummm :) nostalgia tb ku", "id conflict thank help :d here' screenshot work", 'hi liv :)', 'hello need know someth u fm twitter — sure thing :) dm x', '

Now you have a matrix **X** and **y** implement a model to classify this tweets.

Note:

1) You can use sequential models with tensorflow in which use 2 nodes in last layer.

2) The node which has a higher value while using *model.predict* corresponds to the output.

3) Use **SparseCategoricalCrossentropy** as a loss function.

In [None]:
import tensorflow as tf
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [None]:
# model defn
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),

    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])


In [None]:
#model compilation

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Train the model
model.fit(X_train.toarray(), y_train, epochs=10, batch_size=32, validation_data=(X_test.toarray(), y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test.toarray(), y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

# Predict the labels for new tweets
predictions = model.predict(X_test.toarray())



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: 1.1819995641708374
Test Accuracy: 0.7059999704360962
