# 7. Text Analysis and Preprocessing
In this section, we will outline some steps that are used to preprocess text data so that it can be used to train a predictive model.  We will use Tweets by Barack Obama and Donald Trump as our data source.

## 7.1 Exploratory analysis using word clouds
We can identify candidate feature variables in text using a word cloud visualisation, which shows the most frequently-used words in the text.
### 7.1.1 Install and import required modules
To install WordCloud using Anaconda Navigator:
- Go to Environments.
- Click on Channels.
- Click on Add….
- Type ‘https://anaconda.org/conda-forge’ and press Enter.
- Click on Update channels.
- Search packages for ‘wordcloud’ and install it.

In [None]:
# Import modules:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from wordcloud import WordCloud, STOPWORDS

### 7.1.2 Create a word cloud

In [None]:
# Get Obama’s Tweets.
obama = pd.read_csv('BarackObamaTweets.csv')
obama.head()

In [None]:
# Get all the text from the obama DataFrame in a Series
obama_tweets = obama.text
print(obama_tweets)
type(obama_tweets)

In [None]:
# Turn all the tweets into a string and convert to lower case
obama_str = str(obama_tweets).lower()

# Create a Word Cloud using WordCloud()
wordcloud = WordCloud(width = 3000,
                      height = 2000,
                      background_color = 'black',
                      stopwords = STOPWORDS).generate(obama_str)

# Plot the cloud using pyplot
fig = plt.figure(figsize = (40, 30),
                 facecolor = 'k',
                 edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

### Exercise 7.1
Create a word cloud of Trump’s Tweets (from the file ‘DonaldTrumpTweets.csv’).
Make the following adjustments:
1. Change the background colour to white.
2. Set the minimum font size to 10.
3. Set the maximum number of words to 50.  
  
Documentation for Word Cloud: https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html.  

In [None]:
# Enter your code here:


## 7.2 Data preparation - Bag-of-words (BOW)
### 7.2.1 Simple example of a BOW
We cannot use text data (e.g., a Tweet) in the mathematical models we use for prediction. So, we need a way to convert the text data into a numerical representation.  
  
Bag-of-words (BoW) is a simple method for doing this. It examines each observation (e.g., a single Tweet), treating each word as a feature variable, and the numeric value of the feature is the number of times the word occurs.  
  
For example, recall the two Donald Trump Tweets:  
  
  
![Tweet 1](tweet1.png)
![Tweet 2](tweet2.png)

Using bag-of-words, these could be represented numerically as shown below.

![bag_of_words](bag_of_words.png)

Note:
- The punctuation has been removed and all letters have been changed to lowercase.
- The word “is” occurs thrice in the second Tweet, so its count is 3. It does not occur in the first Tweet, so its count is zero.

### Exercise 7.2
In an earlier exercise, we created the top row of the above table. It is the set of words that occur in both Tweets (called "all_words" in the code below). The program for creating this set is in the cell below.

Now, use an appropriate method to generate a Bag of Words representation for the two Tweets, stored as a Pandas DataFrame. I.e., reproduce the above table.

In [None]:
tweet1 = "We are building the wall..."
tweet2 = "The Wall is being rapidly built! The Economy is GREAT! Our Country is Respected again!"

# Import the punctuation chars
from string import punctuation

# function to create a list of lowercase, non-punctuated words
def clean_tweet(tweet):
    lower_str = tweet.lower()
    clean_str = "".join([ ch for ch in lower_str if ch not in punctuation ])
    split_list = clean_str.split(' ')
    return split_list

tweet1_words = clean_tweet(tweet1)
tweet2_words = clean_tweet(tweet2)
all_words = set(tweet1_words + tweet2_words)

print(tweet1_words)
print(tweet2_words)
print(all_words)

In [None]:
# Enter your code here:


### 7.2.2 Bag-of-words representation of all the Obama and Trump Tweets 
#### Read the data

In [None]:
# Read the CSV files into DataFrames
import pandas as pd

obama = pd.read_csv('BarackObamaTweets.csv')
trump = pd.read_csv('DonaldTrumpTweets.csv')
print(len(obama))
print(len(trump))

#### Combine the data sets into one
We need to combine the datasets, however, before doing so, we should add the target variable (i.e., what we want to be able to predict). In our case, we want to be able to predict whether a Tweet was written by Barack Obama or by Donald Trump. Remember, we cannot use text variables in a mathematical model, so we will create a new integer variable 'trump' and set it to 1 for all of Donald Trump's Tweets and 0 for all of Barack Obama's Tweets. 

In [None]:
# Create target variable 'trump' indicating for each record whether it was written by Trump (1) or Obama (0)
trump['trump'] = 1
obama['trump'] = 0

# Contatenate the two DataFrames into one
alltweets = pd.concat([trump[['text', 'trump']], obama[['text', 'trump']]], axis = 0)
print(len(alltweets))
print(alltweets.head(5))
print(alltweets.tail(5))

#### Use a module to perform bag-of-words
We are going to create a bag-of-words using a class called CountVectorizer, imported from the module 'sklearn.feature_extraction.text'.  
  
This class splits each Tweet into words (i.e., tokenises) and creates a vocabulary (i.e., a set) of all unique words in the corpus (i.e., body of work). It then creates a bag-of-words to count the frequency with which each vocabulary word occurs in each Tweet. The return value is a matrix that contains the unique words and the count for each word.  
  
For more information, see:  
- https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction  
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
# Extract the Tweet text into a Pandas Series
corpus = alltweets['text']
print(type(corpus))
corpus.head()

In [None]:
# Import the CountVectorizer class
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectoriser object
vect = CountVectorizer()

# Perform two methods (fit and transform) from the CountVectorizer class
# fit - Learn the vocabulary of all tokens in the raw text.
# transform - Transform documents to 'document-term matrix' (i.e., bag-of-words).
X = vect.fit_transform(corpus)

# Inspect the dictionary contents (each unique word is a feature):
print('Result is {} tokens = {}'.format(len(vect.get_feature_names()), vect.get_feature_names()))

#### Search the corpus for unwanted words
There are many strange words in the corpus, such as '0rtsmxdnfc'.  This is because the tweets include many website and picture links.  

In [None]:
# Check some of the items that may be unnecessary and worth removing
def check_string(corpus, string):
    found = [ tweet for tweet in corpus if string in tweet ]
    print('String "{}" total={} found={}'.format(string, len(corpus), len(found)))
    for f in found:
        print(f)
        print('---')
    
check_string(corpus, 'http')

In [None]:
# Also check the Twitter picture links
check_string(corpus, 'pic.twitter')

#### Removing unwanted words using regular expressions
A useful way to find and remove these unnecessary strings is to use "regular expressions".

An example of a regular expression is ‘http\S+’.  You can test this 'regex' using https://www.regexpal.com.  See what it finds when provided with the Tweet below:  
  
‘Tell your friends you fist bumped President Obama enter now for your chance. http://ofa.bo/fAEqÂ pic.twitter.com/zJEelmeIe6’. 

Let's use regular expressions to match and remove the website links. These start with ‘http’.  
  
A suitable regular expression is ‘http\S+’, which means find the string ‘http’ followed by any sequence of non-whitespace characters.

In [None]:
# Remove all website links from the corpus

# Module for regular expressions 
import re

def clean_corpus(corpus, regex):
    
    # Use the Pandas Series 'apply' method to apply the lambda function to each element
    corpus_cleaned = corpus.apply(
        # Use the sub() function to replace each matched string with an empty string ('')
        lambda x: re.sub(regex, '', x, flags = re.IGNORECASE)
    )

    # Create the BOW
    vect = CountVectorizer()
    X = vect.fit_transform(corpus_cleaned)
    return (vect, X, corpus_cleaned)

(tokens, X, corpus) = clean_corpus(corpus, r'http\S+')
print('Result is {} tokens remaining: {}'.format(len(tokens.get_feature_names()), tokens.get_feature_names()))

Let's also remove the Twitter picture links that start with 'pic.twitter.com'.

In [None]:
# Also remove words containing picture links:
(tokens, X, corpus) = clean_corpus(corpus, r'pic.twitter.com\S+')
print('Result is {} tokens remaining: {}'.format(len(tokens.get_feature_names()), tokens.get_feature_names()))

Let's also remove any words with digits or underscores

In [None]:
# \S*\d\S*   remove digits (\d) and all joined non-whitespace characters (\S*)
# \w*_+\w*   remove underscores (_+ where + is one or more) joined by word-space characters (\w*)
(tokens, X, corpus) = clean_corpus(corpus, r'\S*\d\S*|\w*_+\w*')
print('Result is {} tokens remaining: {}'.format(len(tokens.get_feature_names()), tokens.get_feature_names()))

In [None]:
# Inspect the cleaned corpus
print(corpus)

#### Removing stop words
Note common words like "is", "the" and "in" in the cleaned corpus. Such words are known as 'stop words'; they are generally presumed to contribute little predictive power to a model and can therefore be removed.
  
The module used to extract our BOW includes a standard set of stop words that can be used as shown below. However, some care may be required in some applications to ensure that these stop words do not remove useful information.  
  
For more information, see:
- https://scikit-learn.org/stable/modules/feature_extraction.html#using-stop-words

In [None]:
# Vectorise the cleaned corpus, removing stop words this time:
vect = CountVectorizer(stop_words = 'english')
bow = vect.fit_transform(corpus)

print('Result is {} tokens remaining: {}'.format(len(vect.get_feature_names()), vect.get_feature_names()))

### 7.2.3 BOW limitations
A bag-of-words is a 'unigram' (where the unit of analysis n = 1) that separates the data into single words.  These unigrams cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.  
  
A more sophisticated approach is to use bigrams (where n = 2), where occurrences of pairs of consecutive words are counted.  The occurance of groups of characters is yet another approach, with these collections of character n-grams offering some resilience against misspellings and derivations.  

See more here:
https://scikit-learn.org/stable/modules/feature_extraction.html#limitations-of-the-bag-of-words-representation

## 7.3 Saving your pre-processed data
Save your BOW and your combined data set. You will need them for the next section.

In [None]:
# Store the tweets
alltweets.to_csv('alltweets.csv')

# Store the cleaned corpus
corpus.to_csv('corpus.csv', header = True, index = False)