# Data Science Workshop: Introduction to Natural Language Processing
### *Presented by Laura Stegner, stegner@wisc.edu*
#### Tutorial materials adapted from [this article](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk)

The objectives for today are as follows:
 * Discuss what NLP is, what it is used for, and some ethical considerations surrounding it
 * Learn the basics of a popular NLP Python package, `nltk`
 * Use `nltk` to perform sentiment analysis on a dataset of tweets

## Quick Review

### What is NLP?
Natural Language Processing (NLP) can be broadly thought of as the computational tools used to help computers understand and manipulate spoken or written natural language to do useful things. This goal can be achieved with the help of various NLP tasks, such as:
 * Part of speech tagging
 * Speech recognition
 * Word sense disambiguation
 * Sentiment analysis
 * Natural langauge generation
 * Named entity recognition
 * Co-reference resolution

Each of the above tasks is briefly described in [this article](http://www.ibm.com/cloud/learn/natural-language-processing) by IBM. 

Practically, NLP is present in our everyday lives. Some common examples include autocorrect, autocomplete, *related search* terms in a web engine, email filtering, smart agents (e.g. Siri or Alexa), and machine translation (e.g. Google Translate). It is also useful in business applications such as to analyze reviews or to create automated calling systems and chat bot assistants.

*Did you have any questions about this basic understanding of NLP or high-level questions about the different NLP tasks?*

### What will we use NLP for?
While NLP is being readily implemented in everyday products, it is also greatly useful in data science. NLP can be used to convert messy, unstructured natural language responses (such as interview data or open responses to survey questions) into more structured, processable data forms. Using NLP techniques to analyze data can serve to speed up processing time and also eliminate inconsistencies from manual analysis.

### Where does data for NLP come from?
Business will often use reviews or survey responses to extract information about how their business is doing. Recorded phone calls with customer service agents are also used for NLP training purposes--that's the "this call may be recorded for quality assurance purposes" messsage on many customer service calls! It is also common to get data from social media, such as through the [Twitter API](https://developer.twitter.com/en/docs) or [Reddit](https://www.reddit.com/wiki/api).

Researchers may use either datasets attained from the web, including social media or messaging logs, or from their own work, such as transcribed interviews, open-ended survey responses, or other data collection that results in natural language.

## Sentiment Analysis
Pop quiz: *In your own words, what is sentiment analysis?*

### How do we perform sentiment analysis?
Sentiment analysis is ultimately performed with a machine learning classifier. Today, we will be using a supervised learning technique called Naive Bayes. You can watch [this video](https://www.youtube.com/watch?v=O2L2Uv9pdDA) for an introduction to it if you are curious. The algorithm itself is already provided by the `nltk` library we will be using. In fact, most of the work needs to be done to preprocess the text into the proper format; the actual classifier is simple to build and use with `nltk`.

Since we are not going to go into the details of Naive Bayes, all you need to know is that this classifier will use a labeled dataset to build a model that can classify an unknown datapoint. In our case, the data will be the tweet text, and the categories will be 'positive' or 'negative'. We will use examples of positive and negative tweets to train our model, then use that model to classify unknown tweets from a different dataset.

## Getting started with Semantic Analysis
### Setting up NLTK
`nltk` is a great library for getting started with NLP. However, before we can use it properly, we will need to download some additional resources (which we will discuss later). To do so, we can run the following python script:

In [1]:
import nltk
nltk.download('twitter_samples')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/laura/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package punkt to /Users/laura/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/laura/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/laura/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /Users/laura/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Getting to know the dataset
We will be working with a dataset of tweets that are included in the `nltk` library. The library was downloaded in the setup script we just ran. We have downloaded lists of negative tweets, positive tweets, and unlabeled tweets. We will use the positive and negative tweets to train our model, and the unlabeled tweets to see how it performs!

To access these different lists of tweets, we can import them individually and save them as variables:

In [2]:
from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
unlabeled_tweets = twitter_samples.strings('tweets.20150430-223406.json')

### Text Preprocessing
Real world natural langauge data is messing; we cannot just immediately start running a script to analyze the text. First, we must preprocess the data into a format that we can more accurately analyze. This preprocessing will involve three steps, which we will walk through below: tokenization, normalization, and noise removal.

--> You can check out an example of why preprocessing is necessary in this [article](https://towardsdatascience.com/nlp-text-preprocessing-a-practical-guide-and-template-d80874676e79).

The preprocessing steps from raw data into properly formatted data will follow the steps outlined below.
#### 1. **Tokenization**
Here, we want to break apart each review into smaller parts called *tokens*. In this case, we will break each review into tokens based on white spaces. We can tokenize datasets using the `.tokenized()` function:

In [3]:
sample_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
print("Original tweet:")
print(positive_tweets[0])
print("--------------------")
print("Tokenized tweet:")
print(sample_tokens)

Original tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
--------------------
Tokenized tweet:
['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


#### 2. **Normalization**
Here, we will convert words to their canonical forms. In other words, we will group words that have the same meaning but are of different form, such as "search" and "searching" and "searched." To perform the normalization, we will use a technique called *lemmatization*. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form.
 
Our lemmatization process involves two steps: 
 
First, we use a part-of-speech tagger to help determine the context:
 

In [4]:
from nltk.tag import pos_tag
print(pos_tag(sample_tokens))

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]


Second, we can use a lemmatization function. In order to correctly lemmatize words, we need to also include the part-of-speech as contextual information:

In [5]:
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

print(lemmatize_sentence(sample_tokens))

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']


Note that our lemmatization function includes the `pos_tag()` function, so it will not be run as a standalone step in the final script.

#### 3. **Noise Removal**
Here, we will remove parts of the text that do not add any meaning or information. Exactly what constitutes noise will vary from project to project. *Stop words* are the most common words of a language. One simple approach to removing noise is to filter out all of the stop words, since they are generally irrelevant.

Additionally, we will use *Regular Expressions* to filter out hyperlinks, twitter handles in replies, and punctuation/special characters. A regular expression is a sequence of characters that defines a search pattern. They are often used to validate data entry that must take a certain form (such as emails or phone numbers). You can read more about regular expressions [here](https://www.geeksforgeeks.org/write-regular-expressions/).

We will use the following function to remove noise from our data. Notice that the noise removal includes lemmatization, so it will not be necessary to run lemmatization as a standalone step in the final script.

In [6]:
import re, string
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

print(remove_noise(sample_tokens, stop_words))

['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']


### Building the Naive Bayes Classifier
Now that the data is ready to go, we are just missing the classifier! We will be using the `NaiveBayesClassifier`.

In order to properly classify our dataset, we first need to train our model. To do this, we need to provide a combined dataset that includes the positive and negative tweets. Making this combined dataset involves cleaning the positive and negative tweets, converting the cleaned tweets to the proper form, adding the labels, and combining the positive and negative tweets into one dataset.

#### 1. Clean the training data
We will use the `remove_noise()` function that we created in order to clean both lists:

In [7]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

As a quick check, we can compare some raw tweets with some cleaned tweets:

In [8]:
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht']
['dang', 'rad', '#fanart', ':d']


#### 2. Convert to dictionary
Because of the way the `NaiveBayesClassifier` from `nltk` works, we need to transform our data from a list of words into a dictionary where each word is set to *true*.

In [21]:
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

#### 3. Add the labels
Now that we properly formatted the data, we need to add the labels 'positive' and 'negative' as appropriate. Then, we can combine the positive and negative tweets into one dataset:

In [22]:
import random

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

#### 4. Combining positive and negative tweets
The labeled positive and negative datasets need to be combined into one dataset. This is a simple step. Notice that we randomize the order of the tweets to avoid any bias from the original ordering of the tweets.

In [24]:
dataset = positive_dataset + negative_dataset
random.shuffle(dataset)

#### 5. Split into training and testing sets
Now that we have our training dataset, we are finally ready to build the classifier. For our Bayes Classifier to work, we need to partition the dataset into two sections: training and testing. The training data will be used to build the model, while the testing data will be used to assess the correctness of the model.

In [25]:
train_data = dataset[:7000]
test_data = dataset[7000:]

#### 6. Building the model
`nltk` makes is very simple to build a Bayes Classifier now that we have jumped through all of the data preprocessing hoops:

In [26]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

#### 7. Testing the model
Now that we have our model built, let's use the testing data to check the accuracy:

In [27]:
print("Accuracy is:", classify.accuracy(classifier, test_data))

Accuracy is: 0.996


#### (fun, but optional) Check out the most significant features
Our model will identify certain words that make a tweet more likely to be positive or negative. To check out what words (*features*) it decided were most important in the classifier, we can print them:

In [28]:
print(classifier.show_most_informative_features(10))

Most Informative Features
                      :( = True           Negati : Positi =   2070.9 : 1.0
                      :) = True           Positi : Negati =   1653.4 : 1.0
                     bam = True           Positi : Negati =     23.1 : 1.0
                followed = True           Negati : Positi =     22.9 : 1.0
                     sad = True           Negati : Positi =     20.1 : 1.0
                     x15 = True           Negati : Positi =     19.0 : 1.0
                follower = True           Positi : Negati =     18.6 : 1.0
                     via = True           Positi : Negati =     16.5 : 1.0
                     ugh = True           Negati : Positi =     14.3 : 1.0
                      aw = True           Negati : Positi =     13.6 : 1.0
None


### Using the Model
Now that we have our model built, it is time to see it in action!

In [30]:
from nltk.tokenize import word_tokenize

#custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworldies'

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Positive


### The Full Script
I have included the full Python script separately from the Jupyter notebook to make it easier to see the steps without all of the commentary. You can run it from the command line: `python3 semantic_analysis.py`

## Discussion/Questions
 * Questions about what we covered today?
 * What examples of NLP have you encountered either in your research or in your daily life?
 * Where do you think you could use NLP in your research, or other research situations?
 * Based on the readings or your own experiences, what are some important ethical considerations surrounding NLP?
 * Why should we care about this, and how can it affect our data analysis?