# Topic 6. Natural language processing (NLP)

Natural language processing (NLP) is about developing applications and services that are able to understand human languages. Some Practical examples of NLP are speech recognition for eg: google voice search, understanding what the content is about or sentiment analysis etc.

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk

# Sentiment Analysis 

In [48]:
import nltk
from nltk.tag import pos_tag #provide a list of tokens as an argument to get the tags
from nltk.stem.wordnet import WordNetLemmatizer #Konjugiert Wörter in ihre Normalform
import re, string
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk import FreqDist
import random
from nltk import classify
from nltk import NaiveBayesClassifier
from nltk.tokenize import word_tokenize

## Step 1 — Tokenizing the Data

Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

In [28]:
#create variables for positive_tweets, negative_tweets, and text
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

In [14]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

## Step 2 — Normalizing the Data

Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

In [22]:
#print(pos_tag(tweet_tokens[0])) #Bezeichnung: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [23]:
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer() #aus being -> be
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens): #Bestimmung der Wortart über pos_tag
        if tag.startswith('NN'): #Nomen
            pos = 'n'
        elif tag.startswith('VB'): #Verb
            pos = 'v'
        else:
            pos = 'a' #Adjektiv
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

#print(lemmatize_sentence(tweet_tokens[0]))

## Step 3 — Removing Noise from the Data

Noise is any part of the text that does not add meaning or information to data.

Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

eg Hyperlinks, Punctuation and special characters

In [33]:
stop_words = stopwords.words('english')

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens): 
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        #sub() ersetzt pattern mit ''
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"): #normailze words Step 2
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower()) #string.punctuation (satzzeichen) und stop_words sind füllwörter
    return cleaned_tokens

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words)) # for all tweets
    
#print(positive_tweet_tokens[500])
#print(positive_cleaned_tokens_list[500])

There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

## Step 4 — Determining Word Density

The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

In [36]:
def get_all_words(cleaned_tokens_list): #takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

all_pos_words = get_all_words(positive_cleaned_tokens_list) #generator

In [41]:
freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]


## Step 5 — Preparing Data for the Model

Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

### Converting Tokens to a Dictionary

use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

In [43]:
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens) #True for Naivr Bayes classifer

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

### Naive Bayes classifier

Maschinelles Lernen kann in folgende Kategorien von Lernzielen unterteilt werden:

-Klassifikation: Instanzen einer vordefinierten Klasse zuordnen. (siehe erster Datensatz)

-Clustering: Klassen von Instanzen, die zusammengeh ̈oren, entdecken. 

-Assoziation: Relationen zwischen Attributen lernen

-Numerische Vorhersage (Regression): Eine numerische Gr ̈oße (anstelle einer Klasse) fu ̈r eine Instanz vorhersagen. (siehe zweiter Datensatz)

Maschinelles Lernen in der Praxis bedeutet automatisch Regeln und Muster in Beispielen (Daten) zu finden.
Ein typisches Lernproblem ist die Klassifizierung: wir wollen eine Instanz einer bestimmten Klasse zuordnen.
Um das zu tun, gibt es verschiedene Lernmethoden. Heute haben wir uns den Naive Bayes Klassifizierer angeschaut.
Bayes’sche Statistik kombiniert Vorwissen mit Beobachtungen
Naive Bayes macht die Annahme, dass die Attribute einer Instanz voneinander unabh ̈angig sind.

Aufgrund seiner schnellen Berechenbarkeit bei guter Erkennungsrate ist auch der naive Bayes-Klassifikator sehr beliebt. Mittels des naiven Bayes-Klassifikators ist es möglich, die Zugehörigkeit eines Objektes (Klassenattribut) zu einer Klasse zu bestimmen. Er basiert auf dem Bayesschen Theorem. Man könnte einen naiven Bayes-Klassifikator auch als sternförmiges Bayessches Netz betrachten.

Die naive Grundannahme ist dabei, dass jedes Attribut nur vom Klassenattribut abhängt. Obwohl dies in der Realität selten zutrifft, erzielen naive Bayes-Klassifikatoren bei praktischen Anwendungen häufig gute Ergebnisse, solange die Attribute nicht zu stark korreliert sind.


<img src="https://miro.medium.com/max/1400/1*39U1Ln3tSdFqsfQy6ndxOA.png" height=40% />

[Source](https://towardsdatascience.com/introduction-to-naïve-bayes-classifier-fa59e3e24aaf)


Für den Fall starker Abhängigkeiten zwischen den Attributen ist eine Erweiterung des naiven Bayes-Klassifikators um einen Baum zwischen den Attributen sinnvoll. Das Ergebnis wird baumerweiterter naiver Bayes-Klassifikator genannt.

### Splitting the Dataset for Training and Testing the Model

Next, you need to prepare the data for training the NaiveBayesClassifier class. 
This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

In [44]:
positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model] #dict in dict with (word: True): Negative(endogene)

dataset = positive_dataset + negative_dataset

random.shuffle(dataset) #mischen der Tweets

train_data = dataset[:7000] #split the data
test_data = dataset[7000:]

## Step 8 — Building and Testing the Model

Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

In [47]:
classifier = NaiveBayesClassifier.train(train_data) # training

print("Accuracy is:", classify.accuracy(classifier, test_data)) #test set

print(classifier.show_most_informative_features(10))

Accuracy is: 0.9966666666666667
Most Informative Features
                      :) = True           Positi : Negati =    988.4 : 1.0
                     sad = True           Negati : Positi =     23.3 : 1.0
                follower = True           Positi : Negati =     20.6 : 1.0
                  arrive = True           Positi : Negati =     17.4 : 1.0
                     x15 = True           Negati : Positi =     17.0 : 1.0
                     bam = True           Positi : Negati =     17.0 : 1.0
                 welcome = True           Positi : Negati =     14.4 : 1.0
                followed = True           Negati : Positi =     14.2 : 1.0
                    blog = True           Positi : Negati =     13.0 : 1.0
                 awesome = True           Positi : Negati =     12.6 : 1.0
None


Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. 

Next, you can check how the model performs on random tweets from Twitter.

In [49]:
custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Negative


## Cleaning Up the Code

All imports should be at the top of the file. Imports from the same library should be grouped together in a single statement.

All functions should be defined after the imports.

All the statements in the file should be housed under an if __name__ == "__main__": condition. This ensures that the statements are not executed if you are importing the functions of the file in another file.

In [50]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples, stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk import FreqDist, classify, NaiveBayesClassifier

import re, string, random

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

if __name__ == "__main__":

    positive_tweets = twitter_samples.strings('positive_tweets.json')
    negative_tweets = twitter_samples.strings('negative_tweets.json')
    text = twitter_samples.strings('tweets.20150430-223406.json')
    tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

    stop_words = stopwords.words('english')

    positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
    negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

    positive_cleaned_tokens_list = []
    negative_cleaned_tokens_list = []

    for tokens in positive_tweet_tokens:
        positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

    for tokens in negative_tweet_tokens:
        negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

    all_pos_words = get_all_words(positive_cleaned_tokens_list)

    freq_dist_pos = FreqDist(all_pos_words)
    print(freq_dist_pos.most_common(10))

    positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
    negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

    positive_dataset = [(tweet_dict, "Positive")
                         for tweet_dict in positive_tokens_for_model]

    negative_dataset = [(tweet_dict, "Negative")
                         for tweet_dict in negative_tokens_for_model]

    dataset = positive_dataset + negative_dataset

    random.shuffle(dataset)

    train_data = dataset[:7000]
    test_data = dataset[7000:]

    classifier = NaiveBayesClassifier.train(train_data)

    print("Accuracy is:", classify.accuracy(classifier, test_data))

    print(classifier.show_most_informative_features(10))

    custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

    custom_tokens = remove_noise(word_tokenize(custom_tweet))

    print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]
Accuracy is: 0.998
Most Informative Features
                      :( = True           Negati : Positi =   2065.2 : 1.0
                      :) = True           Positi : Negati =   1665.0 : 1.0
                follower = True           Positi : Negati =     36.9 : 1.0
                     sad = True           Negati : Positi =     24.0 : 1.0
                 welcome = True           Positi : Negati =     19.3 : 1.0
                     x15 = True           Negati : Positi =     17.7 : 1.0
               community = True           Positi : Negati =     17.6 : 1.0
                     ugh = True           Negati : Positi =     14.4 : 1.0
                    blog = True           Positi : Negati =     14.3 : 1.0
                     via = True           Positi : Negati =     13.2 : 1.0
None
I ordered just once from TerribleCo, they screwed 