<a href="https://colab.research.google.com/github/mnijhuis-dnb/Artificial_Intelligence_and_Machine_Learning_for_SupTech/blob/main/Tutorials/Tutorial%207%20NLTK%20and%20sentiment%20analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Artificial Intelligence and Machine Learning for SupTech  
Tutorial 7: NLTK and sentiment analysis

*	Constructing a bag of words
*	Classifying sentiments (positive/negative)
*	Example with financial news data

<br/>

15 March 2023  

**Instructors**  
Prof. Iman van Lelyveld (iman.van.lelyveld@vu.nl)<br/>
Dr. Michiel Nijhuis (m.nijhuis@dnb.nl)  

# Preparation

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
import nltk

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('vader_lexicon')
!pip install twython 

# Load articles from the FT
The file `data/df_articles.csv` contains 436 articles from the Financial Times on the topic of Environmental, Social and Governance (ESG) between March 2021 and 2022.

In [None]:
!gdown 1-4lW68b60PikNWQOfqc5qtriOEHC3_Kq

In [None]:
df_articles = pd.read_csv('/content/df_articles.csv')
df_articles

# Extract tokens from the text
Current, the article is stored as `date`, `title` and `body`. Our focus will be on `body`. This is currently stored as a long string of text. See example below

In [None]:
df_articles.loc[10, 'title']

In [None]:
df_articles.loc[10, 'body']

This is not particularly useful as we want to extract meaningful words. For example, "green" or "inflation" or "rise". For this we first have to splice the `body` into individual tokens, that are separated by whitespaces. The regular expression `'\w+'` means one or more alphanumeric characters. 

In [None]:
from nltk.tokenize import RegexpTokenizer

# \w+ means at least one white space or line break character
regexp = RegexpTokenizer('\w+')

In [None]:
df_articles['body_tokenize'] = df_articles['body'].apply(regexp.tokenize)

In [None]:
df_articles.loc[0, 'body']

In [None]:
df_articles.loc[0, 'body_tokenize']

# Convert all words to lower case

In [None]:
def convert_to_lowercase(list_of_tokens):
  return [tk.lower() for tk in list_of_tokens]

In [None]:
convert_to_lowercase(['BaBaBa','bbbb','AAAA'])

In [None]:
df_articles['body_tokenize'] = df_articles['body_tokenize'].apply(convert_to_lowercase)

In [None]:
df_articles.loc[10, 'body_tokenize']

# Remove stopwords and short words
This is much better. But many words are not informative, which are what we call "stopwords".

In [None]:
stopwords = nltk.corpus.stopwords.words("english")
stopwords

We want to filter out these words and only keep the meaningful ones. Moreover, short words that are one or two characters long can also be removed.

In [None]:
def remove_stopwords(tokens):
    non_stopwords_tokens = []
    for tk in tokens:
        if len(tk) <= 3:
            # if the token is less than 3 characters long, jump to next token
            continue

        if any([a.isnumeric() for a in tk]):
            # if the token is numeric, e.g. '34' yields true
            continue

        if tk in stopwords:
            # if the token is a stopword, jump to next token
            continue

        # if no jumps happened, then add the token to the results
        non_stopwords_tokens.append(tk)

    return non_stopwords_tokens

In [None]:
remove_stopwords([
    'ab',
    'abc',
    'only',
    '2020a',
    'England',
    'ESG',
])

In [None]:
df_articles['body_tokenize_nonstop'] = df_articles['body_tokenize'].apply(remove_stopwords)

In [None]:
df_articles.loc[10, 'body_tokenize_nonstop']

# Stemming and lemmatization

Great, now we need to use stemming or lemmatization to identify inflected forms of a word. I.e. `look` is the base form, but we usually see words like `looking`, `looks`, `looked` etc. Lemmatization goes a step further and also takes the context into account.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer().lemmatize

In [None]:
lemmatizer('bikes')

In [None]:
lemmatizer('going')

In [None]:
def lemmatize_each_word(words):
    lemmatized_words = []
    for word in words:
        lemmatized = lemmatizer(word)
        lemmatized_words.append(lemmatized)
    return lemmatized_words

In [None]:
lemmatize_each_word(df_articles.loc[10, 'body_tokenize_nonstop'])

In [None]:
df_articles['body_tokenize_nonstop_lemma'] = df_articles['body_tokenize_nonstop'].apply(lemmatize_each_word)

# Assign sentiments 

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer().polarity_scores

In [None]:
analyzer('good')

In [None]:
analyzer('bad')

In [None]:
analyzer('I like coffee')

In [None]:
def get_sentiment(list_of_words):
    words = ' '.join(list_of_words)
    analyzed = analyzer(words)
    return analyzed['compound']

In [None]:
get_sentiment(['I', 'like', 'coffee'])

In [None]:
df_articles['sentiment'] = df_articles['body_tokenize_nonstop_lemma'].apply(get_sentiment)

In [None]:
df_articles['sentiment']

# Plot results

In [None]:
df_articles.head()

We now have the desired results. We would like to plot these over time. So let's extract the `sentiment` column and plot it.

In [None]:
df_articles['sentiment'].plot(figsize=[15,5])

This is very messy. There are several issues here that need to be addressed
1. The x-axis is wrong, it should be based on dates, not number of entries.
2. The series is very volatile, we are looking for a trend.

Let us first make our problem easier. `df_articles` contains all the data we constructed throughout the notebook. For plotting, however, we only need two columns: (a) the `date` column as the index and (b) the `sentiment` column for the actual time series values.

In [None]:
df_articles.head(5)

## Getting the dates right

In [None]:
sr_sentiment = df_articles.set_index('date')
sr_sentiment = sr_sentiment['sentiment']

In [None]:
sr_sentiment.plot(figsize=[15,5], marker='o')

This looks identical as the previous figure, only that the x-labels are different. If we look closely, however, the dates are not actual dates. Instead, these are texts. Let's have a look at the index of `sr_sentiment`

In [None]:
sr_sentiment.index

The `dtype='object'` means that the index is just a list of string values. We want to have dates.

In [None]:
sr_sentiment.index = pd.to_datetime(sr_sentiment.index)
sr_sentiment.index

In [None]:
sr_sentiment.plot(figsize=[15,5], marker='o', lw=0)

## Ensuring a consistent timeseries

What's different? A lot of diagonal lines appear that seem to jump between data points. This is usually a sign of interpolation and missing values. 

Can we see the missing values? Let's have look at `sr_sentiment`. Notice anytihng strange?

In [None]:
sr_sentiment

Indeed! There are two data points for the `2022-02-26`. We should only have on value for each day. Also, ther edoes not seem to be a value for `2022-02-05`.

Thanks to the index being in date format, we can make use of pandas' powerful datetime function. Let's "resample" the dataset and compute the average value for each day. 

In [None]:
sr_sentiment

In [None]:
# for days
sr_sentiment.resample('D').mean()

In [None]:
# for months
sr_sentiment.resample('D').mean()

In [None]:
sr_sentiment = sr_sentiment.resample('D').mean()
sr_sentiment

In [None]:
sr_sentiment.plot(figsize=[15,5], marker='o')

## Dealing with missing values

What should we do with the missing data? We have several options
1. Fill with zeros.
2. Fill with most recent known value.
3. Fill by interpolating between last known and next known values.

What should we do?

Since there were not reports on the dates with missing values, one could argue that there was nothing worth reporting about. Thus, zeros would be a neutral way to deal with the missing values, since it denotes neither positive nor negative sentiments.

In [None]:
sr_sentiment = sr_sentiment.fillna(0)

In [None]:
sr_sentiment.plot(figsize=[15,5], marker='o')

## Aggregating over time

The data points before are shown as raw data. That each, each day has a value. However, we are interested in trends. For example, the moving average over a week or a month.

In [None]:
sr_sentiment_week = sr_sentiment.resample('W').mean().diff()
sr_sentiment_week.plot(figsize=[15,5], marker='o', title='Rolling weekly average, mean')

In [None]:
sr_sentiment_week = sr_sentiment.rolling(14).mean()
sr_sentiment_week.plot(figsize=[15,5], marker='o', title='Rolling weekly average, mean')

In [None]:
sr_sentiment_month = sr_sentiment.rolling(30).median()
sr_sentiment_month.plot(figsize=[15,5], marker='o', title='Rolling monthly average, mean')

In [None]:
sr_sentiment_month = sr_sentiment.rolling(30).std()
sr_sentiment_month.plot(figsize=[15,5], marker='o', title='Rolling monthly average, std. dev.')

# Translating word counts into features

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
countvec = CountVectorizer()

In [None]:
X = countvec.fit_transform(df_articles['body_tokenize_nonstop_lemma'].str.join(' '))

In [None]:
df_features = pd.DataFrame(
    X.toarray(), 
    columns=countvec.get_feature_names_out()
)

In [None]:
df_features