# Lecture 5 Exploring Social Media Texts

Many of the examples below are taken from the [NLTK book](http://www.nltk.org/book/). Before we start, we should install the required materials. Run the cell below to install the tools and the corpora. This can take a minute...

In [None]:
!pip install nltk==3.2.4

In [4]:
%matplotlib inline
import nltk
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/kasparbeelen/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/kasparbeelen/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/kasparbeelen/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/kasparbeelen/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/kasparbeelen/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/kasparbeelen/nltk_data...
[nltk_data]    |   Package conll2002 is alread

True

In [7]:
!pip install nltk==3.2.4

[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Introduction to Python's Natural Language Toolkit (NLTK)

In the Digital Humanities, we often treat texts as *raw data*, as input for our programs. Interpretations arise from abstraction, for example, by counting word frequencies, analysing specific segments of a corpus (i.e. KeyWord In Context, or KWIC analysis) or searching for patterns (i.e. collocations). 

NLTK provides several tools for both **processing** data and **interpreting** texts.

Let's inspect the tools NLTK provides us with.

First of all, NLTK helps us with "tokenization", a tool that breaks a string into separate "words" (the basic units of a text document also called tokens, ).

To apply tokenization, we need to import the `word_tokenize` tool using the `import` syntax below.

In [None]:
from nltk.tokenize import word_tokenize

We can apply this function to any English text: it will identify the tokens and separate strings from punctuation. A string is just a sequence of characters. However, for simple tasks, such as word counting, we need to split up this string by token.

Below we first lowercase the string and then tokenize it.

In [None]:
example_sentence = "Downing Street has said it is “extremely concerning” that MPs could attempt to override the government to suspend or delay the article 50 process to leave the EU in their effort to prevent a no-deal Brexit."
print(example_sentence)

In [None]:
example_sentence_lower = example_sentence.lower()
print("example_sentence_lower")

In [None]:
words = word_tokenize(example_sentence_lower)
print(words)

#### Exercise

Select the first sentence of [Alice in Wonderland](http://www.gutenberg.org/cache/epub/28885/pg28885.txt), lowercase and then tokenize the string.

In [None]:
# insert code here

After tokenization, we can easily compute the word frequency with the `nltk.FreqDist()` function.

In [None]:
example_words = word_tokenize(example_sentence)
nltk.FreqDist(example_words)

If we'd like to know the frequency of one specific word we put this word between square brackets.

In [None]:
fq = nltk.FreqDist(words)
fq['the']

#### Exercise

What is the frequency of the word 'to' and 'street'?

In [None]:
# insert code here

`nltk.FreqDist` return the absolute counts. To compute the **relative frequency**, the word count has to be divided by the total number of tokens.

#### Exercise

Can you compute the relative frequency of the word "the" in the example sentence below?

In [None]:
example_sentence = "Downing Street has said it is “extremely concerning” that MPs could attempt to override the government to suspend or delay the article 50 process to leave the EU in their effort to prevent a no-deal Brexit."
# insert code here

#### Exercise

The code below downloads the text of Shakespeare's Romeo and Juliet and saves it in the variable `randj`.
Perform the following steps:
- Lowercase the text;
- Tokenize the lowercased text; save the tokens in a new variable with the name `randj_tokens`.
- How many tokens does Romeo and Julia contain? How many characters?
- Make a frequency table, compute the relative frequency of the word "love"

In [None]:
import requests
randj = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt').text

In [None]:
# insert code here

## Text in Context

After tokenizing the text we can apply a bunch of NLTK tools. Below we use the example of Kipling's [Jungle Book](http://www.gutenberg.org/cache/epub/35997/pg35997.txt). 

We first tokenize the book.

In [None]:
import requests
jungle_book = requests.get('http://www.gutenberg.org/cache/epub/35997/pg35997.txt').text
jungle_book_tokens = word_tokenize(jungle_book)

In [None]:
type(jungle_book_tokens)

`word_tokenize` returns a `list` object. Lists are useful for storing information, but not for analysing texts. To allow for more refined text analysis we have to **convert** the list to an NLTK Text object. This type of conversion (from a list to a NLTK text) is performed by the code below.

In [None]:
jungle_nltk_text = nltk.text.Text(jungle_book_tokens)

In [None]:
type(jungle_nltk_text)

Once we tranformed the text into an NLTK Text object we can apply several useful tools.

### `.concordance()`

An oft-used technique for distant reading is **Keyword In Context Analysis** in which we centre a whole corpus on a specific word of interest. NLTK comes with a `concordance()` method that allows you to do just this. Let's analyse how the word "love" in Jungle book:

In [None]:
jungle_nltk_text.concordance("love")

We can specify the number of hits to print with the `lines` argument.

In [None]:
jungle_nltk_text.concordance("black",lines=35)

#### Exercise

How is the word love used in Romeo and Juliet? The code below downloads Romeo and Juliet, continue the code.

In [None]:
import requests
randj = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt').text
# insert code here

### `.similar()`

`concordance()` shows words in their context. For example, in Moby Dick the word monstrous occurres in contexts such as the \_\_\_ pictures and a \_\_\_ size. What other words appear in a **similar range of contexts**? We can find out by using the `.similar()` method, entering the word you want analyse within parentheses. Below we aks which words are similar to "forest".

In [None]:
jungle_nltk_text.similar("forest")

#### `.collocations()`

#### Exercise

- Use help() to find out what the `.collocations()` method returns.
- Apply the method to our Jungle Book example.

In [None]:
# insert code

### 1. Research Scenario: How left and right-wing media depict Mueller

The research scenario below applies these techniques to understand how the New York times depicts Robert S. Mueller. First, we load the data.

Don't worry about the technicality on line two: this command basically tells Python to read everything in "post_message" columns as a string (this step avoids our code the crash in a later stage).

In [None]:
import pandas as pd
nyt = pd.read_csv('https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/nytimes.tab',sep='\t')
nyt['post_message'] = nyt['post_message'].astype(str) # this line is a technicality, use it but do not worry if you don't understand it

Before we tokenize the post messages, we have to join them together in one string. `' '.join()` is the opposite of `.split()`. Inspect the examples below to understand how these functions work.

In [None]:
sentence = 'Hello how are you?'
words = sentence.split()
words

In [None]:
print(type(words))

In [None]:
sentence_merged =  ' '.join(words)
sentence_merged

In [None]:
print(type(sentence_merged))

To apply the NLTK tools to Facebook "post_messages" we first select a column with the posts, join all the messages into one long string, then save the result in the `posts` variable. These three steps can be executed in just one line. 

In [None]:
posts = ' '.join(nyt['post_message'])

In [None]:
To understand what happens here, analyse the code below, that use just the first three post messages.

In [None]:
posts_example = nyt['post_message'][:3]
print(posts_example)

In [None]:
posts_merged_example = ' '.join(posts_example)
print(posts_merged_example)

Once all the post messages are stored as one string. This string is tokenized with `word_tokenize` and subsequently converted to an NLTK object.

In [None]:
posts_tokens = word_tokenize(posts)
nltk_posts = nltk.text.Text(posts_tokens)

In [None]:
nltk_posts.concordance("Mueller")

#### Exercise

How does Fox News report about Mueller? Use [these data](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/foxnews.tab).


In [None]:
# insert code here

#### \*\*\* Exercise

In this exercise, we will investigate the text of Facebook posts in relation to the reactions they received. More specifically we look at what makes different audiences angry or happy when they are confronted with news about about Trump.

Use these data
- [New York Times]('https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/nytimes.tab')
- [Fox News]('https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/foxnews.tab')

So, again for each of these pages (and each of the emotions):
- Compute the ratio of angry and love reactions (by dividing the number of angry/love reactions type by the total number of reactions);
- Select posts with a ratio of love (or angry) reactions higher than 0.3;
- Save the rows that match these conditions in a new DataFrame (for example `df_nyt_angry`);
- Concatenate the post messages into one string, split this string into tokens;
- Convert the resulting list to an NLTK Text object. Use `.concordance()` to investigate what these supposedly angry or happy posts are about?
- Compute the word frequencies. Based on your KWIC analysis, select a few interesting words and compare their relative frequencies  (across emotion or platform);
- Write a small report to document your results.

In [None]:
# insert code here
nyt['ratio_angry'] = nyt['rea_ANGRY'] / nyt['reactions_count_fb']
df_nyt_angry = nyt[nyt['ratio_angry'] > 0.3]
df_nyt_angry

## 2. Research Scenario: Studying changes over time

Besides inspecting the content, we can study changes over time. When are topics salient and when do they disappear. Below we have a look at the comments on the New York Times.

Before looking at trends, the data has to be sorted in chronological order. We use the timestamp of the comment to sort the DataFrame. The sorted DataFrame is saved in a new variable.
> Notice that the `pd.read_csv()` looks slightly different: we added a `parse_dates` arguments, which is set the columns that contain dates as values (Pandas then automatically interprets these values as timestamps).

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data_nytimes/page_5281959998_2018_12_28_22_00_39_comments.tab"
nyt_comments = pd.read_csv(url,sep='\t',parse_dates=['post_published','comment_published'])
nyt_comments_sorted = nyt_comments.sort_values(by='comment_published')

#### Exercise

Run the code below and inspect the "comment_published" column.

In [None]:
nyt_comments_sorted.head(10)

After sorting the data, we first make sure all values in the "comment_message" is actually a string. 
- Then we, again, join all the sorted comments into one long string. 
- Tokenize this string.
- And convert the list of tokens to a NLTK Text object.

In [None]:
nyt_comments_sorted['comment_message'] = nyt_comments_sorted['comment_message'].astype(str)

In [None]:
import nltk
comments = ' '.join(nyt_comments_sorted['comment_message'])
comments_tokens = word_tokenize(comments)
nltk_comments = nltk.text.Text(comments_tokens)

Once all the comments are stored as a NLTK Text object we can generate a `.dispersion()` plot.

In [None]:
nltk_comments.dispersion_plot(['Merkel','Trump','Mueller'])

#### Question

How to interpret this figure?

The dispersion plot tells you where certain words occur, by plotting the offsets, but not how often. To make things easier for you, I created a function `count_words_from_list` that takes a list of words and then computes how often these words appear. Run the cell below to load the function.

In [None]:
from collections import Counter

def count_words_from_list(text,words2count):
    tokens = word_tokenize(text.lower()) # convert string to tokens
    wordfreq = Counter(tokens) # count the tokens, i.e. map tokens to their frequency
    counter = 0
    for w in words2count:
        counter+=wordfreq.get(w,0)
    
    return counter 


To give an example: below we search for three words and compute how often they appears in a comment. We save this information in a new column with the name "comment_about_trump"

In [None]:
words2count = ['president','trump','donald']
nyt_comments_sorted['comment_about_trump'] = nyt_comments_sorted['comment_message'].apply(count_words_from_list,
                                                            words2count=words2count)



In [None]:
nyt_comments_sorted.head()

To plot the results properly, we have to use the timestamp of the comment as the Index of the DataFrame. This is rather technical stuff--do not worry, just run the code.

In [None]:
nyt_comments_sorted.set_index(nyt_comments_sorted["comment_published"],inplace=True)

The figure below plots the results by timestamp. Of course, on such a granularity it is difficult observe trends over time.

In [None]:
%matplotlib inline
nyt_comments_sorted['comment_about_trump'].plot()

Luckily, in Pandas we can easily group our data by hour or day and compute the salience of the Trump topic in the New Your Times.

In [None]:
nyt_comments_sorted['comment_about_trump'].resample('h').sum().plot() # sum by hour

In [None]:
nyt_comments_sorted['comment_about_trump'].resample('h').mean().plot() # mean by hour

To view other resampling option (for example by day instead of the hours) have a look at [this website](http://benalexkeen.com/resampling-time-series-data-with-pandas/).

#### Question

How to interpret these figures and the diverging trends?

In [None]:
# write answer here

#### Exercise

What about Robert Mueller? Plot the saliency of this topic (use words "mueller" and "counsel").

In [None]:
# insert code here

#### \*\*\* Exercise

Use posts from the [New York Times](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/nytimes.tab) and [Fox News](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/foxnews.tab).
For each page:
- Sort the data by the **posts'** Timestamp;
- Select the post message column and `.join()` the posts into one string;
- Tokenize the text;
- Compute word frequencies; compute the relative frequency of the word Trump;
- Make a dispersion plot for 'Kelly','Cohen','Pelosi', and 'Trump'.

In [None]:
# insert code here

# Congratulations with completing Coding the Humanities!

Run the code below to congratulate yourself!

In [1]:
print("🐱"*100)
print("🐶"*100)

🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱🐱
🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶


## 3. Optional Research Scenario: Training an ideological classifier (Advanced Topic)

The part below is optional. It gives code to train an ideology classifier.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score,accuracy_score  

Download the "cldata.csv" on Canvas. This contains rows with texts and labels."0" means that the post was retrieved from the New York Times, "1" that comments was writen on the Fox News page.

In [None]:
all_data = pd.read_csv('cldata.csv',index_col=0)
all_data.head()

In [None]:
To understand how the data was created, consult the appendix below.

We divide the data into a train and test set. We use 80% for training, 20% for testing.

In [None]:
cut_off = int(all_data.shape[0]*0.8)
cut_off

In [None]:
train,y_train = all_data['comment_message'][:cut_off],all_data['label'][:cut_off]

In [None]:
test,y_test = all_data['comment_message'][cut_off:],all_data['label'][cut_off:]

Subsequently we use the CountVectorizer to create a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix). To understand the arguments, consult the [CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [None]:
vectorizer = CountVectorizer(ngram_range=(1,3),max_df=0.9,min_df=20,norm='l1')

In [None]:
X_train = vectorizer.fit_transform(train)

In [None]:
X_test = vectorizer.transform(test)

The code below initializes a classifier model and fits the parameters.

In [None]:
classifier = LinearSVC(C=10,class_weight='balanced')
classifier.fit(X_train,y_train)

After training the model, we calculate labels for the data we separated for testing purposes.

In [None]:
predictions = classifier.predict(X_test)

And compute how accurarte the classifier performed on these test examples.

In [None]:
score = accuracy_score(predictions,y_test)
print(score)

#### \*\*\* Building an Emotion Classifier (Optional)

### Appendix

Code to create classifier data from Netvizz data

In [None]:
import random
from nltk.tokenize import wordpunct_tokenize

def count_tokens(string):
    tokens = wordpunct_tokenize(str(string))
    return len(tokens)

nyt = pd.read_csv('../classifierdata/nyt_comments.tab',sep='\t')
fn = pd.read_csv('../classifierdata/fn_comments.tab',sep='\t')
nyt['comment_message_length'] = nyt['comment_message'].apply(count_tokens)
fn['comment_message_length'] = fn['comment_message'].apply(count_tokens)
nyt_long = nyt[nyt['comment_message_length'] > 10]
fn_long = fn[fn['comment_message_length'] > 10]
fn_long_reduced = fn_long.iloc[:nyt_long.shape[0],:]
all_data = pd.concat([nyt_long,fn_long_reduced])
labels = np.asarray([0]*nyt_long.shape[0] + [1]*fn_long_reduced.shape[0])
all_data.shape[0]==labels.shape[0]
all_data['label'] = labels
order = np.asarray(range(0,all_data.shape[0]))
random.shuffle(order)
all_data = all_data.iloc[order,:]
cldata = all_data.loc[:,['comment_message','label']]
cldata.to_csv('cldata.csv')