# Working with an existing corpus

In this notebook, we will work with an existing corpus, in csv format, to draw information from it and do some basic operations with spaCy.

Go back to the `spacy_intro.ipynb` notebook for an intro to spaCy and for information on how to iterate through tokens and lemmas. 

Here, we will work with larger texts, to either create a corpus or work with one that already exists. In the `data/` directory, there is a file called `gnm_articles.csv` that you'll need. 

Acknowledgements: [Tuomo Hiippala](https://www.mv.helsinki.fi/home/thiippal/), [Programming Historian](https://programminghistorian.org/), [Melanie Walsh](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import os
import re
from pprint import pprint
import spacy
nlp = spacy.load('en_core_web_sm')

## Working with a csv file

We will first work with a single file, a part of the [SFU Opinion and Comments Corpus (SOCC)](https://github.com/sfu-discourse-lab/SOCC). The corpus was collected in our lab, the Discourse Processing Lab, for a project on evaluative language in online news comments. It consists of: opinion articles, comments, and annotated comments from the Canadian newspaper _The Globe and Mail_. We'll work with the articles, which should be in the data directory. If not, you can always download the corpus directly from the page above or from its [Kaggle page](https://www.kaggle.com/datasets/mtaboada/sfu-opinion-and-comments-corpus-socc) and save the `gnm_articles.csv` file to your data directory. 

We will first read the csv file into a pandas dataframe, `socc_df`. You can find out more about the contents of the file with the pandas function `shape()`. Then, we'll print the first few rows and find out what the headers are. 

In [None]:
# read the csv file into a pandas dataframe

socc_df = pd.read_csv('data/gnm_articles.csv', encoding='utf-8')

In [None]:
# store the number of rows and columns and print them

nRow, nCol = socc_df.shape
print(f'There are {nRow} rows and {nCol} columns')

In [None]:
# print the first 5 rows

socc_df.head(5)

## Information about the dataframe

From the `shape()` information, we know that the file has 10,339 rows. That is one article per row, with information about the title of the article, the author, the date of publication and the number of comments it received. The comments are stored in a separate file in the [SOCC corpus](https://github.com/sfu-discourse-lab/SOCC).  

We can examine the pandas dataframe and figure out, for instance, how many comments articles got. `ncomments` is the number o total comments an article received.  `ntop_level_comments` is how many of them were beginnings of threads (as opposed to replies). 

You can use `value_counts()` to list characteristics of various columns. Here, we'll do `ncomments` and `author`. I also use `describe()` to give me statistics of how many columns there are, the average, min and max, etc. Note the difference between the output for `describe()` for the comments and for the authors, as the former contains numbers and the latter, strings. 

You can also get the same information in a bar chart, which here I am limiting to the top 10 categories. But you can change that parameter (`value_counts()[:10]`) to get more bars. Note: if the `.plot()` cells don't show you a bar chart, run them again. 

In [None]:
socc_df['ncomments'].value_counts()

In [None]:
socc_df['ncomments'].describe()

In [None]:
socc_df['author'].value_counts()

In [None]:
socc_df['author'].describe()

In [None]:
socc_df['ncomments'].value_counts()[:10].plot(kind='bar')

In [None]:
socc_df['author'].value_counts()[:10].plot(kind='bar')

## Explore articles with many and with zero comments

We are going to compare articles that have many comments to those that have none. Let's take a random sample of 100 of each and do a few comparisons.

As you saw above, the average number of comments is 64. So let's use that number as the cut-off for articles with many comments. 

We first copy a part of the original data frame into a new one, selecting only rows where the count of ncomments is higher than 64. If you are curious, that gives us 3,179 articles, with a large spread of how many comments each has. We take a random sample of 100 from those and put the value of the `article_text` column into a string, `many_comments`. 

Then, we do exactly the same for articles with 0 comments and put them into a different string, `zero_comments`. There are 2,542 articles with zero comments. 

In [None]:
# get all the articles with >64 into a df, then take a random sample of that frame
# finally, put the article_text column for those into a string

many_comments_df = socc_df[socc_df['ncomments'] > 64]
sample_many_comments_df = many_comments_df.sample(n=100) 
many_comments = ", ".join(sample_many_comments_df['article_text'])

In [None]:
many_comments_df['ncomments'].describe()

In [None]:
many_comments_df['ncomments'].value_counts()

In [None]:
# get all the articles with 0 comments into a df, then take a random sample of that frame
# finally, put the article_text column for those into a string

zero_comments_df = socc_df[socc_df['ncomments'] == 0]
sample_zero_comments_df = zero_comments_df.sample(n=100) 
zero_comments = ", ".join(sample_zero_comments_df['article_text'])

In [None]:
zero_comments_df['ncomments'].describe()

## Data cleanup and analysis

We have two string variables, `many_comments` and `zero_comments` (hint: if you get tired of typing variable names, you can start typing, hit "tab" and it will autocomplete.)

The first thing we are going to do is look at the strings. You can check how long they are (this will be in characters) and print them, to see if they have any code we don't want. You'll see that there are `<p>` and `</p>` tags. Those are html paragraph marks, and we don't need them. So our first task is exciting: We'll define our first function! You'll see the line that starts with `def`. That's a function definition in python. We use the `re`, regular expression module, to find all the things between angle brackers, and replace them with nothing. Then, print the text again to see if they are gone.

In [None]:
len(many_comments)

In [None]:
len(zero_comments)

In [None]:
print(many_comments)

In [None]:
print(zero_comments)

In [None]:
# Define an HTML tag removal function

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [None]:
many_clean = remove_html_tags(many_comments)
zero_clean = remove_html_tags(zero_comments)

In [None]:
print(many_clean)

In [None]:
len(many_clean)

In [None]:
len(zero_clean)

## spaCy

We use spaCy to analyze the text. That'll give us tokens, POS, NER, etc. Go back to the `spacy_intro.ipynb` for information on the kind of information you can get from the spCy pipeline. I have given you just some examples again here. 

In [None]:
many_doc = nlp(many_clean)

In [None]:
zero_doc = nlp(zero_clean)

In [None]:
# some of the things from the other spacy notebook
# I print only the first 20 tokens in many_doc

for token in many_doc[:20]:
    print(token.text, token.pos_)

In [None]:
# print the entities in the first 50 tokens of zero_doc

for ent in zero_doc[:50].ents:
    print(ent.text, ent.label_)

## Compare many and zero articles

Think about what can make an article have many or no comments. Could it be the topics discussed? The author of the article? The people/places being discussed? We can try and compare these two very small samples using the information that spaCy provides and the information from the articles. 

The first comparison is not linguistic really, but about authors. Check which are the top 4-5 authors in each data sample and if they are significantly different. Note that, because we are taking a random sample from the `many_comments_df` and `zero_comments_df` every time we run this notebook, the results will change every time. You could also do this analysis on the full sample of each type. 

The next analysis compares the number of words and number of sentences in each sample (tokens in the article body). Are there differences? In my case, `many_doc` tends to have more tokens and sentences than `zero_doc`. 

Then, we are going to examine entities. One first analysis creates a dictionary and iterates through the list of entities, to see which entities are more frequent. We do the same for articles with zero comments. These will be long dictionaries, so it may be hard to find any differences. The next thing we do is to create lists of _types_ of entites, rather than the entities themselves. Do you see any differences? What are the top entities and top entity types in each corpus?

In [None]:
sample_many_comments_df['author'].value_counts()

In [None]:
sample_zero_comments_df['author'].value_counts()

In [None]:
print("Tokens in articles with many comments:", len(many_doc))

In [None]:
print("Sentences in articles with many comments:", len(list(many_doc.sents)))

In [None]:
print("Tokens in articles with zero comments:", len(zero_doc))

In [None]:
print("Sentences in articles with zero comments:", len(list(zero_doc.sents)))

In [None]:
# create a dictionary to store the entities in the many_comments articles

many_ent_dict = {}

# iterate through the list of entities and increase the count every time
# we see the same text

for ent in many_doc.ents:
    many_ent_dict[ent.text] = many_ent_dict.get(ent.text, 0) + 1

# you can print many_ent_dict now, but it's not in sorted order
# instead, create a sorted version of the dictionary and print that

many_ent_dict_sorted = sorted(many_ent_dict.items(), key=lambda item: item[1], reverse=True)

print("Entities in articles with many comments:")
for ent, count in many_ent_dict_sorted:
    print(ent, count)

In [None]:
# do the same for zero_comments

zero_ent_dict = {}

for ent in zero_doc.ents:
    zero_ent_dict[ent.text] = zero_ent_dict.get(ent.text, 0) + 1

zero_ent_dict_sorted = sorted(zero_ent_dict.items(), key=lambda item: item[1], reverse=True)

print("Entities in articles with zero comments:")
for ent, count in zero_ent_dict_sorted:
    print(ent, count)

In [None]:
# now, instead of the entities themselves, we are going to compare entity types
many_ent_types = {}

for ent in many_doc.ents:
    many_ent_types[ent.label_] = many_ent_types.get(ent.label_, 0) + 1

print("Entity types in articles with many comments:")
for ent, count in many_ent_types.items():
    print(ent, count)

In [None]:
# same for articles with zero comments
zero_ent_types = {}

for ent in zero_doc.ents:
    zero_ent_types[ent.label_] = zero_ent_types.get(ent.label_, 0) + 1

print("Entity types in articles with zero comments:")
for ent, count in zero_ent_types.items():
    print(ent, count)

## Comparing top words and top POS

What are the most frequent words, most frequent nouns, verbs, etc? Most frequent POS? To do this, it's best to work with the lemmatized versions, because then we can count _think, thought,_ and _thinking_ as instances of the same lemma, _think_. 
We will also use another tool, a list of [stop words](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html), so that we don't count frequent function words such as _the, and, of_. There are many such lists for English, depending on what you are interested in counting. The full list from spaCy is available on [spaCy's GitHub](https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py). We also exclude anything that is not a word, i.e., punctuation and numbers. These are the steps in the for loop to count lemmas:

* First, check if it's an alpha type of token
* Then, check that it is not a stop word
* Lowercase the lemma
* Add to the count of lemmas

The next two code blocks count noun chunks, i.e., noun phrases. 

Then, we are going to work with POS tags. I'll do those in a separate section, because I also want to include plots and functions there. 

In [None]:
many_lemma_counts = {}
many_lemma_total = 0

for token in many_doc:
    if token.is_alpha and not token.is_stop:
        lemma = token.lemma_.lower()
        many_lemma_counts[lemma] = many_lemma_counts.get(lemma, 0) + 1
        many_lemma_total = many_lemma_total + 1
        
many_lemma_counts_sorted = sorted(many_lemma_counts.items(), key=lambda item: item[1], reverse=True)

print("Total lemma count: ", many_lemma_total)
print("Lemmas in the many comments:")
for lemma, count in many_lemma_counts_sorted:
    print(lemma, count)

In [None]:
# same for zero comment articles

zero_lemma_counts = {}
zero_lemma_total = 0

for token in zero_doc:
    if token.is_alpha and not token.is_stop:
        lemma = token.lemma_.lower()
        zero_lemma_counts[lemma] = zero_lemma_counts.get(lemma, 0) + 1
        zero_lemma_total = zero_lemma_total + 1
        
zero_lemma_counts_sorted = sorted(zero_lemma_counts.items(), key=lambda item: item[1], reverse=True)
        
print("Total lemmas: ", zero_lemma_total)

print("Lemmas in the zero comments:")
for lemma, count in zero_lemma_counts_sorted:
    print(lemma, count)

In [None]:
# count noun chunks in many comment articles

many_noun_chunk_counts = {}

for noun_chunk in many_doc.noun_chunks:
    noun_chunk_text = noun_chunk.text
    many_noun_chunk_counts[noun_chunk_text] = many_noun_chunk_counts.get(noun_chunk_text, 0) + 1
        
many_noun_chunk_counts_sorted = sorted(many_noun_chunk_counts.items(), key=lambda item: item[1], reverse=True)
        
print("Noun chunk counts in the many comments:")
for noun_chunk, count in many_noun_chunk_counts_sorted:
    print(noun_chunk, count)

In [None]:
# count noun chunks in zero comment articles

zero_noun_chunk_counts = {}

for noun_chunk in zero_doc.noun_chunks:
    noun_chunk_text = noun_chunk.text
    zero_noun_chunk_counts[noun_chunk_text] = zero_noun_chunk_counts.get(noun_chunk_text, 0) + 1
        
zero_noun_chunk_counts_sorted = sorted(zero_noun_chunk_counts.items(), key=lambda item: item[1], reverse=True)
        
print("Noun chunk counts in the zero comments:")
for noun_chunk, count in zero_noun_chunk_counts_sorted:
    print(noun_chunk, count)

## POS and plotting

For POS tags, notice that we don't need to lowercase or lemmatize, as POS tags are already a sort of abstraction over those word-specific characteristics. 

We are also going to keep track of the total POS tags in each corpus. That will allow us to produce _normalized counts_ of the parts of speech. If we want to compare whether the many comments or the zero comments articles have, for instance, more nouns, we can't compare raw counts, because many articles may have more words, and therefore more POS instances (or vice versa). So we divide the count of each POS by the total number of POS in the corpus. 

There are many bases for normalization. A normalized frequency by 100 is just a percentage. But if you have a very large corpus (millions of words), then you do normalization by 1,000 or more. We'll do 100 here. 

Then, we can produce a pretty plot that compares the two corpora and the relative frequency of each type of POS. 

In [None]:
# count parts of speech in many and also keep track of the total

many_pos_counts = {}
many_pos_total = 0

for token in many_doc:
    pos_tag = token.pos_
    many_pos_counts[pos_tag] = many_pos_counts.get(pos_tag, 0) + 1
    many_pos_total = many_pos_total + 1
        
many_pos_counts_sorted = sorted(many_pos_counts.items(), key=lambda item: item[1], reverse=True)
        
print("Total POS: ", many_pos_total)
print("POS counts in the many comments:")
for pos, count in many_pos_counts_sorted:
    print(pos, count)

In [None]:
# count parts of speech in zero

zero_pos_counts = {}
zero_pos_total = 0

for token in zero_doc:
    pos_tag = token.pos_
    zero_pos_counts[pos_tag] = zero_pos_counts.get(pos_tag, 0) + 1
    zero_pos_total = zero_pos_total + 1
        
zero_pos_counts_sorted = sorted(zero_pos_counts.items(), key=lambda item: item[1], reverse=True)

print("Total POS: ", zero_pos_total)
print("POS counts in the zero comments:")
for pos, count in zero_pos_counts_sorted:
    print(pos, count)

Now we normalize both dictionaries. I do by 100 and use the `round()` function to keep the percentage to two decimal points. I also sort the resulting dictionary by order of frequency, in descending order. 

In [None]:
many_pos_norm = {}
zero_pos_norm = {}

for pos, count in many_pos_counts.items():
    many_pos_norm[pos] = round((count / many_pos_total) * 100, 2)
    
many_pos_norm_sorted = dict(sorted(many_pos_norm.items(), key=lambda item: item[1], reverse=True))
    
for pos, count in zero_pos_counts.items():
    zero_pos_norm[pos] = round((count / zero_pos_total) * 100, 2)
    
zero_pos_norm_sorted = dict(sorted(zero_pos_norm.items(), key=lambda item: item[1], reverse=True))

In [None]:
# just look at the percentages in each

for pos, perc in many_pos_norm_sorted.items():
    print(pos, perc)

In [None]:
for pos, perc in zero_pos_norm_sorted.items():
    print(pos, perc)

In [None]:
# plotting the two dictionaries, side by side
pos_list = list(many_pos_norm_sorted.keys())

many_freq = [many_pos_norm_sorted.get(pos, 0) for pos in pos_list]
zero_freq = [zero_pos_norm_sorted.get(pos, 0) for pos in pos_list]
all_freq = many_freq + zero_freq

plt.figure(figsize=(10, 6))
bar_width = 0.4
index = np.arange(len(pos_list))

pos_with_labels = [pos + "_1" for pos in pos_list] + [pos + "_2" for pos in pos_list]

plt.bar(index, many_freq, bar_width, color='blue', label='Many comments')
plt.bar(index + bar_width, zero_freq, bar_width, color='brown', label='Zero comments')

plt.xlabel('POS Tag')
plt.ylabel('Frequency')
plt.title('POS frequency in articles with many and with zero comments')
plt.xticks(index + bar_width / 2, pos_list, rotation=45)
plt.legend()
plt.show()

## Defining functions

A lot of what we have done above we have done for two different things, the many and the zero comment articles. That felt tedious, because we were just copy-pasting the same information, only changing the name of some variables. For that kind of situation, creating a generic function is much more efficient. 

Let's look at the `count_pos()` function. We define it using `def`. Then, we give it an argument, the doc (spaCy object) that contains our corpus processed by spaCy. 

Then, we do exactly the same thing we did above: we count the pos tags, we add up how many there are and sort that dictionary. At the end, the function prints the total number of POS tags and the list (a dictionary, really).

To use the function, we simply call it with the right doc object. So, if we want to count the POS in many_comments, we pass that argument. And the same for zero comments. 

In [None]:
def count_pos(doc):
    
    pos_counts = {}
    pos_total = 0
    
    for token in doc:
        pos_tag = token.pos_
        pos_counts[pos_tag] = pos_counts.get(pos_tag, 0) + 1
        pos_total = pos_total + 1
        
    pos_counts_sorted = dict(sorted(pos_counts.items(), key = lambda item: item[1], reverse = True))
    
    return pos_counts_sorted

To "call" this function, that is, to use this function, you simply type the name of the function and pass the variable that contains your corpus. Recall that this is `many_doc` and `zero_doc`, from above, the two docs that we produced with spaCy.

In [None]:
count_pos(many_doc)

Now, that only prints the dictionary from many_doc. If you want to store that dictionary to use later, then you assign it to a variable. This variable is exactly the same as the `many_pos_counts_sorted` above, but we produced it through a much more efficient way; that's why I'm calling it "better". And you can do the same for the `zero_doc`. 

In [None]:
many_pos_counts_sorted_better = count_pos(many_doc)

In [None]:
zero_pos_counts_sorted_better = count_pos(zero_doc)

In [None]:
many_pos_counts_sorted_better

In [None]:
zero_pos_counts_sorted_better

# Your turn!

Try it! Define a new function that doesn't do raw counts, but normalized counts, something like:

`def pos_count_norm(doc):`

Then you can probably turn most of the things we did twice, for many and for zero, into functions that you can reuse. 