<h2>CS918 Assignment 1</h2>
<h4>Faraz Taheri (1534783)</h4>

<h2>Introduction</h2>
<p>In this assignment, the news articles in SIGNAL NEWS1 corpus are processed to perform some natural language processing exercises.</p>
<p><b>***The <i>signal-news1</i> folder must be located in the same directory as this file for the program to execute successfully.</b><p>
<p>To run this program, the following libraries have to be imported. Some of the functinos are implemented using the NLTK library and it has to be installed separately.</p>

In [1]:
import json
import re
from collections import defaultdict
import operator
import math
import nltk
from nltk import trigrams
nltk.data.path.append('/modules/cs918/nltk_data/')

<p>These global variables are used accross different sections of the program.</p>

In [2]:
articles = []
lemmatised = []  # All of lemmatised words
lemmatised_first_16000 = []  # Lemmatised words for the first 16000 articles
lemmatised_after_16000 = []  # Lemmatised words for the rest of the articles
pos_words = {}  # Positive words for sentiment analysis
neg_words = {}  # Negative words for sentiment analysis

<h2>Part A: Text preprocessing</h2>
<h3>Data Sanitization</h3>

<p>The corpus used in this exercised is a file consisting of multiple json data. It has to be processed and sanitized. Each json data has multiple attributes such as <i>id, title, source, etc.</i>, but only the <i>content</i> field is used here.</p>
<p>Different regex patterns are used to convert the data to the required format.</p>
<p>Articles are stored in a list and each article is stored as a dictionary, and only the content field data is stored. Other fields of the news file can be stored as well.</p>

In [3]:
with open('signal-news1/signal-news1.jsonl', 'r') as f:
    for line in f:
        tmp_article = {'content': json.loads(line)['content']}
        tmp_article['content'] = tmp_article['content'].lower()
        # Remove Url
        tmp_article['content'] = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', tmp_article['content'], flags=re.MULTILINE)
        # Remove non-alphanumeric except spaces
        tmp_article['content'] = re.sub(r'[^a-zA-Z\d\s:]', '', tmp_article['content'], flags=re.MULTILINE)
        # Remove single characters
        tmp_article['content'] = re.sub(r'\b(\w)\b', '', tmp_article['content'], flags=re.MULTILINE)
        # Remove single numbers
        tmp_article['content'] = re.sub(r'\b\d+\b', '', tmp_article['content'], flags=re.MULTILINE)
        articles.append(tmp_article)

<h3>Lemmatisation</h3>
<p>The NLTK package is used for lemmatising the text, using the WordNetLemmatizer function. The default POS tagging is used here.
<p>For each article, all the words are lemmatised and stored as a dictionary in the <i>'lemmatised'</i> field of that article. The lemmatised words are used as keys, and their frequencies are the values.</p><p>There is also a global list of lemmatised words, consisting of the words in all the articles (in the first 16000 articles and the rest).</p>

In [4]:
articles_count = 0
for article in articles:
    words = article['content'].split()
    article['lemmatised'] = {}
    articles_count += 1
    for word in words:
        lemm = nltk.stem.WordNetLemmatizer().lemmatize(word)  # Lemmatise the word
        if articles_count <= 16000:
            lemmatised_first_16000.append(lemm)
        else:
            lemmatised_after_16000.append(lemm)
        if lemm not in article['lemmatised']:
            article['lemmatised'][lemm] = 1
        else:
            article['lemmatised'][lemm] += 1
lemmatised = lemmatised_first_16000 + lemmatised_after_16000

<h2>Part B: N-grams</h2>
<h3>Number of tokens and vocabulary size</h3>
<p>To calculate the number of tokens (N), the size of the lemmatised words list is used.</p>

In [5]:
print('Number of Tokens (N): ', len(lemmatised))

Number of Tokens (N):  5701427


<p>To calculate the vocabulary size (V), the size of unique lemmatised words is used. The set function creates an unordered collection of unique elements in the lemmatised words list.</p>

In [6]:
print('Vocabulary Size (V): ', len(set(lemmatised)))

Vocabulary Size (V):  128800


<h3>Top 25 trigrams based on the number of occurrences</h3>
<p>Using the NLTK's <i>trigrams</i> function (or <i>ngrams(3)</i>), a list of trigrams in the text is generated. By calculating their frequency and sorting them, the top 25 are selected. This could alternativetly be achieved by using NLTK's <i>FreqDist()</i>and<i>most_common(25)</i> function.</p>

In [7]:
# Calculate top 25 trigrams
tri = trigrams(lemmatised)
#fd = nltk.FreqDist(tri)
#top = fd.most_common(25)
dist = {}
for g in tri:
    if g in dist:
        dist[g] += 1
    else:
        dist[g] = 1
top25 = sorted(dist.items(), key=lambda kv: kv[1], reverse=True)[:25]  # Sort trigrams and pick top 25
print('Top 25 trigrams: ', [g[0] for g in top25])  # Select the key (Trigram tuple). The frequency is g[1]

Top 25 trigrams:  [('one', 'of', 'the'), ('on', 'share', 'of'), ('on', 'the', 'stock'), ('a', 'well', 'a'), ('in', 'research', 'report'), ('in', 'research', 'note'), ('the', 'united', 'state'), ('for', 'the', 'quarter'), ('average', 'price', 'of'), ('research', 'report', 'on'), ('research', 'note', 'on'), ('share', 'of', 'the'), ('the', 'end', 'of'), ('in', 'report', 'on'), ('earnings', 'per', 'share'), ('cell', 'phone', 'plan'), ('phone', 'plan', 'detail'), ('according', 'to', 'the'), ('of', 'the', 'company'), ('buy', 'rating', 'to'), ('appeared', 'first', 'on'), ('moving', 'average', 'price'), ('day', 'moving', 'average'), ('price', 'target', 'on'), ('part', 'of', 'the')]


<h3>Sentiment analysis</h3>
<p>The corpus includes a list of positive and negative words. These lists are loaded from a file and are used in order to count the number of positive and negative words in a given text.</p>

In [8]:
# Load positive and negative words
with open('signal-news1/opinion-lexicon-English/positive-words.txt') as f:
    for line in f:
        pos_words[line.strip()] = 1

with open('signal-news1/opinion-lexicon-English/negative-words.txt') as f:
    for line in f:
        neg_words[line.strip()] = -1


# Count positive words in a lemmatised word set
def count_pos_words(words_set):
    num = 0
    for w in words_set:
        if w in pos_words:
            num += words_set[w]
    return num


# Count negative words in a lemmatised word set 
def count_neg_words(words_set):
    num = 0
    for w in words_set:
        if w in neg_words:
            num += words_set[w]
    return num

<p>For each article, the number of positive and negative lemmatised words is counted and based on the results, the positivity or negativity of each article is determined. The sum of positive and negative words throughout the whole document is also calculated.</p>

In [9]:
total_pos_articles = 0
total_neg_articles = 0
total_pos_words = 0
total_neg_words = 0

for article in articles:
    num_pos_words = count_pos_words(article['lemmatised'])
    num_neg_words = count_neg_words(article['lemmatised'])
    total_pos_words += num_pos_words
    total_neg_words += num_neg_words

    if num_pos_words > num_neg_words:
        total_pos_articles += 1
    elif num_pos_words < num_neg_words:
        total_neg_articles += 1

print('Number of positive words: ', total_pos_words)
print('Number of negative words: ', total_neg_words)
print('Number of positive articles: ', total_pos_articles)
print('Number of negative articles: ', total_neg_articles)

Number of positive words:  170251
Number of negative words:  129278
Number of positive articles:  10816
Number of negative articles:  6385


<h2>Part C: Language models</h2>

<h3>Building a language model</h3>
<p>To generate a 10-word sentence, a language model must be built. Using NLTK's trigrams function, a model is generated based on the first 160000 aricles. <br>The padding parameters allow us to use <i>None</i> in the trigrams and find the trigrams for starting and finishing sentences.<br>The <i>defaultdict</i> class from the <i>collections</i> pacakge is a type of dictionary that generates a dictionary item if the key does not exist when accessed.</p><p>The default value of 0.01 is because <i>log(0)</i> is illegal when calculating perplexity; it also provide slight smoothing.</p>

In [10]:
# Generate a 10 word sentence
model = defaultdict(lambda: defaultdict(lambda: 0.01))  # For smoothing. Also 0 cannot be used with math.log
first_16000_trigrams = trigrams(lemmatised_first_16000, pad_right=True, pad_left=True)

for w1, w2, w3 in first_16000_trigrams:
    model[(w1, w2)][w3] += 1  # Count the appearance of the word

<p>For each pair of words, the probability of a following third word appearing must be calculated.</p>

In [11]:
for pair in model:
    total_count = float(sum(model[pair].values()))
    for w3 in model[pair]:
        model[pair][w3] /= total_count  # Count the probability of the word appearing

<h3>Generating a 10-word sentence</h3>
<p>Next step is to generate a 10 word sentence, with the first 2 words provided. At each step, the next word must be chosen in a way that has the highest probability of appearing, given the previous 2 words.</p>

In [12]:
sentence = ["is", "this"]  # First 2 words of the sentence
while len(sentence) < 10:
    words = model.get(tuple(sentence[-2:]))  # Get the trigrams starting with the last pair of words
    word = max(words.items(), key=operator.itemgetter(1))[0]  # Get the next word with maximum probability
    sentence.append(word)
print('Generated 10 word sentence:', ' '.join([w for w in sentence if w]))

Generated 10 word sentence: is this going to be the first time in the


<h3>Calculating the perplexity</h3>
<p>To calcuate the perplexity, the trigram model in the previous section is used as . The code below calculates the perpelixity using the chainrule for trigrams.</p>

In [13]:
# Calculate the perplexity
P_log = float(0)  # Logs of probabilities
N = 0  # Count
for w1, w2, w3 in trigrams(lemmatised_after_16000, pad_left=True, pad_right=True):
    N += 1
    P_log += math.log2(model[(w1, w2)][w3])

perplexity = pow(2, -P_log/N)
print("Perplexity: ", perplexity)

Perplexity:  42.77968412645628


<p>This is the end of the document.</p>