<h2>CS918 Assignment 1</h2>
<h4>Faraz Taheri (1534783)</h4>

<p>In this assignment, the news articles in SIGNAL NEWS1 corpus are processed to perform some natural language processing exercises.</p>
<p>To run this program, the following libraries have to be imported.</p> Some of the functinos are implemented using the NLTK library and it has to be installed separately.</p>
<b>***The <i>signal-news1</i> must be located in the same directory as this file for the progra</b>

In [1]:
import json
import re
import nltk
from nltk import trigrams
from collections import defaultdict
import operator
import math

<p>These global variables will be used accross different sections of the program.</p>

In [2]:
articles = []
lemmatised = []  # All of lemmatised words
lemmatised_first_16000 = []  # Lemmatised words for the first 16000 articles
lemmatised_after_16000 = []  # Lemmatised words for the rest of the articles
pos_words = {}  # Positive words for sentiment analysis
neg_words = {}  # Negative words for sentiment analysis

<h2>Part A: Text preprocessing</h2>
<h3>Data Sanitization</h3>

<p>The corpus used in this exercised is a file consisting of multiple json data. It has to be processed and sanitized. Each json data has multiple attributes such as <i>id, title, source, etc.</i>, but only the <i>content</i> field is used here.</p>
<p>Different regex patterns are used to convert the data to the required format.</p>
<p>Articles are stored in a list and eech article is stored as a dictionary, and only the content field data is stored. Others fields of the news file can be stored as well.</p>

In [3]:
with open('signal-news1/signal-news1.jsonl', 'r') as f:
    for line in f:
        tmp_article = {'content': json.loads(line)['content']}
        tmp_article['content'] = tmp_article['content'].lower()
        # Remove Url
        tmp_article['content'] = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', tmp_article['content'], flags=re.MULTILINE)
        # Remove non-alphanumeric except spaces
        tmp_article['content'] = re.sub(r'[^a-zA-Z\d\s:]', '', tmp_article['content'], flags=re.MULTILINE)
        # Remove single characters
        tmp_article['content'] = re.sub(r'\b(\w)\b', '', tmp_article['content'], flags=re.MULTILINE)
        # Remove single numbers
        tmp_article['content'] = re.sub(r'\b\d+\b', '', tmp_article['content'], flags=re.MULTILINE)
        articles.append(tmp_article)

<h3>Lemmatisation</h3>
<p>The NLTK package is used for lemmatising the text, using the WordNetLemmatizer function. The default POS tagging is used here.
<p>For each article, all the words are lemmatised and stored as a dictionary in the <i>'lemmatised'</i> field of that article. The lemmatised words are used as keys, and their frequencies are the values.</p><p>There is also a global list of lemmatised words, consisting of the words in all the articles (in the first 16000 articles and the rest).</p>

In [4]:
articles_count = 0
for article in articles:
    words = article['content'].split()
    article['lemmatised'] = {}
    articles_count += 1
    for word in words:
        lemm = nltk.stem.WordNetLemmatizer().lemmatize(word)  # Lemmatise the word
        if articles_count <= 16000:
            lemmatised_first_16000.append(lemm)
        else:
            lemmatised_after_16000.append(lemm)
        if lemm not in article['lemmatised']:
            article['lemmatised'][lemm] = 1
        else:
            article['lemmatised'][lemm] += 1
lemmatised = lemmatised_first_16000 + lemmatised_after_16000

<h2>Part B: N-grams</h2>
<h3>Number of tokens and vocabulary size</h3>
<p>To calculate the number of tokens (N), the size of the lemmatised words list is used.</p>

In [5]:
print('Number of Tokens (N): ', len(lemmatised))

Number of Tokens (N):  5701427


<p>To calculate the vocabulary size (V), the size of unique lemmatised words is used. The set function creates an unordered collection of unique elements in the lemmatised words list.</p>

In [13]:
print('Vocabulary Size (V): ', len(set(lemmatised)))

Vocabulary Size (V):  128800


<h3>Top 25 trigrams based on the number of occurrences</h3>
<p>Using the NLTK package <i>trigrams</i> function (or <i>ngrams()</i>), a list of trigrams in the text is generated. By calculating their frequency and sorting them, the top 25 are selected. This can alternativetly done by using NLTK's <i>most_common(25)</i> function.</p>

In [15]:
# Calculate top 25 trigrams
tri = trigrams(lemmatised)
# top = tri.most_common(25)
dist = {}
for g in tri:
    if g in dist:
        dist[g] += 1
    else:
        dist[g] = 1
top25 = sorted(dist.items(), key=lambda kv: kv[1], reverse=True)[:25]  # Sorting trigrams and selecting top 25
print('Top 25 trigrams: ', [g[0] for g in top25])  # Selecting the key (Trigram tuple). The frequency is g[1]

Top 25 trigrams:  [(('one', 'of', 'the'), 2434), (('on', 'share', 'of'), 2095), (('on', 'the', 'stock'), 1567), (('a', 'well', 'a'), 1424), (('in', 'research', 'report'), 1415), (('in', 'research', 'note'), 1373), (('the', 'united', 'state'), 1222), (('for', 'the', 'quarter'), 1221), (('average', 'price', 'of'), 1193), (('research', 'report', 'on'), 1177), (('research', 'note', 'on'), 1138), (('share', 'of', 'the'), 1132), (('the', 'end', 'of'), 1130), (('in', 'report', 'on'), 1124), (('earnings', 'per', 'share'), 1121), (('cell', 'phone', 'plan'), 1073), (('phone', 'plan', 'detail'), 1070), (('according', 'to', 'the'), 1064), (('of', 'the', 'company'), 1057), (('buy', 'rating', 'to'), 1016), (('appeared', 'first', 'on'), 995), (('moving', 'average', 'price'), 995), (('day', 'moving', 'average'), 993), (('price', 'target', 'on'), 981), (('part', 'of', 'the'), 935)]


<h3>Sentiment analysis</h3>
<p>The corpus includes a list of positive and negative words. These lists are loaded from a file and are used in order the count the number of positive and negative words in a given text.</p>

In [16]:
# Load positive and negative words
with open('signal-news1/opinion-lexicon-English/positive-words.txt') as f:
    for line in f:
        pos_words[line.strip()] = 1

with open('signal-news1/opinion-lexicon-English/negative-words.txt') as f:
    for line in f:
        neg_words[line.strip()] = -1


# Count positive words in a word set
def count_pos_words(words_set):
    num = 0
    for w in words_set:
        if w in pos_words:
            num += words_set[w]
    return num


# Count negative words in a word set 
def count_neg_words(words_set):
    num = 0
    for w in words_set:
        if w in neg_words:
            num += words_set[w]
    return num

<p>For each article, the number of positive and negative lemmatised words is counted and based on the results, the positivity or negativity of each article is determined. The sum of positive and negative words throughout the whole document is also calculated.</p>

In [17]:
num_pos_articles = 0
num_neg_articles = 0
total_pos_words = 0
total_neg_words = 0

for article in articles:
    num_pos_words = count_pos_words(article['lemmatised'])
    num_neg_words = count_neg_words(article['lemmatised'])
    total_pos_words += num_pos_words
    total_neg_words += num_neg_words

    if num_pos_words > num_neg_words:
        num_pos_articles += 1
    elif num_pos_words < num_neg_words:
        num_neg_articles += 1

print('Number of positive words: ', total_pos_words)
print('Number of negative words: ', total_neg_words)
print('Number of positive articles: ', num_pos_articles)
print('Number of negative articles: ', num_neg_articles)

Number of positive words:  170251
Number of negative words:  129278
Number of positive articles:  10816
Number of negative articles:  6385


<h2>Part C: Language models</h2>

<h3>Building a language model</h3>