# Hong Kong Reddit - N-gram Comments Processing (Part 2)

After Part 1, we will process the comments data to produce ngrams for the ngram viewer. 

What are ngrams?
A sequence of n words (where n is a postive number). An example sentence is: I love Hong Kong.
* Unigram consist of a single word. e.g. I, love, Hong, Kong
* Bigrams consist of two words. e.g. I love, love Hong, Hong Kong
* Trigrams consist of three words. e.g. I love Hong, love Hong Kong.

We will only generate up to 3 ngrams as larger ngrams:
1. Require more storage and RAM to process
2. Hard to find non-unique 5 gram words


In [None]:
import re
import json
import datetime
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tokenize import RegexpTokenizer

## Load the data

In [None]:
with open('hongkong_comments_filtered.json', 'r') as f:  
    comments = json.load(f)

In [None]:
# initalize english stopwords
stop_words = set(stopwords.words('english'))

In [None]:
# stores all comments that are in the same year
data = {
    2010:[],
    2011:[],
    2012:[],
    2013:[],
    2014:[],
    2015:[],
    2016:[],
    2017:[],
    2018:[]
}

## Natural Language Processing

In [None]:
for comment in comments['data']:
    comment_year = datetime.datetime.fromtimestamp(round(comment['created_utc'])).year
    
    if comment['body'] != '[deleted]': # skip deleted comments
        result = re.sub(r"http\S+|Http\S+|&gt;", "", comment['body']) #remove http links and &gt; artefacts in the data
        data[comment_year].append(result)

In [None]:
# Initialize tokenizer. Tokenize words and numbers only. No punctuations
tokenizer = RegexpTokenizer(r'\w+')

# For each year...
for key in data:
    # lower case all elements then join then together
    data[key] = (' '.join(filter(None, data[key]))).lower()
    # tokenize the text
    tokens = tokenizer.tokenize(data[key])
    # filter out stopwords from tokens
    tokens = [t for t in tokens if t not in stop_words]
    # remove amp
    data[key] = [t for t in tokens if t not in ['amp']] #http?

In [None]:
# store all the words
words = {}

In [None]:
# store total frequency of words in each year
total_word = {
    2010: 0,
    2011: 0,
    2012: 0,
    2013: 0,
    2014: 0,
    2015: 0,
    2016: 0,
    2017: 0,
    2018: 0,
}

In [None]:
# for each year...
for key in data:
    # for ngram 1 to 3
    for i in range(1,4):
        # generate ngrams and put into a list
        bi_grams = list(ngrams(data[key], i))
        # Count the number of words
        counter = Counter(bi_grams)
        
        # for each word...
        for word in counter.most_common():
            combine_word = ' '.join(word[0]) #combine bigrams, trigrams
            
            # create new key (word) in words dictionary if not exist
            if combine_word not in words:
                words[combine_word] = [{2010 : 0, 2011 : 0, 2012 : 0, 2013 : 0, 2014 : 0, 2015 : 0, 2016 : 0, 
                                        2017 : 0, 2018 : 0},]
            
            # add value for word
            words[combine_word][0][key] = word[1]

# for each word...
for word in words:
    total_count = 0
    # for each year and value of the word...
    for year, year_value in words[word][0].items():
        total_count += year_value # add up for total frequency of the word in the year
        total_word[year] += year_value # add all total frequency words in the year to calculate relative frequency
    words[word].append(total_count)

## Relative Frequency

Since absolute frequency does not provide interesting information relative to the other words used during that time, we will use relative frequency. 

For example, the term 'Occupy central' only emerged in 2014 during protest that occured in Central, Hong Kong. This term may have been cited frequently over time in absolute terms, however, in terms of relative frequency, it is cited less compared to the word 'independence' after a year or two. 

In [None]:
# Calculate relative frequency of the words in each year
for word in words:
    for year, year_value in words[word][0].items():
        words[word][0][year] = year_value / total_word[year]

## Format to be uploaded to database

In [None]:
# format for SQL upload. Not advisable due to 1.2 GB size.
with open('ngram_data.sql', 'w', encoding="utf-8") as outfile: 
    table_id = 1
    for key in words:
        json_str = "'{"
        for year_key in words[key][0]:
            string = '"'+ str(year_key) + '":' + str(words[key][0][year_key])
            if year_key == 2018:
                json_str = json_str + string
            else:
                json_str = json_str + string + ','
        finalstr = "INSERT INTO jobs_ngram_hk (id,word,json,frequency) VALUES (" + str(table_id) + ",'" + str(key) + 
                    "'," + json_str + "}'," + str(words[key][1])
        table_id += 1
        outfile.write(finalstr + ");\n") 

In [None]:
# format for client postgres upload. Delimiter using ;
with open('ngram_data.csv', 'w', encoding="utf-8") as outfile: 
    table_id = 1
    for key in words:
        json_str = "\"{"
        for year_key in words[key][0]:
            string = '\'\"'+ str(year_key) + '\'\":' + str(words[key][0][year_key])
            if year_key == 2018:
                json_str = json_str + string
            else:
                json_str = json_str + string + ','
        finalstr = str(table_id) + ";" + str(key) + ";" + json_str + "}\";" + str(words[key][1])
        table_id += 1
        outfile.write(finalstr + "\n") 

In [None]:
# format for /copy upload using psql. Delimiter using ;
with open('ngram_data.csv', 'w', encoding="utf-8") as outfile: 
    table_id = 1
    for key in words:
        json_str = "\"{"
        for year_key in words[key][0]:
            string = '\\"'+ str(year_key) + '\\":' + str(words[key][0][year_key])
            if year_key == 2018:
                json_str = json_str + string
            else:
                json_str = json_str + string + ','
        finalstr = str(table_id) + ";" + str(key) + ";" + json_str + "}\";" + str(words[key][1])
        table_id += 1
        outfile.write(finalstr + "\n") 