# Collocation and N-Grams

This analysis will give us an idea of how words are co-located in text. We want to analyse how 'humanitarian' is used with other words in the corpus over time. The analysis from this exercise will visualized using HighCharts network graphs.

## Setup

Importing necessary libraries, declaring global variables and the dataframe file at the top.

In [5]:
import nltk # Using NLTK for ngrams analysis
# nltk.download('stopwords') # Download stopwords
from nltk.collocations import *
from nltk.corpus import stopwords
import pandas as pd # Need this for CSV, DF manipulation
import re # Regix to remove punctuation from strings I split
import itertools # For sorting list and removing duplicates

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tashfeenahmed/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
# Declaring global variables that will be reused
yearMonths = ['201912', '202001', '202002', '202003', '202004', '202005', '202006', '202007', '202008']
yearMonthsWord = ['Dec 2019', 'Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020', 'Jun 2020', 'Jul 2020', 'Aug 2020']

In [7]:
# Running the first round for Germany just to check if ngrams will show something interesting
df = pd.read_csv('Clean Data/DE_cleandf.csv')

In [8]:
# Get NLTK stopwords
sw = stopwords.words("english")
sw

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

## Functions

Declaring reusable functions at the top so I can call them later for my analysis.

In [9]:
# Further cleaning

def cleanTextInDf(mystring):
    mystring = mystring.lower() # Text normalization: make string lowercase
    mystring = re.sub(r'[^\w\s]','', mystring) # Text normalization: remove punctuation
    tokens = [token for token in nltk.word_tokenize(mystring) if token not in sw] # Include stop words
    return " ".join(tokens)

# Get yearMonth

def checkYearMonth(row):
    value = row['date']
    return str(value)[0:6]

# For each country/region, combine text from all months into a single row cell.

def combinedTextForCountryDf(country):
    index = 0
    dfCountryYrList = []    
    for ym in yearMonths:
        combinedText = ' '.join(df[(df['yearmonth'] == ym) & (df['country'] == country)].text)
        dictCountryYr = {'country': country, 'yearmonth': ym, 'text': combinedText}
        dfCountryYrList.append(dictCountryYr)
    return dfCountryYrList

## Further cleaning and manipulation

I want to lowercase the text, remove punctuation and get a new column 'yearmonth' to show year and month.

In [10]:
# Apply cleaning function to the text column

cleanText = lambda text: cleanTextInDf(text) # Lambda function applies to all cells in a column
cleandf = pd.DataFrame(df.text.apply(cleanText)) # .apply() the function to all cells
df['text'] = cleandf['text']
df['yearmonth'] = df.apply(checkYearMonth, axis=1)
df

Unnamed: 0,name,path,country,network,date,token_freq,text,yearmonth
0,20200602_DE_DPA_NEXIS157175.txt,Raw text/DEClean/20200602_DE_DPA_NEXIS157175.txt,DE,DPA,20200602,4,donors conference yemen hosted united nations ...,202006
1,20200227_DE_DieWelt_FACTIVA15887.txt,Raw text/DEClean/20200227_DE_DieWelt_FACTIVA15...,DE,DieWelt,20200227,4,copyright 2020 axel springer se world bank pan...,202002
2,20200304_DE_DeutscheWelle_FACTIVA15494.txt,Raw text/DEClean/20200304_DE_DeutscheWelle_FAC...,DE,DeutscheWelle,20200304,15,020 deutsche welle activists say iraq politica...,202003
3,20200403_DE_DeutscheWelle_FACTIVA15551.txt,Raw text/DEClean/20200403_DE_DeutscheWelle_FAC...,DE,DeutscheWelle,20200403,109,020 deutsche welle international monetary fund...,202004
4,20200401_DE_DeutscheWelle_GNAPI64780.txt,Raw text/DEClean/20200401_DE_DeutscheWelle_GNA...,DE,DeutscheWelle,20200401,104,ulliover 870000 covid19 cases reported worldwi...,202004
...,...,...,...,...,...,...,...,...
246,20200326_DE_DeutscheWelle_FACTIVA15540.txt,Raw text/DEClean/20200326_DE_DeutscheWelle_FAC...,DE,DeutscheWelle,20200326,107,020 deutsche welle us 82000 cases country set ...,202003
247,20200331_DE_DPA_NEXIS157630.txt,Raw text/DEClean/20200331_DE_DPA_NEXIS157630.txt,DE,DPA,20200331,4,greece migration ministry tuesday confirmed fi...,202003
248,20200421_DE_DPA_NEXIS157585.txt,Raw text/DEClean/20200421_DE_DPA_NEXIS157585.txt,DE,DPA,20200421,5,number people insufficient food supplies nearl...,202004
249,20200329_DE_DeutscheWelle_FACTIVA15546.txt,Raw text/DEClean/20200329_DE_DeutscheWelle_FAC...,DE,DeutscheWelle,20200329,77,020 deutsche welle spain set another grim dail...,202003


Combining the text over months so that I can analyse it month-wise

In [11]:
# Test run with DE: Germany

dfYrList = combinedTextForCountryDf('DE')
deYrDf = pd.DataFrame(dfYrList)
deYrDf

Unnamed: 0,country,yearmonth,text
0,DE,201912,
1,DE,202001,
2,DE,202002,copyright 2020 axel springer se world bank pan...
3,DE,202003,020 deutsche welle activists say iraq politica...
4,DE,202004,020 deutsche welle international monetary fund...
5,DE,202005,020 deutsche welle video address german chance...
6,DE,202006,donors conference yemen hosted united nations ...
7,DE,202007,times gmt questions weekly planner contact ple...
8,DE,202008,times gmt questions weekly planner contact ple...


## NLTK Analysis

I will try both bigrams and trigrams for Germany and see which one works better.

Edit: Trigrams work well by giving me the context of text by showing frequently occuring words.

In [12]:
text = deYrDf['text'][2] # For February

In [13]:
finder = TrigramCollocationFinder.from_words(text.split(), window_size = 3)

In [14]:
finder.apply_freq_filter(3) # Removes candidate ngrams which have frequency less than min_freq
trigram_measures = nltk.collocations.TrigramAssocMeasures()

In [15]:
for k,v in finder.ngram_fd.items():
    print(k,v)

('world', 'health', 'organization') 7
('small', 'business', 'owners') 4
('shutdown', 'factories', 'china') 4
('vs', 'facts', 'true') 9
('facts', 'true', 'coronavirus') 9
('true', 'coronavirus', 'information') 9
('coronavirus', 'information', 'web') 9
('lilimyths', 'vs', 'facts') 8
('businessman', 'clifford', 'tsache') 3
('clifford', 'tsache', 'says') 3
('business', 'owners', 'say') 3
('due', 'coronavirus', 'outbreak') 3
('020', 'deutsche', 'welle') 4
('total', 'number', 'cases') 3
('luxury', 'cruise', 'ship') 5
('holland', 'america', 'line') 7
('due', 'fears', 'coronavirus') 3
('prime', 'minister', 'prayut') 3
('minister', 'prayut', 'chanocha') 3
('cases', 'coronavirus', 'board') 3
('holland', 'america', 'said') 4
('editorial', 'contactsediting', 'stephen') 3
('contactsediting', 'stephen', 'lowman') 3
('stephen', 'lowman', '49') 3
('lowman', '49', '30') 3
('49', '30', '285231472') 5
('30', '285231472', 'internationaldpacom') 3
('285231472', 'internationaldpacom', 'loaddate') 3
('intern

Let's sort and filter to get the top 20 trigrams.

In [16]:
sorted(finder.nbest(trigram_measures.raw_freq, 20))

[('020', 'deutsche', 'welle'),
 ('49', '30', '285231472'),
 ('china', 'tourism', 'industry'),
 ('coronavirus', 'hits', 'china'),
 ('coronavirus', 'information', 'web'),
 ('cultural', 'sites', 'coronavirus'),
 ('facts', 'true', 'coronavirus'),
 ('hits', 'china', 'tourism'),
 ('holland', 'america', 'line'),
 ('holland', 'america', 'said'),
 ('hun', 'sen', 'told'),
 ('internationaldpacom', 'loaddate', 'february'),
 ('liliempty', 'cultural', 'sites'),
 ('lilimyths', 'vs', 'facts'),
 ('luxury', 'cruise', 'ship'),
 ('minister', 'hun', 'sen'),
 ('sites', 'coronavirus', 'hits'),
 ('true', 'coronavirus', 'information'),
 ('vs', 'facts', 'true'),
 ('world', 'health', 'organization')]

From the results above, I can tell that Germany in February 2020 was talking about Coronavirus cases, flights to China and health measures across the globe.

Now I want to import the entire dataset and run the same analysis (using a reusable function) for blocks of countries over each month.

In [17]:
ae_df = pd.read_csv('Clean Data/AE_cleandf.csv')
cn_df = pd.read_csv('Clean Data/CN_cleandf.csv')
de_df = pd.read_csv('Clean Data/DE_cleandf.csv')
fr_df = pd.read_csv('Clean Data/FR_cleandf.csv')
ir_df = pd.read_csv('Clean Data/IR_cleandf.csv')
kw_df = pd.read_csv('Clean Data/KW_cleandf.csv')
qa_df = pd.read_csv('Clean Data/QA_cleandf.csv')
ru_df = pd.read_csv('Clean Data/RU_cleandf.csv')
sa_df = pd.read_csv('Clean Data/SA_cleandf.csv')
tr_df = pd.read_csv('Clean Data/TR_cleandf.csv')
uk_df = pd.read_csv('Clean Data/UK_cleandf.csv')
us_df = pd.read_csv('Clean Data/US_cleandf.csv')
df = pd.concat([ae_df, cn_df, de_df, fr_df, ir_df, kw_df, qa_df, ru_df, sa_df, tr_df, uk_df, us_df], ignore_index=True)
df

Unnamed: 0,name,path,country,network,date,token_freq,text
0,20200619_AE_EmiratesNewsAgency_NEXIS212.txt,Raw text/AEClean/20200619_AE_EmiratesNewsAgenc...,AE,EmiratesNewsAgency,20200619,4,MELBOURNE 18th June 2020 WAM The UAE General C...
1,20200422_AE_AlArabiya_FACTIVA2484.txt,Raw text/AEClean/20200422_AE_AlArabiya_FACTIVA...,AE,AlArabiya,20200422,13,020 Al Arabiya All rights Reserved Provided by...
2,20200328_AE_TheNational_GDELT136558.txt,Raw text/AEClean/20200328_AE_TheNational_GDELT...,AE,TheNational,20200328,10,UAE offers to help Syria counter coronavirus t...
3,20200616_AE_KhaleejTimes_NEXIS59.txt,Raw text/AEClean/20200616_AE_KhaleejTimes_NEXI...,AE,KhaleejTimes,20200616,3,LOréal Middle East has launched a UAE solidari...
4,20200318_AE_TheNational_NEXIS18486.txt,Raw text/AEClean/20200318_AE_TheNational_NEXIS...,AE,TheNational,20200318,16,Medical staff push a patient on a gurney to a ...
...,...,...,...,...,...,...,...
13426,20200402_US_AssociatedPress_SERP11978.txt,Raw text/USClean/20200402_US_AssociatedPress_S...,US,AssociatedPress,20200402,5,World Food Program USA Allocates 333000 in Eme...
13427,20200629_US_TheNewYorkTimes_NEXIS794361.txt,Raw text/USClean/20200629_US_TheNewYorkTimes_N...,US,TheNewYorkTimes,20200629,13,The country has been hit with a triplewhammy r...
13428,20200520_US_CNN_GNAPI69344.txt,Raw text/USClean/20200520_US_CNN_GNAPI69344.txt,US,CNN,20200520,18,CNNChinese leader Xi Jinping made preserving d...
13429,20200402_US_VOA_GDELT131430.txt,Raw text/USClean/20200402_US_VOA_GDELT131430.txt,US,VOA,20200402,22,WASHINGTON North Koreas decision to protect it...


## Dataframe

Converting country names into blocks so that the analysis can be shown for the entire block rather than a country.

In [18]:
cdf = df # Keeping a copy of df
df = df.replace({'country':'AE'},{'country':'Gulf Countries'},regex=False)
df = df.replace({'country':'KW'},{'country':'Gulf Countries'},regex=False)
df = df.replace({'country':'SA'},{'country':'Gulf Countries'},regex=False)
df = df.replace({'country':'QA'},{'country':'Gulf Countries'},regex=False)
df = df.replace({'country':'US'},{'country':'Euro-Atlantic Countries'},regex=False)
df = df.replace({'country':'UK'},{'country':'Euro-Atlantic Countries'},regex=False)
df = df.replace({'country':'DE'},{'country':'Euro-Atlantic Countries'},regex=False)
df = df.replace({'country':'FR'},{'country':'Euro-Atlantic Countries'},regex=False)
df = df.replace({'country':'CN'},{'country':'New Global Media Players'},regex=False)
df = df.replace({'country':'RU'},{'country':'New Global Media Players'},regex=False)
df = df.replace({'country':'IR'},{'country':'New Global Media Players'},regex=False)
df = df.replace({'country':'TR'},{'country':'New Global Media Players'},regex=False)

In [19]:
cleanText = lambda text: cleanTextInDf(text) # Lambda function applies to all cells in a column
cleandf = pd.DataFrame(df.text.apply(cleanText)) # .apply() the function to all cells
df['text'] = cleandf['text']
df['yearmonth'] = df.apply(checkYearMonth, axis=1)
df

Unnamed: 0,name,path,country,network,date,token_freq,text,yearmonth
0,20200619_AE_EmiratesNewsAgency_NEXIS212.txt,Raw text/AEClean/20200619_AE_EmiratesNewsAgenc...,Gulf Countries,EmiratesNewsAgency,20200619,4,melbourne 18th june 2020 wam uae general consu...,202006
1,20200422_AE_AlArabiya_FACTIVA2484.txt,Raw text/AEClean/20200422_AE_AlArabiya_FACTIVA...,Gulf Countries,AlArabiya,20200422,13,020 al arabiya rights reserved provided syndig...,202004
2,20200328_AE_TheNational_GDELT136558.txt,Raw text/AEClean/20200328_AE_TheNational_GDELT...,Gulf Countries,TheNational,20200328,10,uae offers help syria counter coronavirus thre...,202003
3,20200616_AE_KhaleejTimes_NEXIS59.txt,Raw text/AEClean/20200616_AE_KhaleejTimes_NEXI...,Gulf Countries,KhaleejTimes,20200616,3,loréal middle east launched uae solidarity pro...,202006
4,20200318_AE_TheNational_NEXIS18486.txt,Raw text/AEClean/20200318_AE_TheNational_NEXIS...,Gulf Countries,TheNational,20200318,16,medical staff push patient gurney waiting medi...,202003
...,...,...,...,...,...,...,...,...
13426,20200402_US_AssociatedPress_SERP11978.txt,Raw text/USClean/20200402_US_AssociatedPress_S...,Euro-Atlantic Countries,AssociatedPress,20200402,5,world food program usa allocates 333000 emerge...,202004
13427,20200629_US_TheNewYorkTimes_NEXIS794361.txt,Raw text/USClean/20200629_US_TheNewYorkTimes_N...,Euro-Atlantic Countries,TheNewYorkTimes,20200629,13,country hit triplewhammy raging coronavirus am...,202006
13428,20200520_US_CNN_GNAPI69344.txt,Raw text/USClean/20200520_US_CNN_GNAPI69344.txt,Euro-Atlantic Countries,CNN,20200520,18,cnnchinese leader xi jinping made preserving d...,202005
13429,20200402_US_VOA_GDELT131430.txt,Raw text/USClean/20200402_US_VOA_GDELT131430.txt,Euro-Atlantic Countries,VOA,20200402,22,washington north koreas decision protect coron...,202004


## Trigrams Function

Creating a function to get trigrams. I want trigrams for each month and then sort them by frequency to see top 20 trigrams to make sense of the news data from different regions over time.

In [85]:
def getTrigrams(month, region):
    dfYrList = combinedTextForCountryDf(region) # Combine text into months
    yrDf = pd.DataFrame(dfYrList) # Make a df from list
    text = yrDf['text'][month] # Just need the text for finder
    finder = TrigramCollocationFinder.from_words(text.split(), window_size = 3) # Finds trigrams
    finder.apply_freq_filter(3) # Removes candidate ngrams which have frequency less than min_freq
    trigram_measures = nltk.collocations.TrigramAssocMeasures() # Gets collocations
    return sorted(finder.nbest(trigram_measures.raw_freq, 20)) # Sorts by frequency and returns top 20

In [99]:
def getAllTrigrams(region):
    print('Processing Trigrams for', region)
    print('\n')
    for index, month in enumerate(yearMonthsWord): # I want trigrams for each month
        print(month)
        print('\n')
        trigrams = getTrigrams(index, region)
        for tg in trigrams:
            print(tg)
        print('\n')

## Trigrams

The trigrams will be shown for regions over time.

In [100]:
getAllTrigrams('Euro-Atlantic Countries')

Processing Trigrams for Euro-Atlantic Countries


Dec 2019


('a', 'general', 'view')
('according', 'to', 'the')
('around', 'the', 'world')
('has', 'led', 'to')
('hundreds', 'of', 'thousands')
('in', 'a', 'statement')
('in', 'the', 'city')
('in', 'the', 'state')
('new', 'york', 'new')
('of', 'the', 'states')
('of', 'trust', 'in')
('one', 'of', 'the')
('part', 'of', 'the')
('refers', 'to', 'the')
('said', 'in', 'a')
('the', 'city', 'council')
('the', 'country', 'is')
('the', 'ebola', 'response')
('trust', 'in', 'the')
('york', 'new', 'york')


Jan 2020


('around', 'the', 'world')
('as', 'well', 'as')
('blocktime', 'updatedtimeupdated', 'at')
('cases', 'of', 'the')
('city', 'of', 'wuhan')
('gmt', 'blocktime', 'publishedtime')
('in', 'china', 'and')
('in', 'the', 'city')
('lunar', 'new', 'year')
('of', 'the', 'coronavirus')
('of', 'the', 'new')
('of', 'the', 'outbreak')
('of', 'the', 'virus')
('one', 'of', 'the')
('spread', 'of', 'the')
('the', 'new', 'coronavirus')
('the', 'number', 'of

For Euro-Atlantic Countries there are some interesting trigrams:

* Dec 2019: The Ebola Response is an interesting one
* Jan 2020: The EAC news talks about the coronavirus and mentions world health organization
* Feb 2020: There is a mention of The Diamond Princess cruise ship. An outbreak of coronavirus disease (COVID-19) occurred on the Diamond Princess cruise ship making an international journey.
* Mar 2020: The phrase 'tested positive for' tells that the news was focusing on new covid cases.
* Apr - Aug 2020: News mostly focuses on the coronavirus pandemic, mentions WHO and number of covid cases.

In [101]:
getAllTrigrams('Gulf Countries')

Processing Trigrams for Gulf Countries


Dec 2019


('15', 'august', '2020')
('16', 'august', '2020')
('2019', 'blog', 'syria')
('2020', 'coronavirus', 'pandemic')
('august', '2020', 'coronavirus')
('august', '2020', 'dubai')
('august', '2020', 'government')
('coronavirus', 'pandemic', 'coronavirus')
('december', '15', '2019')
('december', '18', '2019')
('december', '19', '2019')
('follow', 'our', 'coverage')
('khaleej', 'times', 'khaleejtimes')
('khaleejtimes', 'december', '15')
('khaleejtimes', 'december', '18')
('khaleejtimes', 'december', '19')
('more', 'coronavirus', 'pandemic')
('read', 'more', 'coronavirus')
('times', 'khaleejtimes', 'december')
('video', 'by', 'asankar')


Jan 2020


('15', 'august', '2020')
('about', 'which', 'countries')
('case', 'of', 'the')
('cases', 'of', 'the')
('confirmed', 'cases', 'of')
('countries', 'have', 'confirmed')
('have', 'confirmed', 'cases')
('in', 'a', 'statement')
('in', 'the', 'country')
('of', 'the', 'coronavirus')
('of', 'the', 'new')
('

Trigrams show the following for Gulf Donors:

* Dec 2019: Mention of the coronavirus pandemic in December 2019 shows that there were some news articles edited later. Since coronavirus outbreak wasn't declared a pandemic in 2019.
* Jan 2020: The Gulf news talks about the coronavirus and mentions world health organization.
* Feb-Jul 2020: The news mostly mentions the spread of coronavirus and the rise in cases.
* Aug 2020: Besides coronavirus, there is a mention of Beirut explosion, French president and the UN.

In [102]:
getAllTrigrams('New Global Media Players')

Processing Trigrams for New Global Media Players


Dec 2019


('democratic', 'republic', 'of')
('in', 'dr', 'congo')
('the', 'democratic', 'republic')
('the', 'dr', 'congo')
('the', 'number', 'of')


Jan 2020


('attend', 'medical', 'training')
('before', 'entering', 'subway')
('body', 'temperature', 'checked')
('checked', 'before', 'entering')
('china', 'is', 'shaanxi')
('chinese', 'new', 'year')
('combating', 'novel', 'coronavirus')
('coronavirus', 'in', 'wuhan')
('coronavirus', 'in', 'xian')
('curb', 'spread', 'of')
('entering', 'subway', 'in')
('eve', 'body', 'temperature')
('from', 'shanghai', 'attend')
('frontline', 'of', 'combating')
('in', 'qingdao', 'snow')
('in', 'wuhan', 'people')
('novel', 'coronavirus', 'in')
('of', 'novel', 'coronavirus')
('spread', 'of', 'novel')
('to', 'curb', 'spread')


Feb 2020


('according', 'to', 'the')
('as', 'well', 'as')
('fight', 'against', 'the')
('in', 'a', 'statement')
('of', 'the', 'novel')
('of', 'the', 'virus')
('president', 'vladimir', 

Trigrams from New Global Media Players show:

* Dec 2019: Mention of Congo which is because of ebola outbreak.
* Jan 2020: News mentions the novel coronavirus, its spread and measures taken to combat it.
* Feb-Aug 2020: Besides coronavirus, there is a mention of Beirut explosion, French president and the UN.

## Network Visualisation

We want to show relationships of words in a network graph using Highcharts. Highcharts.js requires an array of arrays with two values. For instance, `['president', 'donald'], ['donald','trump']` will be connected as president -> donald -> trump.

### Bigram Functions

Creating functions to render bigrams in the correct format. We need these bigrams for Highcharts visualizations.

In [171]:
# Declaring a vocabulary of stop words that I don't need in the bigrams
myStopWords = ["i", "the", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

def getBigrams(month, region):
    dfYrList = combinedTextForCountryDf(region) # Combine text into months
    yrDf = pd.DataFrame(dfYrList) # Make a dataframe
    text = yrDf['text'][month] # Get the text for bigrams
    finder = BigramCollocationFinder.from_words(text.split(), window_size = 3) # Finds bigrams
    finder.apply_freq_filter(3) # Removes candidate ngrams which have frequency less than min_freq
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    return sorted(finder.nbest(bigram_measures.raw_freq, 300)) # Sorts by frequency and returns top n results

def getAllBigrams(region):
    print('Processing Bigrams for', region)
    print('\n')
    for index, month in enumerate(yearMonthsWord):
        bigrams = getBigrams(index, region) # Get bigrams by months in a region
        for bg in bigrams:
            if bg[0] not in myStopWords: # Minus the stop words
                if bg[1] not in myStopWords:
                    if bg[0].isnumeric() is False and bg[1].isnumeric() is False: # We don't need numbers/dates etc
                        print("['"+bg[0]+"', '"+bg[1]+"'],") # Make a list format that can be copy-pasted into JavaScript code

In [153]:
getAllBigrams('Gulf Countries')

Processing Bigrams for Gulf Countries


['abu', 'dhabi'],
['anjana', 'sankar'],
['app', 'available'],
['asankar', 'khaleej'],
['asankar', 'kt'],
['assistant', 'editor'],
['august', 'coronavirus'],
['august', 'dubai'],
['august', 'government'],
['available', 'download'],
['blog', 'syria'],
['blog', 'video'],
['cases', 'recoveries'],
['combating', 'coronavirus'],
['conferencesupcoming', 'events'],
['conferencesupcoming', 'eventspast'],
['cookbook', 'remote'],
['cookbook', 'workforce'],
['coronavirus', 'combating'],
['coronavirus', 'coronavirus'],
['coronavirus', 'pandemic'],
['coronavirus', 'reports'],
['coronavirus', 'uae'],
['covid19', 'cases'],
['day', 'education'],
['day', 'uae'],
['deaths', 'august'],
['dubai', 'cares'],
['due', 'war'],
['education', 'cookbook'],
['education', 'enrolment'],
['emirates', 'mars'],
['emirates', 'mission'],
['enrolment', 'cookbook'],
['enrolment', 'remote'],
['events', 'conferencesupcoming'],
['events', 'eventspast'],
['eventspast', 'events'],
['follow'

Removing duplicate items from the list:

In [155]:
bigramGulf = [['abu', 'dhabi'],
['anjana', 'sankar'],
['app', 'available'],
['asankar', 'khaleej'],
['asankar', 'kt'],
['assistant', 'editor'],
['august', 'coronavirus'],
['august', 'dubai'],
['august', 'government'],
['available', 'download'],
['blog', 'syria'],
['blog', 'video'],
['cases', 'recoveries'],
['combating', 'coronavirus'],
['conferencesupcoming', 'events'],
['conferencesupcoming', 'eventspast'],
['cookbook', 'remote'],
['cookbook', 'workforce'],
['coronavirus', 'combating'],
['coronavirus', 'coronavirus'],
['coronavirus', 'pandemic'],
['coronavirus', 'reports'],
['coronavirus', 'uae'],
['covid19', 'cases'],
['day', 'education'],
['day', 'uae'],
['deaths', 'august'],
['dubai', 'cares'],
['due', 'war'],
['education', 'cookbook'],
['education', 'enrolment'],
['emirates', 'mars'],
['emirates', 'mission'],
['enrolment', 'cookbook'],
['enrolment', 'remote'],
['events', 'conferencesupcoming'],
['events', 'eventspast'],
['eventspast', 'events'],
['follow', 'coverage'],
['girls', 'education'],
['ground', 'syria'],
['homeless', 'due'],
['independence', 'day'],
['inspired', 'living'],
['khaleej', 'khaleejtimes'],
['khaleej', 'times'],
['khaleejtimes', 'december'],
['kt', 'anjana'],
['kt', 'ktinsyria'],
['kt', 'video'],
['northern', 'syria'],
['pandemic', 'coronavirus'],
['pandemic', 'uae'],
['read', 'coronavirus'],
['read', 'general'],
['recoveries', 'deaths'],
['rendered', 'due'],
['rendered', 'homeless'],
['sankar', 'ktinsyria'],
['syria', 'asankar'],
['syria', 'video'],
['times', 'december'],
['times', 'khaleejtimes'],
['uae', 'august'],
['uae', 'reports'],
['video', 'anjana'],
['video', 'asankar'],
['video', 'sankar'],
['chinese', 'authorities'],
['citizens', 'wuhan'],
['confirmed', 'cases'],
['coronavirus', 'outbreak'],
['coronavirus', 'pandemic'],
['countries', 'confirmed'],
['death', 'toll'],
['evacuation', 'flight'],
['first', 'case'],
['flights', 'china'],
['government', 'said'],
['health', 'minister'],
['health', 'said'],
['hong', 'kong'],
['hubei', 'province'],
['lunar', 'new'],
['lunar', 'year'],
['million', 'people'],
['ministry', 'said'],
['new', 'cases'],
['new', 'coronavirus'],
['new', 'year'],
['news', 'agency'],
['prime', 'minister'],
['said', 'would'],
['south', 'korea'],
['told', 'news'],
['abu', 'dhabi'],
['august', 'coronavirus'],
['august', 'dubai'],
['bin', 'al'],
['chinese', 'people'],
['coronavirus', 'cases'],
['coronavirus', 'coronavirus'],
['coronavirus', 'covid19'],
['coronavirus', 'outbreak'],
['coronavirus', 'pandemic'],
['cruise', 'ship'],
['health', 'minister'],
['health', 'said'],
['humanitarian', 'aid'],
['inspired', 'living'],
['king', 'salman'],
['medical', 'supplies'],
['middle', 'east'],
['new', 'cases'],
['new', 'coronavirus'],
['north', 'korea'],
['pandemic', 'coronavirus'],
['qatar', 'airways'],
['read', 'coronavirus'],
['saudi', 'arabia'],
['sheikh', 'bin'],
['south', 'korea'],
['tested', 'positive'],
['united', 'states'],
['world', 'health'],
['abu', 'dhabi'],
['al', 'jazeera'],
['cases', 'coronavirus'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'outbreak'],
['coronavirus', 'pandemic'],
['death', 'toll'],
['health', 'ministry'],
['health', 'said'],
['new', 'cases'],
['new', 'coronavirus'],
['news', 'agency'],
['prime', 'minister'],
['saudi', 'arabia'],
['south', 'korea'],
['spread', 'coronavirus'],
['tested', 'positive'],
['total', 'number'],
['united', 'states'],
['abu', 'dhabi'],
['al', 'jazeera'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'covid19'],
['coronavirus', 'outbreak'],
['coronavirus', 'pandemic'],
['covid19', 'pandemic'],
['death', 'toll'],
['new', 'cases'],
['new', 'coronavirus'],
['novel', 'coronavirus'],
['prime', 'minister'],
['saudi', 'arabia'],
['united', 'nations'],
['abu', 'dhabi'],
['bin', 'al'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'pandemic'],
['covid19', 'pandemic'],
['death', 'toll'],
['health', 'ministry'],
['million', 'people'],
['new', 'cases'],
['new', 'coronavirus'],
['novel', 'coronavirus'],
['prime', 'minister'],
['saudi', 'arabia'],
['sheikh', 'bin'],
['united', 'nations'],
['world', 'health'],
['abu', 'dhabi'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'pandemic'],
['covid19', 'pandemic'],
['death', 'toll'],
['humanitarian', 'aid'],
['million', 'people'],
['new', 'cases'],
['new', 'coronavirus'],
['novel', 'coronavirus'],
['prime', 'minister'],
['saudi', 'arabia'],
['united', 'nations'],
['abu', 'dhabi'],
['around', 'world'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'pandemic'],
['covid19', 'cases'],
['covid19', 'pandemic'],
['million', 'people'],
['new', 'cases'],
['new', 'coronavirus'],
['per', 'cent'],
['saudi', 'arabia'],
['security', 'council'],
['united', 'nations'],
['abu', 'dhabi'],
['air', 'express'],
['air', 'india'],
['al', 'jazeera'],
['august', 'coronavirus'],
['beirut', 'blast'],
['beirut', 'port'],
['coronavirus', 'pandemic'],
['covid19', 'pandemic'],
['explosion', 'beirut'],
['french', 'president'],
['humanitarian', 'aid'],
['india', 'express'],
['least', 'people'],
['massive', 'explosion'],
['medical', 'aid'],
['new', 'cases'],
['president', 'macron'],
['prime', 'minister'],
['united', 'nations']]

bigramGulf.sort()
list(bigramGulf for bigramGulf,_ in itertools.groupby(bigramGulf))

[['abu', 'dhabi'],
 ['air', 'express'],
 ['air', 'india'],
 ['al', 'jazeera'],
 ['anjana', 'sankar'],
 ['app', 'available'],
 ['around', 'world'],
 ['asankar', 'khaleej'],
 ['asankar', 'kt'],
 ['assistant', 'editor'],
 ['august', 'coronavirus'],
 ['august', 'dubai'],
 ['august', 'government'],
 ['available', 'download'],
 ['beirut', 'blast'],
 ['beirut', 'port'],
 ['bin', 'al'],
 ['blog', 'syria'],
 ['blog', 'video'],
 ['cases', 'coronavirus'],
 ['cases', 'recoveries'],
 ['chinese', 'authorities'],
 ['chinese', 'people'],
 ['citizens', 'wuhan'],
 ['combating', 'coronavirus'],
 ['conferencesupcoming', 'events'],
 ['conferencesupcoming', 'eventspast'],
 ['confirmed', 'cases'],
 ['cookbook', 'remote'],
 ['cookbook', 'workforce'],
 ['coronavirus', 'cases'],
 ['coronavirus', 'combating'],
 ['coronavirus', 'coronavirus'],
 ['coronavirus', 'covid19'],
 ['coronavirus', 'outbreak'],
 ['coronavirus', 'pandemic'],
 ['coronavirus', 'reports'],
 ['coronavirus', 'uae'],
 ['countries', 'confirmed'],


In [156]:
getAllBigrams('New Global Media Players')

Processing Bigrams for New Global Media Players


['cases', 'reported'],
['democratic', 'republic'],
['dr', 'congo'],
['ebola', 'cases'],
['ebola', 'outbreak'],
['recent', 'outbreak'],
['west', 'africa'],
['aid', 'coronavirus'],
['attend', 'medical'],
['attend', 'training'],
['australian', 'open'],
['body', 'checked'],
['body', 'temperature'],
['checked', 'entering'],
['china', 'shaanxi'],
['chinese', 'new'],
['chinese', 'year'],
['combating', 'coronavirus'],
['combating', 'novel'],
['coronavirus', 'control'],
['coronavirus', 'wuhan'],
['coronavirus', 'xian'],
['curb', 'spread'],
['east', 'peace'],
['east', 'plan'],
['entering', 'subway'],
['eve', 'body'],
['eve', 'temperature'],
['focus', 'xi'],
['frontline', 'combating'],
['humanitarian', 'assistance'],
['jan', 'xinhua'],
['measures', 'stepped'],
['medical', 'staff'],
['medical', 'supplies'],
['medical', 'training'],
['medical', 'work'],
['million', 'yuan'],
['new', 'eve'],
['new', 'virus'],
['new', 'year'],
['novel', 'coronavirus'],

In [157]:
bigramMP = [['cases', 'reported'],
['democratic', 'republic'],
['dr', 'congo'],
['ebola', 'cases'],
['ebola', 'outbreak'],
['recent', 'outbreak'],
['west', 'africa'],
['aid', 'coronavirus'],
['attend', 'medical'],
['attend', 'training'],
['australian', 'open'],
['body', 'checked'],
['body', 'temperature'],
['checked', 'entering'],
['china', 'shaanxi'],
['chinese', 'new'],
['chinese', 'year'],
['combating', 'coronavirus'],
['combating', 'novel'],
['coronavirus', 'control'],
['coronavirus', 'wuhan'],
['coronavirus', 'xian'],
['curb', 'spread'],
['east', 'peace'],
['east', 'plan'],
['entering', 'subway'],
['eve', 'body'],
['eve', 'temperature'],
['focus', 'xi'],
['frontline', 'combating'],
['humanitarian', 'assistance'],
['jan', 'xinhua'],
['measures', 'stepped'],
['medical', 'staff'],
['medical', 'supplies'],
['medical', 'training'],
['medical', 'work'],
['million', 'yuan'],
['new', 'eve'],
['new', 'virus'],
['new', 'year'],
['novel', 'coronavirus'],
['nw', 'china'],
['peace', 'plan'],
['people', 'stick'],
['posts', 'chinese'],
['prevention', 'measures'],
['prevention', 'stepped'],
['qingdao', 'scenery'],
['qingdao', 'snow'],
['russian', 'foreign'],
['russian', 'ministry'],
['scenery', 'guizhou'],
['security', 'council'],
['shanghai', 'attend'],
['shanghai', 'medical'],
['snow', 'scenery'],
['spread', 'novel'],
['spring', 'festival'],
['staff', 'shanghai'],
['staff', 'work'],
['stick', 'posts'],
['subway', 'qingdao'],
['temperature', 'checked'],
['training', 'wuhan'],
['united', 'nations'],
['work', 'frontline'],
['wuhan', 'people'],
['wuhan', 'stick'],
['xian', 'china'],
['xian', 'nw'],
['xinhua', 'photos'],
['year', 'body'],
['year', 'eve'],
['chinese', 'people'],
['coronavirus', 'outbreak'],
['foreign', 'ministry'],
['health', 'organization'],
['hubei', 'province'],
['humanitarian', 'aid'],
['medical', 'supplies'],
['novel', 'coronavirus'],
['public', 'health'],
['russian', 'ministry'],
['united', 'nations'],
['united', 'states'],
['vladimir', 'putin'],
['world', 'health'],
['world', 'organization'],
['coronavirus', 'outbreak'],
['coronavirus', 'pandemic'],
['death', 'toll'],
['foreign', 'ministry'],
['health', 'ministry'],
['health', 'organization'],
['iranian', 'foreign'],
['iranian', 'ministry'],
['islamic', 'republic'],
['medical', 'equipment'],
['namaki', 'said'],
['novel', 'coronavirus'],
['plan', 'include'],
['sanctions', 'iran'],
['united', 'nations'],
['united', 'states'],
['us', 'sanctions'],
['world', 'health'],
['world', 'organization'],
['across', 'country'],
['coronavirus', 'outbreak'],
['coronavirus', 'pandemic'],
['covid19', 'pandemic'],
['covid19', 'virus'],
['death', 'toll'],
['foreign', 'ministry'],
['health', 'ministry'],
['health', 'organization'],
['iranian', 'ministry'],
['medical', 'equipment'],
['medical', 'supplies'],
['namaki', 'said'],
['novel', 'coronavirus'],
['united', 'nations'],
['united', 'states'],
['us', 'sanctions'],
['world', 'health'],
['world', 'organization'],
['coronavirus', 'outbreak'],
['coronavirus', 'pandemic'],
['covid19', 'pandemic'],
['death', 'toll'],
['health', 'ministry'],
['health', 'organization'],
['medical', 'supplies'],
['new', 'cases'],
['novel', 'coronavirus'],
['united', 'nations'],
['united', 'states'],
['world', 'health'],
['world', 'organization'],
['coronavirus', 'pandemic'],
['covid19', 'pandemic'],
['death', 'toll'],
['foreign', 'ministry'],
['health', 'ministry'],
['human', 'rights'],
['international', 'community'],
['medical', 'supplies'],
['new', 'cases'],
['novel', 'coronavirus'],
['past', 'hours'],
['prevention', 'control'],
['public', 'health'],
['united', 'nations'],
['united', 'states'],
['world', 'health'],
['coronavirus', 'cases'],
['coronavirus', 'pandemic'],
['covid19', 'cases'],
['covid19', 'pandemic'],
['death', 'toll'],
['foreign', 'ministry'],
['health', 'ministry'],
['health', 'organization'],
['humanitarian', 'aid'],
['million', 'people'],
['new', 'cases'],
['novel', 'coronavirus'],
['past', 'hours'],
['security', 'council'],
['united', 'nations'],
['united', 'states'],
['world', 'health'],
['world', 'organization'],
['bringing', 'country'],
['confirmed', 'new'],
['coronavirus', 'bringing'],
['coronavirus', 'cases'],
['coronavirus', 'infections'],
['coronavirus', 'pandemic'],
['coronavirus', 'patients'],
['coronavirus', 'vaccine'],
['country', 'number'],
['country', 'official'],
['covid19', 'cases'],
['covid19', 'pandemic'],
['death', 'toll'],
['foreign', 'ministry'],
['health', 'ministry'],
['human', 'rights'],
['infections', 'bringing'],
['mayor', 'sergei'],
['ministry', 'said'],
['new', 'cases'],
['new', 'coronavirus'],
['new', 'infections'],
['news', 'reported'],
['number', 'cases'],
['official', 'number'],
['president', 'putin'],
['president', 'vladimir'],
['prime', 'minister'],
['united', 'nations'],
['ussia', 'confirmed'],
['vladimir', 'putin']]
bigramMP.sort()
list(bigramMP for bigramGulf,_ in itertools.groupby(bigramMP))

[['across', 'country'],
 ['aid', 'coronavirus'],
 ['attend', 'medical'],
 ['attend', 'training'],
 ['australian', 'open'],
 ['body', 'checked'],
 ['body', 'temperature'],
 ['bringing', 'country'],
 ['cases', 'reported'],
 ['checked', 'entering'],
 ['china', 'shaanxi'],
 ['chinese', 'new'],
 ['chinese', 'people'],
 ['chinese', 'year'],
 ['combating', 'coronavirus'],
 ['combating', 'novel'],
 ['confirmed', 'new'],
 ['coronavirus', 'bringing'],
 ['coronavirus', 'cases'],
 ['coronavirus', 'control'],
 ['coronavirus', 'infections'],
 ['coronavirus', 'outbreak'],
 ['coronavirus', 'pandemic'],
 ['coronavirus', 'patients'],
 ['coronavirus', 'vaccine'],
 ['coronavirus', 'wuhan'],
 ['coronavirus', 'xian'],
 ['country', 'number'],
 ['country', 'official'],
 ['covid19', 'cases'],
 ['covid19', 'pandemic'],
 ['covid19', 'virus'],
 ['curb', 'spread'],
 ['death', 'toll'],
 ['democratic', 'republic'],
 ['dr', 'congo'],
 ['east', 'peace'],
 ['east', 'plan'],
 ['ebola', 'cases'],
 ['ebola', 'outbreak'],


In [172]:
getAllBigrams('Euro-Atlantic Countries')

Processing Bigrams for Euro-Atlantic Countries


['affordable', 'housing'],
['aid', 'agencies'],
['city', 'council'],
['city', 'officials'],
['donald', 'trump'],
['holiday', 'island'],
['killed', 'people'],
['new', 'new'],
['new', 'york'],
['north', 'korean'],
['officials', 'said'],
['san', 'francisco'],
['blocktime', 'publishedtime'],
['blocktime', 'updatedtimeupdated'],
['chinese', 'authorities'],
['chinese', 'city'],
['christmas', 'island'],
['city', 'wuhan'],
['confirmed', 'cases'],
['coronavirus', 'outbreak'],
['hong', 'kong'],
['hubei', 'province'],
['january', '2020'],
['million', 'people'],
['new', 'coronavirus'],
['new', 'virus'],
['new', 'year'],
['north', 'korea'],
['public', 'health'],
['publishedtime', 'gmt'],
['united', 'states'],
['world', 'health'],
['1', 'hr'],
['begin', 'clip'],
['begin', 'video'],
['bernie', 'sanders'],
['blocktime', 'publishedtime'],
['coronavirus', 'outbreak'],
['cruise', 'ship'],
['diamond', 'princess'],
['donald', 'trump'],
['end', 'clip'],
['end

In [161]:
bigramsEAC = [['affordable', 'housing'],
['aid', 'agencies'],
['city', 'council'],
['city', 'officials'],
['donald', 'trump'],
['holiday', 'island'],
['killed', 'people'],
['new', 'new'],
['new', 'york'],
['north', 'korean'],
['officials', 'said'],
['san', 'francisco'],
['blocktime', 'publishedtime'],
['blocktime', 'updatedtimeupdated'],
['chinese', 'authorities'],
['chinese', 'city'],
['christmas', 'island'],
['city', 'wuhan'],
['confirmed', 'cases'],
['coronavirus', 'outbreak'],
['hong', 'kong'],
['hubei', 'province'],
['million', 'people'],
['new', 'coronavirus'],
['new', 'virus'],
['new', 'year'],
['north', 'korea'],
['public', 'health'],
['publishedtime', 'gmt'],
['united', 'states'],
['world', 'health'],
['begin', 'clip'],
['begin', 'video'],
['bernie', 'sanders'],
['blocktime', 'publishedtime'],
['coronavirus', 'outbreak'],
['cruise', 'ship'],
['diamond', 'princess'],
['donald', 'trump'],
['end', 'clip'],
['end', 'video'],
['hong', 'kong'],
['hr', 'mins'],
['new', 'york'],
['north', 'korea'],
['president', 'trump'],
['public', 'health'],
['publishedtime', 'gmt'],
['south', 'korea'],
['united', 'states'],
['video', 'clip'],
['blocktime', 'publishedtime'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'outbreak'],
['coronavirus', 'pandemic'],
['death', 'toll'],
['new', 'cases'],
['new', 'york'],
['prime', 'minister'],
['public', 'health'],
['publishedtime', 'gmt'],
['social', 'distancing'],
['stay', 'home'],
['tested', 'positive'],
['united', 'states'],
['around', 'world'],
['blocktime', 'publishedtime'],
['confirmed', 'cases'],
['coronavirus', 'pandemic'],
['death', 'toll'],
['donald', 'trump'],
['health', 'organization'],
['last', 'week'],
['new', 'cases'],
['new', 'york'],
['president', 'trump'],
['prime', 'minister'],
['public', 'health'],
['publishedtime', 'bst'],
['social', 'distancing'],
['tested', 'positive'],
['united', 'states'],
['white', 'house'],
['world', 'health'],
['world', 'organization'],
['blocktime', 'publishedtime'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'pandemic'],
['death', 'toll'],
['donald', 'trump'],
['health', 'organization'],
['million', 'people'],
['new', 'cases'],
['new', 'york'],
['prime', 'minister'],
['public', 'health'],
['publishedtime', 'bst'],
['social', 'distancing'],
['united', 'states'],
['white', 'house'],
['world', 'health'],
['world', 'organization'],
['blocktime', 'publishedtime'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'pandemic'],
['death', 'toll'],
['million', 'people'],
['new', 'cases'],
['new', 'york'],
['prime', 'minister'],
['public', 'health'],
['publishedtime', 'bst'],
['social', 'distancing'],
['tested', 'positive'],
['united', 'states'],
['world', 'health'],
['blocktime', 'publishedtime'],
['blocktime', 'updatedtimeupdated'],
['bst', 'blocktime'],
['bst', 'publishedtime'],
['confirmed', 'cases'],
['coronavirus', 'cases'],
['coronavirus', 'pandemic'],
['death', 'toll'],
['million', 'people'],
['new', 'cases'],
['new', 'york'],
['prime', 'minister'],
['public', 'health'],
['publishedtime', 'bst'],
['social', 'distancing'],
['tested', 'positive'],
['united', 'states'],
['ammonium', 'nitrate'],
['begin', 'clip'],
['begin', 'video'],
['donald', 'trump'],
['end', 'clip'],
['end', 'video'],
['new', 'cases'],
['president', 'trump'],
['prime', 'minister'],
['united', 'states'],
['video', 'clip']]
bigramsEAC.sort()
list(bigramsEAC for bigramsEAC,_ in itertools.groupby(bigramsEAC))

[['affordable', 'housing'],
 ['aid', 'agencies'],
 ['ammonium', 'nitrate'],
 ['around', 'world'],
 ['begin', 'clip'],
 ['begin', 'video'],
 ['bernie', 'sanders'],
 ['blocktime', 'publishedtime'],
 ['blocktime', 'updatedtimeupdated'],
 ['bst', 'blocktime'],
 ['bst', 'publishedtime'],
 ['chinese', 'authorities'],
 ['chinese', 'city'],
 ['christmas', 'island'],
 ['city', 'council'],
 ['city', 'officials'],
 ['city', 'wuhan'],
 ['confirmed', 'cases'],
 ['coronavirus', 'cases'],
 ['coronavirus', 'outbreak'],
 ['coronavirus', 'pandemic'],
 ['cruise', 'ship'],
 ['death', 'toll'],
 ['diamond', 'princess'],
 ['donald', 'trump'],
 ['end', 'clip'],
 ['end', 'video'],
 ['health', 'organization'],
 ['holiday', 'island'],
 ['hong', 'kong'],
 ['hr', 'mins'],
 ['hubei', 'province'],
 ['killed', 'people'],
 ['last', 'week'],
 ['million', 'people'],
 ['new', 'cases'],
 ['new', 'coronavirus'],
 ['new', 'new'],
 ['new', 'virus'],
 ['new', 'year'],
 ['new', 'york'],
 ['north', 'korea'],
 ['north', 'korean'

## Trigrams to Bigrams Conversion

Since we are representing data using a network graph that requires bigrams, and our analysis shows that trigrams make more sense and give more context, we will create bigrams for the Highchart.js based network graph using trigrams. HighCharts will automatically connect Bigrams as Trigrams i.e. the trigram `('ds4d', 'is', 'cool')` will give us two bigrams `('ds4d', 'is')` and `('is', 'cool')`.

In [34]:
# Declaring a vocabulary of stop words that I don't need in the bigrams
myStopWords = ["i", "the", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

# Get top N Trigrams 

def getBigramsFromTri(month, region):
    dfYrList = combinedTextForCountryDf(region) # Combine text into months
    yrDf = pd.DataFrame(dfYrList) # Make a dataframe
    text = yrDf['text'][month] # Get the text for brigrams
    finder = TrigramCollocationFinder.from_words(text.split(), window_size = 3) # Finds trigrams
    finder.apply_freq_filter(3) # Removes candidate ngrams which have frequency less than min_freq
    trigram_measures = nltk.collocations.TrigramAssocMeasures()
    return sorted(finder.nbest(trigram_measures.raw_freq, 20)) # Sorts by frequency and returns top n results

# Loop through region and make bigrams from trigrams

def getAllBigramsFromTri(region):
    print('Processing Bigrams for', region)
    print('\n')
    for index, month in enumerate(yearMonthsWord):
        trigrams = getBigramsFromTri(index, region) # Get trigrams by months in a region
        for bg in trigrams:
            if bg[0] not in myStopWords: # Minus the stop words
                if bg[1] not in myStopWords:
                    if bg[2] not in myStopWords:
                        if bg[0].isnumeric() is False and bg[1].isnumeric() is False and bg[2].isnumeric() is False: # We don't need numbers/dates etc
                            print("['"+bg[0]+"', '"+bg[1]+"'],") # Make a list format that can be copy-pasted into JavaScript code
                            print("['"+bg[1]+"', '"+bg[2]+"'],") # Make a list format that can be copy-pasted into JavaScript code

In [35]:
getAllBigramsFromTri('Gulf Countries')

Processing Bigrams for Gulf Countries


['coronavirus', 'pandemic'],
['pandemic', 'coronavirus'],
['homeless', 'due'],
['due', 'war'],
['khaleej', 'times'],
['times', 'khaleejtimes'],
['read', 'coronavirus'],
['coronavirus', 'pandemic'],
['syria', 'video'],
['video', 'asankar'],
['times', 'khaleejtimes'],
['khaleejtimes', 'december'],
['video', 'asankar'],
['asankar', 'kt'],
['afp', 'news'],
['news', 'agency'],
['case', 'new'],
['new', 'coronavirus'],
['centers', 'disease'],
['disease', 'control'],
['china', 'national'],
['national', 'health'],
['chinese', 'football'],
['football', 'association'],
['confirmed', 'cases'],
['cases', 'virus'],
['countries', 'confirmed'],
['confirmed', 'cases'],
['disease', 'control'],
['control', 'prevention'],
['evacuate', 'citizens'],
['citizens', 'wuhan'],
['first', 'case'],
['case', 'coronavirus'],
['flights', 'mainland'],
['mainland', 'china'],
['longdistance', 'bus'],
['bus', 'services'],
['lunar', 'new'],
['new', 'year'],
['read', 'countries'],
['c

I can see that UAE is the prominent news source for Gulf countries. However, to give a clearer picture, we should show relationships of words from all countries in the block. So, I'll modify the function to incorporate countries.

In [44]:
mainDf = cdf # This is the combined DF for all countries
cleanText = lambda text: cleanTextInDf(text) # Lambda function applies to all cells in a column
cleandf = pd.DataFrame(mainDf.text.apply(cleanText)) # .apply() the function to all cells
mainDf['text'] = cleandf['text']
mainDf['yearmonth'] = df.apply(checkYearMonth, axis=1)

In [59]:
# Revising my stop word vocabulary

myStopWords = ["reuters", "news", "video", "clip", "read", "khaleej", "khaleejtimes", "i", "the", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]


# Revising this function to change df to mainDf (which is my new df)

def combinedTextForCountryDf(country):
    index = 0
    dfCountryYrList = []
    
    for ym in yearMonths:
        combinedText = ' '.join(mainDf[(mainDf['yearmonth'] == ym) & (mainDf['country'] == country)].text)
        dictCountryYr = {'country': country, 'yearmonth': ym, 'text': combinedText}
        dfCountryYrList.append(dictCountryYr)
    return dfCountryYrList

# Revising this function to get only top 8 results

def getBigramsFromTri(month, region):
    dfYrList = combinedTextForCountryDf(region) # Combine text into months
    yrDf = pd.DataFrame(dfYrList) # Make a dataframe
    text = yrDf['text'][month] # Get the text for brigrams
    finder = TrigramCollocationFinder.from_words(text.split(), window_size = 3) # Finds trigrams
    finder.apply_freq_filter(3) # Removes candidate ngrams which have frequency less than min_freq
    trigram_measures = nltk.collocations.TrigramAssocMeasures()
    return sorted(finder.nbest(trigram_measures.raw_freq, 8)) # Sorts by frequency and returns top n results

# Revising this function to take list of countries and print a combined array for JS

def getAllBigramsFromTri(countries):
    for region in countries:
        for index, month in enumerate(yearMonthsWord):
            trigrams = getBigramsFromTri(index, region) # Get trigrams by months in a region
            for bg in trigrams:
                if bg[0] not in myStopWords: # Minus the stop words
                    if bg[1] not in myStopWords:
                        if bg[2] not in myStopWords:
                            if bg[0].isnumeric() is False and bg[1].isnumeric() is False and bg[2].isnumeric() is False: # We don't need numbers/dates etc
                                print("['"+bg[0]+"', '"+bg[1]+"'],") # Make a list format that can be copy-pasted into JavaScript code
                                print("['"+bg[1]+"', '"+bg[2]+"'],") # Make a list format that can be copy-pasted into JavaScript code

In [57]:
getAllBigramsFromTri(['SA', 'QA', 'AE', 'KW']) # Gulf donors

['aid', 'relief'],
['relief', 'center'],
['custodian', 'two'],
['two', 'holy'],
['humanitarian', 'aid'],
['aid', 'relief'],
['king', 'salman'],
['salman', 'humanitarian'],
['salman', 'humanitarian'],
['humanitarian', 'aid'],
['two', 'holy'],
['holy', 'mosques'],
['custodian', 'two'],
['two', 'holy'],
['holy', 'mosques'],
['mosques', 'king'],
['mosques', 'king'],
['king', 'salman'],
['told', 'arab'],
['arab', 'news'],
['two', 'holy'],
['holy', 'mosques'],
['world', 'health'],
['health', 'organization'],
['coronavirus', 'covid19'],
['covid19', 'pandemic'],
['coronavirus', 'disease'],
['disease', 'covid19'],
['kingdom', 'saudi'],
['saudi', 'arabia'],
['told', 'arab'],
['arab', 'news'],
['two', 'holy'],
['holy', 'mosques'],
['world', 'health'],
['health', 'organization'],
['aid', 'relief'],
['relief', 'center'],
['gamers', 'without'],
['without', 'borders'],
['humanitarian', 'aid'],
['aid', 'relief'],
['king', 'salman'],
['salman', 'humanitarian'],
['salman', 'humanitarian'],
['humanitaria

In [58]:
# remove duplicates
listToProcess = [['aid', 'relief'],
['relief', 'center'],
['custodian', 'two'],
['two', 'holy'],
['humanitarian', 'aid'],
['aid', 'relief'],
['king', 'salman'],
['salman', 'humanitarian'],
['salman', 'humanitarian'],
['humanitarian', 'aid'],
['two', 'holy'],
['holy', 'mosques'],
['custodian', 'two'],
['two', 'holy'],
['holy', 'mosques'],
['mosques', 'king'],
['mosques', 'king'],
['king', 'salman'],
['told', 'arab'],
['arab', 'news'],
['two', 'holy'],
['holy', 'mosques'],
['world', 'health'],
['health', 'organization'],
['coronavirus', 'covid19'],
['covid19', 'pandemic'],
['coronavirus', 'disease'],
['disease', 'covid19'],
['kingdom', 'saudi'],
['saudi', 'arabia'],
['told', 'arab'],
['arab', 'news'],
['two', 'holy'],
['holy', 'mosques'],
['world', 'health'],
['health', 'organization'],
['aid', 'relief'],
['relief', 'center'],
['gamers', 'without'],
['without', 'borders'],
['humanitarian', 'aid'],
['aid', 'relief'],
['king', 'salman'],
['salman', 'humanitarian'],
['salman', 'humanitarian'],
['humanitarian', 'aid'],
['world', 'health'],
['health', 'organization'],
['gamers', 'without'],
['without', 'borders'],
['humanitarian', 'aid'],
['aid', 'relief'],
['king', 'salman'],
['salman', 'humanitarian'],
['kingdom', 'saudi'],
['saudi', 'arabia'],
['salman', 'humanitarian'],
['humanitarian', 'aid'],
['told', 'arab'],
['arab', 'news'],
['aid', 'relief'],
['relief', 'center'],
['coronavirus', 'disease'],
['disease', 'covid19'],
['humanitarian', 'aid'],
['aid', 'relief'],
['told', 'arab'],
['arab', 'news'],
['un', 'security'],
['security', 'council'],
['world', 'health'],
['health', 'organization'],
['aid', 'relief'],
['relief', 'center'],
['aid', 'yemeni'],
['yemeni', 'city'],
['fourth', 'fifth'],
['fifth', 'batches'],
['humanitarian', 'aid'],
['aid', 'relief'],
['king', 'salman'],
['salman', 'humanitarian'],
['salman', 'humanitarian'],
['humanitarian', 'aid'],
['centers', 'disease'],
['disease', 'control'],
['confirmed', 'cases'],
['cases', 'virus'],
['countries', 'confirmed'],
['confirmed', 'cases'],
['lunar', 'new'],
['new', 'year'],
['spread', 'new'],
['new', 'coronavirus'],
['airways', 'group'],
['group', 'chief'],
['chinese', 'embassy'],
['embassy', 'qatar'],
['diamond', 'princess'],
['princess', 'cruise'],
['princess', 'cruise'],
['cruise', 'ship'],
['qatar', 'airways'],
['airways', 'cargo'],
['world', 'health'],
['health', 'organization'],
['bringing', 'total'],
['total', 'number'],
['confirmed', 'coronavirus'],
['coronavirus', 'cases'],
['new', 'coronavirus'],
['coronavirus', 'cases'],
['president', 'donald'],
['donald', 'trump'],
['reuters', 'news'],
['news', 'agency'],
['world', 'health'],
['health', 'organization'],
['health', 'ministry'],
['ministry', 'said'],
['new', 'coronavirus'],
['coronavirus', 'cases'],
['president', 'donald'],
['donald', 'trump'],
['told', 'al'],
['al', 'jazeera'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['johns', 'hopkins'],
['hopkins', 'university'],
['new', 'coronavirus'],
['coronavirus', 'cases'],
['president', 'donald'],
['donald', 'trump'],
['reuters', 'news'],
['news', 'agency'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['confirmed', 'coronavirus'],
['coronavirus', 'cases'],
['johns', 'hopkins'],
['hopkins', 'university'],
['new', 'coronavirus'],
['coronavirus', 'cases'],
['new', 'coronavirus'],
['coronavirus', 'infections'],
['world', 'health'],
['health', 'organization'],
['confirmed', 'coronavirus'],
['coronavirus', 'cases'],
['johns', 'hopkins'],
['hopkins', 'university'],
['new', 'coronavirus'],
['coronavirus', 'cases'],
['news', 'agency'],
['agency', 'reported'],
['world', 'health'],
['health', 'organization'],
['french', 'president'],
['president', 'emmanuel'],
['massive', 'explosion'],
['explosion', 'beirut'],
['president', 'emmanuel'],
['emmanuel', 'macron'],
['president', 'michel'],
['michel', 'aoun'],
['bin', 'zayed'],
['zayed', 'al'],
['emirates', 'humanitarian'],
['humanitarian', 'city'],
['mohamed', 'bin'],
['bin', 'zayed'],
['sheikh', 'mohamed'],
['mohamed', 'bin'],
['world', 'health'],
['health', 'organisation'],
['mohammed', 'bin'],
['bin', 'rashid'],
['bin', 'rashid'],
['rashid', 'al'],
['mohammed', 'bin'],
['bin', 'rashid'],
['rashid', 'al'],
['al', 'maktoum'],
['minister', 'mustafa'],
['mustafa', 'al'],
['mustafa', 'al'],
['al', 'kadhimi'],
['prime', 'minister'],
['minister', 'mustafa'],
['world', 'refugee'],
['refugee', 'day'],
['air', 'india'],
['india', 'express'],
['arab', 'times'],
['times', 'kuwait'],
['health', 'ministry'],
['ministry', 'said'],
['highness', 'prime'],
['prime', 'minister'],
['kuwait', 'english'],
['english', 'daily'],
['arab', 'times'],
['times', 'kuwait'],
['kuwait', 'english'],
['english', 'daily'],
['kuwait', 'red'],
['red', 'crescent'],
['red', 'crescent'],
['crescent', 'society'],
['alina', 'l'],
['l', 'romanowski'],
['brig', 'deepak'],
['deepak', 'sreevastava'],
['jleeb', 'alshuyoukh'],
['alshuyoukh', 'mahboula'],
['rapid', 'response'],
['response', 'team'],
['together', 'build'],
['build', 'better'],
['united', 'states'],
['states', 'kuwait'],
['foreign', 'affairs'],
['affairs', 'kuwait'],
['highlevel', 'policy'],
['policy', 'dialogue'],
['ambassador', 'plenipotentiary'],
['plenipotentiary', 'cooperative'],
['cooperative', 'republic'],
['republic', 'guyana'],
['dr', 'ahmad'],
['ahmad', 'nasser'],
['dr', 'shamir'],
['shamir', 'ali'],
['eu', 'delegation'],
['delegation', 'kuwait'],
['sheikh', 'dr'],
['dr', 'ahmad'],
['novel', 'coronavirus'],
['coronavirus', 'covid19']]
listToProcess.sort()
list(listToProcess for listToProcess,_ in itertools.groupby(listToProcess))

[['affairs', 'kuwait'],
 ['agency', 'reported'],
 ['ahmad', 'nasser'],
 ['aid', 'relief'],
 ['aid', 'yemeni'],
 ['air', 'india'],
 ['airways', 'cargo'],
 ['airways', 'group'],
 ['al', 'jazeera'],
 ['al', 'kadhimi'],
 ['al', 'maktoum'],
 ['alina', 'l'],
 ['alshuyoukh', 'mahboula'],
 ['ambassador', 'plenipotentiary'],
 ['arab', 'news'],
 ['arab', 'times'],
 ['bin', 'rashid'],
 ['bin', 'zayed'],
 ['brig', 'deepak'],
 ['bringing', 'total'],
 ['build', 'better'],
 ['cases', 'virus'],
 ['centers', 'disease'],
 ['chinese', 'embassy'],
 ['confirmed', 'cases'],
 ['confirmed', 'coronavirus'],
 ['cooperative', 'republic'],
 ['coronavirus', 'cases'],
 ['coronavirus', 'covid19'],
 ['coronavirus', 'disease'],
 ['coronavirus', 'infections'],
 ['countries', 'confirmed'],
 ['covid19', 'pandemic'],
 ['crescent', 'society'],
 ['cruise', 'ship'],
 ['custodian', 'two'],
 ['deepak', 'sreevastava'],
 ['delegation', 'kuwait'],
 ['diamond', 'princess'],
 ['disease', 'control'],
 ['disease', 'covid19'],
 ['dona

In [60]:
getAllBigramsFromTri(['US', 'UK', 'FR', 'DE']) # Euro-Atlantic Countries

['city', 'officials'],
['officials', 'said'],
['project', 'homeless'],
['homeless', 'connect'],
['sales', 'tax'],
['tax', 'revenue'],
['west', 'nile'],
['nile', 'virus'],
['centers', 'disease'],
['disease', 'control'],
['disease', 'control'],
['control', 'prevention'],
['juan', 'guaido'],
['guaido', 'translator'],
['lunar', 'new'],
['new', 'year'],
['minister', 'benjamin'],
['benjamin', 'netanyahu'],
['world', 'health'],
['health', 'organization'],
['world', 'health'],
['health', 'organization'],
['new', 'york'],
['york', 'city'],
['president', 'donald'],
['donald', 'trump'],
['world', 'health'],
['health', 'organization'],
['health', 'care'],
['care', 'workers'],
['new', 'york'],
['york', 'city'],
['new', 'york'],
['york', 'times'],
['personal', 'protective'],
['protective', 'equipment'],
['president', 'donald'],
['donald', 'trump'],
['world', 'health'],
['health', 'organization'],
['centers', 'disease'],
['disease', 'control'],
['disease', 'control'],
['control', 'prevention'],
['new

['camps', 'slums'],
['slums', 'dealing'],
['lilicoronavirus', 'consequences'],
['consequences', 'tourism'],
['lilicoronavirus', 'refugee'],
['refugee', 'camps'],
['president', 'donald'],
['donald', 'trump'],
['refugee', 'camps'],
['camps', 'slums'],
['slums', 'dealing'],
['dealing', 'hygiene'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['among', 'highest'],
['highest', 'world'],
['schools', 'daycare'],
['daycare', 'centres'],
['start', 'air'],
['air', 'campaign'],
['tested', 'positive'],
['positive', 'virus'],
['bab', 'alhawa'],
['alhawa', 'hospital'],
['chancellor', 'angela'],
['angela', 'merkel'],
['coronavirus', 'recovery'],
['recovery', 'aid'],
['german', 'chancellor'],
['chancellor', 'angela'],
['un', 'security'],
['security', 'council'],
['tokyo', 'japan'],
['japan', 'releases'],


In [62]:
# remove duplicates
listToProcess = [['city', 'officials'],
['officials', 'said'],
['project', 'homeless'],
['homeless', 'connect'],
['sales', 'tax'],
['tax', 'revenue'],
['west', 'nile'],
['nile', 'virus'],
['centers', 'disease'],
['disease', 'control'],
['disease', 'control'],
['control', 'prevention'],
['juan', 'guaido'],
['guaido', 'translator'],
['lunar', 'new'],
['new', 'year'],
['minister', 'benjamin'],
['benjamin', 'netanyahu'],
['world', 'health'],
['health', 'organization'],
['world', 'health'],
['health', 'organization'],
['new', 'york'],
['york', 'city'],
['president', 'donald'],
['donald', 'trump'],
['world', 'health'],
['health', 'organization'],
['health', 'care'],
['care', 'workers'],
['new', 'york'],
['york', 'city'],
['new', 'york'],
['york', 'times'],
['personal', 'protective'],
['protective', 'equipment'],
['president', 'donald'],
['donald', 'trump'],
['world', 'health'],
['health', 'organization'],
['centers', 'disease'],
['disease', 'control'],
['disease', 'control'],
['control', 'prevention'],
['new', 'york'],
['york', 'city'],
['new', 'york'],
['york', 'times'],
['president', 'donald'],
['donald', 'trump'],
['world', 'health'],
['health', 'organization'],
['black', 'lives'],
['lives', 'matter'],
['disease', 'control'],
['control', 'prevention'],
['new', 'york'],
['york', 'city'],
['new', 'york'],
['york', 'times'],
['preceded', 'death'],
['death', 'parents'],
['world', 'health'],
['health', 'organization'],
['health', 'care'],
['care', 'workers'],
['new', 'york'],
['york', 'city'],
['new', 'york'],
['york', 'times'],
['president', 'donald'],
['donald', 'trump'],
['un', 'security'],
['security', 'council'],
['world', 'health'],
['health', 'organization'],
['cnn', 'senior'],
['senior', 'international'],
['president', 'donald'],
['donald', 'trump'],
['president', 'united'],
['united', 'states'],
['senior', 'international'],
['international', 'correspondent'],
['tons', 'ammonium'],
['ammonium', 'nitrate'],
['new', 'york'],
['york', 'new'],
['new', 'york'],
['york', 'usa'],
['north', 'korean'],
['korean', 'leader'],
['york', 'new'],
['new', 'york'],
['chinese', 'city'],
['city', 'wuhan'],
['city', 'wuhan'],
['wuhan', 'epicentre'],
['foreign', 'office'],
['office', 'advised'],
['gmt', 'blocktime'],
['blocktime', 'publishedtime'],
['new', 'south'],
['south', 'wales'],
['chief', 'medical'],
['medical', 'officer'],
['diamond', 'princess'],
['princess', 'cruise'],
['gmt', 'blocktime'],
['blocktime', 'publishedtime'],
['princess', 'cruise'],
['cruise', 'ship'],
['speech', 'text'],
['text', 'transcript1'],
['tested', 'positive'],
['positive', 'coronavirus'],
['tested', 'positive'],
['positive', 'virus'],
['world', 'health'],
['health', 'organization'],
['bst', 'blocktime'],
['blocktime', 'publishedtime'],
['chief', 'medical'],
['medical', 'officer'],
['gmt', 'blocktime'],
['blocktime', 'publishedtime'],
['new', 'york'],
['york', 'city'],
['speech', 'text'],
['text', 'transcript1'],
['tested', 'positive'],
['positive', 'coronavirus'],
['tested', 'positive'],
['positive', 'covid19'],
['world', 'health'],
['health', 'organization'],
['bst', 'blocktime'],
['blocktime', 'publishedtime'],
['johns', 'hopkins'],
['hopkins', 'university'],
['new', 'york'],
['york', 'city'],
['personal', 'protective'],
['protective', 'equipment'],
['president', 'donald'],
['donald', 'trump'],
['speech', 'text'],
['text', 'transcript1'],
['world', 'health'],
['health', 'organization'],
['according', 'johns'],
['johns', 'hopkins'],
['bst', 'blocktime'],
['blocktime', 'publishedtime'],
['johns', 'hopkins'],
['hopkins', 'university'],
['number', 'confirmed'],
['confirmed', 'cases'],
['president', 'donald'],
['donald', 'trump'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['according', 'johns'],
['johns', 'hopkins'],
['black', 'lives'],
['lives', 'matter'],
['bst', 'blocktime'],
['blocktime', 'publishedtime'],
['johns', 'hopkins'],
['hopkins', 'university'],
['speech', 'text'],
['text', 'transcript1'],
['tested', 'positive'],
['positive', 'coronavirus'],
['world', 'health'],
['health', 'organization'],
['bst', 'blocktime'],
['blocktime', 'publishedtime'],
['johns', 'hopkins'],
['hopkins', 'university'],
['new', 'coronavirus'],
['coronavirus', 'cases'],
['new', 'south'],
['south', 'wales'],
['premier', 'daniel'],
['daniel', 'andrews'],
['tested', 'positive'],
['positive', 'coronavirus'],
['tested', 'positive'],
['positive', 'covid19'],
['world', 'health'],
['health', 'organization'],
['bst', 'blocktime'],
['blocktime', 'publishedtime'],
['days', 'publication'],
['publication', 'date'],
['new', 'coronavirus'],
['coronavirus', 'cases'],
['speech', 'text'],
['text', 'transcript1'],
['world', 'health'],
['health', 'organization'],
['world', 'aids'],
['aids', 'day'],
['former', 'north'],
['north', 'korean'],
['kim', 'jong'],
['jong', 'un'],
['korea', 'united'],
['united', 'states'],
['leader', 'kim'],
['kim', 'jong'],
['north', 'korea'],
['korea', 'united'],
['ortagus', 'said'],
['said', 'united'],
['said', 'united'],
['united', 'states'],
['general', 'antonio'],
['antonio', 'guterres'],
['immediate', 'global'],
['global', 'ceasefire'],
['president', 'donald'],
['donald', 'trump'],
['secretary', 'general'],
['general', 'antonio'],
['un', 'secretary'],
['secretary', 'general'],
['world', 'health'],
['health', 'organization'],
['world', 'worst'],
['worst', 'humanitarian'],
['worst', 'humanitarian'],
['humanitarian', 'crisis'],
['doctors', 'without'],
['without', 'borders'],
['president', 'donald'],
['donald', 'trump'],
['un', 'security'],
['security', 'council'],
['united', 'arab'],
['arab', 'emirates'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['world', 'worst'],
['worst', 'humanitarian'],
['worst', 'humanitarian'],
['humanitarian', 'crisis'],
['late', 'last'],
['last', 'year'],
['president', 'donald'],
['donald', 'trump'],
['tens', 'thousands'],
['thousands', 'people'],
['un', 'security'],
['security', 'council'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['world', 'worst'],
['worst', 'humanitarian'],
['worst', 'humanitarian'],
['humanitarian', 'crisis'],
['already', 'gripped'],
['gripped', 'un'],
['calls', 'world'],
['world', 'worst'],
['doctors', 'without'],
['without', 'borders'],
['international', 'humanitarian'],
['humanitarian', 'city'],
['rio', 'de'],
['de', 'paz'],
['world', 'health'],
['health', 'organization'],
['world', 'worst'],
['worst', 'humanitarian'],
['worst', 'humanitarian'],
['humanitarian', 'crisis'],
['ferry', 'moby'],
['moby', 'zaza'],
['missile', 'defence'],
['defence', 'system'],
['outbreak', 'dr'],
['dr', 'congo'],
['said', 'last'],
['last', 'month'],
['told', 'afp'],
['afp', 'via'],
['united', 'arab'],
['arab', 'emirates'],
['world', 'health'],
['health', 'organization'],
['worst', 'humanitarian'],
['humanitarian', 'crisis'],
['islamic', 'state'],
['state', 'group'],
['kurdish', 'red'],
['red', 'crescent'],
['northeastern', 'syria'],
['syria', 'contracted'],
['personal', 'protective'],
['protective', 'equipment'],
['tens', 'thousands'],
['thousands', 'people'],
['three', 'health'],
['health', 'workers'],
['un', 'security'],
['security', 'council'],
['coronavirus', 'information'],
['information', 'web'],
['facts', 'true'],
['true', 'coronavirus'],
['holland', 'america'],
['america', 'line'],
['internationaldpacom', 'loaddate'],
['loaddate', 'february'],
['lilimyths', 'vs'],
['vs', 'facts'],
['true', 'coronavirus'],
['coronavirus', 'information'],
['vs', 'facts'],
['facts', 'true'],
['world', 'health'],
['health', 'organization'],
['global', 'spread'],
['spread', 'covid19'],
['lilicoronavirus', 'consequences'],
['consequences', 'tourism'],
['lilicoronavirus', 'timeline'],
['timeline', 'global'],
['president', 'donald'],
['donald', 'trump'],
['timeline', 'global'],
['global', 'spread'],
['touch', 'coronavirus'],
['coronavirus', 'outbreak'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['diary', 'ocean'],
['ocean', 'viking'],
['johns', 'hopkins'],
['hopkins', 'university'],
['lilicoronavirus', 'consequences'],
['consequences', 'tourism'],
['president', 'donald'],
['donald', 'trump'],
['robert', 'koch'],
['koch', 'institute'],
['social', 'distancing'],
['distancing', 'measures'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['camps', 'slums'],
['slums', 'dealing'],
['lilicoronavirus', 'consequences'],
['consequences', 'tourism'],
['lilicoronavirus', 'refugee'],
['refugee', 'camps'],
['president', 'donald'],
['donald', 'trump'],
['refugee', 'camps'],
['camps', 'slums'],
['slums', 'dealing'],
['dealing', 'hygiene'],
['us', 'president'],
['president', 'donald'],
['world', 'health'],
['health', 'organization'],
['among', 'highest'],
['highest', 'world'],
['schools', 'daycare'],
['daycare', 'centres'],
['start', 'air'],
['air', 'campaign'],
['tested', 'positive'],
['positive', 'virus'],
['bab', 'alhawa'],
['alhawa', 'hospital'],
['chancellor', 'angela'],
['angela', 'merkel'],
['coronavirus', 'recovery'],
['recovery', 'aid'],
['german', 'chancellor'],
['chancellor', 'angela'],
['un', 'security'],
['security', 'council'],
['tokyo', 'japan'],
['japan', 'releases']]
listToProcess.sort()
list(listToProcess for listToProcess,_ in itertools.groupby(listToProcess))

[['according', 'johns'],
 ['afp', 'via'],
 ['aids', 'day'],
 ['air', 'campaign'],
 ['alhawa', 'hospital'],
 ['already', 'gripped'],
 ['america', 'line'],
 ['ammonium', 'nitrate'],
 ['among', 'highest'],
 ['angela', 'merkel'],
 ['antonio', 'guterres'],
 ['arab', 'emirates'],
 ['bab', 'alhawa'],
 ['benjamin', 'netanyahu'],
 ['black', 'lives'],
 ['blocktime', 'publishedtime'],
 ['bst', 'blocktime'],
 ['calls', 'world'],
 ['camps', 'slums'],
 ['care', 'workers'],
 ['centers', 'disease'],
 ['chancellor', 'angela'],
 ['chief', 'medical'],
 ['chinese', 'city'],
 ['city', 'officials'],
 ['city', 'wuhan'],
 ['cnn', 'senior'],
 ['confirmed', 'cases'],
 ['consequences', 'tourism'],
 ['control', 'prevention'],
 ['coronavirus', 'cases'],
 ['coronavirus', 'information'],
 ['coronavirus', 'outbreak'],
 ['coronavirus', 'recovery'],
 ['cruise', 'ship'],
 ['daniel', 'andrews'],
 ['daycare', 'centres'],
 ['days', 'publication'],
 ['de', 'paz'],
 ['dealing', 'hygiene'],
 ['death', 'parents'],
 ['defence',

In [63]:
getAllBigramsFromTri(['CN', 'RU', 'IR', 'TR']) # Global Media Players

['attend', 'medical'],
['medical', 'training'],
['body', 'temperature'],
['temperature', 'checked'],
['checked', 'entering'],
['entering', 'subway'],
['chinese', 'new'],
['new', 'year'],
['curb', 'spread'],
['spread', 'novel'],
['new', 'year'],
['year', 'eve'],
['novel', 'coronavirus'],
['coronavirus', 'wuhan'],
['spread', 'novel'],
['novel', 'coronavirus'],
['access', 'chinese'],
['chinese', 'mainland'],
['cctv', 'access'],
['access', 'chinese'],
['china', 'hubei'],
['hubei', 'province'],
['fight', 'novel'],
['novel', 'coronavirus'],
['novel', 'coronavirus'],
['coronavirus', 'outbreak'],
['president', 'xi'],
['xi', 'jinping'],
['world', 'health'],
['health', 'organization'],
['access', 'chinese'],
['chinese', 'mainland'],
['bringing', 'total'],
['total', 'number'],
['global', 'humanitarian'],
['humanitarian', 'response'],
['humanitarian', 'response'],
['response', 'plan'],
['secretarygeneral', 'antonio'],
['antonio', 'guterres'],
['un', 'secretarygeneral'],
['secretarygeneral', 'anton

['according', 'data'],
['data', 'compiled'],
['china', 'last'],
['last', 'december'],
['compiled', 'usbased'],
['usbased', 'johns'],
['johns', 'hopkins'],
['hopkins', 'university'],
['told', 'anadolu'],
['anadolu', 'agency'],
['usbased', 'johns'],
['johns', 'hopkins'],
['world', 'health'],
['health', 'organization'],
['wuhan', 'china'],
['china', 'last'],
['china', 'last'],
['last', 'december'],
['compiled', 'usbased'],
['usbased', 'johns'],
['health', 'ministry'],
['ministry', 'said'],
['johns', 'hopkins'],
['hopkins', 'university'],
['president', 'donald'],
['donald', 'trump'],
['president', 'recep'],
['recep', 'tayyip'],
['usbased', 'johns'],
['johns', 'hopkins'],
['world', 'health'],
['health', 'organization'],
['china', 'last'],
['last', 'december'],
['humanitarian', 'aid'],
['aid', 'pandemic'],
['italy', 'spain'],
['spain', 'uk'],
['johns', 'hopkins'],
['hopkins', 'university'],
['provider', 'humanitarian'],
['humanitarian', 'aid'],
['told', 'anadolu'],
['anadolu', 'agency'],
['w

In [64]:
# remove duplicates
listToProcess = [['attend', 'medical'],
['medical', 'training'],
['body', 'temperature'],
['temperature', 'checked'],
['checked', 'entering'],
['entering', 'subway'],
['chinese', 'new'],
['new', 'year'],
['curb', 'spread'],
['spread', 'novel'],
['new', 'year'],
['year', 'eve'],
['novel', 'coronavirus'],
['coronavirus', 'wuhan'],
['spread', 'novel'],
['novel', 'coronavirus'],
['access', 'chinese'],
['chinese', 'mainland'],
['cctv', 'access'],
['access', 'chinese'],
['china', 'hubei'],
['hubei', 'province'],
['fight', 'novel'],
['novel', 'coronavirus'],
['novel', 'coronavirus'],
['coronavirus', 'outbreak'],
['president', 'xi'],
['xi', 'jinping'],
['world', 'health'],
['health', 'organization'],
['access', 'chinese'],
['chinese', 'mainland'],
['bringing', 'total'],
['total', 'number'],
['global', 'humanitarian'],
['humanitarian', 'response'],
['humanitarian', 'response'],
['response', 'plan'],
['secretarygeneral', 'antonio'],
['antonio', 'guterres'],
['un', 'secretarygeneral'],
['secretarygeneral', 'antonio'],
['united', 'nations'],
['nations', 'march'],
['world', 'health'],
['health', 'organization'],
['access', 'chinese'],
['chinese', 'mainland'],
['cctv', 'access'],
['access', 'chinese'],
['community', 'shared'],
['shared', 'future'],
['president', 'donald'],
['donald', 'trump'],
['secretarygeneral', 'antonio'],
['antonio', 'guterres'],
['told', 'global'],
['global', 'times'],
['world', 'health'],
['health', 'organization'],
['access', 'chinese'],
['chinese', 'mainland'],
['cctv', 'access'],
['access', 'chinese'],
['chinese', 'president'],
['president', 'xi'],
['community', 'shared'],
['shared', 'future'],
['global', 'humanitarian'],
['humanitarian', 'response'],
['president', 'xi'],
['xi', 'jinping'],
['world', 'health'],
['health', 'assembly'],
['world', 'health'],
['health', 'organization'],
['billion', 'us'],
['us', 'dollars'],
['community', 'shared'],
['shared', 'future'],
['council', 'information'],
['information', 'office'],
['cpc', 'central'],
['central', 'committee'],
['epidemic', 'prevention'],
['prevention', 'control'],
['president', 'xi'],
['xi', 'jinping'],
['state', 'council'],
['council', 'information'],
['told', 'global'],
['global', 'times'],
['appeal', 'global'],
['global', 'ceasefire'],
['community', 'shared'],
['shared', 'future'],
['global', 'fight'],
['fight', 'covid19'],
['global', 'humanitarian'],
['humanitarian', 'response'],
['humanitarian', 'response'],
['response', 'plan'],
['new', 'covid19'],
['covid19', 'cases'],
['united', 'nations'],
['nations', 'july'],
['world', 'health'],
['health', 'organization'],
['beirut', 'lebanon'],
['lebanon', 'aug'],
['children', 'lacked'],
['lacked', 'basic'],
['un', 'refugee'],
['refugee', 'agency'],
['world', 'health'],
['health', 'organization'],
['attack', 'us'],
['us', 'embassy'],
['center', 'reconciliation'],
['reconciliation', 'opposing'],
['consumer', 'protection'],
['protection', 'welfare'],
['deputy', 'foreign'],
['foreign', 'minister'],
['dmitry', 'peskov'],
['peskov', 'said'],
['federal', 'service'],
['service', 'oversight'],
['foreign', 'ministry'],
['ministry', 'said'],
['russian', 'foreign'],
['foreign', 'ministry'],
['dmitry', 'peskov'],
['peskov', 'said'],
['foreign', 'minister'],
['minister', 'sergei'],
['foreign', 'ministry'],
['ministry', 'said'],
['moscow', 'time'],
['time', 'february'],
['president', 'vladimir'],
['vladimir', 'putin'],
['russian', 'foreign'],
['foreign', 'ministry'],
['russian', 'president'],
['president', 'vladimir'],
['world', 'health'],
['health', 'organization'],
['minister', 'mikhail'],
['mikhail', 'mishustin'],
['novel', 'coronavirus'],
['coronavirus', 'covid19'],
['president', 'vladimir'],
['vladimir', 'putin'],
['press', 'service'],
['service', 'said'],
['russian', 'president'],
['president', 'vladimir'],
['un', 'secretarygeneral'],
['secretarygeneral', 'antonio'],
['world', 'health'],
['health', 'organization'],
['cases', 'novel'],
['novel', 'coronavirus'],
['coronavirus', 'outbreak'],
['outbreak', 'pandemic'],
['president', 'vladimir'],
['vladimir', 'putin'],
['russian', 'president'],
['president', 'vladimir'],
['world', 'health'],
['health', 'organization'],
['amid', 'coronavirus'],
['coronavirus', 'pandemic'],
['president', 'vladimir'],
['vladimir', 'putin'],
['press', 'service'],
['service', 'said'],
['russian', 'foreign'],
['foreign', 'ministry'],
['russian', 'president'],
['president', 'vladimir'],
['un', 'security'],
['security', 'council'],
['world', 'health'],
['health', 'organization'],
['deputy', 'foreign'],
['foreign', 'minister'],
['novel', 'coronavirus'],
['coronavirus', 'pandemic'],
['president', 'vladimir'],
['vladimir', 'putin'],
['press', 'service'],
['service', 'said'],
['russian', 'foreign'],
['foreign', 'ministry'],
['russian', 'president'],
['president', 'vladimir'],
['world', 'health'],
['health', 'organization'],
['amid', 'coronavirus'],
['coronavirus', 'pandemic'],
['editorial', 'staff'],
['staff', 'reached'],
['president', 'vladimir'],
['vladimir', 'putin'],
['press', 'service'],
['service', 'said'],
['russian', 'president'],
['president', 'vladimir'],
['staff', 'reached'],
['reached', 'engeditorsinterfaxru'],
['world', 'health'],
['health', 'organization'],
['bringing', 'country'],
['country', 'official'],
['coronavirus', 'infections'],
['infections', 'bringing'],
['country', 'official'],
['official', 'number'],
['infections', 'bringing'],
['bringing', 'country'],
['new', 'coronavirus'],
['coronavirus', 'infections'],
['official', 'number'],
['number', 'cases'],
['president', 'vladimir'],
['vladimir', 'putin'],
['resident', 'vladimir'],
['vladimir', 'putin'],
['coronavirus', 'test'],
['test', 'kits'],
['iranian', 'foreign'],
['foreign', 'ministry'],
['islamic', 'republic'],
['republic', 'iran'],
['medical', 'sanctions'],
['sanctions', 'iran'],
['medicine', 'medical'],
['medical', 'equipment'],
['secretary', 'state'],
['state', 'mike'],
['state', 'mike'],
['mike', 'pompeo'],
['us', 'secretary'],
['secretary', 'state'],
['central', 'chinese'],
['chinese', 'city'],
['chinese', 'city'],
['city', 'wuhan'],
['coronavirus', 'outbreak'],
['outbreak', 'country'],
['foreign', 'ministry'],
['ministry', 'spokesman'],
['iranian', 'foreign'],
['foreign', 'ministry'],
['islamic', 'republic'],
['republic', 'iran'],
['late', 'last'],
['last', 'year'],
['world', 'health'],
['health', 'organization'],
['amid', 'coronavirus'],
['coronavirus', 'outbreak'],
['covid19', 'virus'],
['virus', 'identified'],
['infected', 'covid19'],
['covid19', 'virus'],
['iranian', 'foreign'],
['foreign', 'ministry'],
['islamic', 'republic'],
['republic', 'iran'],
['patients', 'infected'],
['infected', 'covid19'],
['world', 'health'],
['health', 'organization'],
['amid', 'coronavirus'],
['coronavirus', 'outbreak'],
['infected', 'covid19'],
['covid19', 'virus'],
['iranian', 'foreign'],
['foreign', 'ministry'],
['islamic', 'republic'],
['republic', 'iran'],
['patients', 'infected'],
['infected', 'covid19'],
['swiss', 'humanitarian'],
['humanitarian', 'trade'],
['world', 'health'],
['health', 'organization'],
['infected', 'covid19'],
['covid19', 'virus'],
['iranian', 'foreign'],
['foreign', 'ministry'],
['iranian', 'health'],
['health', 'ministry'],
['patients', 'infected'],
['infected', 'covid19'],
['sadat', 'lari'],
['lari', 'said'],
['world', 'health'],
['health', 'organization'],
['identified', 'country'],
['country', 'past'],
['iranian', 'foreign'],
['foreign', 'ministry'],
['iranian', 'health'],
['health', 'ministry'],
['iranian', 'president'],
['president', 'hassan'],
['islamic', 'republic'],
['republic', 'iran'],
['president', 'hassan'],
['hassan', 'rouhani'],
['foreign', 'minister'],
['minister', 'mohammad'],
['foreign', 'ministry'],
['ministry', 'spokesman'],
['iranian', 'foreign'],
['foreign', 'ministry'],
['iranian', 'health'],
['health', 'ministry'],
['islamic', 'republic'],
['republic', 'iran'],
['minister', 'mohammad'],
['mohammad', 'javad'],
['mohammad', 'javad'],
['javad', 'zarif'],
['democratic', 'republic'],
['republic', 'congo'],
['countries', 'including'],
['including', 'us'],
['declared', 'outbreak'],
['outbreak', 'international'],
['health', 'organization'],
['organization', 'declared'],
['organization', 'declared'],
['declared', 'outbreak'],
['world', 'health'],
['health', 'organization'],
['according', 'data'],
['data', 'compiled'],
['china', 'last'],
['last', 'december'],
['compiled', 'usbased'],
['usbased', 'johns'],
['johns', 'hopkins'],
['hopkins', 'university'],
['told', 'anadolu'],
['anadolu', 'agency'],
['usbased', 'johns'],
['johns', 'hopkins'],
['world', 'health'],
['health', 'organization'],
['wuhan', 'china'],
['china', 'last'],
['china', 'last'],
['last', 'december'],
['compiled', 'usbased'],
['usbased', 'johns'],
['health', 'ministry'],
['ministry', 'said'],
['johns', 'hopkins'],
['hopkins', 'university'],
['president', 'donald'],
['donald', 'trump'],
['president', 'recep'],
['recep', 'tayyip'],
['usbased', 'johns'],
['johns', 'hopkins'],
['world', 'health'],
['health', 'organization'],
['china', 'last'],
['last', 'december'],
['humanitarian', 'aid'],
['aid', 'pandemic'],
['italy', 'spain'],
['spain', 'uk'],
['johns', 'hopkins'],
['hopkins', 'university'],
['provider', 'humanitarian'],
['humanitarian', 'aid'],
['told', 'anadolu'],
['anadolu', 'agency'],
['world', 'health'],
['health', 'organization'],
['aung', 'san'],
['san', 'suu'],
['democratic', 'republic'],
['republic', 'congo'],
['health', 'ministry'],
['ministry', 'said'],
['president', 'recep'],
['recep', 'tayyip'],
['recep', 'tayyip'],
['tayyip', 'erdogan'],
['san', 'suu'],
['suu', 'kyi'],
['world', 'health'],
['health', 'organization'],
['according', 'figures'],
['figures', 'compiled'],
['johns', 'hopkins'],
['hopkins', 'university'],
['turkish', 'red'],
['red', 'crescent'],
['usbased', 'johns'],
['johns', 'hopkins'],
['world', 'health'],
['health', 'organization'],
['agency', 'morning'],
['morning', 'briefing'],
['agency', 'rundown'],
['rundown', 'latest'],
['anadolu', 'agency'],
['agency', 'morning'],
['anadolu', 'agency'],
['agency', 'rundown'],
['ankara', 'anadolu'],
['anadolu', 'agency'],
['developments', 'coronavirus'],
['coronavirus', 'pandemic'],
['latest', 'developments'],
['developments', 'coronavirus']]
listToProcess.sort()
list(listToProcess for listToProcess,_ in itertools.groupby(listToProcess))

[['access', 'chinese'],
 ['according', 'data'],
 ['according', 'figures'],
 ['agency', 'morning'],
 ['agency', 'rundown'],
 ['aid', 'pandemic'],
 ['amid', 'coronavirus'],
 ['anadolu', 'agency'],
 ['ankara', 'anadolu'],
 ['antonio', 'guterres'],
 ['appeal', 'global'],
 ['attack', 'us'],
 ['attend', 'medical'],
 ['aung', 'san'],
 ['beirut', 'lebanon'],
 ['billion', 'us'],
 ['body', 'temperature'],
 ['bringing', 'country'],
 ['bringing', 'total'],
 ['cases', 'novel'],
 ['cctv', 'access'],
 ['center', 'reconciliation'],
 ['central', 'chinese'],
 ['central', 'committee'],
 ['checked', 'entering'],
 ['children', 'lacked'],
 ['china', 'hubei'],
 ['china', 'last'],
 ['chinese', 'city'],
 ['chinese', 'mainland'],
 ['chinese', 'new'],
 ['chinese', 'president'],
 ['city', 'wuhan'],
 ['community', 'shared'],
 ['compiled', 'usbased'],
 ['consumer', 'protection'],
 ['coronavirus', 'covid19'],
 ['coronavirus', 'infections'],
 ['coronavirus', 'outbreak'],
 ['coronavirus', 'pandemic'],
 ['coronavirus',

## Conclusion

The results from this exercise will be visualized on a network graph and shown on the interactive website. Links in the network graph will show how words are used together and give users context of how `covid` and `humanitarianism` was represented in news text.