In [23]:
from datetime import datetime as dt
import unicodedata
import pandas as pd
import numpy as np
import string

## Exploratory Data Analysis

In [2]:
para_count = []
sntc_count = []
word_count = []
language = []
source = []

for i in range(17):
    data = pd.read_csv('data/data_' + str(i) + '.csv', usecols=['language', 'paragraphs_count', 'sentences_count', 'source', 'words_count'],
                      dtype = {'language': str, 'paragraphs_count': str, 'sentences_count': str, 'source': str, 'words_count': str})
    para_count += [data.paragraphs_count.values]
    sntc_count += [data.sentences_count.values]
    word_count += [data.words_count.values]
    language += [data.language.values]
    source += [data.source.values]

In [3]:
# Convert into arrays and string
para_count = np.concatenate(para_count, axis=0).astype(str)
sntc_count = np.concatenate(sntc_count, axis=0).astype(str)
word_count = np.concatenate(word_count, axis=0).astype(str)

In [4]:
# Convert string list to array
language = np.concatenate(language, axis=0)
source = np.concatenate(source, axis=0)

**Average Counts**

We have to further clean our data as well, for example in `para_count`, we had a date appear in the column but since this is a small proportion of the data, we remove it. Similarly, `sntc_count` has some websites in it so we again only keep the numerical values.

In [5]:
# Average number of paragraphs
para_count = pd.to_numeric(para_count, errors='coerce')
np.nanmean(para_count)

18.136828435875078

In [6]:
# Average number of sentences
sntc_count = pd.to_numeric(sntc_count, errors='coerce')
np.nanmean(sntc_count)

23.224798025040773

In [7]:
# Average word count
word_count = word_count.astype(float)
np.nanmean(word_count)

522.7987642759947

We can see that we have on average 18 paragraphs per article, 23 sentences and 522 words per article. Now, we don't know how the AYLIEN API defines each paragraph to be, i.e. would a caption be a paragraph so we take this with a grain of salt.

**Language**

In [8]:
pd.Series(language).value_counts()

en    1673351
1           2
dtype: int64

We can see that we have mainly english articles but two articles with the number 1 as language. We would want to investigate these two articles to ensure that they in English.

In [9]:
np.where(language == '1')

(array([ 419355, 1635730]),)

So, we'll look into the 419355th and 1635730th articles once we load them in.

**Source**

In [10]:
pd.Series(source).value_counts()

dailymail.co.uk       97356
reuters.com           90420
yahoo.com             85906
indiatimes.com        50433
urdupoint.com         42043
                      ...  
sozcu.com.tr              1
mundodeportivo.com        1
20minutes.fr              1
zeit.de                   1
texas.gov                 1
Length: 429, dtype: int64

We can see that the top 5 sources are Daily Mail UK, Reuters, Yahoo, India Times, Urdu Point. This helps to better understand where our articles are coming from and our scope will have to be widened to consider the bias globally rather than in the United States since we would not have sufficient articles from only the United States.

### Text Cleaning

Now, for our purposes, we are only interested in the `body` and the `published_at` columns. Hence, we get a DataFrame with just those two columns now.

In [12]:
body = []
dates = []

for i in range(17):
    data = pd.read_csv('data/data_' + str(i) + '.csv', usecols=['body', 'published_at'])
    body += [data.body.values]
    dates += [data.published_at.values]

In [13]:
df = pd.DataFrame({'published_at': np.concatenate(dates, axis=0), 'body': np.concatenate(body, axis=0)})

df.head()

Unnamed: 0,published_at,body
0,2020-04-05,"On Sunday, British Prime Minister Boris Johnso..."
1,2020-04-05,NSW has now recorded 18 COVID-19 deaths as the...
2,2020-04-05,ChandigarhWith shops and manufacturing units c...
3,2020-04-05,"Chandigarh The 23-year-old man, discharged fro..."
4,2020-04-05,CHANDIGARH The stillness which had become so m...


In [14]:
df.shape

(1673356, 2)

The next step we want is to organize by dates since the data did not come in chronological order. 

In [15]:
df['date'] = df['published_at'].apply(lambda x: str(x)[:7])

In [16]:
df.date.value_counts()

2020-04    421719
2020-03    356982
2020-05    312505
2020-06    245143
2020-07    243827
2020-02     72057
2020-01     21102
2019-12         8
2019-11         8
nan             3
4               1
2               1
Name: date, dtype: int64

From this, we can see that we do not have a lot of data for 2019 since there are only 8 articles. We also have some odd values, i.e. `nan`, `4`, `2`. This limits our analysis as we will be now comparing January 2020 until July 2020. 

First, we start with removing the null values, the 4 and the 2. We can also remove 2019 data since it is not suficient enough and we will start our exploration in 2020.

In [17]:
# List of values to drop
drop_lst = ['nan', '4', '2', '2019-11', '2019-12']
  
# Drop rows that sastisfy those values
df = df[df.date.isin(drop_lst) == False]

In [18]:
# Double check data
df.date.value_counts()

2020-04    421719
2020-03    356982
2020-05    312505
2020-06    245143
2020-07    243827
2020-02     72057
2020-01     21102
Name: date, dtype: int64

We also check to see if our body values have any missing data. If we have any nulls, we would drop the values since there is no way to impute a body value.

In [19]:
df.body.isnull().values.sum()

0

Now that our data is the way we want it to be, we continue with some text cleaning for the body.

In [20]:
# view one article
df.body[0]

'On Sunday, British Prime Minister Boris Johnson was hospitalized "for tests" because of "persistent" COVID-19 symptoms\xa010 days\xa0after he tested positive, CNN reports.\xa0\nJohnson reportedly went to the unspecified London hospital after his doctor advised him to do so. A press release from his office called the\xa0move\xa0"precautionary."\xa0\nOn March 26, Johnson revealed he had tested positive and that he had been dealing with symptoms since that date. Britain had gone into lockdown two days earlier.\nSince the 26th, Johnson has been quarantined at his Downing Street residence. He is the first known world leader to have contracted the virus.\xa0\nRoughly a month ago, right around the time the U.K. started dealing with an outbreak, Johnson garnered media coverage for saying he\'d shook hands with coronavirus patients during a hospital visit. \xa0\n"I shook hands with everybody, you will be pleased to know, and I continue to shake hands," Johnson said during a press conference th

Now, From this article, we can see that there are some newlines, so we'll apply the following preprocessing steps:
1. Remove unicode characters
2. Remove new line characters
3. Convert all words to lowercase 
4. Remove any punctuation

First, we check if there are any NA values in our columns.

In [21]:
def clean_text(s):
    line = unicodedata.normalize("NFKD", s)
    line = line.rstrip('\n').lower()
    line = line.translate(str.maketrans('', '', string.punctuation))
    return line

In [24]:
df.body = df.body.astype('string')
df.body = df.body.apply(clean_text)

In [25]:
# Check example of cleaned text
df.body[0]

'on sunday british prime minister boris johnson was hospitalized for tests because of persistent covid19 symptoms 10 days after he tested positive cnn reports \njohnson reportedly went to the unspecified london hospital after his doctor advised him to do so a press release from his office called the move precautionary \non march 26 johnson revealed he had tested positive and that he had been dealing with symptoms since that date britain had gone into lockdown two days earlier\nsince the 26th johnson has been quarantined at his downing street residence he is the first known world leader to have contracted the virus \nroughly a month ago right around the time the uk started dealing with an outbreak johnson garnered media coverage for saying hed shook hands with coronavirus patients during a hospital visit  \ni shook hands with everybody you will be pleased to know and i continue to shake hands johnson said during a press conference that took place on march 3 his positive test was registe

Now that our data is clean, we use word2vec to get embeddings on our words. We want monthly embeddings, so we first split by months and then save the individual csv files so that we can work with the month data that we need.

### Split by Months

In [26]:
jan = df[df.date == '2020-01'].reset_index()
feb = df[df.date == '2020-02'].reset_index()
march = df[df.date == '2020-03'].reset_index()
april = df[df.date == '2020-04'].reset_index()
may = df[df.date == '2020-05'].reset_index()
june = df[df.date == '2020-06'].reset_index()
july = df[df.date == '2020-07'].reset_index()

In [27]:
jan.published_at.unique()

array(['2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28',
       '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24',
       '2020-01-23', '2020-01-22', '2020-01-21', '2020-01-20',
       '2020-01-19', '2020-01-18', '2020-01-17', '2020-01-16',
       '2020-01-15', '2020-01-14', '2020-01-13', '2020-01-12',
       '2020-01-11', '2020-01-10', '2020-01-09', '2020-01-08',
       '2020-01-07', '2020-01-06', '2020-01-05', '2020-01-04',
       '2020-01-03', '2020-01-02', '2020-01-01'], dtype=object)

In [28]:
may.published_at.unique()

array(['2020-05-31', '2020-05-30', '2020-05-29', '2020-05-28',
       '2020-05-27', '2020-05-26', '2020-05-25', '2020-05-24',
       '2020-05-23', '2020-05-22', '2020-05-21', '2020-05-20',
       '2020-05-19', '2020-05-18', '2020-05-17', '2020-05-16',
       '2020-05-15', '2020-05-14', '2020-05-13', '2020-05-12',
       '2020-05-11', '2020-05-10', '2020-05-09', '2020-05-08',
       '2020-05-07', '2020-05-06', '2020-05-05', '2020-05-04',
       '2020-05-03', '2020-05-02', '2020-05-01'], dtype=object)

For both of these months, we have data from dates at the beginning of the month until the end of the month.

In [29]:
jan.to_csv('months/jan.csv')
feb.to_csv('months/feb.csv')
march.to_csv('months/mar.csv')
april.to_csv('months/apr.csv')
may.to_csv('months/may.csv')
june.to_csv('months/june.csv')
july.to_csv('months/july.csv')