# Introduction
This notebook is a part of a series of notebooks.
1. ***Data Cleaning***
2. Data Analysis
3. Sentiment Analysis
4. Topic Modelling  

The purpose of this notebook is to clean the data for unecessary data, such as symbols, numbers, brackets and foreign words. The expected outcome of this data cleaning process is to have a dataframe with lemmetized english words, from english domains.

# Problem statement
How has the mainstream media's coverage of Covid-19 changed during the pandemic?
- Can an overall change in sentiment be seen?
- Has the topics changed over time?
- Can a change in positivity og negativity within publishers be seen?

[Link to Github repository](https://github.com/miguel2650/covid-19-research)

# Getting the data
The dataset used for this research was found at [Kaggle](https://www.kaggle.com/jannalipenkova/covid19-public-media-dataset). The dataset is open to public downloads with a verified Kaggle account. The dataset cosists of 50,000 online articles with full texts which were scraped from online media in the timespan since January 2020, focussed mainly on the non-medical aspects of COVID-19. These articles will be in all kind of languages without any form of preprocessing.  
The authors of the dataset will continually update the datasets. However, we will focus on the latest version by the time of conduting this research.  
*Version: covid19_articles_20200504.csv*

In [3]:
import os.path
# Retrieves the dataset from https://www.kaggle.com/jannalipenkova/covid19-public-media-dataset
# In order to do so, please go to your kaggle account and create an API key.
# Follow the documentation here https://github.com/Kaggle/kaggle-api to set it up.
!kaggle datasets download -d jannalipenkova/covid19-public-media-dataset

Downloading covid19-public-media-dataset.zip to C:\Users\Mikkel\Desktop\covid-19-research




  0%|          | 0.00/244M [00:00<?, ?B/s]
  0%|          | 1.00M/244M [00:00<00:25, 9.83MB/s]
  2%|2         | 5.00M/244M [00:00<00:20, 12.0MB/s]
  4%|3         | 9.00M/244M [00:00<00:19, 12.4MB/s]
  7%|7         | 18.0M/244M [00:00<00:14, 16.7MB/s]
 10%|#         | 25.0M/244M [00:01<00:12, 18.2MB/s]
 15%|#5        | 37.0M/244M [00:01<00:08, 24.4MB/s]
 18%|#7        | 43.0M/244M [00:01<00:09, 22.2MB/s]
 23%|##2       | 55.0M/244M [00:01<00:06, 29.4MB/s]
 25%|##5       | 62.0M/244M [00:01<00:07, 26.9MB/s]
 30%|##9       | 73.0M/244M [00:02<00:05, 30.0MB/s]
 33%|###3      | 81.0M/244M [00:02<00:05, 32.5MB/s]
 36%|###6      | 89.0M/244M [00:02<00:04, 35.5MB/s]
 41%|####1     | 101M/244M [00:02<00:03, 44.9MB/s] 
 44%|####4     | 108M/244M [00:02<00:03, 44.8MB/s]
 47%|####6     | 114M/244M [00:03<00:03, 42.5MB/s]
 52%|#####1    | 126M/244M [00:03<00:02, 52.6MB/s]
 55%|#####4    | 133M/244M [00:03<00:02, 48.2MB/s]
 59%|#####8    | 143M/244M [00:03<00:01, 56.9MB/s]
 62%|######1   | 151M/244

In [4]:
from zipfile import ZipFile

# Creates a ZipFile Object and loads the dataset into it
with ZipFile('covid19-public-media-dataset.zip', 'r') as zipObj:
    # Extracts all the contents of the zip file in to the datasets folder.
    zipObj.extractall('datasets')

In [21]:
import pandas as pd

# Sets the size of width when printing our the dataframe.
pd.set_option('max_colwidth', 150)
# Reading the CSV file intro a Pandas dataframe.
data_df = pd.read_csv('datasets/covid19_articles_20200504.csv', index_col=0)
data_df.head(2)

Unnamed: 0,title,date,domain,url,author,content,topic_area
0,My experience of surviving cancer twice,2020-01-03,medicalnewstoday,https://www.medicalnewstoday.com/articles/327373,Helen Ziatyk,"“Helen, I’m so sorry to tell you that you have stage 4 ovarian cancer.” I will never forget hearing those words. Cancer treatment was pretty gruel...",healthcare
1,Ginger: Health benefits and dietary tips,2020-01-03,medicalnewstoday,https://www.medicalnewstoday.com/articles/265990.php,Jenna Fletcher,"If you buy something through a link on this page, we may earn a small commission. How this works. People have used ginger in cooking and medicine ...",healthcare


# Cleaning round 1
The most simple round of data cleaning. This will makre sure that the data will only contain one long line of words in each article.

In [6]:
# Apply first round of text cleaning
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [22]:
data_clean = pd.DataFrame(data_df.content.apply(round1))
data_clean.head(5)

Unnamed: 0,content
0,“helen i’m so sorry to tell you that you have stage ovarian cancer” i will never forget hearing those words cancer treatment was pretty grueling ...
1,if you buy something through a link on this page we may earn a small commission how this works people have used ginger in cooking and medicine sin...
2,a cluster of more than pneumonia cases in the central chinese city of wuhan may be due to a newly emerging member of the family of viruses that c...
3,passengers arriving at hong kongs international airport are being monitored for signs a mystery illness that emerged in central china credit andy ...
4,the finding that the outbreak of viral pneumonia in china that has struck people may be caused by a coronavirus the family of viruses behind sars...


# Cleaning round 2
Some of the symbols and punctuation was not removed the first time around. This is the 2. round of cleaning to make sure it is completely removed.

In [8]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and symbols that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [23]:
data_clean = pd.DataFrame(data_clean.content.apply(round2))
data_clean.head(5)

Unnamed: 0,content
0,helen im so sorry to tell you that you have stage ovarian cancer i will never forget hearing those words cancer treatment was pretty grueling in ...
1,if you buy something through a link on this page we may earn a small commission how this works people have used ginger in cooking and medicine sin...
2,a cluster of more than pneumonia cases in the central chinese city of wuhan may be due to a newly emerging member of the family of viruses that c...
3,passengers arriving at hong kongs international airport are being monitored for signs a mystery illness that emerged in central china credit andy ...
4,the finding that the outbreak of viral pneumonia in china that has struck people may be caused by a coronavirus the family of viruses behind sars...


# Cleaning round 3
After cleaning round 1 and 2, the data is clean for any kind of symbols and only contains words now.  
This time around, the data will be cleaned for any words that does not appear to be in english. This will make it easier for us to understand and work with the data. At the same time, the same words wont appear in diffrent languages.

In [10]:
import nltk, re
# Using the Natural Language Toolkit to clean the data for non-english words.
nltk.download('words')

def clean_text_round3(text):
    '''Remove all non-English word except for words containing "corona" and "covid"'''
    covid_reg = re.compile(r'.*corona.*|.*covid.*', re.IGNORECASE)
    
    # Creating a set containing all the words from the NLTK library.
    words = set(nltk.corpus.words.words())

    # wordpunct_tokenize will split all words.
    return " ".join(w for w in nltk.wordpunct_tokenize(text) \
             if w.lower() in words or covid_reg.match(w))

round3 = lambda x: clean_text_round3(x)

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Mikkel\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [None]:
data_clean = pd.DataFrame(data_clean.content.apply(round3))
# Creating a folder for pickle files
if not os.path.exists("pickle/"):
    !mkdir "pickle"
# Saving the cleaned data as a pickle file.
data_clean = pd.read_pickle("pickle/data_clean_r3.pkl")

# Cleaning round 4
Now that the data only contains english words, it will be possible to clean the words for lemmatization. This will prevent the same words from being repeated in diffrent comparison.

In [13]:
import nltk
from nltk.stem import WordNetLemmatizer
# Using the Natural Language Toolkit to lemmatize the data.
nltk.download('wordnet')

def clean_text_round4(text):
    '''Lemmatize words except for "was" as we experienced the issue that it would return "wa"'''
    lemmer=WordNetLemmatizer()
    
    return ' '.join([lemmer.lemmatize(word) for word in nltk.wordpunct_tokenize(text) if word not in 'was'])

round4 = lambda x: clean_text_round4(x)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mikkel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
data_clean = pd.DataFrame(data_clean.content.apply(round4))
data_clean.head(5)

KeyboardInterrupt: 

# End of cleaning
This is the end of the data cleaning. More steps could have been applied however, a decision were made that the data was clean enough for analysis. The cleaned data will be saved as files for analysis in the next notebook.

In [None]:
# Saves the cleaned data at pickle file.
if not os.path.exists("pickle/"):
    !mkdir "pickle"
# Saving the cleaned data
data_clean.to_pickle("pickle/data_clean_r4.pkl")
# Saving the raw data
data_df.to_pickle("pickle/data_df.pkl")