<h1>Data Processing</h1>
<h3>
    <ol>
        <li>
            Accessing Previously Extracted Data from Scraping
        </li>
        <li>
            Combining Both Kinds of Articles and Converting The Combination to a DataFrame
        </li>
        <li>
            Scraping to Obtain the Textual Content in all Short-listed Articles
        </li>
        <li>
            Cleaning the Data Obtained to Ensure Only Textual Content is Taken
        </li>
        <li>
            Saving the DataFrame for Short-listed Articles as JSON
        </li>
    </ol>
</h3>

***

<h4>Accessing Previously Extracted Data from Scraping</h4>

In [1]:
import json

# creating a function to read in stored JSON files
def read_from_storage(filename: str) -> list:
    # specifying a filename where to create a new file
    filename = f"Data/extract/{filename}.json"

    # creating a new file located at filename and writing JSON-ified articles into that file
    with open(filename, 'r') as f:
        return json.loads(f.read())

<h4>Combining Both Kinds of Articles and Converting The Combination to a DataFrame</h4>

In [2]:
# reading the scientific and conspiracy articles stored in the data_collection process
science_articles = read_from_storage('science')
conspiracy_articles = read_from_storage('conspiracy')

In [3]:
# trimming the both sets of articles by the length of the minimum of the two

min_len = min(len(science_articles), len(conspiracy_articles))
science_articles = science_articles[:min_len]
conspiracy_articles = conspiracy_articles[:min_len]

In [4]:
import pandas as pd

# creating a pandas dataframe from the two lists for the two types of articles
df_science = pd.DataFrame.from_dict(science_articles)
df_conspiracy = pd.DataFrame.from_dict(conspiracy_articles)

In [5]:
# creating a new column to distinguish scientific articles from conspiracy ones

df_science['article_type'] = 1
df_conspiracy['article_type'] = 0

In [6]:
df_science

Unnamed: 0,title,link,article_type
0,\n 09/02/2022\n CoQ10 for Post-COVID Fatigue?\n,https://www.consumerlab.com/clinical-updates/#...,1
1,\n 09/02/2022\n Omicron-Targeting Boosters\n,https://www.consumerlab.com/clinical-updates/#...,1
2,\n \n \n What's in Gatorade?\n \n We were surp...,https://www.consumerlab.com/reviews/electrolyt...,1
3,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
4,"\n \n Latest Reviews of N95, KN95 and Other Ma...",https://www.consumerlab.com/answers/how-to-mak...,1
...,...,...,...
260,"Re: How covid-19 spreads: narratives, counter ...",https://www.bmj.com/content/378/bmj-2022-06994...,1
261,Risk of preterm birth and stillbirth after cov...,https://www.bmj.com/content/378/bmj-2022-071416,1
262,Investigating the monkeypox outbreak,https://www.bmj.com/content/377/bmj.o1314,1
263,"After covid, politicians are failing us again ...",https://www.bmj.com/content/378/bmj.o2121,1


In [7]:
df_conspiracy

Unnamed: 0,title,link,article_type
0,\n\t\t\t\t\t\t\t\t\t\tTHE PLAN – WHO plans for...,https://www.winterwatch.net/2022/07/the-plan-w...,0
1,SO MUCH FOR SCIENCE: Vaccine industry no longe...,https://www.sgtreport.com/2022/09/so-much-for-...,0
2,The \xe2\x80\x98World War\xe2\x80\x99 on COVID...,https://www.sgtreport.com/2022/09/the-world-wa...,0
3,SMOKING GUN: THE COVID VAX IS A SCHEDULE 4 POI...,https://www.sgtreport.com/2022/09/smoking-gun-...,0
4,\n\t\t\t\t\t\t\t\t\t\tJudging the Covid-19 Pan...,https://americanfreepress.net/judging-the-covi...,0
...,...,...,...
260,Coronavirus UPDATE: Massive coverup exposed by...,https://www.naturalhealth365.com/coronavirus-d...,0
261,… COVID-19 and Mental Health: The “Unpopular” ...,https://www.naturalhealth365.com/videos/mental...,0
262,COVID-19 and Mental Health: The “Unpopular” Truth,https://www.naturalhealth365.com/videos/mental...,0
263,… IRREFUTABLE evidence: mRNA COVID jab causes ...,https://www.naturalhealth365.com/irrefutable-e...,0


In [8]:
# merging the two dataframes whilst also preserving order
df = pd.concat([df_science, df_conspiracy], axis=0, ignore_index=True)

In [9]:
df

Unnamed: 0,title,link,article_type
0,\n 09/02/2022\n CoQ10 for Post-COVID Fatigue?\n,https://www.consumerlab.com/clinical-updates/#...,1
1,\n 09/02/2022\n Omicron-Targeting Boosters\n,https://www.consumerlab.com/clinical-updates/#...,1
2,\n \n \n What's in Gatorade?\n \n We were surp...,https://www.consumerlab.com/reviews/electrolyt...,1
3,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
4,"\n \n Latest Reviews of N95, KN95 and Other Ma...",https://www.consumerlab.com/answers/how-to-mak...,1
...,...,...,...
525,Coronavirus UPDATE: Massive coverup exposed by...,https://www.naturalhealth365.com/coronavirus-d...,0
526,… COVID-19 and Mental Health: The “Unpopular” ...,https://www.naturalhealth365.com/videos/mental...,0
527,COVID-19 and Mental Health: The “Unpopular” Truth,https://www.naturalhealth365.com/videos/mental...,0
528,… IRREFUTABLE evidence: mRNA COVID jab causes ...,https://www.naturalhealth365.com/irrefutable-e...,0


<h4>Scraping to Obtain the Textual Content in all Short-listed Articles</h4>

In [10]:
import numpy as np

# splitting the df into 30 different parts to improve async scraping efficiency
split_df = np.array_split(df, 30)

In [11]:
split_df[0]

Unnamed: 0,title,link,article_type
0,\n 09/02/2022\n CoQ10 for Post-COVID Fatigue?\n,https://www.consumerlab.com/clinical-updates/#...,1
1,\n 09/02/2022\n Omicron-Targeting Boosters\n,https://www.consumerlab.com/clinical-updates/#...,1
2,\n \n \n What's in Gatorade?\n \n We were surp...,https://www.consumerlab.com/reviews/electrolyt...,1
3,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
4,"\n \n Latest Reviews of N95, KN95 and Other Ma...",https://www.consumerlab.com/answers/how-to-mak...,1
5,\n \n Product Reviews and Answers to Questions...,https://www.consumerlab.com/topic/coronavirus/,1
6,\n Routine Functional Testing after PCI\n D.-W...,https://nejm.org/doi/full/10.1056/NEJMoa220833...,1
7,\n \n \n \n \n Coronavirus\n,https://nejm.org/coronavirus,1
8,Coronavirus,https://nejm.org/coronavirus,1
9,\n Protection against Omicron BA.5 from Previo...,https://nejm.org/doi/full/10.1056/NEJMc2209479...,1


In [12]:
from newspaper import Article
import time

# obtaining the texts for the articles present in the current chunk
def parse_chunk(articles: list) -> list:
    # creating a variable to store the article texts for the current chunk
    extracted_text = []

    # looping over all of the articles
    for article in articles:
        try:
            # attempting to obtain the article text for the current article
            current_article = Article(article[1])
            current_article.download(), current_article.parse()

            # adding the article text to storage
            extracted_text.append(current_article.text)
        except Exception:
            # adding in a sentinel value to storage to indicate failure
            extracted_text.append("N/A")

    return extracted_text

In [13]:
len(split_df)

30

In [14]:
# creating a variable to store all the text for all of the articles
all_text = []

In [15]:
# loop over
for index, chunk in enumerate(split_df):
    # printing out a status message - specifying the current chunk number
    print(f'---- Chunk #{index+1} ----')

    # creating a variable to store the time execution is started
    start = time.time()

    # parsing the current chunk
    parsed_current_chunk = parse_chunk(chunk.values)

    # creating a variable to store the time execution is ended
    end = time.time()

    # printing out a status message - specifying the number of seconds elapsed
    print(end-start, "seconds elapsed")

    # adding all of the articles in the current chunk to storage
    all_text.append(parsed_current_chunk)

---- Chunk #1 ----
11.186994075775146 seconds elapsed
---- Chunk #2 ----
31.79817008972168 seconds elapsed
---- Chunk #3 ----
18.847004890441895 seconds elapsed
---- Chunk #4 ----
9.428709745407104 seconds elapsed
---- Chunk #5 ----
12.034369945526123 seconds elapsed
---- Chunk #6 ----
24.070032835006714 seconds elapsed
---- Chunk #7 ----
28.038588285446167 seconds elapsed
---- Chunk #8 ----
40.04304218292236 seconds elapsed
---- Chunk #9 ----
15.187128782272339 seconds elapsed
---- Chunk #10 ----
15.48180603981018 seconds elapsed
---- Chunk #11 ----
18.968178749084473 seconds elapsed
---- Chunk #12 ----
14.915328025817871 seconds elapsed
---- Chunk #13 ----
11.413012981414795 seconds elapsed
---- Chunk #14 ----
23.358027935028076 seconds elapsed
---- Chunk #15 ----
23.26058793067932 seconds elapsed
---- Chunk #16 ----
11.188714027404785 seconds elapsed
---- Chunk #17 ----
6.897191047668457 seconds elapsed
---- Chunk #18 ----
4.972226858139038 seconds elapsed
---- Chunk #19 ----
10.601

In [16]:
df.shape

(530, 3)

In [17]:
# ensuring no extra columns have been created by any account
assert sum(len(elem) for elem in all_text) == df.shape[0]

In [18]:
df

Unnamed: 0,title,link,article_type
0,\n 09/02/2022\n CoQ10 for Post-COVID Fatigue?\n,https://www.consumerlab.com/clinical-updates/#...,1
1,\n 09/02/2022\n Omicron-Targeting Boosters\n,https://www.consumerlab.com/clinical-updates/#...,1
2,\n \n \n What's in Gatorade?\n \n We were surp...,https://www.consumerlab.com/reviews/electrolyt...,1
3,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
4,"\n \n Latest Reviews of N95, KN95 and Other Ma...",https://www.consumerlab.com/answers/how-to-mak...,1
...,...,...,...
525,Coronavirus UPDATE: Massive coverup exposed by...,https://www.naturalhealth365.com/coronavirus-d...,0
526,… COVID-19 and Mental Health: The “Unpopular” ...,https://www.naturalhealth365.com/videos/mental...,0
527,COVID-19 and Mental Health: The “Unpopular” Truth,https://www.naturalhealth365.com/videos/mental...,0
528,… IRREFUTABLE evidence: mRNA COVID jab causes ...,https://www.naturalhealth365.com/irrefutable-e...,0


In [19]:
# creating a new column to store the texts for a particular article
df['text'] = 'undetermined'

# moving the article_type column to the far end of the dataframe, as where it should be
df.insert(len(df.columns)-1, 'article_type', df.pop('article_type'))

In [20]:
# creating a variable to store the current index in the all_text list
current_index = 0

# looping over all of the articles
for current_chunk in all_text:

    # looping over the articles parsed in the current chunk
    for current_text in current_chunk:

        # replacing the value in the current row with current_text
        df.at[current_index, 'text'] = current_text

        # incrementing the current_index
        current_index += 1

In [21]:
df

Unnamed: 0,title,link,text,article_type
0,\n 09/02/2022\n CoQ10 for Post-COVID Fatigue?\n,https://www.consumerlab.com/clinical-updates/#...,Save to favorites\n\nThis feature is restricte...,1
1,\n 09/02/2022\n Omicron-Targeting Boosters\n,https://www.consumerlab.com/clinical-updates/#...,Save to favorites\n\nThis feature is restricte...,1
2,\n \n \n What's in Gatorade?\n \n We were surp...,https://www.consumerlab.com/reviews/electrolyt...,Join our Free Newsletter and Become a Member t...,1
3,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,"Answer:\n\nBefore we get into specifics, let's...",1
4,"\n \n Latest Reviews of N95, KN95 and Other Ma...",https://www.consumerlab.com/answers/how-to-mak...,"Answer:\n\nN95 masks are the gold standard, bu...",1
...,...,...,...,...
525,Coronavirus UPDATE: Massive coverup exposed by...,https://www.naturalhealth365.com/coronavirus-d...,(NaturalHealth365) As we continue to report on...,0
526,… COVID-19 and Mental Health: The “Unpopular” ...,https://www.naturalhealth365.com/videos/mental...,NaturalHeatlh365 with Jonathan Landsman presen...,0
527,COVID-19 and Mental Health: The “Unpopular” Truth,https://www.naturalhealth365.com/videos/mental...,NaturalHeatlh365 with Jonathan Landsman presen...,0
528,… IRREFUTABLE evidence: mRNA COVID jab causes ...,https://www.naturalhealth365.com/irrefutable-e...,(NaturalHealth365) We already have hard proof ...,0


In [22]:
# print out how many invalid articles there are
df['text'].value_counts()['N/A']

166

In [23]:
# drop all columns with N/A in their text column
df = df[df['text'] != "N/A"]

In [24]:
df

Unnamed: 0,title,link,text,article_type
0,\n 09/02/2022\n CoQ10 for Post-COVID Fatigue?\n,https://www.consumerlab.com/clinical-updates/#...,Save to favorites\n\nThis feature is restricte...,1
1,\n 09/02/2022\n Omicron-Targeting Boosters\n,https://www.consumerlab.com/clinical-updates/#...,Save to favorites\n\nThis feature is restricte...,1
2,\n \n \n What's in Gatorade?\n \n We were surp...,https://www.consumerlab.com/reviews/electrolyt...,Join our Free Newsletter and Become a Member t...,1
3,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,"Answer:\n\nBefore we get into specifics, let's...",1
4,"\n \n Latest Reviews of N95, KN95 and Other Ma...",https://www.consumerlab.com/answers/how-to-mak...,"Answer:\n\nN95 masks are the gold standard, bu...",1
...,...,...,...,...
525,Coronavirus UPDATE: Massive coverup exposed by...,https://www.naturalhealth365.com/coronavirus-d...,(NaturalHealth365) As we continue to report on...,0
526,… COVID-19 and Mental Health: The “Unpopular” ...,https://www.naturalhealth365.com/videos/mental...,NaturalHeatlh365 with Jonathan Landsman presen...,0
527,COVID-19 and Mental Health: The “Unpopular” Truth,https://www.naturalhealth365.com/videos/mental...,NaturalHeatlh365 with Jonathan Landsman presen...,0
528,… IRREFUTABLE evidence: mRNA COVID jab causes ...,https://www.naturalhealth365.com/irrefutable-e...,(NaturalHealth365) We already have hard proof ...,0


<h4>Cleaning the Data Obtained to Ensure <i>Only</i> Textual Content is Taken</h4>

- remove all escape sequences and non-ascii unicode characters (like \\xe2)
- .strip() to remove any unnecessary spaces
- standardize to single space between word

In [25]:
import re

# remove escape characters (\n works, unicode characters stuff like \\xe90 doesn't work the best - e.g. the latter part of possessive or contractions are cut off)
def clean_str(unfiltered_str: str):
    formatted_str = re.sub('\\\\\w+', '', unfiltered_str).strip()
    return str((' ').join(formatted_str.split()))

In [26]:
# loop over all of the columns and and replace existing contents with cleaned up contents
for i, row in df.iterrows():
    df.at[i, 'title'] = clean_str(df.at[i, 'title'])
    df.at[i, 'text'] = clean_str(df.at[i, 'text'])

In [27]:
df

Unnamed: 0,title,link,text,article_type
0,09/02/2022 CoQ10 for Post-COVID Fatigue?,https://www.consumerlab.com/clinical-updates/#...,Save to favorites This feature is restricted t...,1
1,09/02/2022 Omicron-Targeting Boosters,https://www.consumerlab.com/clinical-updates/#...,Save to favorites This feature is restricted t...,1
2,What's in Gatorade? We were surprised by what ...,https://www.consumerlab.com/reviews/electrolyt...,Join our Free Newsletter and Become a Member t...,1
3,How Many Times Should You Test for COVID? It m...,https://www.consumerlab.com/answers/how-and-wh...,"Answer: Before we get into specifics, let's st...",1
4,"Latest Reviews of N95, KN95 and Other Masks fo...",https://www.consumerlab.com/answers/how-to-mak...,"Answer: N95 masks are the gold standard, but o...",1
...,...,...,...,...
525,Coronavirus UPDATE: Massive coverup exposed by...,https://www.naturalhealth365.com/coronavirus-d...,(NaturalHealth365) As we continue to report on...,0
526,… COVID-19 and Mental Health: The “Unpopular” ...,https://www.naturalhealth365.com/videos/mental...,NaturalHeatlh365 with Jonathan Landsman presen...,0
527,COVID-19 and Mental Health: The “Unpopular” Truth,https://www.naturalhealth365.com/videos/mental...,NaturalHeatlh365 with Jonathan Landsman presen...,0
528,… IRREFUTABLE evidence: mRNA COVID jab causes ...,https://www.naturalhealth365.com/irrefutable-e...,(NaturalHealth365) We already have hard proof ...,0


<h4>Saving the DataFrame for Short-listed Articles as JSON</h4>

In [None]:
# saving the dataframe under the 'Data' directory for further use

processed_json = df.to_json('Data/processed/processed_data.json', orient='records', indent=4)