<h1>Data Processing</h1>
<h3>
    <ol>
        <li>
            Accessing Previously Extracted Data from Scraping
        </li>
        <li>
            Combining Both Kinds of Articles and Converting The Combination to a DataFrame
        </li>
        <li>
            Scraping to Obtain the Textual Content in all Short-listed Articles
        </li>
        <li>
            Cleaning the Data Obtained to Ensure Only Textual Content is Taken
        </li>
        <li>
            Saving the DataFrame for Short-listed Articles as JSON
        </li>
    </ol>
</h3>

***

<h4>Accessing Previously Extracted Data from Scraping</h4>

In [131]:
import json

# creating a function to read in stored JSON files
def read_from_storage(filename: str) -> list:
    # specifying a filename where to create a new file
    filename = f"Data/extract/{filename}.json"

    # creating a new file located at filename and writing JSON-ified articles into that file
    with open(filename, 'r') as f:
        return json.loads(f.read())

<h4>Combining Both Kinds of Articles and Converting The Combination to a DataFrame</h4>

In [132]:
# reading the scientific and conspiracy articles stored in the data_acquisition process

science_articles = read_from_storage('science')
conspiracy_articles = read_from_storage('conspiracy')

In [None]:
# trimming the both sets of articles by the length of the minimum of the two

min_len = min(len(science_articles), len(conspiracy_articles))
science_articles = science_articles[:min_len]
conspiracy_articles = conspiracy_articles[:min_len]

In [None]:
import pandas as pd

# creating a pandas dataframe from the two lists for the two types of articles
df_science = pd.DataFrame.from_dict(science_articles)
df_conspiracy = pd.DataFrame.from_dict(conspiracy_articles)

In [None]:
# creating a new column to distinguish scientific articles from conspiracy ones

df_science['article_type'] = 1
df_conspiracy['article_type'] = 0

In [None]:
df_science

Unnamed: 0,title,link,article_type
0,\n \n \n Quercetin for Seasonal Allergies?\n \...,https://www.consumerlab.com/reviews/quercetin-...,1
1,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
2,\n 08/08/2022\n Woman Pleads Guilty for Sellin...,https://www.consumerlab.com/recalls/14684/woma...,1
3,\n 08/08/2022\n Seller of CBD Warned for COVID...,https://www.consumerlab.com/recalls/14683/sell...,1
4,\n \n Product Reviews and Answers to Questions...,https://www.consumerlab.com/topic/coronavirus/,1
...,...,...,...
275,David Oliver: The overwhelming reaction to my ...,https://www.bmj.com/content/378/bmj.o2017,1
276,Investigating the monkeypox outbreak,https://www.bmj.com/content/377/bmj.o1314,1
277,Monkeypox: what we know about the 2022 outbrea...,https://www.bmj.com/content/378/bmj.o2058,1
278,Risk of preterm birth and stillbirth after cov...,https://www.bmj.com/content/378/bmj-2022-071416,1


In [None]:
df_conspiracy

Unnamed: 0,title,link,article_type
0,Fauci to Step Down in December \xe2\x80\x94 Wi...,https://www.sgtreport.com/2022/08/fauci-to-ste...,0
1,\n\t\t\t\t\t\t\t\t\t\tJudging the Covid-19 Pan...,https://americanfreepress.net/judging-the-covi...,0
2,Is the COVID-19 Vaccine a Miracle?,https://biologos.org/post/is-the-covid-19-vacc...,0
3,Is the COVID-19 vaccine safe?,https://biologos.org/resources/is-the-covid-19...,0
4,A Christian Statement on Science for Pandemic ...,https://biologos.org/post/a-christian-statemen...,0
...,...,...,...
275,Most people infected with Omicron weren't even...,https://www.sott.net/article/471164-Most-peopl...,0
276,"Zoonotic Langya virus found in China, CDC says",https://www.sott.net/article/470863-Zoonotic-L...,0
277,CDC quietly removes 'claim' that spike protein...,https://www.sott.net/article/471150-CDC-quietl...,0
278,9/11 vs. COVID: The Sorry Tale of a Nation Gon...,https://www.veteranstoday.com/2022/08/12/9-11-...,0


In [None]:
# merging the two dataframes whilst also preserving order
df = pd.concat([df_science, df_conspiracy], axis=0, ignore_index=True)

In [None]:
df

Unnamed: 0,title,link,article_type
0,\n \n \n Quercetin for Seasonal Allergies?\n \...,https://www.consumerlab.com/reviews/quercetin-...,1
1,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
2,\n 08/08/2022\n Woman Pleads Guilty for Sellin...,https://www.consumerlab.com/recalls/14684/woma...,1
3,\n 08/08/2022\n Seller of CBD Warned for COVID...,https://www.consumerlab.com/recalls/14683/sell...,1
4,\n \n Product Reviews and Answers to Questions...,https://www.consumerlab.com/topic/coronavirus/,1
...,...,...,...
555,Most people infected with Omicron weren't even...,https://www.sott.net/article/471164-Most-peopl...,0
556,"Zoonotic Langya virus found in China, CDC says",https://www.sott.net/article/470863-Zoonotic-L...,0
557,CDC quietly removes 'claim' that spike protein...,https://www.sott.net/article/471150-CDC-quietl...,0
558,9/11 vs. COVID: The Sorry Tale of a Nation Gon...,https://www.veteranstoday.com/2022/08/12/9-11-...,0


<h4>Scraping to Obtain the Textual Content in all Short-listed Articles</h4>

In [None]:
import numpy as np

# splitting the df into 30 different parts to improve async scraping efficiency
split_df = np.array_split(df, 30)

In [None]:
split_df[0]

Unnamed: 0,title,link,article_type
0,\n \n \n Quercetin for Seasonal Allergies?\n \...,https://www.consumerlab.com/reviews/quercetin-...,1
1,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
2,\n 08/08/2022\n Woman Pleads Guilty for Sellin...,https://www.consumerlab.com/recalls/14684/woma...,1
3,\n 08/08/2022\n Seller of CBD Warned for COVID...,https://www.consumerlab.com/recalls/14683/sell...,1
4,\n \n Product Reviews and Answers to Questions...,https://www.consumerlab.com/topic/coronavirus/,1
5,"\n \n Latest Reviews of N95, KN95 and Other Ma...",https://www.consumerlab.com/answers/how-to-mak...,1
6,Coronavirus,https://nejm.org/coronavirus,1
7,\n Japanese Encephalitis in Australia\n C. Wal...,https://nejm.org/doi/full/10.1056/NEJMc2207004...,1
8,\n Case Series of Children with Acute Hepatiti...,https://nejm.org/doi/full/10.1056/NEJMoa220629...,1
9,\n \n \n \n \n Coronavirus\n,https://nejm.org/coronavirus,1


In [None]:
from newspaper import Article
import time

# obtaining the texts for the articles present in the current chunk
def parse_chunk(articles: list) -> list:

    # creating a variable to store the article texts for the current chunk
    extracted_text = []

    # looping over all of the articles
    for article in articles:
        try:
            # attempting to obtain the article text for the current article
            current_article = Article(article[1])
            current_article.download(), current_article.parse()

            # adding the article text to storage
            extracted_text.append(current_article.text)
        except Exception:
            # adding in a sentinel value to storage to indicate failure
            extracted_text.append("N/A")

    return extracted_text

In [None]:
len(split_df)

30

In [None]:
# creating a variable to store all the text for all of the articles
all_text = []

In [126]:
# loop over
for index, chunk in enumerate(split_df):
    # printing out a status message - specifying the current chunk number
    print(f'---- Chunk #{index+1} ----')

    # creating a variable to store the time execution is started
    start = time.time()

    # parsing the current chunk
    parsed_current_chunk = parse_chunk(chunk.values)

    # creating a variable to store the time execution is ended
    end = time.time()

    # printing out a status message - specifying the number of seconds elapsed
    print(end-start, "seconds elapsed")

    # adding all of the articles in the current chunk to storage
    all_text.append(parsed_current_chunk)

---- Chunk #1 ----


KeyboardInterrupt: 

In [None]:
df.shape

(560, 3)

In [None]:
# ensuring no extra columns have been created by any account
assert sum(len(elem) for elem in all_text) == df.shape[0]

In [None]:
df

Unnamed: 0,title,link,article_type
0,\n \n \n Quercetin for Seasonal Allergies?\n \...,https://www.consumerlab.com/reviews/quercetin-...,1
1,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
2,\n 08/08/2022\n Woman Pleads Guilty for Sellin...,https://www.consumerlab.com/recalls/14684/woma...,1
3,\n 08/08/2022\n Seller of CBD Warned for COVID...,https://www.consumerlab.com/recalls/14683/sell...,1
4,\n \n Product Reviews and Answers to Questions...,https://www.consumerlab.com/topic/coronavirus/,1
...,...,...,...
555,Most people infected with Omicron weren't even...,https://www.sott.net/article/471164-Most-peopl...,0
556,"Zoonotic Langya virus found in China, CDC says",https://www.sott.net/article/470863-Zoonotic-L...,0
557,CDC quietly removes 'claim' that spike protein...,https://www.sott.net/article/471150-CDC-quietl...,0
558,9/11 vs. COVID: The Sorry Tale of a Nation Gon...,https://www.veteranstoday.com/2022/08/12/9-11-...,0


In [None]:
# creating a new column to store the texts for a particular article
df['text'] = 'undetermined'

# moving the article_type column to the far end of the dataframe, as where it should be
df.insert(len(df.columns)-1, 'article_type', df.pop('article_type'))

In [None]:
# creating a variable to store the current index in the all_text list
current_index = 0

# looping over all of the articles
for current_chunk in all_text:

    # looping over the articles parsed in the current chunk
    for current_text in current_chunk:

        # replacing the value in the current row with current_text
        df.at[current_index, 'text'] = current_text

        # incrementing the current_index
        current_index += 1

In [130]:
df

Unnamed: 0,title,link,article_type
0,\n \n \n Quercetin for Seasonal Allergies?\n \...,https://www.consumerlab.com/reviews/quercetin-...,1
1,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,1
2,\n 08/08/2022\n Woman Pleads Guilty for Sellin...,https://www.consumerlab.com/recalls/14684/woma...,1
3,\n 08/08/2022\n Seller of CBD Warned for COVID...,https://www.consumerlab.com/recalls/14683/sell...,1
4,\n \n Product Reviews and Answers to Questions...,https://www.consumerlab.com/topic/coronavirus/,1
...,...,...,...
555,Most people infected with Omicron weren't even...,https://www.sott.net/article/471164-Most-peopl...,0
556,"Zoonotic Langya virus found in China, CDC says",https://www.sott.net/article/470863-Zoonotic-L...,0
557,CDC quietly removes 'claim' that spike protein...,https://www.sott.net/article/471150-CDC-quietl...,0
558,9/11 vs. COVID: The Sorry Tale of a Nation Gon...,https://www.veteranstoday.com/2022/08/12/9-11-...,0


In [128]:
# print out how many invalid articles there are
df['text'].value_counts()['N/A']

KeyError: 'text'

In [None]:
# drop all columns with N/A in their text column
df = df[df['text'] != "N/A"]

In [None]:
df

Unnamed: 0,title,link,text,article_type
0,\n \n \n Quercetin for Seasonal Allergies?\n \...,https://www.consumerlab.com/reviews/quercetin-...,Save to favorites\n\nThis feature is restricte...,1
1,\n \n \n How Many Times Should You Test for CO...,https://www.consumerlab.com/answers/how-and-wh...,"Answer:\n\nBefore we get into specifics, let's...",1
2,\n 08/08/2022\n Woman Pleads Guilty for Sellin...,https://www.consumerlab.com/recalls/14684/woma...,"On July 27, 2022, Diana Daffin, owner of Savvy...",1
3,\n 08/08/2022\n Seller of CBD Warned for COVID...,https://www.consumerlab.com/recalls/14683/sell...,"On August 4, 2022, the FDA sent a warning lett...",1
4,\n \n Product Reviews and Answers to Questions...,https://www.consumerlab.com/topic/coronavirus/,Save to favorites\n\nThis feature is restricte...,1
...,...,...,...,...
555,Most people infected with Omicron weren't even...,https://www.sott.net/article/471164-Most-peopl...,The lack of public awareness about being infec...,0
556,"Zoonotic Langya virus found in China, CDC says",https://www.sott.net/article/470863-Zoonotic-L...,© Daily PRABHAT/simplifay\n\n\n\nThe 26 patien...,0
557,CDC quietly removes 'claim' that spike protein...,https://www.sott.net/article/471150-CDC-quietl...,The mRNA vaccines cannot give you COVID-19. Th...,0
558,9/11 vs. COVID: The Sorry Tale of a Nation Gon...,https://www.veteranstoday.com/2022/08/12/9-11-...,By John Kaminski\n\nWhat goes around comes aro...,0


<h4>Cleaning the Data Obtained to Ensure <i>Only</i> Textual Content is Taken</h4>

- remove all escape sequences and non-ascii unicode characters (like \\xe2)
- .strip() to remove any unnecessary spaces
- standardize to single space between word

In [None]:
import re

# remove escape characters (\n works, unicode characters stuff like \\xe90 doesn't work the best - e.g. the latter part of possessive or contractions are cut off)
def clean_str(unfiltered_str: str):
    formatted_str = re.sub('\\\\\w+', '', unfiltered_str).strip()
    return str((' ').join(formatted_str.split()))

In [None]:
# loop over all of the columns and and replace existing contents with cleaned up contents
for i, row in df.iterrows():
    df.at[i, 'title'] = clean_str(df.at[i, 'title'])
    df.at[i, 'text'] = clean_str(df.at[i, 'text'])

In [None]:
df

Unnamed: 0,title,link,text,article_type
0,Quercetin for Seasonal Allergies? A recent stu...,https://www.consumerlab.com/reviews/quercetin-...,Save to favorites This feature is restricted t...,1
1,How Many Times Should You Test for COVID? It m...,https://www.consumerlab.com/answers/how-and-wh...,"Answer: Before we get into specifics, let's st...",1
2,08/08/2022 Woman Pleads Guilty for Selling and...,https://www.consumerlab.com/recalls/14684/woma...,"On July 27, 2022, Diana Daffin, owner of Savvy...",1
3,08/08/2022 Seller of CBD Warned for COVID-19 C...,https://www.consumerlab.com/recalls/14683/sell...,"On August 4, 2022, the FDA sent a warning lett...",1
4,Product Reviews and Answers to Questions About...,https://www.consumerlab.com/topic/coronavirus/,Save to favorites This feature is restricted t...,1
...,...,...,...,...
555,Most people infected with Omicron weren't even...,https://www.sott.net/article/471164-Most-peopl...,The lack of public awareness about being infec...,0
556,"Zoonotic Langya virus found in China, CDC says",https://www.sott.net/article/470863-Zoonotic-L...,© Daily PRABHAT/simplifay The 26 patients deve...,0
557,CDC quietly removes 'claim' that spike protein...,https://www.sott.net/article/471150-CDC-quietl...,The mRNA vaccines cannot give you COVID-19. Th...,0
558,9/11 vs. COVID: The Sorry Tale of a Nation Gon...,https://www.veteranstoday.com/2022/08/12/9-11-...,By John Kaminski What goes around comes around...,0


<h4>Saving the DataFrame for Short-listed Articles as JSON</h4>

In [None]:
# saving the dataframe under the 'Data' directory for further use

processed_json = df.to_json('Data/processed/processed_data.json', orient='records', indent=4)