accessing processed data and loading into a Pandas DataFrame

In [17]:
import pandas as pd

In [18]:
# accessing stored processed data in JSON format
df = pd.read_json('Data/articles.json', orient='records')

In [19]:
df.shape

(1262, 4)

In [20]:
df.head()

Unnamed: 0,title,link,text,article_type
0,Children and COVID-19 Vaccination Trends,https://www.aap.org/en/pages/2019-novel-corona...,Summary of data publicly reported by the Cente...,science
1,COVID-19 State-Level Data Reports,https://www.aap.org/en/pages/2019-novel-corona...,"On May 11, 2023, the United States ended the P...",science
2,Prevention Papillomavirus can cause 6 types of...,https://www.cancer.org/cancer/risk-prevention/...,Our highly trained specialists are available 2...,science
3,COVID-19,https://www.lung.org/lung-health-diseases/lung...,Can we help you find more info? Start by selec...,science
4,End Youth Vaping Let\'s join together to end t...,https://www.lung.org/quit-smoking/end-youth-va...,Research – Youth Vaping and Lung Health The Am...,science


In [21]:
# drop title, link column as they are not relevant to model training
df = df.drop(columns=["title", "link"])

In [22]:
df.shape

(1262, 2)

In [23]:
df['article_type'].value_counts()

article_type
conspiracy    689
science       573
Name: count, dtype: int64

change class label of output from "science", "conspiracy" to 0 and 1

In [24]:
df['article_type'] = df['article_type'].map({
    "science": 1,
    "conspiracy": 0
})

In [25]:
df.head()

Unnamed: 0,text,article_type
0,Summary of data publicly reported by the Cente...,1
1,"On May 11, 2023, the United States ended the P...",1
2,Our highly trained specialists are available 2...,1
3,Can we help you find more info? Start by selec...,1
4,Research – Youth Vaping and Lung Health The Am...,1


In [26]:
df['article_type'].value_counts()

article_type
0    689
1    573
Name: count, dtype: int64

both classes are not too far apart, meaning that the data isn't imbalanced

create function to process text

In [11]:
from text_processor import TextProcessor
text_processor = TextProcessor()

In [12]:
test_articles = [
    "The QUICK brown fox jumps over the 3:05 lazy 12/25 dog www.google.com/search.",
    "A JOUrney of A A A A 1000 miles begins with a single step.",
    "Life is what happens when you're busy making other plans.",
    "The only limit to our realization of tomorrow will be our doubts of today.",
    "In three words I can sum up everything I've learned about life: it goes on."
]

In [13]:
text_processor.preprocess(test_articles[0])

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']

In [14]:
text_processor.vectorize(text_processor.preprocess(test_articles[0]))[:5]

array([ 0.03666334,  0.01219157, -0.03719561,  0.05736743, -0.00656513],
      dtype=float32)

In [15]:
text_processor.process(test_articles[0])[:5]

array([ 0.03666334,  0.01219157, -0.03719561,  0.05736743, -0.00656513],
      dtype=float32)

In [16]:
df['embedding_vector'] = df['text'].apply(lambda x: text_processor.process(x))

ValueError: cannot compute mean with no input

splitting the dataset into training data and test data

In [94]:
from sklearn.model_selection import train_test_split
training_set, test_set = train_test_split(df, test_size = 0.2, random_state = 1)

In [95]:
training_set.head()

Unnamed: 0,title,link,text,article_type
123,coronavirus vaccine,https://www.hopkinsmedicine.org/search?q=coron...,COVID-19 Vaccine: What You Need to Know - Hopk...,1
256,Prognosis and persistence of smell and taste d...,https://www.bmj.com/content/378/bmj-2021-069503,Smell and taste disorders tended to be overloo...,1
258,US CDC announces major changes after criticism...,https://www.bmj.com/content/378/bmj.o2074,Janice Hopkins Tanne New York The US national ...,1
31,Doomscrolling COVID-19 News Takes an Emotional...,https://scienceblogs.com/sb-admin/2021/10/22/d...,"Picture this: it’s April 2020, you’re between ...",1
18,Polio was found in New York City wastewater. S...,https://geneticliteracyproject.org/2022/08/23/...,It’s easy to feel a bit of panic in the air. A...,1
...,...,...,...,...
203,Heat waves,https://watchers.news/category/earth-changes/h...,The second heatwave to hit Europe since mid-Ju...,1
255,"Re: Risk of preterm birth, small for gestation...",https://www.bmj.com/content/378/bmj-2022-07141...,,1
72,ponders society relationship with viruses,https://www.sciencenews.org/article/virology-b...,"Virology Joseph Osmundson W.W. Norton & Co., $...",1
235,CDC ResponseLearn how CDC is responding to COV...,https://www.cdc.gov/coronavirus/2019-ncov/comm...,UPDATE The White House announced that vaccines...,1


In [96]:
test_set.head()

Unnamed: 0,title,link,text,article_type
376,Pandemic of Propaganda – The Unvaccinated Are ...,https://oye.news/news/psychological/propaganda...,I’ve added the dynamic graph using the embed c...,0
421,White House Orders 171 Million Doses of “New” ...,https://vaccineimpact.com/2022/white-house-ord...,"by Brian Shilhavy Editor, Health Impact News T...",0
364,California Police Depts Refuse to Enforce News...,https://thefreethoughtproject.com/be-the-chang...,"Last month, California Governor Gavin Newsom m...",0
354,Watch: Cop Accuses Frontline COVID Worker of \...,https://thefreethoughtproject.com/police-bruta...,"Ulster County, NY — Shana Shaw, 26, is a nurse...",0
463,"10,000% Increase in Cancers Following COVID-19...",https://healthimpactnews.com/2022/10000-increa...,"by Brian Shilhavy Editor, Health Impact News R...",0
...,...,...,...,...
473,Most people infected with Omicron weren't even...,https://www.sott.net/article/471164-Most-peopl...,The lack of public awareness about being infec...,0
407,COVID vaccination and turbo cancer: pathologic...,https://doctors4covidethics.org/covid-vaccinat...,"In this video (26 min, Swedish with English su...",0
342,Journal Article Questioning COVID Vaccination ...,https://thevaccinereaction.org/2021/07/journal...,A scholarly article published in a medical jou...,0
410,with Professor Hannah Fry The BBC Pandemic Par...,https://iaindavis.com/propaganda-fry-part-3/,In Part 1 we looked at the propaganda and mani...,0


<h4>Serializing the Training and Testing DataFrames for Further Use</h4>

In [None]:
train_data.to_pickle('Data/model/train_data.pkl')

In [None]:
test_data.to_pickle('Data/model/test_data.pkl')

In [None]:
import json

with open('Data/model/relevant_words.json', 'w') as f:
    json.dump(train_relevant_words, f, indent=4)