# Data Preparation and Preprocessing for Creating Vector Embeddings

The CAM corpus consists of 92,909,509 words (of which 758,151 distinct words), the SE corpus of 7,413,679 words (of which 154,174 distinct words), the YOUGOV corpus of 12,795,527 words (of which 236,853 distinct words). Based on this and other insights from the descriptive analysis, the following considerations present themselves:

* Should we create 3 models, or combine SE and YOUGOV?
* Should we remove stopwords?
* Should we stem/lemmatize?
* Should we replace rare words with a rare word token, to decrease variance (due to limited data) and computational time (due to limited resources)?
* Which of Word2Vec, FastText, and GloVe should we choose?

### Should we create 3 models, or combine SE and YOUGOV?

Given the fact that the CAM corpus is substantially larger than the other two, and that SE and YOUGOV are semantically and contextually somewhat similar, it could make sense to combine them to have a larger dataset. However, this will dilute unique semantic features of SE and YOUGOV. But since we are interested in how the semantics within the CAM community differs from a suitable control group, it seems sensible to combine the corpora.

Moreover, according to Altszyler et al. (2017, https://arxiv.org/abs/1610.01520), Word2Vec significantly outperform simpler statistical NLP models in certain semantic tasks when trained on corpora with >10m words. The SE corpus alone would be below this threshold, which provides another argument for combining the two. For comparison, Karjus and Cuskley (2023) had around 20m words in total. 

### Should we discard "junk" sentences?

In data scraped from the internet, it is common to find chunks of text of low quality (typos, non-english content, or just gibberish overall). These 'junk' sentences can add noise to the data and make it more difficult for models to learn word semantics effectively. Therefore it makes sense to discard sentences that do not live up to a certain level of quality. 

### Should we remove stopwords?

Stopwords are high-frequency words with little semantic significance, like "the", "a", etc. Removing stopwords can significantly reduce the computational costs required for training a word vector embedding model like Word2Vec. Eliminating stop words can hence lead to faster training times and reduced memory requirements. Moreover, the final word embeddings become more concise, resulting in models that are smaller in size and quicker to deploy or query. However, the efficiency gains from removing stopwords come at the potential expense of losing certain contextual nuances. Especially negation words (like "not") might provide important nuance for word vector embeddings. 

According to Qiao et al. (2019, https://arxiv.org/pdf/1904.07531.pdf), stopwords didn't make a difference to the performance metrics by which they evaluated a BERT model. I would recommend removing a custom selection of stopwords (or trying both if viable). Since it will significantly reduce computational complexity without sacrificing much semantic nuance, we will remove stopwords.

### Should we stem/lemmatize?

Stemming/lemmatizing reduces vocabulary size, leading to faster training and more compact models, as well as provide more consistency/better generalizability as the model does not need to learn separate representations for different forms of the same word. However, it might also strip words of their nuanced meanings (e.g. "love" and "loving"). In the context of Word2Vec, which relies on local word contexts to derive embeddings, altering these contexts through lemmatization or stemming can in theory affect the resulting word representations. 

There is some research indicating that stemming has little to no impact on word vector embeddings, and can therefore be omitted (https://ieeexplore.ieee.org/document/9139948). Karjus and Cuskley did lemmatize for the word vector embeddings ("However, lemmatization suits our main goal of ultimately comparing semantic concepts (such as the activity of running, regardless of whether it is expressed as a noun or a verb), rather than morpho-syntax, particularly for our topic, word frequency and semantic divergence analyses (for sentiment analysis, the text was not lemmatized)"). 

We will lemmatize our words to reduce variance and computational time in generating word vector embeddings.

###  Should we replace rare words with a rare word token, to decrease variance (due to limited data) and computational time (due to limited resources)?

Replacing rare words with a specific "rare word" token in Word2Vec training can mitigate overfitting caused by the model's attempts to learn embeddings for words with insufficent information about their context due to their limited occurrences. Moreover, substituting these infrequent terms can lower the computational costs, as the vocabulary size reduces, leading to faster training cycles and lesser memory usage. But, as above, removing rare words can also potentially lead to a loss of specific information. On balance, however, and since we are mostly interested in specific, commonly occurring words, it seems like a reasonable decision to exclude rare words. 

### Should we remove URLs, years and numbers?

A way to reduce unnecessary variance is to replace specific types of linguistic items with specific tokens. For examples, we can replace URLs with a specific URL token (like "url_token"), as URLs will in this context not carry specific semantic meaning. The same can be done for years and numbers. Overall, this will decrease variance/complexity of the vector embedding, as well as make it more computationally efficient to run.

### Additional preprocessing steps

* Remove short documents (less than 100 words)
* Remove punctuation
* Lowercase



## 0. Loading data

In [1]:
# Standard library imports
import os
import json
import re
import pickle
from collections import Counter

# Import third-party libraries
import spacy

# Import custom util functions
from utils import *

: 

: 

In [2]:
# Set the base directory and subdirectories of the corpora
base_dir = 'documents/raw'
subdirs = ['pseudoscience', 'search_engine', 'trusted_sources']

In [None]:
# Load the corpora
corpora, filtered_counts = load_corpora(base_dir, subdirs, filter=False)

## 1. Removing Junk Sentences

In [5]:
# Define a function to split the text into chunks
def chunk_text(text, chunk_size=900000):  # Keeping a buffer from the maximum limit
    """
    Splits the text into chunks of specified size, ensuring not to split sentences.
    """
    # Use regex to split the text by full stops
    sentences = re.split('(?<=[.!?])\s+', text)
    chunks = []
    current_chunk = ''
    for sentence in sentences:
        if len(current_chunk) + len(sentence) > chunk_size:
            chunks.append(current_chunk)
            current_chunk = sentence
        else:
            current_chunk += ' ' + sentence
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

# Process and store the documents
base_store_dir = 'documents/preprocessed'

for corpus in corpora:
    print(corpus)
    already_processed = 0
    for i, (filename, doc) in enumerate(zip(os.listdir(os.path.join(base_dir, corpus)), corpora[corpus])):
        if document_processed(base_store_dir, corpus, filename):
            already_processed += 1
            continue
        if i == already_processed+1:
            print(f"Already processed: {already_processed}")
        # check if its a multiple of 1000
        elif i % 1000 == 0:
            print(f"Document {i}")
        
        # Check for document length and process in chunks if necessary
        if len(doc) > 900000:  # Approximate buffer from max limit
            print(f"Document {filename} has length: {len(doc)}")
            chunks = chunk_text(doc)
            english_parts = []
            non_english_parts = []
            for chunk in chunks:
                english_chunk, non_english_chunk = filter_english_sentences(chunk)
                english_parts.append(english_chunk)
                non_english_parts.extend(non_english_chunk)  # Since non_english is a list
            english = ' '.join(english_parts)
            non_english = non_english_parts  # Already a list
        else:
            english, non_english = filter_english_sentences(doc)

        store_document(english, non_english, base_store_dir, corpus, filename)
        print(f"Processed and stored {filename} in {corpus}.")
        if i == already_processed+20000:
            print(i)
            break


pseudoscience
search_engine
trusted_sources


In [6]:
# Print junk
base_store_dir = 'documents/preprocessed'
load_and_print_junk(base_store_dir, subdirs)


Junk from VCO093069.json in pseudoscience:
Makia
----------------------------------------
Junk from VAP217020.json in pseudoscience:
WND 2016
----------------------------------------
Junk from VAS226027.json in pseudoscience:
Habakkuk 1:5 (KJB)
----------------------------------------
Junk from NAT364062.json in pseudoscience:
This will be the first AIIA in Uttar Pradesh."
----------------------------------------
Junk from HOM012051.json in pseudoscience:
Poole
----------------------------------------
Junk from ANT167003.json in pseudoscience:
Consider the case of Sean and Stephanie Recchi. AmerisourceBergen, McKesson, and Central Health.
----------------------------------------
Junk from VEF271027.json in pseudoscience:
bioRxiv -:x-x • Beasley, D. (2020) • Palmer, S.; Cunniffe, N. and Donnelly, R. (2021) Jefferson, T.; Spencer, E.; Brassey, J. and Heneghan, C. (2020) Mandavilli, A. (2020) Proc. Natl. Acad. Eur. J. Clin. Giurgea, L.T. and Memoli, M.J. (2020) Vaccines Yeadon, M. (2021) 

In [7]:
# Get statistics for each corpus
statistics = {}

for corpus in corpora:
    total_clean_words = 0
    total_junk_words = 0
    
    # Directory where the preprocessed documents for the current corpus are stored
    corpus_dir = os.path.join(base_store_dir, corpus)
    
    # Iterate over each processed document in the corpus directory
    for filename in os.listdir(corpus_dir):
        with open(os.path.join(corpus_dir, filename), 'r') as file:
            data = json.load(file)
            total_clean_words += count_words(data['cleaned_text'])
            total_junk_words += count_words(' '.join(data['junk']))  # Since junk is stored as a list of sentences

    # Compute percentage
    total_words = total_clean_words + total_junk_words
    junk_percentage = (total_junk_words / total_words) * 100 if total_words != 0 else 0  # Handle case when total_words is 0
    
    statistics[corpus] = {
        'clean_words': total_clean_words,
        'junk_words': total_junk_words,
        'junk_percentage': junk_percentage
    }

In [8]:
# Print the results
for corpus, stats in statistics.items():
    print(f"{corpus}:")
    print(f"• Clean words: {stats['clean_words']:,}")
    print(f"• Junk words: {stats['junk_words']:,}")
    print(f"• Percentage of junk: {stats['junk_percentage']:.2f}%")
    print("------------------------------")

pseudoscience:
• Clean words: 80,900,555
• Junk words: 1,840,123
• Percentage of junk: 2.22%
------------------------------
search_engine:
• Clean words: 43,024,759
• Junk words: 1,638,797
• Percentage of junk: 3.67%
------------------------------
trusted_sources:
• Clean words: 9,548,965
• Junk words: 174,425
• Percentage of junk: 1.79%
------------------------------


## 2. Replace special tokens

We will now remove a number of specific terms which, in our context, do not carry much semantic significance. These terms will be replaced by a placeholder token. The Specifically, these will be:

* Years (e.g. 2020, 1992) (replace with [YEAR])
* Numbers, where they have not been removed (replace with [NUM])
* URLs (replace with [URL])

In [3]:
# Set the base directory and subdirectories of the corpora
base_dir = 'documents/preprocessed'
subdirs = ['pseudoscience', 'search_engine', 'trusted_sources']

# Load the corpora
corpora, filtered_counts = load_corpora(base_dir, subdirs, filter=True)

In [4]:
urls_list = []
numbers_count = 0
years_count = 0

for corpus in corpora:
    for i, doc in enumerate(corpora[corpus]):

        # Filter URLs
        doc_wo_urls, urls = replace_urls(doc)
        urls_list.extend(urls)

        # Filter years and numbers
        doc_wo_urls_num, years, numbers = replace_numbers(doc_wo_urls)
        years_count += len(years)
        numbers_count += len(numbers)

        # Reassign the document
        corpora[corpus][i] = doc_wo_urls_num

        # if i == 100:
        #     break

# Turn URLs into counter after processing all documents
urls_counter = Counter(urls_list)

print(f"Total URLs: {len(urls_list)}")
print(f"URLs: {urls_list[:10]}, ...")
print(f"Total years removed: {years_count}")
print(f"Total numbers removed: {numbers_count}")


Total URLs: 61027
URLs: ['ToolsForFreedom.com', 'http://www.fda.gov', 'VoteFraud.news', 'www.samplelecturepage.com', 'www.samplelecturepage.com', 'csps.com', 'https://www.mdpi.com/2076-2607/11/5/1308', 'weatherbell.com', 'Sciencemag.org', 'Greenmedinfo.com'], ...
Total years removed: 410459
Total numbers removed: 1246470


In [5]:
# Store Counter object as a dictionary
most_common_urls = dict(urls_counter.most_common(100))

# Write the data to a JSON file
with open('data/urls_counter.json', 'w', encoding='utf-8') as f:
    json.dump(most_common_urls, f, indent=4, ensure_ascii=False)

In [6]:
# Print the 10 most common URLs that were removed
dict(urls_counter.most_common(10))

{'Brighteon.com': 1496,
 'Amazon.com': 990,
 'HealthImpactNews.com': 561,
 'gmail.com': 532,
 'hsionline.com': 462,
 'ClinicalTrials.gov': 462,
 'Vaccines.news': 413,
 'stillnessinthestorm.com': 399,
 'Cancer.gov': 260,
 'naturalhealthresponse.com': 240}

In [7]:
# Print an example of a document that contains a URL
for corpus in corpora:
    for doc in corpora[corpus]:
        if 'url_token' in doc:
            # Find the index of the text where it says URL
            index = doc.find('url_token')
            print(f"..{doc[index-100:index+100]}..")
            break

..Articles, author of the book Cancer: The Lies, the Truth and the Solutions and senior researcher at url_token...
..Commons Attribution Non-Commercial ShareAlike num_token.0 IGO licence (CC BY-NC-SA num_token.0 IGO; url_token). The health of women, mothers and their families is influenced by access to reproductive ..
..e implications of this can be dire." In the United States, according to an essay on the career site url_token, in order to administer botulinum toxin injections "you must be a physician, physician ass..


### 3. Combining corpora, tokenizing and preprocessing

Here we will perform all the other steps outlined above:
* combining SE and YOUGOV corpora 
* removing rare words, stop words, and words shorter than 3 characters
* lowercase
* remove punctuation and special characters
* lemmatize words
* do some custom steps based on strange tokens encountered

In [8]:
# Load the Counter object from pickle
with open('data/word_freqs.pkl', 'rb') as f:
    word_freqs = pickle.load(f)

In [9]:
# Combine word frequencies for SE and YOUGOV
combined_freqs = word_freqs['search_engine'] + word_freqs['trusted_sources']
word_freqs['se_and_trusted'] = combined_freqs

# Delete individual word frequencies for SE and YOUGOV
del word_freqs['search_engine']
del word_freqs['trusted_sources']

# Now compute and print statistics
for corpus in word_freqs:
    print(f"{corpus}:")
    n_unique = len(word_freqs[corpus].most_common())
    total_words = sum(word_freqs[corpus].values())
    n = 50
    unique_words_less_than_n = len([word for word, count in word_freqs[corpus].items() if count < n])
    total_words_less_than_n = sum([count for word, count in word_freqs[corpus].items() if count < n])

    print(f"Unique words (after preprocessing): {n_unique:,}")
    print(f"Total words (after preprocessing): {total_words:,}")
    print(f"Unique words appearing less than 100 times: {unique_words_less_than_n:,}")
    print(f"Total words appearing less than 100 times: {total_words_less_than_n:,}")
    print(f"Percentage of unique words appearing less than 100 times: {unique_words_less_than_n/n_unique*100:.2f}%")
    print(f"Percentage of total words appearing less than 100 times: {total_words_less_than_n/total_words*100:.2f}%")
    print('-' * 40)

pseudoscience:
Unique words (after preprocessing): 342,194
Total words (after preprocessing): 41,786,947
Unique words appearing less than 100 times: 307,665
Total words appearing less than 100 times: 1,494,036
Percentage of unique words appearing less than 100 times: 89.91%
Percentage of total words appearing less than 100 times: 3.58%
----------------------------------------
se_and_trusted:
Unique words (after preprocessing): 398,124
Total words (after preprocessing): 28,996,649
Unique words appearing less than 100 times: 369,734
Total words appearing less than 100 times: 1,485,158
Percentage of unique words appearing less than 100 times: 92.87%
Percentage of total words appearing less than 100 times: 5.12%
----------------------------------------


Comparing CAM with the combined corpus of YOUGOV and SE, we find that in both, words that appear less than 100 times make up around 91% of unique words. However, they only make up between 3.5$ and 8.7% of total words. So, removing them will reduce computational costs and variance significantly, while only losing little semantic nuance.

In [10]:
corpora.keys()

dict_keys(['pseudoscience', 'search_engine', 'trusted_sources'])

In [11]:
# Combine the corpus of SE and YOUGOV
corpora['se_and_trusted'] = corpora['search_engine'] + corpora['trusted_sources']

In [12]:
# Preprocess SE and YOUGOV and store them
preprocessed_sentences_se_and_trusted = preprocess_for_word2vec(corpora['se_and_trusted'], word_freqs['se_and_trusted'])
with open('data/preprocessed_sentences_se_and_trusted.json', 'w') as f:
    json.dump(preprocessed_sentences_se_and_trusted, f)

Chunk 0 of 679
Chunk 50 of 679
Chunk 100 of 679
Chunk 150 of 679
Chunk 200 of 679
Chunk 250 of 679
Chunk 300 of 679
Chunk 350 of 679
Chunk 400 of 679
Chunk 450 of 679
Chunk 500 of 679
Chunk 550 of 679
Chunk 600 of 679
Chunk 650 of 679


In [13]:
# Preprocess CAM and store it
preprocessed_sentences_cam = preprocess_for_word2vec(corpora['pseudoscience'], word_freqs['pseudoscience'])
with open('data/preprocessed_sentences_pseudoscience.json', 'w') as f:
    json.dump(preprocessed_sentences_cam, f)

Chunk 0 of 1000
Chunk 50 of 1000
Chunk 100 of 1000
Chunk 150 of 1000
Chunk 200 of 1000
Chunk 250 of 1000
Chunk 300 of 1000
Chunk 350 of 1000
Chunk 400 of 1000
Chunk 450 of 1000
Chunk 500 of 1000
Chunk 550 of 1000
Chunk 600 of 1000
Chunk 650 of 1000
Chunk 700 of 1000
Chunk 750 of 1000
Chunk 800 of 1000
Chunk 850 of 1000
Chunk 900 of 1000
Chunk 950 of 1000


In [14]:
# Load the preprocessed sentences
with open('data/preprocessed_sentences_se_and_trusted.json', 'r') as f:
    preprocessed_sentences_seyougov = json.load(f)
with open('data/preprocessed_sentences_pseudoscience.json', 'r') as f:
    preprocessed_sentences_cam = json.load(f)

# Print a few examples 
print("se_and_trusted:")
print(preprocessed_sentences_seyougov[:10])
print()
print("pseudoscience:")
print(preprocessed_sentences_cam[:10])

se_and_trusted:
[['riley', 'year_token', 'antimicrobial', 'activity', 'major', 'component', 'essential', 'oil', 'rare_token', 'rare_token'], ['mode', 'antimicrobial', 'action', 'essential', 'oil', 'rare_token', 'rare_token', 'tea', 'tree', 'oil'], ['study', 'minimum', 'inhibitory', 'concentration', 'mode', 'action', 'oregano', 'essential', 'oil', 'transition', 'pore', 'inner', 'mitochondrial', 'membrane', 'operate', 'open', 'state', 'different', 'selectivity'], ['rare_token', 'rare_token', 'num_token', 'rare_token', 'rare_token', 'rare_token', 'year_token', 'mechanism', 'action', 'spanish', 'oregano', 'chinese', 'cinnamon', 'savory', 'essential', 'oil', 'cell', 'membrane', 'wall', 'escherichia', 'rare_token', 'rare_token', 'year_token'], ['project', 'report', 'extraction', 'essential', 'oil', 'application', 'bachelor', 'technology', 'chemical', 'engineering', 'department', 'chemical', 'engineering', 'national', 'institute', 'technology', 'rourkela-769008', 'rare_token', 'india', 'rare_