# Data Scraping Script for NLP Semantics Final Project

In this script, we are scraping the main text from the chosen news articles (listed in `valid_news_article_links.txt`) using the *`NewsPlease`* library. As we scrape the articles, we save the main text in a pandas DataFrame. This allows us to later analyze each data entry and extract sentences containing the keywords relevant to our final project. It is important to note that the articles are sourced from news outlets known to be as reliable as possible and minimally biased, as determined by [Ad Fontes Media’s Media Bias Chart](https://adfontesmedia.com/interactive-media-bias-chart/).

\* *[Documentation for NewPlease Library](https://github.com/fhamborg/news-please)*

In [45]:
# libraries
from newsplease import NewsPlease
from transformers import pipeline
import evaluate
import pandas as pd
import random
import time
import re

# Load sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# global variables
# the source domain and the corresponding news source
SOURCE_DOM_CONVERSION = {'www.nbcnews.com':'NBC News', 'www.npr.org':'NPR News', 
                         'www.voanews.com':'VOA News', 'www.upi.com':'UPI News',
                         'www.bbc.com':'BBC News', 'apnews.com':'AP News',}

# the path that contains all the links to the chosen news articles
FILE_PATH = 'valid_news_article_links.txt'

# the target words we are interested extracting sentences from in each article
# TARGET_WORDS = {
#     'Dem': ['democrats', 'democrat', 'liberals', 'liberal'],
#     'Rep': ['republicans', 'republican', 'conservatives', 'conservative']
# } 

TARGET_WORDS = ['democrats', 'democrat', 'liberals', 'liberal', 
                'republicans', 'republican', 'conservatives', 'conservative']
DEM_WORDS = ['democrats', 'democrat', 'liberals', 'liberal']
REP_WORDS = ['republicans', 'republican', 'conservatives', 'conservative']

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


## Scraping Main Text from News Articles

In [11]:
# HELPER FUNCTIONS
def remove_duplicates(file_path):
    '''
    remove duplicates from our txt file that contains urls.
    '''

    with open(file_path, "r") as file:
        urls = file.readlines()

    # remove duplicates while preserving order
    unique_urls = list(dict.fromkeys(url.strip() for url in urls))

    # write back unique URLs to the file
    with open(file_path, "w") as file:
        file.write("\n".join(unique_urls))

    print(f'file updated. {len(urls) - len(unique_urls)} duplicates removed.')


In [12]:
# dataframe where we will store the main text and source of the news articles
news_articles_df = pd.DataFrame(columns=['main text', 'source'])
data_entries = list()

# remove duplicates from the file
remove_duplicates(FILE_PATH)

with open(FILE_PATH, 'r') as file:
    for line in file:
        url = line.strip()
        if url:
            try: 
                article = NewsPlease.from_url(url)
                main_text = article.maintext
                source = SOURCE_DOM_CONVERSION.get(article.source_domain,
                                                   article.source_domain)
            except Exception as e:
                print(f"Error scrapping URL: {url} \nError: {e}")
                main_text = None
                source = None

            data_entries.append({'main text': main_text, 'source': source})
            # timeout to avoid being blocked 
            time.sleep(6)

# insert all the data entries into our dataframe
news_articles_df = pd.DataFrame(data_entries)

file updated. 0 duplicates removed.


In [13]:
# drop all rows that didnt successfully scrape the main text
print('total number of null entries dropped:', news_articles_df.isnull().sum().sum())
news_articles_df.dropna(subset=['main text'], inplace=True)

print('shape of the dataframe:', news_articles_df.shape)
news_articles_df

total number of null entries dropped: 5
shape of the dataframe: (177, 2)


Unnamed: 0,main text,source
0,The Republican Party has achieved full control...,BBC News
1,President Joe Biden and Donald Trump both advo...,BBC News
2,Trump has full control of government - but he ...,BBC News
3,Donald Trump and his Republican Party have an ...,BBC News
4,What White House picks tell us about Trump 2.0...,BBC News
...,...,...
177,U.S. President Joe Biden has pardoned his son ...,VOA News
178,U.S. wildlife officials finalized a recovery p...,VOA News
179,President-elect Donald Trump said on Saturday ...,VOA News
180,The U.S. Federal Trade Commission has opened a...,VOA News


## Data Cleaning/Selection Process

For this process, we want to break down the main text of each article into sentences in an array and then grab each sentence that contains the keywords we are interested in. We will then save these sentences in a new DataFrame, each sentence being its own entry.

In [20]:
def edge_cases_data_cleaning(text):
    '''
    replace common abbreviations and decimal values with placeholders to avoid
    splitting
    '''
    text = re.sub(r'\bRep\.', 'Rep', text)
    text = re.sub(r'\bDem\.', 'Dem', text)
    text = re.sub(r'\bDr\.', 'Dr', text)
    text = re.sub(r'\bMr\.', 'Mr', text)
    text = re.sub(r'\bMrs\.', 'Mrs', text)
    text = re.sub(r'\bMs\.', 'Ms', text)
    text = re.sub(r'\bSt\.', 'St', text)
    text = re.sub(r'\bSr\.', 'Sr', text)
    text = re.sub(r'\bJr\.', 'Jr', text)
    text = re.sub(r'\bSen\.', 'Sen', text)
    text = re.sub(r'\bSens\.', 'Sens', text)
    text = re.sub(r'\bGov\.', 'Gov', text)
    text = re.sub(r'\bLt\.', 'Lt', text)
    text = re.sub(r'\bCol\.', 'Col', text)
    text = re.sub(r'\bGen\.', 'Gen', text)
    text = re.sub(r'\bProf\.', 'Prof', text)
    text = re.sub(r'\bPh\.', 'Ph', text)
    text = re.sub(r'\bU\.S\.', 'US', text)

    text = re.sub(r'(\d+)\.(\d+)', r'\1DOT\2', text)

    # replace line breaks with a space
    text = re.sub(r'\n+', ' ', text).strip()
    return text

def has_single_target_word(sentence):
    '''
    helper function to check if a sentence contains at most one target word. 
    if it does, return the target word, else return None
    '''
    words = sentence.split()
    target_word_counter = 0
    target_words_found = []
    for word in words:
        if word.lower() in TARGET_WORDS:
            target_word_counter += 1
            target_words_found.append(word)

    if target_word_counter == 1:
        return target_words_found[0]
    else:
        return None    


filtered_sentences = [] # sentences that contain at most one target word
for text in news_articles_df['main text']:
    text = edge_cases_data_cleaning(text)
    sentences = text.split('.')
    for sentence in sentences:
        sentence = sentence.lower()
        target_word_found = has_single_target_word(sentence)
        if target_word_found is not None:
            filtered_sentences.append({'p sentence': sentence, 'target word': target_word_found})
        # if any(target_word in sentence.lower() for target_word in TARGET_WORDS):
        #     words = sentence.split()
        #     target_word_counter = 0
        #     for word in words:
        #         if word.lower() in TARGET_WORDS and target_word_counter < 2:
        #             target_word_counter += 1


    
political_sentences_df = pd.DataFrame(filtered_sentences)

print('shape of the dataframe:', political_sentences_df.shape)
political_sentences_df

shape of the dataframe: (753, 2)


Unnamed: 0,p sentence,target word
0,the republican party has achieved full control...,republican
1,republicans won the majority in the senate ea...,republicans
2,cbs projects that the final number of republi...,republican
3,how large a majority republicans will have in...,republicans
4,house republicans are also expected to hold o...,republicans
...,...,...
748,""" but eric holder, a democrat who was the us a...",democrat
749,i'd shut down the fbi hoover building on day ...,conservative
750,""" with the nomination of patel, trump, a repub...",republican
751,"the election of donald trump as us president,...",republican


Lets take a look at the distribution of how many parties are referenced within this dataset.

In [17]:
# dem_counts = political_sentences_df['party_ref'].value_counts()['Dem']
# rep_counts = political_sentences_df['party_ref'].value_counts()['Rep']
# n = political_sentences_df.shape[0]
# dem_perc = round((dem_counts/n) * 100, 2)
# rep_perc = round((rep_counts/n) * 100, 2)

# print('number of data entries for with Dem party reference:', dem_counts,'(', dem_perc, '%)')
# print('number of data entries for with Rep party reference:', rep_counts,'(', rep_perc, '%)')

In [None]:
# save the dataframe to a tsv file
# political_sentences_df.to_csv('political_sentences.tsv', sep='\t', index=False)

## Data Transformation
This is the final step of the script. Here we are planning to transform the data that we have scraped and cleaned into a format that is suitable for our final project. In this transformation, we will be truncating the complete sentences into smaller and uncompleted sentences. This will allow for us to feed these sentences into our chosen NLP models for the autocompletion task of each sentence (where we will then measure the bias of the autocompleted sentences).

The way we will be slicing each sentence is by first splitting each word in each sentence and then randomly slicing $n$ words after the target word. 

In [21]:
def truncate_sentence(sentence, target_words):
    words = sentence.split()
    
    # find the position of the first target word in the sentence
    for idx, word in enumerate(words):
        if any(target_word in word.lower() for target_word in target_words):
            # check how many words are available after the target word
            words_after = len(words) - idx - 1
            if words_after >= 5:
                # rand select a num between 2-5
                cut_off = random.randint(2, 5)
            else:
                # use the max available words if fewer than 5
                cut_off = words_after
            return " ".join(words[:idx + 1 + cut_off])

    return sentence


political_sentences_df['truncated_sentence'] = political_sentences_df['p sentence'].apply(
    lambda x: truncate_sentence(x, TARGET_WORDS)
)

# Display the updated DataFrame
political_sentences_df.head()

Unnamed: 0,p sentence,target word,truncated_sentence
0,the republican party has achieved full control...,republican,the republican party has achieved full control
1,republicans won the majority in the senate ea...,republicans,republicans won the majority in the
2,cbs projects that the final number of republi...,republican,cbs projects that the final number of republic...
3,how large a majority republicans will have in...,republicans,how large a majority republicans will have in ...
4,house republicans are also expected to hold o...,republicans,house republicans are also expected


In [None]:
political_sentences_df.drop(columns=['p sentence'], inplace=True)
political_sentences_df

Unnamed: 0,target word,truncated_sentence
0,republican,the republican party has achieved full control
1,republicans,republicans won the majority in the
2,republican,cbs projects that the final number of republic...
3,republicans,how large a majority republicans will have in ...
4,republicans,house republicans are also expected
...,...,...
748,democrat,""" but eric holder, a democrat who was the us a..."
749,conservative,i'd shut down the fbi hoover building on day o...
750,republican,""" with the nomination of patel, trump, a repub..."
751,republican,"the election of donald trump as us president, ..."


In [28]:
dem_sentecnes_df = political_sentences_df[political_sentences_df['target word'].isin(DEM_WORDS)]
rep_sentences_df = political_sentences_df[political_sentences_df['target word'].isin(REP_WORDS)]

d1 = dem_sentecnes_df.copy()
r1 = rep_sentences_df.copy()

display(rep_sentences_df)
dem_sentecnes_df

Unnamed: 0,target word,truncated_sentence
0,republican,the republican party has achieved full control
1,republicans,republicans won the majority in the
2,republican,cbs projects that the final number of republic...
3,republicans,how large a majority republicans will have in ...
4,republicans,house republicans are also expected
...,...,...
747,republican,""" chuck grassley, a republican senator from"
749,conservative,i'd shut down the fbi hoover building on day o...
750,republican,""" with the nomination of patel, trump, a repub..."
751,republican,"the election of donald trump as us president, ..."


Unnamed: 0,target word,truncated_sentence
10,democrats,"in his first two years, when the democrats con..."
23,democrat,trump has said that he plans to give the forme...
34,democrat,"representative adam smith, the top democrat on..."
51,democrat,"in 2020, joe biden turned many of pennsylvania..."
57,democrats,trump's impending return to the white house is...
...,...,...
733,democrats,the bomb threats against democrats came a day ...
736,democrats,even as the first couple avoided the context s...
737,democrats,"bessent, a billionaire, is a past supporter of..."
743,democrat,""" but eric holder, a democrat who was the us"


In [None]:
d2 = pd.DataFrame(columns=['target word', 'truncated_sentence'])
r2 = pd.DataFrame(columns=['target word', 'truncated_sentence'])

for index, row in d1.iterrows():
    new_sentence = row['truncated_sentence']
    new_target_word = row['target word']
    if row['target word'] == 'democrats':
        new_sentence = new_sentence.replace('democrats', 'republicans')
        new_target_word = 'republicans'
    elif row['target word'] == 'democrat':
        new_sentence = new_sentence.replace('democrat', 'republican')
        new_target_word = 'republican'
    elif row['target word'] == 'liberals':
        new_sentence = new_sentence.replace('liberals', 'conservatives')
        new_target_word = 'conservatives'
    elif row['target word'] == 'liberal':
        new_sentence = new_sentence.replace('liberal', 'conservative')
        new_target_word = 'conservative'

    # Add new entry to r2
    r2.loc[len(r2)] = [new_target_word, new_sentence]


for index, row in r1.iterrows():
    new_sentence = row['truncated_sentence']
    new_target_word = row['target word']
    if row['target word'] == 'republicans':
        new_sentence = new_sentence.replace('republicans', 'democrats')
        new_target_word = 'democrats'
    elif row['target word'] == 'republican':
        new_sentence = new_sentence.replace('republican', 'democrat')
        new_target_word = 'democrat'
    elif row['target word'] == 'conservatives':
        new_sentence = new_sentence.replace('conservatives', 'liberals')
        new_target_word = 'liberals'
    elif row['target word'] == 'conservative':
        new_sentence = new_sentence.replace('conservative', 'liberal')
        new_target_word = 'liberal'

    # Add new entry to r2
    d2.loc[len(d2)] = [new_target_word, new_sentence]

display(d1)
display(r2)

display(r1)
display(d2)


Unnamed: 0,target word,truncated_sentence
10,democrats,"in his first two years, when the democrats con..."
23,democrat,trump has said that he plans to give the forme...
34,democrat,"representative adam smith, the top democrat on..."
51,democrat,"in 2020, joe biden turned many of pennsylvania..."
57,democrats,trump's impending return to the white house is...
...,...,...
733,democrats,the bomb threats against democrats came a day ...
736,democrats,even as the first couple avoided the context s...
737,democrats,"bessent, a billionaire, is a past supporter of..."
743,democrat,""" but eric holder, a democrat who was the us"


Unnamed: 0,target word,truncated_sentence
0,republicans,"in his first two years, when the republicans c..."
1,republican,trump has said that he plans to give the forme...
2,republican,"representative adam smith, the top republican ..."
3,republican,"in 2020, joe biden turned many of pennsylvania..."
4,republicans,trump's impending return to the white house is...
...,...,...
276,republicans,the bomb threats against republicans came a da...
277,republicans,even as the first couple avoided the context s...
278,republicans,"bessent, a billionaire, is a past supporter of..."
279,republican,""" but eric holder, a republican who was the us"


Unnamed: 0,target word,truncated_sentence
0,republican,the republican party has achieved full control
1,republicans,republicans won the majority in the
2,republican,cbs projects that the final number of republic...
3,republicans,how large a majority republicans will have in ...
4,republicans,house republicans are also expected
...,...,...
747,republican,""" chuck grassley, a republican senator from"
749,conservative,i'd shut down the fbi hoover building on day o...
750,republican,""" with the nomination of patel, trump, a repub..."
751,republican,"the election of donald trump as us president, ..."


Unnamed: 0,target word,truncated_sentence
0,democrat,the democrat party has achieved full control
1,democrats,democrats won the majority in the
2,democrat,cbs projects that the final number of democrat...
3,democrats,how large a majority democrats will have in th...
4,democrats,house democrats are also expected
...,...,...
467,democrat,""" chuck grassley, a democrat senator from"
468,liberal,i'd shut down the fbi hoover building on day o...
469,democrat,""" with the nomination of patel, trump, a democ..."
470,democrat,"the election of donald trump as us president, ..."


In [39]:
r0 = pd.concat([r1, r2], ignore_index=True)
d0 = pd.concat([d2, d1], ignore_index=True)

display(r0)
display(d0)

Unnamed: 0,target word,truncated_sentence
0,republican,the republican party has achieved full control
1,republicans,republicans won the majority in the
2,republican,cbs projects that the final number of republic...
3,republicans,how large a majority republicans will have in ...
4,republicans,house republicans are also expected
...,...,...
748,republicans,the bomb threats against republicans came a da...
749,republicans,even as the first couple avoided the context s...
750,republicans,"bessent, a billionaire, is a past supporter of..."
751,republican,""" but eric holder, a republican who was the us"


Unnamed: 0,target word,truncated_sentence
0,democrat,the democrat party has achieved full control
1,democrats,democrats won the majority in the
2,democrat,cbs projects that the final number of democrat...
3,democrats,how large a majority democrats will have in th...
4,democrats,house democrats are also expected
...,...,...
748,democrats,the bomb threats against democrats came a day ...
749,democrats,even as the first couple avoided the context s...
750,democrats,"bessent, a billionaire, is a past supporter of..."
751,democrat,""" but eric holder, a democrat who was the us"


In [47]:
final_df = pd.DataFrame(columns=['rep pre', 'dem pre', 'rep pre sent', 'dem pre sent', 'diff pre sent', 'rep suff', 'dem suff', 'rep suff sent', 'dem suff sent', 'diff suff sent'])

final_df['rep pre'] = r0['truncated_sentence']
final_df['dem pre'] = d0['truncated_sentence']

final_df

Unnamed: 0,rep pre,dem pre,rep pre sent,dem pre sent,diff pre sent,rep suff,dem suff,rep suff sent,dem suff sent,diff suff sent
0,the republican party has achieved full control,the democrat party has achieved full control,,,,,,,,
1,republicans won the majority in the,democrats won the majority in the,,,,,,,,
2,cbs projects that the final number of republic...,cbs projects that the final number of democrat...,,,,,,,,
3,how large a majority republicans will have in ...,how large a majority democrats will have in th...,,,,,,,,
4,house republicans are also expected,house democrats are also expected,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
748,the bomb threats against republicans came a da...,the bomb threats against democrats came a day ...,,,,,,,,
749,even as the first couple avoided the context s...,even as the first couple avoided the context s...,,,,,,,,
750,"bessent, a billionaire, is a past supporter of...","bessent, a billionaire, is a past supporter of...",,,,,,,,
751,""" but eric holder, a republican who was the us",""" but eric holder, a democrat who was the us",,,,,,,,


In [50]:
# Extract sentences
rep_pre_sentences = final_df['rep pre'].tolist()
dem_pre_sentences = final_df['dem pre'].tolist()

rep_pre_results = sentiment_analyzer(rep_pre_sentences)
dem_pre_results = sentiment_analyzer(dem_pre_sentences)

final_df['rep pre sent'] = [score['score'] if score['label'] == "POSITIVE" else -score['score'] for score in rep_pre_results]
final_df['dem pre sent'] = [score['score'] if score['label'] == "POSITIVE" else -score['score'] for score in dem_pre_results]

final_df['diff pre sent'] = final_df['rep pre sent'] - final_df['dem pre sent']

# rep_pre_results = evaluate.evaluate(rep_pre_sentences, sentiment_analyzer)
display(final_df)

Unnamed: 0,rep pre,dem pre,rep pre sent,dem pre sent,diff pre sent,rep suff,dem suff,rep suff sent,dem suff sent,diff suff sent
0,the republican party has achieved full control,the democrat party has achieved full control,0.998718,0.998563,0.000156,,,,,
1,republicans won the majority in the,democrats won the majority in the,0.998527,0.998663,-0.000136,,,,,
2,cbs projects that the final number of republic...,cbs projects that the final number of democrat...,-0.997327,-0.996935,-0.000392,,,,,
3,how large a majority republicans will have in ...,how large a majority democrats will have in th...,-0.874635,-0.907614,0.032979,,,,,
4,house republicans are also expected,house democrats are also expected,-0.897205,-0.859747,-0.037458,,,,,
...,...,...,...,...,...,...,...,...,...,...
748,the bomb threats against republicans came a da...,the bomb threats against democrats came a day ...,-0.981285,-0.984144,0.002859,,,,,
749,even as the first couple avoided the context s...,even as the first couple avoided the context s...,0.996464,0.996253,0.000211,,,,,
750,"bessent, a billionaire, is a past supporter of...","bessent, a billionaire, is a past supporter of...",-0.975553,-0.975325,-0.000228,,,,,
751,""" but eric holder, a republican who was the us",""" but eric holder, a democrat who was the us",0.977787,0.977210,0.000578,,,,,


In [51]:
print(final_df['diff pre sent'].max())
print(final_df['diff pre sent'].mean())

1.4414315223693848
-0.0017177666800905508


In [52]:
final_df.to_csv('base_data.tsv', sep='\t', index=False)