# Data Scraping Script for NLP Semantics Final Project

In this script, we are scraping the main text from the chosen news articles (listed in `valid_news_article_links.txt`) using the *`NewsPlease`* library. As we scrape the articles, we save the main text in a pandas DataFrame. This allows us to later analyze each data entry and extract sentences containing the keywords relevant to our final project. It is important to note that the articles are sourced from news outlets known to be as reliable as possible and minimally biased, as determined by [Ad Fontes Media’s Media Bias Chart](https://adfontesmedia.com/interactive-media-bias-chart/).

\* *[Documentation for NewPlease Library](https://github.com/fhamborg/news-please)*

In [41]:
# libraries
from newsplease import NewsPlease
import pandas as pd
import random
import time
import re

# global variables
# the source domain and the corresponding news source
SOURCE_DOM_CONVERSION = {'www.nbcnews.com':'NBC News', 'www.npr.org':'NPR News', 
                         'www.voanews.com':'VOA News', 'www.upi.com':'UPI News',
                         'www.bbc.com':'BBC News', 'apnews.com':'AP News',}

# the path that contains all the links to the chosen news articles
FILE_PATH = 'valid_news_article_links.txt'

# the target words we are interested extracting sentences from in each article
TARGET_WORDS = {
    'Dem': ['democrats', 'democrat', 'democratic', 'liberals', 'liberal'],
    'Rep': ['republicans', 'republican', 'conservatives', 'conservative']
} 

## Scraping Main Text from News Articles

In [4]:
# HELPER FUNCTIONS
def remove_duplicates(file_path):
    '''
    remove duplicates from our txt file that contains urls.
    '''

    with open(file_path, "r") as file:
        urls = file.readlines()

    # remove duplicates while preserving order
    unique_urls = list(dict.fromkeys(url.strip() for url in urls))

    # write back unique URLs to the file
    with open(file_path, "w") as file:
        file.write("\n".join(unique_urls))

    print(f'file updated. {len(urls) - len(unique_urls)} duplicates removed.')


In [5]:
# dataframe where we will store the main text and source of the news articles
news_articles_df = pd.DataFrame(columns=['main text', 'source'])
data_entries = list()

# remove duplicates from the file
remove_duplicates(FILE_PATH)

with open(FILE_PATH, 'r') as file:
    for line in file:
        url = line.strip()
        if url:
            try: 
                article = NewsPlease.from_url(url)
                main_text = article.maintext
                source = SOURCE_DOM_CONVERSION.get(article.source_domain,
                                                   article.source_domain)
            except Exception as e:
                print(f"Error scrapping URL: {url} \nError: {e}")
                main_text = None
                source = None

            data_entries.append({'main text': main_text, 'source': source})
            # timeout to avoid being blocked 
            time.sleep(6)

# insert all the data entries into our dataframe
news_articles_df = pd.DataFrame(data_entries)

file updated. 7 duplicates removed.


In [None]:
# drop all rows that didnt successfully scrape the main text
print('total number of null entries dropped:', news_articles_df.isnull().sum().sum())
news_articles_df.dropna(subset=['main text'], inplace=True)

print('shape of the dataframe:', news_articles_df.shape)
news_articles_df

shape of the dataframe: (140, 2)


Unnamed: 0,main text,source
0,The Republican Party has achieved full control...,BBC News
1,President Joe Biden and Donald Trump both advo...,BBC News
2,Trump has full control of government - but he ...,BBC News
3,Donald Trump and his Republican Party have an ...,BBC News
4,What White House picks tell us about Trump 2.0...,BBC News
...,...,...
138,"MADISON, Wis. (AP) — Republicans won six congr...",AP News
139,"ANNAPOLIS, Md. (AP) — Angela Alsobrooks won a ...",AP News
142,"CONCORD, N.H. (AP) — Democrats maintained thei...",AP News
143,"COLUMBUS, Ohio (AP) — Democrats won a key cong...",AP News


## Data Cleaning/Selection Process

For this process, we want to break down the main text of each article into sentences in an array and then grab each sentence that contains the keywords we are interested in. We will then save these sentences in a new DataFrame, each sentence being its own entry.

In [36]:
def edge_cases_data_cleaning(text):
    '''
    replace common abbreviations and decimal values with placeholders to avoid
    splitting
    '''
    text = re.sub(r'\bRep\.', 'Rep', text)
    text = re.sub(r'\bDem\.', 'Dem', text)
    text = re.sub(r'\bDr\.', 'Dr', text)
    text = re.sub(r'\bMr\.', 'Mr', text)
    text = re.sub(r'\bMrs\.', 'Mrs', text)
    text = re.sub(r'\bMs\.', 'Ms', text)
    text = re.sub(r'\bSt\.', 'St', text)
    text = re.sub(r'\bSr\.', 'Sr', text)
    text = re.sub(r'\bJr\.', 'Jr', text)
    text = re.sub(r'\bSen\.', 'Sen', text)
    text = re.sub(r'\bSens\.', 'Sens', text)
    text = re.sub(r'\bGov\.', 'Gov', text)
    text = re.sub(r'\bLt\.', 'Lt', text)
    text = re.sub(r'\bCol\.', 'Col', text)
    text = re.sub(r'\bGen\.', 'Gen', text)
    text = re.sub(r'\bProf\.', 'Prof', text)
    text = re.sub(r'\bPh\.', 'Ph', text)
    text = re.sub(r'\bU\.S\.', 'US', text)

    text = re.sub(r'(\d+)\.(\d+)', r'\1DOT\2', text)

    # replace line breaks with a space
    text = re.sub(r'\n+', ' ', text).strip()
    return text


filtered_sentences = []
for text in news_articles_df['main text']:
    text = edge_cases_data_cleaning(text)
    sentences = text.split('.')
    for sentence in sentences:
        if any(target_word in sentence.lower() for target_word in TARGET_WORDS['Dem']):
            filtered_sentences.append({'p_sentence':sentence.strip(),
                                        'party_ref':'Dem'})
        elif any(target_word in sentence.lower() for target_word in TARGET_WORDS['Rep']):
             filtered_sentences.append({'p_sentence':sentence.strip(),
                                        'party_ref':'Rep'})
    
political_sentences_df = pd.DataFrame(filtered_sentences)

print('shape of the dataframe:', political_sentences_df.shape)
political_sentences_df

shape of the dataframe: (950, 2)


Unnamed: 0,p_sentence,party_ref
0,The Republican Party has achieved full control...,Rep
1,Republicans won the majority in the Senate ear...,Rep
2,It also leaves Democrats with less leverage to...,Dem
3,"CBS News, the BBC's US partner, projects that ...",Dem
4,CBS projects that the final number of Republic...,Rep
...,...,...
945,Flood took a harder conservative tack in this ...,Dem
946,"Blood, a state lawmaker from Bellevue who serv...",Rep
947,Republican Rep Adrian Smith easily won a 10th ...,Dem
948,"Over the years, 3rd District voters have shown...",Rep


Lets take a look at the distribution of how many parties are referenced within this dataset.

In [40]:
dem_counts = political_sentences_df['party_ref'].value_counts()['Dem']
rep_counts = political_sentences_df['party_ref'].value_counts()['Rep']
n = political_sentences_df.shape[0]
dem_perc = round((dem_counts/n) * 100, 2)
rep_perc = round((rep_counts/n) * 100, 2)

print('number of data entries for with Dem party reference:', dem_counts,'(', dem_perc, '%)')
print('number of data entries for with Rep party reference:', rep_counts,'(', rep_perc, '%)')

number of data entries for with Dem party reference: 550 ( 57.89 %)
number of data entries for with Rep party reference: 400 ( 42.11 %)


In [None]:
# save the dataframe to a tsv file
# political_sentences_df.to_csv('political_sentences.tsv', sep='\t', index=False)

## Data Transformation
This is the final step of the script. Here we are planning to transform the data that we have scraped and cleaned into a format that is suitable for our final project. In this transformation, we will be truncating the complete sentences into smaller and uncompleted sentences. This will allow for us to feed these sentences into our chosen NLP models for the autocompletion task of each sentence (where we will then measure the bias of the autocompleted sentences).

The way we will be slicing each sentence is by first splitting each word in each sentence and then randomly slicing $n$ words after the target word. 

In [None]:
def truncate_sentence(sentence, target_words):
    words = sentence.split()
    
    # find the position of the first target word in the sentence
    for idx, word in enumerate(words):
        if any(target_word in word.lower() for target_word in target_words):
            # check how many words are available after the target word
            words_after = len(words) - idx - 1
            if words_after >= 5:
                # rand select a num between 2-5
                cut_off = random.randint(2, 5)
            else:
                # use the max available words if fewer than 5
                cut_off = words_after
            return " ".join(words[:idx + 1 + cut_off])

    return sentence

all_target_words = TARGET_WORDS['Dem'] + TARGET_WORDS['Rep']

political_sentences_df['truncated_sentence'] = political_sentences_df['p_sentence'].apply(
    lambda x: truncate_sentence(x, all_target_words)
)

# Display the updated DataFrame
political_sentences_df.head()

Unnamed: 0,p_sentence,party_ref,truncated_sentence
0,The Republican Party has achieved full control...,Rep,The Republican Party has achieved
1,Republicans won the majority in the Senate ear...,Rep,Republicans won the majority in the
2,It also leaves Democrats with less leverage to...,Dem,It also leaves Democrats with less leverage to...
3,"CBS News, the BBC's US partner, projects that ...",Dem,"CBS News, the BBC's US partner, projects that ..."
4,CBS projects that the final number of Republic...,Rep,CBS projects that the final number of Republic...


In [44]:
political_sentences_df.to_csv('political_sentences.tsv', sep='\t', index=False)