# Master Thesis - Mattia Piazzalunga
In this notebook, the generation of the English dataset is carried out thanks to the data collected by Patrick Martinchek https://medium.com/newco/what-i-discovered-about-trump-and-clinton-from-analyzing-4-million-facebook-posts-922a4381fd2f

*Title*: Bridging a GAP: Text Style Transfer from Journalistic to Conversational for enhanced social media dissemination of news

*Supervisor*: Gabriella Pasi <br>
*Author*: Mattia Piazzalunga

*University*: Bicocca University of Milan <br>
*Department*: Informatics, Systems and Communication <br>
*Course*: Computer Science <br>
*Academic year*: 2023/2024

*Info*: This notebook was run locally. Download the whole repository before running.

*For suggestions or questions*: mattiapiazzalunga@outlook.com

## Inizializzazion

### Dowloading libraires

In [6]:
!pip install requests lxml newspaper3k torch ipywidgets sentence-transformers

Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.1.2-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tinysegmenter==0.3 

### Importing libraries

In [None]:
import glob
import pandas as pd
import random
from IPython.display import clear_output
from urllib.parse import urlparse
import os
import re
from newspaper import Article
import time
from bs4 import BeautifulSoup
import torch
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm

### Generating dataframe from json

Extracting dataframe from json and removing nan

In [3]:
class FileProcessor:
    def __init__(self, path, attributes):
        self.path = path
        self.attributes = attributes

    def get_content_list(self):
        # Get list of all CSV files in the specified directory
        all_files = glob.glob(self.path + "/*.csv")

        if not all_files:
            raise ValueError(f"No CSV files found in directory: {self.path}")

        li = []

        # Print the files being processed (debugging step)
        print(f"Files found: {all_files}")

        # Iterate over all files
        for filename in all_files:
            try:
                # Read each CSV file with encoding utf-16 and skip bad lines
                df = pd.read_csv(filename, encoding="utf-16", index_col=None, header=0, on_bad_lines='skip')

                # Handle missing columns by checking against the given attributes list
                if all(attr in df.columns for attr in self.attributes):

                    # Add a new column with the filename
                    df['filename'] = os.path.basename(filename).replace('.csv', '')                   # Extract just the file name from the path

                    # Reorder columns: filename at the beginning + attributes
                    df = df[['filename'] + self.attributes]

                    li.append(df)
                else:
                    print(f"Skipping file {filename} due to missing columns.")
            except PermissionError:
                print(f"Permission denied for file {filename}. Skipping.")
            except Exception as e:
                print(f"Error reading file {filename}: {e}")

        # If no valid dataframes have been collected, raise an error
        if not li:
            raise ValueError("No valid CSV files could be read.")

        # Concatenate all DataFrames into a single DataFrame
        frame = pd.concat(li, axis=0, ignore_index=True)

        # Return the DataFrame as well if further processing is needed
        return frame

In [4]:
# Load the dataframe
processor = FileProcessor(path="../starting_datasets/EN/", attributes=["message", "post_type", "link"])
df = processor.get_content_list()

Files found: ['../starting_dataset/EN\\abc.csv', '../starting_dataset/EN\\bbc.csv', '../starting_dataset/EN\\cbs.csv', '../starting_dataset/EN\\cnn.csv', '../starting_dataset/EN\\fox.csv', '../starting_dataset/EN\\foxandfriends.csv', '../starting_dataset/EN\\huffington.csv', '../starting_dataset/EN\\latimes.csv', '../starting_dataset/EN\\nbc.csv', '../starting_dataset/EN\\npr.csv', '../starting_dataset/EN\\nytimes.csv', '../starting_dataset/EN\\time.csv', '../starting_dataset/EN\\usatoday.csv', '../starting_dataset/EN\\washington.csv', '../starting_dataset/EN\\wsj.csv']


In [5]:
#Check the dataframe
df.head()

Unnamed: 0,filename,message,post_type,link
0,abc,Roberts took the unusual step of devoting the ...,link,http://abcnews.go.com/blogs/headlines/2011/12/...
1,abc,Do you agree with the new law?,link,http://abcnews.go.com/blogs/politics/2011/12/w...
2,abc,Some pretty cool confetti will rain down on Ne...,link,http://abcnews.go.com/blogs/headlines/2011/12/...
3,abc,,link,http://abcnews.go.com/blogs/politics/2011/12/m...
4,abc,The pharmacy was held up by a man seeking pres...,link,http://abcnews.go.com/US/ny-pharmacy-shootout-...


In [6]:
# Filter the DataFrame to keep only the rows where 'post_type' is exactly 'link', ignoring leading/trailing spaces and case sensitivity
df_filtered = df[df['post_type'].str.strip().str.lower().eq('link'.lower())]

In [7]:
# Replace empty strings with pd.NA
df_filtered = df_filtered.copy()
df_filtered.replace("", pd.NA, inplace=True)

# Replace 'nan' and '<na>' strings with pd.NA
df_filtered.replace(['nan', '<na>'], pd.NA, inplace=True)

In [8]:
len(df_filtered)

427479

In [9]:
df_filtered = df.dropna()

In [10]:
len(df_filtered)

512052

In [11]:
df_filtered.head()

Unnamed: 0,filename,message,post_type,link
0,abc,Roberts took the unusual step of devoting the ...,link,http://abcnews.go.com/blogs/headlines/2011/12/...
1,abc,Do you agree with the new law?,link,http://abcnews.go.com/blogs/politics/2011/12/w...
2,abc,Some pretty cool confetti will rain down on Ne...,link,http://abcnews.go.com/blogs/headlines/2011/12/...
4,abc,The pharmacy was held up by a man seeking pres...,link,http://abcnews.go.com/US/ny-pharmacy-shootout-...
6,abc,There were no immediate reports of damage or i...,link,http://abcnews.go.com/International/wireStory/...


In [12]:
# Remove rows where the number of words (space-separated) in the specified column is greater than 1
df_filtered = df_filtered[df_filtered['link'].str.split().str.len() <= 1]
df_filtered = df_filtered[df_filtered['link'].str.split().str.len() == 1]

In [13]:
len(df_filtered)

512049

In [14]:
df_filtered.head()

Unnamed: 0,filename,message,post_type,link
0,abc,Roberts took the unusual step of devoting the ...,link,http://abcnews.go.com/blogs/headlines/2011/12/...
1,abc,Do you agree with the new law?,link,http://abcnews.go.com/blogs/politics/2011/12/w...
2,abc,Some pretty cool confetti will rain down on Ne...,link,http://abcnews.go.com/blogs/headlines/2011/12/...
4,abc,The pharmacy was held up by a man seeking pres...,link,http://abcnews.go.com/US/ny-pharmacy-shootout-...
6,abc,There were no immediate reports of damage or i...,link,http://abcnews.go.com/International/wireStory/...


In [15]:
# Drop the 'post_type' column from the DataFrame
df_filtered = df_filtered.drop(columns='post_type')

In [16]:
df_filtered.head()

Unnamed: 0,filename,message,link
0,abc,Roberts took the unusual step of devoting the ...,http://abcnews.go.com/blogs/headlines/2011/12/...
1,abc,Do you agree with the new law?,http://abcnews.go.com/blogs/politics/2011/12/w...
2,abc,Some pretty cool confetti will rain down on Ne...,http://abcnews.go.com/blogs/headlines/2011/12/...
4,abc,The pharmacy was held up by a man seeking pres...,http://abcnews.go.com/US/ny-pharmacy-shootout-...
6,abc,There were no immediate reports of damage or i...,http://abcnews.go.com/International/wireStory/...


### Getting real urls
Let's take additional steps so that we don't make too many requests

In [17]:
# Extract the top-level domain from each URL
df_filtered['top_level_domain'] = df_filtered['link'].apply(lambda url: f"{urlparse(url).scheme}://{urlparse(url).netloc}" if pd.notna(url) else pd.NA)

# Get the unique top-level domains and remove NaN values
unique_domains = pd.Series(df_filtered['top_level_domain'].unique()).dropna()

In [18]:
df_filtered.head()

Unnamed: 0,filename,message,link,top_level_domain
0,abc,Roberts took the unusual step of devoting the ...,http://abcnews.go.com/blogs/headlines/2011/12/...,http://abcnews.go.com
1,abc,Do you agree with the new law?,http://abcnews.go.com/blogs/politics/2011/12/w...,http://abcnews.go.com
2,abc,Some pretty cool confetti will rain down on Ne...,http://abcnews.go.com/blogs/headlines/2011/12/...,http://abcnews.go.com
4,abc,The pharmacy was held up by a man seeking pres...,http://abcnews.go.com/US/ny-pharmacy-shootout-...,http://abcnews.go.com
6,abc,There were no immediate reports of damage or i...,http://abcnews.go.com/International/wireStory/...,http://abcnews.go.com


In [19]:
len(unique_domains)

1018

In [20]:
# Function to get the final URL after following redirects
def get_final_url(url):
    try:
        response = requests.head(url, allow_redirects=True, timeout=10)
        final_url = response.url
        # Parse and return the scheme and netloc (domain)
        parsed_url = urlparse(final_url)
        return f"{parsed_url.scheme}://{parsed_url.netloc}"
    except requests.RequestException:
        return pd.NA

In [21]:
# Process the list of URLs and extract the final domains
final_domains = [get_final_url(url) for url in unique_domains]

In [22]:
len(final_domains)

1018

In [23]:
# Get the final domains to manually check the correctness to avoid unnecessary requests and problems with fake domains
final_domains

['https://abcnews.go.com',
 'http://abcn.ws',
 'http://getglue.com',
 'https://www.snappytv.com',
 'https://www.theknot.com',
 'https://www.facebook.com',
 'https://www.espnfrontrow.com',
 'https://www.goodmorningamerica.com',
 'http://news.yahoo.com',
 'https://bitly.com',
 'https://www.yahoo.com',
 'https://www.youtube.com',
 'http://espn.go.com',
 'https://www.pinterest.com',
 <NA>,
 'http://abcnewsradioonline.com',
 <NA>,
 'https://www.instagram.com',
 'https://t.co',
 'http://trib.al',
 'http://bitly.com',
 'https://www.youtube.com',
 <NA>,
 'https://www.goodmorningamerica.com',
 'http://www.espn.com',
 'https://www.youtube.com',
 'https://abcnews.go.com',
 'https://abc7news.com',
 'https://abc7.com',
 'https://abc7chicago.com',
 'https://abc7ny.com',
 'https://abc7ny.com',
 'https://giphy.com',
 'https://giphy.com',
 'https://vote.webbyawards.com',
 'http://trib.al',
 'https://abc11.com',
 'https://abc13.com',
 'https://www.bbc.co.uk',
 'https://audioboo.fm',
 'https://www.bbc.co

In [24]:
# Create a dictionary for the mapping of unique domains to final domains
domain_mapping = dict(zip(unique_domains, final_domains))

# Replace the unique domains with final domains in the URLs
df_filtered["top_level_domain"] = df_filtered["top_level_domain"].replace(domain_mapping, regex=True)

In [25]:
df_filtered

Unnamed: 0,filename,message,link,top_level_domain
0,abc,Roberts took the unusual step of devoting the ...,http://abcnews.go.com/blogs/headlines/2011/12/...,https://abcnews.go.com
1,abc,Do you agree with the new law?,http://abcnews.go.com/blogs/politics/2011/12/w...,https://abcnews.go.com
2,abc,Some pretty cool confetti will rain down on Ne...,http://abcnews.go.com/blogs/headlines/2011/12/...,https://abcnews.go.com
4,abc,The pharmacy was held up by a man seeking pres...,http://abcnews.go.com/US/ny-pharmacy-shootout-...,https://abcnews.go.com
6,abc,There were no immediate reports of damage or i...,http://abcnews.go.com/International/wireStory/...,https://abcnews.go.com
...,...,...,...,...
534386,wsj,"“It took massive amounts of work, incredible a...",http://on.wsj.com/2fyLhrR,http://on.wsj.com
534387,wsj,"As he has many times, Donald J. Trump cast his...",http://on.wsj.com/2exxAsn,http://on.wsj.com
534388,wsj,"As the Trump Organization's finance chief, All...",http://on.wsj.com/2faR36h,http://on.wsj.com
534389,wsj,HealthCare.gov has been straining to handle th...,http://on.wsj.com/2exw5dL,http://on.wsj.com


In [26]:
len(df_filtered)

512049

In [27]:
df_filtered_top_level = df_filtered.dropna(subset=['top_level_domain'])

In [28]:
len(df_filtered_top_level)

481451

In [29]:
df_filtered_top_level = df_filtered_top_level.copy()
df_filtered_top_level['top_level_domain'] = df_filtered_top_level['top_level_domain'].str.replace(r'https?://(www\.)?|www\.', '', regex=True)
df_filtered_top_level['top_level_domain'] = df_filtered_top_level['top_level_domain'].str.strip()

In [30]:
df_filtered_top_level.head()

Unnamed: 0,filename,message,link,top_level_domain
0,abc,Roberts took the unusual step of devoting the ...,http://abcnews.go.com/blogs/headlines/2011/12/...,abcnews.go.com
1,abc,Do you agree with the new law?,http://abcnews.go.com/blogs/politics/2011/12/w...,abcnews.go.com
2,abc,Some pretty cool confetti will rain down on Ne...,http://abcnews.go.com/blogs/headlines/2011/12/...,abcnews.go.com
4,abc,The pharmacy was held up by a man seeking pres...,http://abcnews.go.com/US/ny-pharmacy-shootout-...,abcnews.go.com
6,abc,There were no immediate reports of damage or i...,http://abcnews.go.com/International/wireStory/...,abcnews.go.com


Removing old links, fake domains and non-parallel news based on news agencies (the link inside the post must link to a news of the same news agency)

In [10]:
# Dictionary mapping the news agency to the list of domains associated. This dictionary is handcrafted & checked based on the final domains obtained.
domain_dict = {
    'cnn': ['us.cnn.com', 'edition.cnn.com'],
    'nbc': ['cnbc.com', 'nascartalk.nbcsports.com', 'nbcolympics.com', 'nbcsandiego.com', 'nbclosangeles.com', 'nbcchicago.com', 'nbcnews.com', 'nbcdfw.com', 'nbcnewyork.com', 'nbcbayarea.com', 'nbcwashington.com', 'nbcmiami.com', 'nbcsportschicago.com', 'nbcsportsbayarea.com', 'nbcsports.com'],
    'huffington': ['projects.huffingtonpost.com', 'elections.huffingtonpost.com', 'testkitchen.huffingtonpost.com', 'huffingtonpost.co.uk'],
    'washington': ['washingtonpost.com'],
    'usatoday': ['eu.shreveporttimes.com', 'ftw.usatoday.com', 'eu.usatoday.com', 'mmajunkie.usatoday.com', 'eu.citizen-times.com', 'eu.usatodayhss.com', 'sportswire.usatoday.com', 'sportsdata.usatoday.com', 'broncoswire.usatoday.com', 'boxingjunkie.usatoday.com'],
    'fox': ['foxnews.com', 'radio.foxnews.com', 'nation.foxnews.comgeo-block', 'fox29.com', 'fox59.com', 'fox4kc.com', 'fox5ny.com', 'foxbusiness.com'],
    'foxandfriends': ['foxnews.com', 'radio.foxnews.com', 'nation.foxnews.comgeo-block', 'fox29.com', 'fox59.com', 'fox4kc.com', 'fox5ny.com', 'foxbusiness.com'],
    'nytimes': ['archive.nytimes.com', 'nytimes.com', 'cooking.nytimes.com'],
    'abc': ['abcnewsradioonline.com', 'abcnews.go.com', 'abc7ny.com', 'abc7.com', 'abc11.com', 'abc13.com', 'abc7chicago.com', 'abc7news.com'],
    'bbc': ['bbc.co.uk', 'bbc.com'],
    'wsj': ['financingthefuture.wsj.com', 'wsj.com', 'blogs.wsj.com', 'graphics.wsj.com', 's.wsj.net'],
    'npr': ['elections.npr.org', 'elections2012.npr.org', 'blog.apps.npr.org', 'npr.org', 'stateimpact.npr.org'],
    'cbs': ['cbssports.com', 'cbs.com', 'cbsnews.com'],
    'latimes': ['latimes.com', 'events.latimes.com', 'homicide.latimes.com'],
    'time': ['time.com']
}

In [32]:
# Creating a tuple list with the filename and associated domains
filter_conditions = []
for filename, domains in domain_dict.items():
    for domain in domains:
        filter_conditions.append((filename, domain))

# Create a DataFrame from the filter conditions
filter_conditions_df = pd.DataFrame(filter_conditions, columns=['filename', 'top_level_domain'])

# Perform a merge to filter the original DataFrame based on the conditions
df_filtered_top_level = df_filtered_top_level.merge(filter_conditions_df, on=['filename', 'top_level_domain'], how='inner')

In [33]:
len(df_filtered_top_level)

275075

In [34]:
df_filtered_top_level.head()

Unnamed: 0,filename,message,link,top_level_domain
0,abc,Roberts took the unusual step of devoting the ...,http://abcnews.go.com/blogs/headlines/2011/12/...,abcnews.go.com
1,abc,Do you agree with the new law?,http://abcnews.go.com/blogs/politics/2011/12/w...,abcnews.go.com
2,abc,Some pretty cool confetti will rain down on Ne...,http://abcnews.go.com/blogs/headlines/2011/12/...,abcnews.go.com
3,abc,The pharmacy was held up by a man seeking pres...,http://abcnews.go.com/US/ny-pharmacy-shootout-...,abcnews.go.com
4,abc,There were no immediate reports of damage or i...,http://abcnews.go.com/International/wireStory/...,abcnews.go.com


### Make the dataset more manageable, reming some random rows
In future work we could skip this part, but the computational cost must be considered!

In [35]:
# Group the DataFrame by the 'filename' column 
grouped = df_filtered_top_level.groupby('filename')

# Create an empty list to store the sampled dataframes
sampled_dfs = []

# For each unique filename, take a random sample of 2000 rows
for filename, group in grouped:
    # If there are more than 2000 rows for the filename, sample 2000, otherwise take all rows
    sampled = group.sample(n=min(2250, len(group)), random_state=42)
    sampled_dfs.append(sampled)

# Concatenate all the sampled dataframes
df_filtered_top_level_filtered = pd.concat(sampled_dfs)

In [36]:
df_filtered_top_level_filtered.head()

Unnamed: 0,filename,message,link,top_level_domain
8644,abc,Marco Rubio repeated a similar line at least 4...,http://abcnews.go.com/Politics/marco-rubio-rep...,abcnews.go.com
1076,abc,A California parole board will meet today to d...,http://abcnews.go.com/US/california-mass-murde...,abcnews.go.com
2970,abc,The Force is strong with Angry Birds.,http://abcnews.go.com/blogs/technology/2012/10...,abcnews.go.com
2029,abc,$20K for Drumsticks?,http://abcnews.go.com/blogs/politics/2012/07/g...,abcnews.go.com
8543,abc,A precious photo of a Florida deputy spending ...,http://abcnews.go.com/US/photo-shows-florida-d...,abcnews.go.com


In [37]:
len(df_filtered_top_level_filtered)

27718

In [38]:
# Drop the unnecessary columns
df_filtered_top_level_filtered.drop(columns=['filename', 'top_level_domain'], inplace=True)

In [39]:
df_filtered_top_level_filtered.head()

Unnamed: 0,message,link
8644,Marco Rubio repeated a similar line at least 4...,http://abcnews.go.com/Politics/marco-rubio-rep...
1076,A California parole board will meet today to d...,http://abcnews.go.com/US/california-mass-murde...
2970,The Force is strong with Angry Birds.,http://abcnews.go.com/blogs/technology/2012/10...
2029,$20K for Drumsticks?,http://abcnews.go.com/blogs/politics/2012/07/g...
8543,A precious photo of a Florida deputy spending ...,http://abcnews.go.com/US/photo-shows-florida-d...


### Scraping websites

In [19]:
# Updated headers with a recent User-Agent
HEADERS = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/114.0.5735.110 Safari/537.36'),
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
}

# Function to scrape content from a URL using BeautifulSoup
def scrape_content(url):
    try:
        response = requests.get(url, headers=HEADERS, timeout=10)
        response.raise_for_status()
        print(f"Scraping {url}: status code {response.status_code}, content length {len(response.content)}")

        soup = BeautifulSoup(response.text, 'lxml')  # Use 'lxml' parser for better performance

        # Remove unwanted elements (scripts, styles, etc.)
        for element in soup.find_all(["script", "style", "header", "footer", "aside", "nav", "form", "figure"]):
            element.decompose()

        # Define possible content containers based on common website structures
        possible_containers = [
            'article',
            'div[id*="content"]',
            'div[class*="content"]',
            'div[id*="article"]',
            'div[class*="article"]',
            'div[id*="main"]',
            'div[class*="main"]',
            'section[class*="content"]',
            'section[class*="article"]',
            'main',
        ]

        article = None
        for container in possible_containers:
            article = soup.select_one(container)
            if article and article.find_all('p'):
                break

        # If no specific container is found, use the body tag as a fallback
        if not article:
            print(f"No specific content container found for {url}. Using the body tag as fallback.")
            article = soup.body

        # Extract text from the found article or body
        paragraphs = article.find_all('p')
        if not paragraphs:
            # If no paragraphs are found, extract all text
            text = article.get_text(separator=' ', strip=True)
        else:
            text = ' '.join([p.get_text(strip=True) for p in paragraphs])

        if text:
            return text.strip()
        else:
            print(f"No text extracted from {url}.")
            return None

    except requests.exceptions.HTTPError as http_err:
        if response.status_code == 404:
            print(f"404 Error: Page not found for {url}")
        else:
            print(f"HTTP error occurred: {http_err} for {url}")
    except requests.exceptions.Timeout:
        print(f"Timeout while scraping {url}")
    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {str(e)}")
    except Exception as e:
        print(f"Unexpected error scraping {url}: {str(e)}")
    return None

# Function to scrape content using Newspaper3k as a fallback
def scrape_content_with_newspaper(url):
    try:
        article = Article(url)
        article.download()
        article.parse()
        text = article.text.strip()
        if text:
            print(f"Successfully extracted content using Newspaper3k for {url}")
            return text
        else:
            print(f"No text extracted with Newspaper3k from {url}.")
            return None
    except Exception as e:
        print(f"Error scraping {url} with Newspaper3k: {str(e)}")
        return None

def load_progress(df, filename='progress.csv'):
    if os.path.exists(filename):
        print(f"Loading progress from {filename}")
        progress_df = pd.read_csv(filename, dtype={'link': str, 'original': str, 'status': str}, low_memory=False)
        # Filter progress_df to include only links present in df
        progress_df = progress_df[progress_df['link'].isin(df['link'])]
        return progress_df
    return None

def save_progress(df, filename='progress.csv'):
    # Save only the 'link', 'original', and 'status' columns
    df[['link', 'original', 'status']].to_csv(filename, index=False)
    print(f"Progress saved to {filename}")

# Function to validate URL before scraping
def is_url_valid(url):
    try:
        response = requests.head(url, headers=HEADERS, timeout=5)
        if response.status_code == 404:
            return False
        return True
    except requests.exceptions.RequestException:
        return False

# Main scraping function
def scrape_articles(input_data, checkpoint_interval=10, clear_output_interval=300):
    if isinstance(input_data, pd.DataFrame):
        df = input_data.copy()
        print("Using in-memory DataFrame as input.")
    else:
        df = pd.read_csv(input_data)
        print(f"Loaded data from file: {input_data}")

    # Ensure 'status' and 'original' columns exist
    if 'status' not in df.columns:
        df['status'] = ''
    if 'original' not in df.columns:
        df['original'] = ''

    # Ensure 'link' column is string and strip whitespaces
    df['link'] = df['link'].astype(str).str.strip()

    # Remove duplicates based on the 'link' column
    df = df.drop_duplicates(subset='link').reset_index(drop=True)
    print(f"Length of df after removing duplicates: {len(df)}")

    # Check for existing progress
    progress_df = load_progress(df)
    if progress_df is not None:
        # Ensure 'link' column in progress_df is string and strip whitespaces
        progress_df['link'] = progress_df['link'].astype(str).str.strip()

        # Remove duplicates in progress_df
        progress_df = progress_df.drop_duplicates(subset='link').reset_index(drop=True)

        # Merge progress with the original DataFrame
        df = df.merge(
            progress_df[['link', 'original', 'status']],
            on='link',
            how='left',
            suffixes=('', '_progress'),
            validate='one_to_one'
        )

        # Update 'original' and 'status' columns with progress
        for col in ['original', 'status']:
            progress_col = f"{col}_progress"
            if progress_col in df.columns:
                df[col] = df[progress_col]
                df.drop(columns=[progress_col], inplace=True)
        print("Resuming from previous progress.")

    # Fill NaN values in 'original' and 'status' with empty strings
    df['original'] = df['original'].fillna('')
    df['status'] = df['status'].fillna('')

    # Convert 'original' and 'status' columns to string
    df['original'] = df['original'].astype(str)
    df['status'] = df['status'].astype(str)

    total_rows = df.shape[0]
    print(f"Total rows to process: {total_rows}")

    # Iterate over DataFrame rows using enumerate
    for i, (index, row) in enumerate(df.iterrows()):
        # Skip if 'status' is not empty
        if row['status'] != '':
            continue

        url = row['link']

        # Validate URL before scraping
        if not is_url_valid(url):
            print(f"URL is invalid (404 Not Found): {url}")
            df.at[index, 'status'] = 'failed'
            continue  # Skip to the next URL

        content = scrape_content(url)
        if not content:
            print(f"No content found with BeautifulSoup for {url}. Trying Newspaper3k.")
            content = scrape_content_with_newspaper(url)
        if content:
            df.at[index, 'original'] = content
            df.at[index, 'status'] = 'success'
            print(f"Scraped content for index {index}, length: {len(content)}")
        else:
            print(f"Failed to scrape content for index {index}")
            df.at[index, 'status'] = 'failed'

        # Sleep to avoid overwhelming the server (randomized delay)
        time.sleep(random.uniform(0.5, 1.5))

        # Clear output every 'clear_output_interval' iterations
        if (i + 1) % clear_output_interval == 0:
            clear_output(wait=True)
            print(f"Output cleared after {i + 1} iterations.")

        # Save progress at intervals
        if (i + 1) % checkpoint_interval == 0:
            save_progress(df)

    # Final save after completing the loop
    save_progress(df)
    return df

In [20]:
# Copy the dataframe - 
scraped_df = df_filtered_top_level_filtered.copy(deep=True)
scraped_df=scraped_df.iloc[:, :2]
# Run the scraper
scraped_df = scrape_articles(scraped_df, checkpoint_interval=50,  clear_output_interval=300)

Progress saved to progress.csv

HTTP error occurred: 401 Client Error: HTTP Forbidden for url: https://www.wsj.com/articles/amazon-launches-4k-streaming-video-for-prime-members-1418159683?mod=e2fb for http://www.wsj.com/articles/amazon-launches-4k-streaming-video-for-prime-members-1418159683?mod=e2fb

No content found with BeautifulSoup for http://www.wsj.com/articles/amazon-launches-4k-streaming-video-for-prime-members-1418159683?mod=e2fb. Trying Newspaper3k.

Error scraping http://www.wsj.com/articles/amazon-launches-4k-streaming-video-for-prime-members-1418159683?mod=e2fb with Newspaper3k: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.wsj.com/articles/amazon-launches-4k-streaming-video-for-prime-members-1418159683?mod=e2fb on URL http://www.wsj.com/articles/amazon-launches-4k-streaming-video-for-prime-members-1418159683?mod=e2fb

Failed to scrape content for index 27000

HTTP error occurred: 401 Client Error: HTTP Forbidden for url: https://www.ws

### Removing a bad lines

In [110]:
#Copy the dataframe
scraped_df_new = scraped_df.copy(deep=True)
scraped_df_new.head()

Unnamed: 0,message,link,status,original
0,Marco Rubio repeated a similar line at least 4...,http://abcnews.go.com/Politics/marco-rubio-rep...,success,"Marco Rubio made his point again, and again, a..."
1,A California parole board will meet today to d...,http://abcnews.go.com/US/california-mass-murde...,success,"The notorious ""Helter Skelter"" killer was deni..."
2,The Force is strong with Angry Birds.,http://abcnews.go.com/blogs/technology/2012/10...,success,"Credit: Rovio Entertainment|LucasFilm Ltd. ""An..."
3,$20K for Drumsticks?,http://abcnews.go.com/blogs/politics/2012/07/g...,success,The General Services Administration is back in...
4,A precious photo of a Florida deputy spending ...,http://abcnews.go.com/US/photo-shows-florida-d...,success,A photo shows a Florida deputy having a tea pa...


In [12]:
# Function to extract domain
def extract_domain(url):
    parsed_url = urlparse(url)
    domain = parsed_url.netloc
    # Remove leading 'www.' if present
    return domain.replace('www.', '')

# Apply the function to the 'link' column to create a new 'domain' column
scraped_df_new['domain'] = scraped_df_new['link'].apply(extract_domain)

In [13]:
len(scraped_df_new)

27530

In [14]:
scraped_df_new.head()

Unnamed: 0,message,link,status,original,domain
1,A California parole board will meet today to d...,http://abcnews.go.com/US/california-mass-murde...,success,"The notorious ""Helter Skelter"" killer was deni...",abcnews.go.com
2,The Force is strong with Angry Birds.,http://abcnews.go.com/blogs/technology/2012/10...,success,"Credit: Rovio Entertainment|LucasFilm Ltd. ""An...",abcnews.go.com
3,$20K for Drumsticks?,http://abcnews.go.com/blogs/politics/2012/07/g...,success,The General Services Administration is back in...,abcnews.go.com
4,A precious photo of a Florida deputy spending ...,http://abcnews.go.com/US/photo-shows-florida-d...,success,A photo shows a Florida deputy having a tea pa...,abcnews.go.com
5,“My advice to the Muslim Brotherhood is they n...,http://abcnews.go.com/blogs/politics/2013/07/e...,success,Egypt's ambassador to the U.S. says the Muslim...,abcnews.go.com


In [15]:
# Remove rows with 'status' as 'failed'
scraped_df_new = scraped_df_new[scraped_df_new['status'] != 'failed']

In [16]:
len(scraped_df_new)

23552

In [17]:
#Remove rows with wrong scraped content
domains_to_remove = domain_dict['wsj'] + domain_dict['fox'] + domain_dict['foxandfriends']

# Filter out the rows where the domain is in the combined list
scraped_df_new = scraped_df_new[~scraped_df_new['domain'].isin(domains_to_remove)]

In [18]:
len(scraped_df_new)

22943

In [19]:
# Remove rows with NA
scraped_df_new = scraped_df_new.dropna()

In [21]:
#Remove non unicode characters
def normalize_encoding(text):
    if isinstance(text, bytes):
        return text.decode('utf-8', errors='ignore')
    return text

scraped_df_new["message"] = scraped_df_new["message"].apply(normalize_encoding)
scraped_df_new["original"] = scraped_df_new["original"].apply(normalize_encoding)

In [24]:
# Function to trim and replace multiple spaces with one
def clean_string_columns(df):
    # Select only string columns and apply the transformation
    str_cols = df.select_dtypes(include=['object', 'string'])
    df[str_cols.columns] = str_cols.apply(lambda col: col.str.split().str.join(' '))
    return df

# Apply the function to the DataFrame
scraped_df_new = clean_string_columns(scraped_df_new)

In [26]:
# Replace empty strings with pd.NA
scraped_df_new.replace("", pd.NA, inplace=True)

# Remove rows with NA
scraped_df_new = scraped_df_new.dropna()

In [27]:
len(scraped_df_new)

22943

### Removing harmful content

In [28]:
#Remove unwanted columns
scraped_df_new.drop(columns=['link', 'status','domain'], inplace=True)

In [29]:
scraped_df_new.head()

Unnamed: 0,message,original
1,A California parole board will meet today to d...,"The notorious ""Helter Skelter"" killer was deni..."
2,The Force is strong with Angry Birds.,"Credit: Rovio Entertainment|LucasFilm Ltd. ""An..."
3,$20K for Drumsticks?,The General Services Administration is back in...
4,A precious photo of a Florida deputy spending ...,A photo shows a Florida deputy having a tea pa...
5,“My advice to the Muslim Brotherhood is they n...,Egypt's ambassador to the U.S. says the Muslim...


In [30]:
len(scraped_df_new)

22943

In [31]:
# Function to remove rows where any string cell starts with "Don't want to wait?" (ignoring case)
def remove_rows_starting_with_phrase(df, phrase):
    # Define a helper function to check if a cell starts with the specified phrase
    def starts_with_phrase(cell):
        if isinstance(cell, str):
            return not cell.lower().startswith(phrase.lower())
        return True  # Non-string values are ignored

    # Apply the condition to the entire DataFrame
    return df[df.apply(lambda row: all(starts_with_phrase(cell) for cell in row), axis=1)]

# Apply the function
phrase = "Don't want to wait?"
scraped_df_new = remove_rows_starting_with_phrase(scraped_df_new, phrase)

In [32]:
len(scraped_df_new)

22857

In [33]:
# Define the function to remove links at the end of a string
def remove_links_at_end(s):
    if isinstance(s, str):
        # Updated regex to match URLs at the end, ignoring spaces before and after the URL
        pattern = re.compile(r'\s*(https?://|www\.)\S+\s*$')
        prev_s = None
        while prev_s != s:
            prev_s = s
            s = pattern.sub('', s)
        return s.strip()  # Strip any leftover spaces after removing the URL
    else:
        return s

# Apply the function to all cells in the DataFrame
scraped_df_new = scraped_df_new.apply(lambda col: col.apply(remove_links_at_end))

In [35]:
# Regular expression patterns to match dates in common formats
date_patterns = [
    r'\b\d{1,2}[-/]\d{1,2}[-/]\d{2,4}\b',  # Matches dates like 25/09/2024 or 09-25-2024
    r'\b\d{4}[-/]\d{2}[-/]\d{2}\b',  # Matches dates like 2024-09-25
    r'\b\d{1,2} \w{3,9} \d{4}\b',  # Matches dates like 25 Sep 2024
    r'\b\w{3,9} \d{1,2}[,.]? \d{4}\b',  # Matches dates like September 25, 2024 or September 25. 2024
    r'\b\w{3} \d{1,2}[,.]? \d{4}\b',  # Matches dates like Mar 25, 2024 or Mar 25. 2024
    r'\b\d{1,2} \w{3,9}\b',  # Matches dates like 25 Sep
    r'\b\w{3,9} \d{1,2}\b'  # Matches dates like September 25
]

def remove_dates_from_string(text):
    # Remove dates from the beginning
    for pattern in date_patterns:
        text = re.sub(r'^' + pattern, '', text).strip()
        # Remove dates from the end
        text = re.sub(pattern + r'$', '', text).strip()
    return text

def clean_dataframe_dates(df):
    # Apply only to columns of type 'object' (usually strings)
    string_columns = df.select_dtypes(include=['object']).columns
    df[string_columns] = df[string_columns].apply(lambda col: col.apply(remove_dates_from_string))
    return df

# Clean the dataframe
scraped_df_new = clean_dataframe_dates(scraped_df_new)

In [36]:
# Function to remove the unwanted text from the end of each cell
def remove_phrase(cell):
    if isinstance(cell, str):
        return cell.replace('24/7 coverage of breaking news and live events', '', 1).strip()
    return cell

# Apply the function to all cells in the dataframe
scraped_df_new = scraped_df_new.apply(lambda col: col.apply(remove_phrase))

In [37]:
# Function to remove 'watch:' from the end of each string, ignoring case
def remove_watch_phrase(cell):
    if isinstance(cell, str):  # Apply only to strings
        # Use a regular expression to remove 'watch:' at the end of the string, ignoring case
        return re.sub(r'(?i)watch:$', '', cell).strip()
    return cell

# Apply the function to all cells in the dataframe
scraped_df_new = scraped_df_new.apply(lambda col: col.apply(remove_watch_phrase))

In [38]:
# Function to remove 'by X X' from the start of each string
def remove_by_name_phrase(cell):
    if isinstance(cell, str):
        return re.sub(r'(?i)^by \w+ \w+', '', cell).strip()
    return cell

# Apply the function to all cells in the dataframe
scraped_df_new = scraped_df_new.apply(lambda col: col.apply(remove_by_name_phrase))

In [39]:
# Function to trim and replace multiple spaces with one
def clean_string_columns(df):
    # Select only string columns and apply transformation to trim and replace multiple spaces with one
    string_columns = df.select_dtypes(include=['object', 'string'])
    df[string_columns.columns] = string_columns.apply(lambda x: x.str.split().str.join(' '))
    return df

# Apply the function to the DataFrame
scraped_df_new = clean_string_columns(scraped_df_new)

# Replace empty strings with pd.NA
scraped_df_new.replace("", pd.NA, inplace=True)

# Remove rows with NA
scraped_df_new = scraped_df_new.dropna()

In [40]:
def remove_rows_with_short_text(df):
    # Define a helper function to check if a cell contains at least 7 words
    def has_minimum_words(cell):
        if isinstance(cell, str):
            return len(cell.split()) >= 7
        return True  # Non-string values are ignored for this condition

    # Apply the condition to each row, using only string columns
    return df[df.apply(lambda row: all(has_minimum_words(cell) for cell in row), axis=1)]

# Apply the function
scraped_df_new = remove_rows_with_short_text(scraped_df_new)

In [42]:
len(scraped_df_new)

20565

In [43]:
# Function to ensure that each string ends with a period (.)
def ensure_period_at_end(cell):
    if isinstance(cell, str):  # Apply only to strings
        # If the string ends with any punctuation, replace it with a period
        cell = re.sub(r'[!?.,;:]+$', '.', cell)
        # If the string doesn't end with a period, add one
        if not cell.endswith('.'):
            cell += '.'
    return cell

# Apply the function to each column of type 'object' (strings)
for col in scraped_df_new.select_dtypes(include='object').columns:
    scraped_df_new[col] = scraped_df_new[col].apply(ensure_period_at_end)

In [44]:
scraped_df_new.head()

Unnamed: 0,message,original
1,A California parole board will meet today to d...,"The notorious ""Helter Skelter"" killer was deni..."
2,The Force is strong with Angry Birds.,"Credit: Rovio Entertainment|LucasFilm Ltd. ""An..."
4,A precious photo of a Florida deputy spending ...,A photo shows a Florida deputy having a tea pa...
5,“My advice to the Muslim Brotherhood is they n...,Egypt's ambassador to the U.S. says the Muslim...
6,"There are more than 129,817 federally licensed...","(Getty Images) There are more than 129,817 fed..."


### Remove bad alignments

In [45]:
# Clean the DataFrame
scraped_df_new = scraped_df_new.dropna(subset=['original', 'message'])
scraped_df_new = scraped_df_new[scraped_df_new['original'].str.strip() != '']
scraped_df_new = scraped_df_new[scraped_df_new['message'].str.strip() != '']

# Reset the index to align with positional indices
scraped_df_new.reset_index(drop=True, inplace=True)

# Load the pre-trained model with device setting
model_name = 'sentence-transformers/paraphrase-mpnet-base-v2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer(model_name, device=device)

# Define the similarity threshold and batch size
similarity_threshold = 0.7  # Adjust this value as needed
batch_size = 4096  # Adjust batch size based on hardware capabilities

# Initialize lists to store indices of rows to keep and similarities
indices_to_keep = []
similarities = []

# Prepare the DataFrame for batching
n_rows = len(scraped_df_new)
batches = range(0, n_rows, batch_size)

# Process the rows in batches
for start_idx in tqdm(batches, desc="Processing batches"):
    end_idx = min(start_idx + batch_size, n_rows)
    batch_df = scraped_df_new.iloc[start_idx:end_idx]
    
    # Extract texts from the batch
    journalistic_texts = batch_df['original'].tolist()
    conversational_texts = batch_df['message'].tolist()
    
    # Compute embeddings for both sets of texts in the batch
    journalistic_embeddings = model.encode(journalistic_texts, convert_to_tensor=True)
    conversational_embeddings = model.encode(conversational_texts, convert_to_tensor=True)

    # Compute cosine similarity between each pair of texts in the batch
    cosine_scores = util.cos_sim(journalistic_embeddings, conversational_embeddings).diagonal()
    
    # Handle potential zero vectors by filtering out NaN values
    valid_mask = ~torch.isnan(cosine_scores)
    valid_indices = [start_idx + i for i, valid in enumerate(valid_mask) if valid]
    valid_scores = cosine_scores[valid_mask]
    
    # Apply similarity threshold
    threshold_mask = valid_scores >= similarity_threshold
    batch_indices_to_keep = [valid_indices[i] for i, keep in enumerate(threshold_mask) if keep]
    batch_similarities = valid_scores[threshold_mask].cpu().tolist()
    
    # Append results to the final lists
    indices_to_keep.extend(batch_indices_to_keep)
    similarities.extend(batch_similarities)

# Add the similarity column for the filtered rows
df_filtered = scraped_df_new.iloc[indices_to_keep].copy()
df_filtered['Similarity'] = similarities

# Reset the index of the filtered DataFrame
df_filtered.reset_index(drop=True, inplace=True)

# Print the results
print(f"Total rows before filtering: {n_rows}")
print(f"Total rows after filtering: {len(df_filtered)}")

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Processing batches:   0%|          | 0/6 [00:00<?, ?it/s]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Processing batches:  17%|█▋        | 1/6 [02:07<10:37, 127.49s/it]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Processing batches:  33%|███▎      | 2/6 [04:17<08:35, 128.87s/it]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Processing batches:  50%|█████     | 3/6 [06:16<06:13, 124.61s/it]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Processing batches:  67%|██████▋   | 4/6 [08:25<04:12, 126.39s/it]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Batches:   0%|          | 0/128 [00:00<?, ?it/s]

Processing batches:  83%|████████▎ | 5/6 [10:23<02:03, 123.14s/it]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Processing batches: 100%|██████████| 6/6 [10:26<00:00, 104.39s/it]

Total rows before filtering: 20565
Total rows after filtering: 5352





In [55]:
# Replace empty strings with pd.NA
df_filtered.replace("", pd.NA, inplace=True)

# Remove rows with NA
df_filtered = df_filtered.dropna()

In [56]:
len(df_filtered)

5352

### Saving the dataset

In [57]:
# Rename the columns
df_filtered = df_filtered.rename(columns={'original': 'journalistic', 'message': 'conversational'})
df_filtered = df_filtered[['journalistic', 'conversational']]

In [58]:
df_filtered.head()

Unnamed: 0,journalistic,conversational
0,A photo shows a Florida deputy having a tea pa...,A precious photo of a Florida deputy spending ...
1,Egypt's ambassador to the U.S. says the Muslim...,“My advice to the Muslim Brotherhood is they n...
2,"(Getty Images) There are more than 129,817 fed...","There are more than 129,817 federally licensed..."
3,Ferrera thanks Trump for his offensive tactics...,America Ferrera to Donald Trump: Thanks! --.
4,L'Osservatore Romano Vatican Pool/Getty Images...,This time there's no need to mourn; because th...


In [59]:
len(df_filtered)

5352

In [60]:
df_filtered.to_csv("../corpora/J2C_news_EN.csv", index=False)