# Counting sources in scraped news articles 

I wrote a simple program to make it easier to take a CSV with data scraped from various news outlets and EquiQuote's results for them, extract the counts of male and female sources, add the user's own counts and any comments, detect if there is a match and then add all of these numbers back to the CSV. 

This was just to make the process manageable and accurate, so I didn't get overwhelmed looking through and annotating 75 articles with up to 1000 words in text each. Parts of the code can also be repurposed for other content analysis work.

In [78]:
import pandas as pd
import os
import re
from ast import literal_eval
import matplotlib.pyplot as plt

KeyError: 'Rectangle:kwdoc'

## Define functions

In [55]:
def extract_counts(sources_detected):
    '''Extract the number of sources from each gender and add them as separate columns'''
    
    men_count = 0
    women_count = 0
    
    # Handle cases where sources aren't detected or output is in the wrong format
    if not sources_detected or isinstance(sources_detected, (float, int)):
        return men_count, women_count
    
    # If sources_detected is a string, convert it to a list
    if isinstance(sources_detected, str):
        sources_detected = literal_eval(sources_detected)

    for source in sources_detected:
        if 'Male' in source['Gender']:
            men_count += 1
        elif 'Female' in source['Gender']:
            women_count += 1

    return men_count, women_count


In [56]:
def chunk_text(text, chunk_size=200):
    '''Display text in smaller chunks so it's easier to read'''
    
    if not isinstance(text, str):
        print("Text not detected. Skipping...")
        return 0, 0

    words = text.split()
    total_men, total_women = 0, 0
    
    for i in range(0, len(words), chunk_size):
        while True:  # Loop in case of need to redo
            
            print(" ".join(words[i:i+chunk_size]))  # Print up to 200 words (or whatever chunk size specified)
            
            action = ''
            # Check for "restart" at any point
            if action == 'restart':
                return 'restart', 'restart'
            
            print("\n" + "-"*40 + "\n")
            
            while True:  # Loop to handle 'Number of new men quoted' input
                try:
                    chunk_men = int(input("Number of new men quoted: "))
                    break
                except ValueError:
                    print("Invalid input. Please enter a number.")

            while True:  # Loop to handle 'Number of new women quoted' input
                try:
                    chunk_women = int(input("Number of new women quoted: "))
                    break
                except ValueError:
                    print("Invalid input. Please enter a number.")

            action = input(f"You entered {chunk_men} men and {chunk_women} women for this chunk. Press Enter to continue or type 'redo' to re-evaluate: ").lower()
            
            print("\n" + "-"*40 + "\n")
            
            if action != 'redo':
                total_men += chunk_men
                total_women += chunk_women
                break
        
    return total_men, total_women

In [57]:
def format_sources(sources_list):
    '''Format the sources EquiQuote has detected (which is a string with a list of dictionaries)\
    such that it just lists the men and women detected'''
    
    if isinstance(sources_list, str):
        sources_list = literal_eval(sources_list)

    men_quoted, women_quoted = [], []

    for source in sources_list:
        if 'Male' in source['Gender']:
            men_quoted.append(source['Source'])
        elif 'Female' in source['Gender']:
            women_quoted.append(source['Source'])

    print("Men quoted:")
    for man in men_quoted:
        print(man)

    print("\nWomen quoted:")
    for woman in women_quoted:
        print(woman)
    print("\n")

In [58]:
def manual_evaluation(df):
    '''Prompts to print the link and text of each story and then enter my counts for men, women, and any comments.
    Automatically check for matches with EquiQuote counts'''
        
    # List out the story links
    print("\nAvailable story links in this CSV file:\n" + "-"*60)
    for i, link in enumerate(df['link']):
        print(f"{i + 1}. {link}")
    print("-"*60)

    # Option to start from a certain story
    start_input = input("Enter the number of the story you want to start processing from (or press Enter to start from the first story): ")
    start_from_story = int(start_input) - 1 if start_input else 0
    
    for index, row in df.iloc[start_from_story:].iterrows():
        
        while True: # Loop this in case I want to redo after checking 

            print(f"Story link: {row['link']}\n" + "-"*40 + "\n")

                    
            total_men, total_women = chunk_text(row["text"])
            
            if total_men == 'restart' and total_women == 'restart':
                return manual_evaluation(df)

            # Check if the return value indicates a delete
            elif total_men == 'delete' and total_women == 'delete':
                df.drop(index, inplace=True)
                
                # Confirm saving the CSV immediately after dropping the row
                save_input = input(f"Do you want to save changes to {csv_file}? (yes/no): ")
                if save_input.lower() == 'yes':
                    df.to_csv(file_path, index=False)
                return manual_evaluation(df)
            
            df.at[index, "my_count_men"] = total_men
            df.at[index, "my_count_women"] = total_women

            # Check for match automatically
            df.at[index, "match"] = (total_men == df.at[index, "count_men"]) and (total_women == df.at[index, "count_women"])

            # Print out results to check
            print(f"EquiQuote counted {df.at[index, 'count_women']} women and {df.at[index, 'count_men']} men. You counted {total_women} women and {total_men} men. Was this a match? {'Yes' if df.at[index, 'match'] else 'No'}.")

            while True:  # Loop to handle 'See sources' input
                see_sources = input("See sources detected by EquiQuote? (yes/no): ").lower()
                if see_sources in ['yes', 'no']:
                    break
                print("Invalid input. Please enter 'yes' or 'no'.")

            if see_sources == 'yes':
                format_sources(df.at[index, "sources_detected"])
            
            my_count_comments = input("Add comments if any (else 'NA'): ")
            if not my_count_comments.strip():
                my_count_comments = "NA"
            df.at[index, "my_count_comments"] = my_count_comments

            while True:  # Loop to handle 'redo' input
                redo = input("Do you want to redo? (yes/no): ").lower()
                if redo in ['yes', 'no']:
                    break
                print("Invalid input. Please enter 'yes' or 'no'.")

            if redo != 'yes':
                break
        
        # Confirm saving the CSV
        save_input = input(f"Do you want to save changes to {csv_file}? (yes/no): ")
        if save_input.lower() == 'yes':
            df.to_csv(file_path, index=False)
    
        # Option to restart or continue
        restart_input = input("Press Enter to continue to the next story, or type 'restart' to go back to the beginning: ")
        print("\n" + "-"*40 + "\n")
        
        if restart_input.lower() == 'restart':
            return manual_evaluation(df)
            
    return df


## Run the code 

In [79]:
# Open CSV files in alphabetical order

data_dir = "data"
csv_files = sorted([f for f in os.listdir(data_dir) if f.endswith('.csv')])

In [114]:
# Print out the list of CSV files
print("Available CSV files:\n" + "-"*60)
for i, csv in enumerate(csv_files):
    print(f"{i + 1}. {csv}")
print("-"*60)

# Option to start from particular file
start_input = input("Enter the number of the file you want to start processing from (or press Enter to start from the first file): ")
start_from = int(start_input) - 1 if start_input else 0

for csv_file in csv_files[start_from:]:
    
    print("-"*60)
    print(f"Processing: {csv_file}\n" + "-"*60)
    
    # Load the CSV file
    file_path = os.path.join(data_dir, csv_file)
    df = pd.read_csv(file_path)
    
    # Extract counts from sources detected
    df["count_men"], df["count_women"] = zip(*df["sources_detected"].apply(extract_counts))
    
    # Initialise new columns with default values
    df["my_count_comments"] = "NA"
    
    for column in ["my_count_men", "my_count_women", "match"]:
        if column not in df.columns:
            if column.startswith("my_count"):
                df[column] = 0
            else:
                df[column] = False
    
    # Run manual evaluation
    df = manual_evaluation(df)
   
print("Evaluation completed!")

Available CSV files:
------------------------------------------------------------
1. BBC_2023-08-17.csv
2. BBC_2023-08-18.csv
3. BBC_2023-08-19.csv
4. BBC_2023-08-20.csv
5. BBC_2023-08-21.csv
6. Mail_2023-08-17.csv
7. Mail_2023-08-18.csv
8. Mail_2023-08-19.csv
9. Mail_2023-08-20.csv
10. Mail_2023-08-21.csv
11. Sun_2023-08-17.csv
12. Sun_2023-08-18.csv
13. Sun_2023-08-19.csv
14. Sun_2023-08-20.csv
15. Sun_2023-08-21.csv
------------------------------------------------------------
Enter the number of the file you want to start processing from (or press Enter to start from the first file): 6
------------------------------------------------------------
Processing: Mail_2023-08-17.csv
------------------------------------------------------------

Available story links in this CSV file:
------------------------------------------------------------
1. https://www.dailymail.co.uk/tvshowbiz/article-12418807/Britney-Spears-estranged-husband-Sam-Asghari-BREAKS-SILENCE-amid-divorce-news-says-s-t-hap

Number of new men quoted: 0
Number of new women quoted: 0
You entered 0 men and 0 women for this chunk. Press Enter to continue or type 'redo' to re-evaluate: 

----------------------------------------

EquiQuote counted 0 women and 2 men. You counted 0 women and 1 men. Was this a match? No.
See sources detected by EquiQuote? (yes/no): yes
Men quoted:
Sam Asghari
Neal Hersh

Women quoted:


Add comments if any (else 'NA'): It detected Sam Asghari as a source, but I think he is the main subject here
Do you want to redo? (yes/no): no
Do you want to save changes to Mail_2023-08-17.csv? (yes/no): yes
Press Enter to continue to the next story, or type 'restart' to go back to the beginning: 

----------------------------------------

Story link: https://www.dailymail.co.uk/news/article-12418033/Mother-three-spared-jail-FOURTH-time-attacking-police.html
----------------------------------------

A mother-of-three has been spared jail for a fourth time after attacking police - this time after a

KeyboardInterrupt: Interrupted by user

## Analyse results

In [115]:
def tally_accuracy(csv_files, data_dir):
    '''After manually evaluating the CSV files, count the number of matches and tally the accuracy/match rate'''
    
    sources = ["BBC", "Mail", "Sun"]
    data = {source: {"matches": 0, "non_matches": 0} for source in sources}

    for file in csv_files:
        df = pd.read_csv(os.path.join(data_dir, file))
        for source in sources:
            if file.startswith(source):
                data[source]["matches"] += df["match"].sum()
                data[source]["non_matches"] += len(df) - df["match"].sum()

    total_matches = sum([data[source]["matches"] for source in sources])
    total_non_matches = sum([data[source]["non_matches"] for source in sources])
    
    results = []
    
    for source in sources:
        matches = data[source]["matches"]
        non_matches = data[source]["non_matches"]
        total = matches + non_matches
        accuracy = round((matches / total) * 100, 2) if total != 0 else 0
        results.append([source, matches, non_matches, accuracy])  # This line is important
    
    total_accuracy = (total_matches / (total_matches + total_non_matches)) * 100 if (total_matches + total_non_matches) != 0 else 0
    results.append(["Total", total_matches, total_non_matches, round(total_accuracy, 2)])

    df_results = pd.DataFrame(results, columns=["Source", "Matches", "Non-matches", "Accuracy (%)"])
    
    # Display the table
    print(df_results)

In [116]:
# Overall tally
tally_accuracy(csv_files, data_dir)

  Source  Matches  Non-matches  Accuracy (%)
0    BBC       20            3         86.96
1   Mail       15            8         65.22
2    Sun       22            3         88.00
3  Total       57           14         80.28


In [121]:
# Table of rows where there was a disagreement between my and EquiQuote's judgment

no_match = [] 

columns = ['file_name', 'link', 'text', 'count_men', 'my_count_men', 'count_women', 'my_count_women', 'my_count_comments']

for csv_file in csv_files:
    file_path = os.path.join(data_dir, csv_file)
    
    # Read the CSV into a DataFrame
    df = pd.read_csv(file_path)
    
    # Check if "match" column exists
    if "match" in df.columns:
        # Filter out rows where "match" is FALSE
        mismatches = df[df["match"] == False]
        
        # Add a column to the mismatches DataFrame for the CSV file name
        mismatches['file_name'] = csv_file
        
        # Append mismatches to the no_match list
        no_match.append(mismatches[columns])

# Concatenate all mismatched DataFrames
mismatch_df = pd.concat(no_match, ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mismatches['file_name'] = csv_file
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mismatches['file_name'] = csv_file
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mismatches['file_name'] = csv_file
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_in

Unnamed: 0,file_name,link,text,count_men,my_count_men,count_women,my_count_women,my_count_comments
0,BBC_2023-08-18.csv,https://www.bbc.co.uk/news/uk-66120934,Hospital bosses failed to investigate allegati...,2,1,0,0,"EquiQuote identified Dr Ravi as a source, but ..."
1,BBC_2023-08-19.csv,https://www.bbc.co.uk/news/uk-england-merseysi...,The former chair of the NHS trust where serial...,3,2,2,1,"In this case, EquiQuote identified Sir Duncan ..."
2,BBC_2023-08-21.csv,https://www.bbc.co.uk/news/uk-england-merseysi...,"Neonatal nurse Lucy Letby, who is the UK's mos...",4,5,4,4,It missed a short quote from Rishi Sunak sayin...
3,Mail_2023-08-17.csv,https://www.dailymail.co.uk/tvshowbiz/article-...,Britney Spears' estranged husband Sam Asghari ...,2,1,0,0,"It detected Sam Asghari as a source, but I thi..."
4,Mail_2023-08-17.csv,https://www.dailymail.co.uk/news/article-12418...,A mother-of-three has been spared jail for a f...,2,2,1,2,It missed an unnamed female police officer.


In [122]:
mismatch_df.head()

Unnamed: 0,file_name,link,text,count_men,my_count_men,count_women,my_count_women,my_count_comments
0,BBC_2023-08-18.csv,https://www.bbc.co.uk/news/uk-66120934,Hospital bosses failed to investigate allegati...,2,1,0,0,"EquiQuote identified Dr Ravi as a source, but ..."
1,BBC_2023-08-19.csv,https://www.bbc.co.uk/news/uk-england-merseysi...,The former chair of the NHS trust where serial...,3,2,2,1,"In this case, EquiQuote identified Sir Duncan ..."
2,BBC_2023-08-21.csv,https://www.bbc.co.uk/news/uk-england-merseysi...,"Neonatal nurse Lucy Letby, who is the UK's mos...",4,5,4,4,It missed a short quote from Rishi Sunak sayin...
3,Mail_2023-08-17.csv,https://www.dailymail.co.uk/tvshowbiz/article-...,Britney Spears' estranged husband Sam Asghari ...,2,1,0,0,"It detected Sam Asghari as a source, but I thi..."
4,Mail_2023-08-17.csv,https://www.dailymail.co.uk/news/article-12418...,A mother-of-three has been spared jail for a f...,2,2,1,2,It missed an unnamed female police officer.


In [123]:
# Show my comments for the cases where I did not agree with EquiQuote

for csv_file in csv_files:
    file_path = os.path.join(data_dir, csv_file)
    
    # Read the CSV into a DataFrame
    df = pd.read_csv(file_path)
    
    # Check if "match" column exists
    if "match" in df.columns:
        # Filter out rows where "match" is FALSE
        mismatches = df[df["match"] == False]
        
        # Print the values for my_count_comments
        for comment in mismatches['my_count_comments']:
            print(comment)

EquiQuote identified Dr Ravi as a source, but he was just mentioned in a recollection of events. Brearey is the only real source.
In this case, EquiQuote identified Sir Duncan Nichol as a source when actually I think he's the main newsmaker? It also identified Dr Jane Hawdon as a source, when she was quoted only through other sources. This is admittedly a challenging story, and shows where EquiQuote might struggle. 
It missed a short quote from Rishi Sunak saying it was "cowardly" for criminals not to face victims
It detected Sam Asghari as a source, but I think he is the main subject here
It missed an unnamed female police officer.
EquiQuote counted Dr Ravi; I didn't as I thought he was the newsmaker
EquiQuote identified Lucy Letby as a source, although she was one of the main subjects of the story. It also counted the grandmother, but I felt the she was only quoted in someone else's story. Overall, this story is a bit ambiguous
This is really difficult! Melanie Taylor, the source det