This notebook uses pydantic and ChatGPT API's function calling to extract details about a protest event from a newspaper article. In the old days, you had to ask it to provide a JSON-like object. Next, I defined the JSONs myself in the functions. Now I'm learning to use pydantic.

In [133]:
from datetime import date
from enum import Enum
import json

from pydantic import BaseModel, Field
from typing import Optional, List
from openai import OpenAI
import pandas as pd
from concurrent.futures import ThreadPoolExecutor


In [8]:
coding_date_guidelines = '''
# Coding the Date of a Protest Event from a Media Account

## 1. Carefully Read the Article
- Read through the entire article text thoroughly
- Identify and underline/highlight any phrases that indicate a date or time period relative to the publication date (e.g., "yesterday", "Saturday", "the day after Easter", "last week", "two months ago", etc.)

## 2. Determine the Publication Date
- Obtain the publication date of the article from the provided information
- Write down the publication date in a standard format (e.g., March 24, 2024)

## 3. Create a Calendar Reference
- Prepare a reference calendar for the year of the publication date
- This can be a printed calendar or a handwritten one, whichever is more convenient

## 4. Interpret Relative Date Phrases
- For each relative date phrase identified in the article, interpret it based on the publication date and the calendar reference:
 - "yesterday": Count back one day from the publication date on the calendar
 - Weekday names (e.g., "Saturday"): Find the closest day before or on the publication date that matches the mentioned weekday
 - "the day after Easter": Determine the date of Easter for that year, and count one day after that date on the calendar
 - "last week": Count back 7 days from the publication date
 - "two months ago": Count back 2 months from the publication date's month, adjusting the day if necessary
 - Other relative phrases: Use logical reasoning and the calendar to interpret the phrase relative to the publication date

## 5. Identify the Protest Event Description
- Carefully read through the article again, looking for the section(s) that describe the protest event itself
- Underline or highlight the key details and descriptions related to the protest you need to date

## 6. Prioritize Relevant Date Phrases
- From the list of relative date phrases identified earlier, prioritize the ones that are in closest proximity to the protest event description
- If multiple date phrases are equally close, use your best judgment based on the context clues in the article to determine which is more likely referring to the protest date

## 7. Cross-Reference with Event Details
- As you interpret each prioritized date phrase, cross-reference it with the key details about the protest event, such as:
 - Location mentioned (does the calculated date make sense for that location?)
 - Other events or incidents described around the protest (does the interpreted date align with the chronology?)
 - Quotes from organizers or participants about timing (do they corroborate the calculated date?)

## 8. Resolve Ambiguity
- If multiple relative date phrases are present, prioritize the one closest to the description of the event in the article text
- If phrases conflict, use your best judgment based on the context and chronology of events described in the article

## 9. Record the Calculated Date
- Once you have interpreted the relevant relative date phrase(s), record the calculated date in a YYYY-MM-DD format (e.g., 2023-04-13)

## 10. Double-Check and Revise if Necessary
- Review your work, ensuring that the calculated date aligns with the context and details provided in the article
- If needed, revise your interpretation or consult additional resources (e.g., historical calendars, Easter date calculators) to validate your date calculation
- Be open to revising your date calculation based on additional context clues surrounding the protest event itself
'''

In [9]:
class WeekDay(Enum):
    Monday = "Monday"
    Tuesday = "Tuesday"
    Wednesday = "Wednesday"
    Thursday = "Thursday"
    Friday = "Friday"
    Saturday = "Saturday"
    Sunday = "Sunday"



class DateDetails(BaseModel):
    event_date: date = Field(
        ...,
        description="Date of the protest. Pay attention to dates mentioned in the article and words such as ‘yesterday,’ ‘last week,’ and ‘Monday.’",
    )
    day_of_week: WeekDay = Field(
        ...,
        description="The day of the week the protest occurred, such as Monday or Thursday.",
    )
    date_text: List[str] = Field(
        ...,
        description="List of text descriptors for the protest date, such as 'yesterday', 'last week', or 'Monday' .",
    )



class Protest(BaseModel):
    protest_article: bool = Field(
        False,
        description="Indicates if the article describes a protest against police brutality.",
    )
    summary: str = Field(
        ...,
        description="A focused summary of the article focusing on the protest details.",
    )
    event_date: DateDetails = Field(
        ...,
        description=f"{coding_date_guidelines}",
    )

In [10]:
article = {
    "text": """COLUMBUS, Ohio (WCMH) – Yesterday was a national day of protest, and Columbus recognized the day when dozens of families gathered at the Ohio Statehouse to protest police brutality.

Protesters were asking for accountability and justice by sharing how they lost their loved ones, while organizers said the protest was about telling their stories in more than one way.

“We know that more than 1,200 Ohioans have been lost to police violence since the year 2000,” Ohio Families United for Political Action and Change (OFUPAC) Organizing Director Elaine Schleiffer said. “We wanted to represent the loss that that is, the empty shoes, that there’s no replacing those family members.”

OFUPAC is a non-profit organization that unites families who have lost loved ones in officer-involved shootings.

For many of those who turned out to yesterday’s protest, the issue hits close to home. Sabrina Jordan lost her son in an officer-involved shooting in 2017 just outside of Dayton.

“We’re just here also, to, like, celebrate and love each other,” Jordan, who is also OFUPAC’s founder, said. “You know, connect.”

Tania Hudson’s son was fatally shot by a Columbus police officer in 2015. 

“We’re asking accountability,” she said. “Officers be drug tested when they’re involved in a shooting, alcohol test. We understand that they have trauma and drama, too.”

The city’s police union, the Fraternal Order of Police (FOP), said there is already accountability in place.

“Accountability? How much more accountability can they ask for,” FOP Executive Vice President Brian Steel said. “We have an internal affairs. We have an inspector general’s office. We’re investigated by BCI in, say, a police-involved shooting, in a grand jury of our peers. There’s literally no more accountability that can be put on police officers today.”

“Accountability is pretty much all that we can ask for,” Hudson said. “We can’t say justice – ours is gone. There will never be justice for us, but we’re out here trying to save other people’s lives. That’s why we’re constantly out here.”

Protestors also mentioned their frustration with Marsy’s Law, which was originally passed to protect the victims of violent crimes, but which was extended to allow law enforcement departments to shield officers’ names when they are involved in a shooting. Protesters think this shouldn’t be the case while Steel said it’s an important protection for officers who are victims of violent crimes.
""",
    "headline": "Statehouse protest calls for end to police brutality",
    "publication-date": "2023-10-22",
    "source": "WCMH",
}

In [128]:
def get_protest_date(article):
    client = OpenAI(
        max_retries=3,
        timeout=20.0,
    )

    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant that extracts summaries of newspaper articles about political protests as JSON for a database. ",
        },
        {
            "role": "user",
            "content": f"""Extract information about the details about a protest from the following article.
      Only use information from the article.

      {article}
      
      """,
        },
    ]

    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",  # model = 'gpt-4-turbo-preview',
        functions=[
            {
                "name": "protest_details",
                "description": "Extract insights from media article about protest.",
                "parameters": Protest.model_json_schema(),
            }
        ],
        n=1,
        messages=messages,
    )

    r = json.loads(completion.choices[0].message.function_call.arguments)
    return r['event_date']['event_date']



In [124]:
r = get_protest_details(article)

In [13]:
df = pd.json_normalize(
    r, sep="_"
)  # It is returning some nested dictionaries, so I can't use the normal pd.from_json
df

Unnamed: 0,summary,event_date_event_date,event_date_day_of_week,event_date_date_text
0,Dozens of families gathered at the Ohio Stateh...,2023-10-21,Friday,[Yesterday]


In [148]:
article_df = pd.read_json(
    "https://raw.githubusercontent.com/nealcaren/notes/main/posts/from-articles-to-events/protest_articles.json"
)

In [149]:
for item in ['title','date', 'site','text']:
    article_df = article_df.rename(columns={item: f'article_{item}'})

article_df = article_df[article_df['article_date'].isna()==True]

In [163]:
article_df['url'].values[10]

'https://www.cbsnews.com/colorado/video/sag-aftra-members-rally-in-denvers-city-park-1/'

In [162]:
article_df['article_text'].values[9]

"Posted Thursday, April 6, 2023 7:10 am\n\n(April 6, 2023) More than 100 Nantucket High School students gathered on the football field Friday afternoon to protest recent anti-LGBTQ legislation in several southern states, joining schools across the country in a walkout.\n\n“It was a much bigger turnout than I was expecting, which I'm really grateful for,” said senior Ellie Kinsella, one of the organizers of the walkout.\n\n“Even if some of them were just here to skip class, I’m really grateful for the people who came just to support the cause,” she added.\n\nBloomberg reported earlier this month that at least 385 anti-LGBTQ laws have been introduced in state legislatures across the country this year, more than the last five years combined.\n\nLaws passed in primarily Republican-led states such as Texas, Tennessee and Missouri include denying access to healthcare for transgender people, limiting access to bathrooms or sports teams and what can be discussed in classrooms.\n\nSenior Kipper

In [116]:
for item in ['title','date', 'site','text']:
    article_df = article_df.rename(columns={item: f'article_{item}'})

article_df = article_df[article_df['article_date'].isna()==False]

In [40]:
event_df= pd.read_csv(
    "https://github.com/nonviolent-action-lab/crowd-counting-consortium/raw/master/ccc_compiled_2021-present.csv",
    encoding="latin",
    low_memory=False,
)

1


In [117]:
merged_df = pd.merge(article_df, event_df, left_on='url', right_on='source_1', how='inner')
len(merged_df)

762

In [134]:

# Your dataframe named merged_df and function get_protest_date should be defined

# Define the relevant columns
relevant_columns = ['article_title', 'article_date', 'article_site', 'article_text']

# Convert the relevant columns of the dataframe to a list of dictionaries
data_to_process = merged_df[relevant_columns].to_dict(orient='records')

# Define a function that wraps your get_protest_date function for convenience
def process_row(row_dict):
    try:
        return get_protest_date(row_dict)
    except Exception as e:
        print(f"Error processing row: {e}")
        return None

# Use ThreadPoolExecutor to apply the function to all rows concurrently
with ThreadPoolExecutor(max_workers=5) as executor:
    # Submit all the tasks and collect the futures
    futures = [executor.submit(process_row, row_dict) for row_dict in data_to_process]
    
    # Wait for all futures to complete and collect the results
    results = [future.result() for future in futures]

# Assuming you want to add the results back into the dataframe:
merged_df['estimated_date'] = results


Error processing row: 'NoneType' object has no attribute 'arguments'
Error processing row: 'event_date'
Error processing row: 'NoneType' object has no attribute 'arguments'
Error processing row: 'NoneType' object has no attribute 'arguments'


In [139]:
merged_df['date_match'] = merged_df['date'] == merged_df['estimated_date']

In [147]:
merged_df[merged_df['date_match']==False][['article_title','date','estimated_date', 'article_date']]

Unnamed: 0,article_title,date,estimated_date,article_date
4,Families Stand Up For Kids in Gaza! : Indybay,2023-10-22,2023-10-20,2023-10-20T00:00:00.000
7,Towson students threatened with arrest at late...,2023-11-15,2023-11-16,2023-11-17T14:37:35.000Z
22,‘No days off’: Durham city sanitation workers ...,2023-09-06,2023-09-05,2023-09-05T22:18:32.000Z
28,"San Francisco AAPI community rallies, calling ...",2023-04-16,2023-04-17,2023-04-17T17:49:00.000Z
35,Community members rally outside HISD headquart...,2023-08-05,2023-08-04,2023-08-05T15:58:36.000
...,...,...,...,...
740,Wisconsin Ukrainians host peace vigil as Russi...,2023-02-25,2023-02-26,2023-02-26T04:30:29.000Z
741,Northampton police incident sparks rally at Ci...,2023-08-13,2023-08-11,2023-08-11T20:17:15.291Z
742,Presidential candidate Robert F. Kennedy draws...,2024-02-05,2024-02-06,2024-02-07T00:00:00.000
743,Kaimuki residents rally to block ‘monster home...,2023-06-23,2023-07-14,2023-06-25T00:00:00.000


In [145]:
merged_df.iloc[7]

article_title      Towson students threatened with arrest at late...
article_text       By Gabriel Donahue, Editor-in-Chief\n\nA secon...
url                https://thetowerlight.com/towson-students-thre...
authors                                            [Gabriel Donahue]
article_date                                2023-11-17T14:37:35.000Z
                                         ...                        
resolved_county                                     Baltimore County
resolved_state                                                    MD
fips_code                                                    24005.0
estimated_date                                            2023-11-16
date_match                                                     False
Name: 7, Length: 87, dtype: object

In [91]:
sample_d

{'article_title': 'Civil Rights march taking place in Shreveport this weekend',
 'article_date': None,
 'article_site': 'KTBS',
 'article_text': "Mac Users Didn't Know This Simple Trick To Block All Ads (Do It Now)\n\nSafe Tech Tips"}

In [75]:
results

{'summary': 'Faces of protest: Thursday at the Indiana Statehouse',
 'event_date': {'event_date': '2023-03-16',
  'day_of_week': 'Thursday',
  'date_text': ['Thursday']}}

In [19]:
from datetime import datetime

# The function to check if the day of the week is consistent with the date
def check_day_of_week(date_str, day_of_week):
    """
    Check if the provided day of the week is consistent with the date string.
    
    Parameters:
    date_str (str): Date in the format 'YYYY-MM-DD'
    day_of_week (str): Day of the week
    
    Returns:
    bool: True if the day of the week is consistent with the date, False otherwise.
    """
    # Parse the date string into a datetime object
    date_obj = datetime.strptime(date_str, '%Y-%m-%d')
    
    # Create a dictionary to map datetime weekday numbers to day names
    day_dict = {
        0: 'Monday',
        1: 'Tuesday',
        2: 'Wednesday',
        3: 'Thursday',
        4: 'Friday',
        5: 'Saturday',
        6: 'Sunday'
    }
    
    # Get the day of the week number for the date object
    day_num = date_obj.weekday()
    print(day_num)
    # Compare and return the result
    return day_dict[day_num].lower() == day_of_week.lower()


In [20]:
df["is_consistent"] = df.apply(
    lambda row: check_day_of_week(
        row["event_date_event_date"], row["event_date_day_of_week"]
    ),
    axis=1,
)
df["is_consistent"].mean()

5


0.0

Estimated cost: 

* gpt-4-0125-preview: 55 articles for $1
* gpt-3.5-turbo: 1106 articles for $1

In [48]:
Protest.model_json_schema()

{'$defs': {'DateDetails': {'properties': {'event_date': {'description': 'Date of the protest. Pay attention to dates mentioned in the article and words such as ‘yesterday,’ ‘last week,’ and ‘Monday.’',
     'format': 'date',
     'title': 'Event Date',
     'type': 'string'},
    'day_of_week': {'allOf': [{'$ref': '#/$defs/WeekDay'}],
     'description': 'The day of the week the protest occurred, such as Monday or Thursday.'},
    'date_text': {'description': "List of text descriptors for the protest date, such as 'yesterday', 'last week', or 'Monday' .",
     'items': {'type': 'string'},
     'title': 'Date Text',
     'type': 'array'}},
   'required': ['event_date', 'day_of_week', 'date_text'],
   'title': 'DateDetails',
   'type': 'object'},
  'LocationDetails': {'properties': {'city': {'description': 'The city where the protest took place.',
     'title': 'City',
     'type': 'string'},
    'state_abbreviation': {'allOf': [{'$ref': '#/$defs/StateAB'}],
     'description': 'The two-