This notebook addresses a crucial challenge encountered with our collaborator's data: many documents were initially associated with incorrect countries. To tackle this, we developed this notebook as an approach to obtain more accurately linked documents for a given country and event.

Our process begins by defining a specific event through a chosen time period and country. A core part of our strategy involves using an Large Language Model (LLM) to generate a thematic description, which is inferred from the collective titles of the majority of documents initially gathered for that period and country. This LLM-provided theme then guides a subsequent manual filtering step, allowing us to identify and remove documents that seem irrelevant to the overall theme. The aim of this selective process is to refine the dataset and ensure that we have a collection of documents genuinely associated with the intended country and event. While this method is still under development and refinement, it represents our current best effort to improve data quality.



# Read Data

In [None]:


import os
import json 
import pandas as pd 
full_data = pd.read_excel("./Results/Sources/fulldata_scraped.xlsx") 
openaikey = ""


In [6]:
len(full_data['source_domain'].unique())

142

In [3]:
vc = full_data['source_domain'].value_counts()

# Display 50 rows at a time
for i in range(0, len(vc), 50):
    print(vc.iloc[i:i+50].to_string())

source_domain
reliefweb.int             22483
allafrica.com              1828
theguardian.com            1589
radiotamazuj.org            971
news.un.org                 969
independent.co.uk           922
globalsecurity.org          789
abcnews.go.com              786
aljazeera.com               781
hrw.org                     762
voanews.com                 755
jamaica-gleaner.com         655
tribune.com.pk              602
apnews.com                  528
dawn.com                    492
darfur24.com                491
english.news.cn             485
dw.com                      485
edition.cnn.com             483
naharnet.com                456
dailysabah.com              453
preventionweb.net           440
naharnet.com:443            432
thedailystar.net            428
jamaicaobserver.com         428
aa.com.tr                   403
al-monitor.com              375
middleeastmonitor.com       365
mb.com.ph                   359
bbc.com                     321
euronews.com              

In [4]:
len(vc)

142

# Select data

## Prepare dataframe

In [None]:
import ast 
import numpy as np

full_data.head()
full_data['publish_week'] = full_data['publish_date'].apply(lambda x: f"Week {x.isocalendar().week} {x.year}")
full_data['publish_month']=full_data['publish_date'].dt.strftime('%B %Y')



full_data['country'] = full_data['country'].apply(ast.literal_eval)
full_data = full_data[full_data['country'].apply(len) <=5]
full_data= full_data.explode('country').reset_index(drop=True)

countries = np.unique(full_data['country'])


ndocs_dict = full_data.groupby(['country', 'publish_week']).size().to_dict()
ndocs_dict = dict(sorted(ndocs_dict.items(), key=lambda item: item[1], reverse=True))



In [12]:
full_data['country'].value_counts()

United States            7701
Palestinian Territory    4586
Israel                   3420
Sudan                    3169
Mexico                   1885
                         ... 
Gibraltar                   1
Andorra                     1
Faroe Islands               1
Mercosur                    1
Christmas Island            1
Name: country, Length: 251, dtype: int64

In [5]:

ndocs_dict

{('United States', 'Week 39 2024'): 490,
 ('United States', 'Week 40 2024'): 424,
 ('United States', 'Week 41 2024'): 413,
 ('United States', 'Week 32 2024'): 382,
 ('United States', 'Week 30 2024'): 379,
 ('United States', 'Week 37 2024'): 372,
 ('United States', 'Week 31 2024'): 358,
 ('Palestinian Territory', 'Week 35 2024'): 331,
 ('United States', 'Week 35 2024'): 315,
 ('United States', 'Week 25 2024'): 313,
 ('United States', 'Week 36 2024'): 312,
 ('United States', 'Week 38 2024'): 303,
 ('United States', 'Week 19 2024'): 302,
 ('United States', 'Week 21 2024'): 294,
 ('United States', 'Week 34 2024'): 294,
 ('United States', 'Week 23 2024'): 291,
 ('United States', 'Week 28 2024'): 291,
 ('United States', 'Week 33 2024'): 281,
 ('United States', 'Week 22 2024'): 278,
 ('United States', 'Week 24 2024'): 277,
 ('United States', 'Week 27 2024'): 266,
 ('United States', 'Week 29 2024'): 266,
 ('Palestinian Territory', 'Week 19 2024'): 261,
 ('United States', 'Week 26 2024'): 257,


## Filtering by country-period 

In [None]:
import random 
random.seed(121)

key_country, key_period = "Afghanistan" , "Week 21 2024" #random.choice(list(ndocs_dict.keys()) )
filterted_data = full_data[(full_data['country'].isin( [key_country])  )& (full_data['publish_week'] == key_period)]

titles = list(filterted_data['title'])

In [128]:
filterted_data

Unnamed: 0.1,Unnamed: 0,id_x,article_id,title,content,lang,attachment_id,content_type,url,source_domain,publish_date,duplicate_insert_time,id_y,clean_contents_id,keywords,country,country_json,publish_week,publish_month
134,49525,106031,c3aa9841-3920-3eec-ae03-5f6b1ffefc90,Spain urges its citizens to leave Lebanon,Spain urges its citizens to leave Lebanon\nSpa...,en,,html,https://www.naharnet.com:443/stories/en/308396...,naharnet.com:443,2024-10-01 00:00:00,NaT,106162,106031,"['spain urges', 'leave lebanon', 'citizens', '...",Israel,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
135,49525,106031,c3aa9841-3920-3eec-ae03-5f6b1ffefc90,Spain urges its citizens to leave Lebanon,Spain urges its citizens to leave Lebanon\nSpa...,en,,html,https://www.naharnet.com:443/stories/en/308396...,naharnet.com:443,2024-10-01 00:00:00,NaT,106162,106031,"['spain urges', 'leave lebanon', 'citizens', '...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
138,49688,106200,6fb64723-0285-3e7f-baee-b50215b744ec,Spain urges its citizens to leave Lebanon,Spain urges its citizens to leave Lebanon\nSpa...,en,,html,https://naharnet.com/stories/en/308396-spain-u...,naharnet.com,2024-10-01 00:00:00,NaT,106305,106200,"['spain urges', 'leave lebanon', 'citizens', '...",Israel,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
139,49688,106200,6fb64723-0285-3e7f-baee-b50215b744ec,Spain urges its citizens to leave Lebanon,Spain urges its citizens to leave Lebanon\nSpa...,en,,html,https://naharnet.com/stories/en/308396-spain-u...,naharnet.com,2024-10-01 00:00:00,NaT,106305,106200,"['spain urges', 'leave lebanon', 'citizens', '...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
1762,50892,107415,f1e9de82-16ad-3491-b69d-b3306cc3d3eb,Syria - Displacement from Lebanon to Syria (DG...,- With the escalation of hostilities between I...,en,,html,https://reliefweb.int/report/syrian-arab-repub...,reliefweb.int,2024-10-04 11:13:32,NaT,107465,107415,"['dg echo', 'unhcr', 'syria', 'sarc', 'lebanon...",Israel,"{'geo_loc': {'Syria': {'latitude': 35.0, 'geon...",Week 40 2024,October 2024
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84660,50936,107460,d4cb5a89-c82a-313d-bc02-2e1148abadc4,PALESTINE One year of hostilities: impact on e...,PALESTINE One year of hostilities: impact on e...,en,c5c046c2-6d00-3d3a-9b30-6f4b5baf89fe,pdf,https://reliefweb.int/attachments/f585e9a5-eee...,reliefweb.int,2024-10-04 13:57:29,NaT,107403,107460,"['impact', 'hostilities', 'gaza', 'education',...",Israel,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
84790,50119,106631,3f995037-a30a-3f71-b10a-3361a4ffeafc,Gaza humanitarian response update | 16-29 Sept...,The information below is provided every other ...,en,,html,https://reliefweb.int/report/occupied-palestin...,reliefweb.int,2024-10-02 19:17:03,NaT,106716,106631,"['16', 'whole population', 'west bank', 'warm ...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
84793,50119,106631,3f995037-a30a-3f71-b10a-3361a4ffeafc,Gaza humanitarian response update | 16-29 Sept...,The information below is provided every other ...,en,,html,https://reliefweb.int/report/occupied-palestin...,reliefweb.int,2024-10-02 19:17:03,NaT,106716,106631,"['16', 'whole population', 'west bank', 'warm ...",Israel,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
84810,50173,106685,3f995037-a30a-3f71-b10a-3361a4ffeafc,Gaza Humanitarian 29 September 2024,Children benefitting from informal learning ac...,en,f5649e98-14cc-3cf2-b9fe-d9a9fef0fd4d,pdf,https://reliefweb.int/attachments/b85778d2-ec7...,reliefweb.int,2024-10-02 19:17:03,NaT,106784,106685,"['whole population', 'west bank', 'warm clothe...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024


## Manual filtering based on the title 

In [8]:
key_country, key_period


('Afghanistan', 'Week 21 2024')

In [None]:
import openai 
def call_openai (prompt, max_tokens): 
    client = openai.OpenAI(api_key=" " )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=max_tokens
    )
    
    # Extract the model's score
    response = response.choices[0].message.content.strip()
    
    return response

In [None]:
import openai 
titles = filterted_data['title'].tolist()

prompt_topic = """
Here are several document titles. Please identify the main event or topic that most of these titles are related to, and provide the most common event/topic. 
Return just the main event/topic, nothing else.  
Here the titles:
:\n\n
""" 


for title in titles:
    prompt_topic+= title + "\n" 

main_topic = call_openai(prompt_topic, max_tokens=30) 


In [11]:
main_topic

'Afghanistan Floods'

In [12]:
for idx, title in enumerate(titles):
    print(idx, title)

0 Afghanistan: Minutes of FSAC Monthly Meeting (15 November 2023) [Meeting Minutes]
1 Faces of the floods
2 DTM Pakistan: Bi-Weekly Flow Monitoring of Afghan Returnees from Pakistan (1 - 15 May 2024)
3 Guideline on Food Security and Agriculture Cluster Response Packages (May 2024)
4 Afghanistan: Minutes of FSAC Monthly Meeting (20 March 2023) [Meeting Minutes]
5 Afghanistan floods affect over 30,000 since year-start: UN
6 Afghanistan: Border Consortium Emergency Border Operations, 05 - 18 May 2024
7 WHO distributes 25 tons of aid to flood victims in Afghanistan's Ghor
8 WFP identifies conflict as primary cause of global hunger crisis
9 Afghanistan Floods Flash Update #2 (21 May 2024)
10 UNICEF Afghanistan Humanitarian Situation Update No. 2 (Northern Region Flash Floods) for 26 May 2024.
11 Shelter Cluster Afghanistan: Regional Monthly Update (April 2024)
12 Afghanistan Flooding Situation Report No. 5 (17 - 20 May 2024)
13 Afghanistan Floods: Flash Update #3 - Floods hit the Northeaste

In [13]:
#this list is given by reading the titles by myself
remove_index = [2,6,17, 18,21,23,25,36,37,39,52,54,58, 62,63,67,70,72,75,77,78,84,89,96,104]
filterted_titles = [titles[idx] for idx in range(len(titles)) if idx not in remove_index]

selected_sources = filterted_data[filterted_data['title'].isin(filterted_titles) ]


remove_index = [1, 9, 77, 87, 102, 122, 123, 142, 143, 146, 154, 155, 171, 172, 175, 176, 177, 182, 183, 192, 202]
filtered_titles = {title for i, title in enumerate(selected_sources['title']) if i not in remove_index}
selected_sources = selected_sources[selected_sources['title'].isin(filtered_titles)]


In [14]:
for idx, title in enumerate(selected_sources['title']):
    print(idx, title)

0 Afghanistan: Minutes of FSAC Monthly Meeting (15 November 2023) [Meeting Minutes]
1 Guideline on Food Security and Agriculture Cluster Response Packages (May 2024)
2 Afghanistan: Minutes of FSAC Monthly Meeting (20 March 2023) [Meeting Minutes]
3 Afghanistan floods affect over 30,000 since year-start: UN
4 WHO distributes 25 tons of aid to flood victims in Afghanistan's Ghor
5 WFP identifies conflict as primary cause of global hunger crisis
6 Afghanistan Floods Flash Update #2 (21 May 2024)
7 UNICEF Afghanistan Humanitarian Situation Update No. 2 (Northern Region Flash Floods) for 26 May 2024.
8 Afghanistan Flooding Situation Report No. 5 (17 - 20 May 2024)
9 Afghanistan Floods: Flash Update #3 - Floods hit the Northeastern Region of Afghanistan (22 May 2024)
10 UNICEF Afghanistan Humanitarian Situation Update No. 1 (Northern Region Flash Floods): 20 May 2024
11 Afghanistan: Countrywide Weekly Market Report: Issue 200: Week 3 May 2024
12 Shelter Cluster Afghanistan: Shelter Needs & A

In [15]:
len(selected_sources)

78

# Save dict of sources 

In [None]:
# Set results
selected_sources = selected_sources.drop_duplicates(subset='content')

sources_dict = {
    index: f"title: {row['title']} \n content: {row['content']}"
    for index, row in selected_sources.iterrows()
}

len(selected_sources)
import json 

#with open(f"./Results/Sources-Gannet /sources-Gannet-{week}-{sector}.json", 'w') as f:
with open(f"./Results/SourcesCountryEvent/Other events/{key_country}_{main_topic}-{key_period}.json", 'w') as f:

    json.dump(sources_dict, f)

In [140]:
selected_sources

Unnamed: 0.1,Unnamed: 0,id_x,article_id,title,content,lang,attachment_id,content_type,url,source_domain,publish_date,duplicate_insert_time,id_y,clean_contents_id,keywords,country,country_json,publish_week,publish_month
3631,50297,106827,3fc4a398-ded7-32a4-a4b1-7bf1d0ac644a,WFP Palestine Emergency Response External Situ...,"HIGHLIGHTS\n• In September to date, WFP reache...",en,,html,https://reliefweb.int/report/occupied-palestin...,reliefweb.int,2024-10-03 12:15:25,NaT,107102,106827,"['34', 'west bank', 'usual number', 'significa...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
6159,49162,105667,15a500d6-b54d-354a-8770-89c98712c018,The United States Announces Nearly $336 Millio...,Office of Press Relations\npress@usaid.gov\nMo...,en,,html,https://reliefweb.int/report/occupied-palestin...,reliefweb.int,2024-09-30 21:15:50,NaT,105724,105667,"['west bank', 'support palestinians', 'humanit...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,September 2024
6628,51530,108059,c7c526f4-0e70-3fc3-98d4-50fcb161ddf7,UNICEF chief warns Gaza kids face ‘post-genera...,After a year of military operations between Is...,en,,html,https://www.middleeastmonitor.com/20241006-uni...,middleeastmonitor.com,2024-10-06 17:20:00,NaT,108159,108059,"['vaccinating thousands', 'us ”', 'unicef warn...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
7943,51171,107696,97b58a07-3f97-3b2f-a235-e754b46d5b28,Leaked emails show White House ignores early w...,"WASHINGTON, Oct. 5 (Xinhua) -- Leaked emails f...",en,,html,https://english.news.cn/20241005/709ff6adda3d4...,english.news.cn,2024-10-05 00:00:00,NaT,107787,107696,"['white house', 'west bank', 'used u', 'state ...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
9375,50822,107345,4971a171-9e86-389c-ac67-334d39a211da,9 Palestinians killed as Gaza faces wrath of I...,"At least nine Palestinians were killed, and se...",en,,html,https://www.dailysabah.com/world/mid-east/9-pa...,dailysabah.com,2024-10-04 11:58:00,NaT,107555,107345,"['israeli shelling', 'year since', 'territory ...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81829,49244,105752,8897b3fe-c0a2-3a7d-9b38-f010b707afcb,Humanitarian Gaza Strip,Makeshift shelters leaving internally displace...,en,db2f33c3-0886-3a96-8204-19526a2f3342,pdf,https://reliefweb.int/attachments/01e445ad-77d...,reliefweb.int,2024-09-30 17:49:15,NaT,105712,105752,"['“ safe', '“ marking', 'younger generation', ...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,September 2024
81846,49201,105709,8897b3fe-c0a2-3a7d-9b38-f010b707afcb,Humanitarian Situation Update #224 | Gaza Strip,The Humanitarian Situation Update is issued by...,en,,html,https://reliefweb.int/report/occupied-palestin...,reliefweb.int,2024-09-30 17:49:15,NaT,105815,105709,"['gaza strip', '224', '“ safe', '“ marking', '...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,September 2024
84659,50936,107460,d4cb5a89-c82a-313d-bc02-2e1148abadc4,PALESTINE One year of hostilities: impact on e...,PALESTINE One year of hostilities: impact on e...,en,c5c046c2-6d00-3d3a-9b30-6f4b5baf89fe,pdf,https://reliefweb.int/attachments/f585e9a5-eee...,reliefweb.int,2024-10-04 13:57:29,NaT,107403,107460,"['impact', 'hostilities', 'gaza', 'education',...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024
84790,50119,106631,3f995037-a30a-3f71-b10a-3361a4ffeafc,Gaza humanitarian response update | 16-29 Sept...,The information below is provided every other ...,en,,html,https://reliefweb.int/report/occupied-palestin...,reliefweb.int,2024-10-02 19:17:03,NaT,106716,106631,"['16', 'whole population', 'west bank', 'warm ...",Palestinian Territory,"{'geo_loc': {'Gaza': {'latitude': 31.50161, 'g...",Week 40 2024,October 2024


In [146]:
for idx, title in enumerate(selected_sources['title']):
    print(idx, title)

0 WFP Palestine Emergency Response External Situation Report #34 (02 October 2024)
1 The United States Announces Nearly $336 Million in Humanitarian Assistance to Support Palestinians in Gaza and the West Bank
2 UNICEF chief warns Gaza kids face ‘post-generational challenges’
4 9 Palestinians killed as Gaza faces wrath of Israeli shelling
5 Verification of damages to schools based on proximity to damaged sites - Gaza, Occupied Palestinian Territory, Update #6 (September 2024)
6 Permission with conditions for pro-Palestine group protest
7 Palestinian authorities urge displaced Gazans to ignore new Israeli evacuation orders
8 At least 21 killed in Israeli airstrikes on homes, shelters across Gaza
10 UN chief urges peace to end suffering ahead of Gaza war anniversary
11 US to announce over $335m in aid for Palestinians in Gaza, West Bank
12 UN chief appeals for peace ahead of Gaza war anniversary
13 Israeli army orders more evictions of northern Gaza residents after 100 air strikes since 

# Save sources with metadata for all the extracted events in a given folder 

This cell is to create new data sources with the metadata for the selected sources. 
The selected metadata are just the title and the url, it's easy to add more information in case. 

In [11]:

input_folder_path = "/Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeline_DatasetDerya/Results/SourcesCountryEvent/Dev set"

output_folder_path = "/Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeline_DatasetDerya/Results/SourcesCountryEvent-Metadata/Dev set"
# Loop through all files in the directory
for filename in os.listdir(input_folder_path)[::-1]:
    # Check if the file is a JSON file
    if filename.endswith(".json"):
        file_path = os.path.join(input_folder_path, filename)
        print(f"Processing file: {file_path}")
        
        # Load the JSON content
        with open(file_path, "r") as f:
            sources_dict = json.load(f)
            
           
        # Create DataFrame from the loaded JSON
        sources_df = pd.DataFrame([
            {"index": k, "content": v.split("\n content: ")[1], "title": v.split("\n content: ")[0].replace("title: ", "")} 
            for k, v in sources_dict.items()
        ])
        
        # Create an empty list to store the updated sources
        updated_sources_list = []

        for key, value in sources_dict.items():
            title = value.split("\n content: ")[0].replace("title: ", "").strip()
            content = value.split("\n content: ")[1].strip()
            
            
            # Find the row in filtered_data that matches both the title and content
            matching_row = full_data[(full_data['content']==content) | (full_data['title']==title)].drop_duplicates(subset='content')
            
            # If a match is found, extract the URL, otherwise set it to "No URL available"
            if not matching_row.empty:
                url = matching_row['url'].values[0]  # Get the URL from the matching row
            else:
                url = "No URL available"
            
            # Add the row to the updated_sources_list with the URL
            updated_sources_list.append({
                "row_number": key,
                "title": matching_row["title"].values[0],
                "content": f"title: {title} \n content: {content}",
                "url": url
            })

    # Convert the updated sources list back to a dictionary
        updated_sources_dict = {item["row_number"]: item for item in updated_sources_list}
            #print(updated_sources_dict)
            
    with open(f"{output_folder_path}/sources-metadata-{filename}", "w") as f:
        json.dump(updated_sources_dict, f, indent=4)

Processing file: /Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeline_DatasetDerya/Results/SourcesCountryEvent/Dev set/Afghanistan_Afghanistan Floods-Week 21 2024.json
Processing file: /Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeline_DatasetDerya/Results/SourcesCountryEvent/Dev set/United Kingdom_UK riots-Week 32 2024.json
Processing file: /Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeline_DatasetDerya/Results/SourcesCountryEvent/Dev set/Ukraine_Ukraine-Week 23 2024.json
Processing file: /Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeline_DatasetDerya/Results/SourcesCountryEvent/Dev set/Israel_Israel-Hamas war-Week 19 2024.json
Processing file: /Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeline_DatasetDerya/Results/SourcesCountryEvent/Dev set/Sudan_Sudan conflict-Week 34 2024.json
Processing file: /Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeli