# Seeking Truth in A Time of War

TK: Brief overview of the thesis idea (a few shorts paragraphs). Include links and brief desription of the data source(s) used in the the notebook's analysis.

List of findings (based on your QQs):

- The number of journalists killed since 1992 was 1,687 journalists.
- The number of journalists killed since Oct. 7, 2023 was 159 journalists in the Israel-Palestine Conflict.
- The number of freelance journalists killed was 47 journalists since Oct. 7, 2023 in the Israel-Palestine Conflict.
- The year that had the highest number of deaths was 2024 with 76 deaths. 
- On a global level, the location that experienced the highest number of journalist deaths was Iraq with 193 deaths. The second was the Israel and the Occupied Palestinian Territory with 178 deaths.
- 
-

Main questions:
- What was the number of journalists killed since 1992-2025?
- What was the number of journalists killed since Oct. 7, 2023 attacks?
    - What was the number of freelance journalists killed since Oct. 7, 2023 attacks? Are more freelancers dying than those that work for a news org? And if so, why? For example, do they tend to have fewer resources for navigating war zones (e.g. war zone handlers who are typically natives of the area that help journalists more safely nagivate war zones)
- Which months had the largest numbers of deaths between late 2023 and May 2025 2023-2025? Why? - ie tie spikes in data to specific conflicts (before and after), deaths over number of years, few years prior and after
- Which countries (or conflicts) have experienced the highest number of journalist deaths?
- How many journalists were killed reporting from civilian versus active combat zones during the conflict?
- Which news orgs lost the most journalists?
- What are the age and gender demographics of journalists being killed?
- How many journalists died under each category (crossfire, crossfire/combat related, dangerous assignment, murder, unknown) from 2023-2025?
- Which category accounted for the highest number of journalist deaths over 2023-2025?
-journalist deaths prior and post oct. 7th - global level and israel/palestine territory, how it compares to other conflicts

In [30]:
import pandas as pd
from vega_datasets import data
import altair as alt
import numpy as np
import datetime
import os
from pathlib import Path
pd.set_option('display.max_columns', None)

In [31]:
# Get the current working directory 
current_directory = Path(os.getcwd()).resolve()

In [32]:
DATA_DIR = current_directory.parent.joinpath('data')
DATA_DIR.mkdir(parents=True, exist_ok=True)

In [33]:
# Load the data
csv_file = DATA_DIR / "cpj_data.csv" # Combine path and file name
cpj = pd.read_csv(csv_file)

In [34]:
#Check categorical values like the location column to see if their are any missing values
print("Unique values in the 'location' column:", cpj['location'].unique())

Unique values in the 'location' column: ['Afghanistan' 'Ethiopia' 'Syria'
 'Israel and the Occupied Palestinian Territory' 'Algeria' 'Iraq' 'Libya'
 'Somalia' 'Pakistan' 'Bangladesh' 'South Africa' 'Sierra Leone' 'Yemen'
 'Russia' 'India' 'South Sudan' 'USA' 'Azerbaijan' 'Peru'
 'Democratic Republic of the Congo' 'Mexico' 'East Timor' 'Egypt'
 'Lebanon' 'Ghana' 'Bahrain' 'Maldives' 'Turkey' 'Sri Lanka' 'Sudan'
 'Angola' 'Mozambique' 'Belarus' 'Philippines' 'Central African Republic'
 'Bosnia' 'Georgia' 'Burundi' 'Colombia' 'El Salvador' 'Indonesia'
 'Kyrgyzstan' 'Rwanda' 'Ukraine' 'Madagascar' 'Nicaragua' 'Ivory Coast'
 'Brazil' 'Thailand' 'Myanmar' 'Serbia' 'Nigeria' 'France' 'Nepal'
 'Canada' 'Haiti' 'Ecuador' 'Honduras' 'Bolivia' 'Cambodia' 'Gambia'
 'Barbados' 'Mali' 'Guatemala' 'Malta' 'Panama' 'Tanzania' 'Burkina Faso'
 'Montenegro' 'Iran' 'Paraguay' 'Zimbabwe' 'Guinea' 'Chad'
 'Republic of Congo' 'China' 'Eritrea' 'Tajikistan' 'Kenya' 'Chile'
 'Yugoslavia' 'Kazakhstan' 'Cameroon

In [35]:
#Check for missing values in the location column
print("Missing values in the 'location' column:", cpj['location'].isnull().sum())

Missing values in the 'location' column: 0


## Data Quality Assessments

In [36]:
#Check row counts if that makes sense (and compare to what is on the source agency site)
print("Total number of rows in the dataset:", cpj.shape[0])

Total number of rows in the dataset: 1687


In [37]:
#Get the column names
print("Column names in the dataset:", cpj.columns.tolist())

Column names in the dataset: ['fullName', 'organizations', 'location', 'status', 'typeOfDeath', 'startDisplay', 'mtpage', 'type', 'motiveConfirmed', 'charges']


In [38]:
#Change startDisplay column name to "Date"
cpj = cpj.rename(columns={'startDisplay': 'Date'})
#Convert the Date column to datetime format
cpj['Date'] = pd.to_datetime(cpj['Date'], errors='coerce')

In [39]:
cpj['Year'] = cpj['Date'].dt.strftime('%Y')
cpj['Date'] = cpj['Date'].dt.strftime('%Y-%m-%d')  # Format the date as YYYY-MM-DD
# Check if the Date column is in the correct format
print("First few rows of the dataset with formatted Date column:")
print(cpj[['Date', 'location']].head())

First few rows of the dataset with formatted Date column:
         Date                                       location
0  2018-04-30                                    Afghanistan
1  1998-02-09                                       Ethiopia
2  2012-12-21                                          Syria
3  2023-12-18  Israel and the Occupied Palestinian Territory
4  1996-02-10                                        Algeria


In [40]:
cpj.to_csv('~/Desktop/cpj_thesis_2/data/cpj.csv', index=False)

In [41]:
#Filter for journalists location column in ""Israel and the Occupied Palestinian Territory"
filtered_cpj = cpj[cpj['location'] == "Israel and the Occupied Palestinian Territory"]
filtered_cpj.count()

fullName           178
organizations      178
location           178
status             178
typeOfDeath        178
Date               178
mtpage             178
type               178
motiveConfirmed    178
charges              0
Year               178
dtype: int64

In [42]:
#The number of journalists killed in the Israel and the Occupied Palestinian Territory was 178 journalists.

## Data Analysis

### Finding 1: The number of journalists killed since 1992 was 1,687 journalists.

In [43]:
#What was the number of journalists killed since 1992-2025?
cpj.count()

fullName           1687
organizations      1687
location           1687
status             1687
typeOfDeath        1687
Date               1687
mtpage             1687
type               1687
motiveConfirmed    1687
charges               0
Year               1687
dtype: int64

### Finding 2: The number of journalists killed since Oct. 7, 2023 was 159 journalists.

In [44]:
#What was the number of journalists killed ('status' column) since Oct. 7?

In [45]:
#Filter for number of journalists killed since Oct. 7, 2023 in filtered_cpj
killed_since_oct7 = filtered_cpj[(filtered_cpj['status'] == 'Killed') & (filtered_cpj['Date'] >= '2023-10-07')]
#Count the number of journalists killed
num_killed_since_oct7 = killed_since_oct7.shape[0]
print(f"Number of journalists killed since Oct. 7, 2023: {num_killed_since_oct7}")

Number of journalists killed since Oct. 7, 2023: 159


In [46]:
# Create a copy of the filtered DataFrame to avoid "SettingWithCopyWarning"
filtered_cpj = filtered_cpj.copy()

# Ensure 'Date' column is in datetime format
filtered_cpj['Date'] = pd.to_datetime(filtered_cpj['Date'])

# Filter for journalists killed since October 7
oct_7_date = datetime.datetime(2023, 10, 7)
killed_since_oct7 = filtered_cpj.loc[
    (filtered_cpj['status'] == 'Killed') & (filtered_cpj['Date'] >= oct_7_date)
].copy()  # Explicitly create a copy of the filtered DataFrame

# Add a 'Month' column safely
killed_since_oct7['Month'] = killed_since_oct7['Date'].dt.month

print(killed_since_oct7)

                    fullName                              organizations  \
3             Abdallah Alwan  Holy Quran Radio,Midan,Mugtama,Al-Jazeera   
5        Abdallah Iyad Breis                Rawafed educational channel   
10    Abdel Rahman al-Tanani                                  Freelance   
36         Abdul Rahman Bahr                    Palestine Breaking News   
38        Abdul Rahman Saima                                  Raqami TV   
...                      ...                                        ...   
1650  Yasser Mamdouh El-Fady                         Kan'an news agency   
1651        Yazan al-Zuweidi                                    Al-Ghad   
1655      Yousef Maher Dawas                                  Freelance   
1672       Zahraa Abu Skheil                                  Freelance   
1679          Zayd Abu Zayed                                Quran Radio   

                                           location  status  \
3     Israel and the Occupied Palest

In [47]:
# Add 'Year' column to filtered_cpj DataFrame
filtered_cpj['Year'] = filtered_cpj['Date'].dt.year
filtered_cpj.to_csv('~/Desktop/cpj_thesis_2/data/filtered_cpj.csv', index=False)

### Finding 3: The number of freelance journalists killed was 47 journalists since Oct. 7, 2023.

In [48]:
#What was the number of freelance journalists killed ('status' column) since Oct. 7?

In [49]:
# Filter for number of freelance journalists killed since Oct. 7, 2023 in filtered_cpj
freelance_killed_since_oct7 = filtered_cpj[
    (filtered_cpj['status'] == 'Killed') & 
    (filtered_cpj['Date'] >= '2023-10-07') & 
    (filtered_cpj['organizations'] == 'Freelance')
]
# Count the number of freelance journalists killed
num_freelance_killed_since_oct7 = freelance_killed_since_oct7.shape[0]
print(f"Number of freelance journalists killed since Oct. 7, 2023: {num_freelance_killed_since_oct7}")

Number of freelance journalists killed since Oct. 7, 2023: 47


In [50]:
#Filter freelance journalists killed per year since Oct. 7, 2023
freelance_killed_per_year = freelance_killed_since_oct7.groupby('Year').size().reset_index(name='Freelance Killed')
print("Freelance journalists killed per year since Oct. 7, 2023:")
print(freelance_killed_per_year)

Freelance journalists killed per year since Oct. 7, 2023:
   Year  Freelance Killed
0  2023                12
1  2024                25
2  2025                10


In [51]:
# Create a bar chart for the number of freelance journalists killed per year since Oct. 7, 2023
bar_chart_freelance_killed = alt.Chart(freelance_killed_per_year).mark_bar().encode(
    x=alt.X('Year:O', title='Year', axis=alt.Axis(labelAngle=0)),  # Set x-axis labels horizontal
    y=alt.Y('Freelance Killed:Q', title='Number of Freelance Journalists Killed'),
    #color=alt.Color('Freelance Killed:Q', scale=alt.Scale(scheme='blues')),  # Light-to-dark blue shades
    tooltip=['Year', 'Freelance Killed']  # Add hover tooltip
).properties(
    title='Freelance Journalists Killed per Year since Oct. 7, 2023',
    width=800  # Make the chart wider
).interactive()  # Enable hover interaction

# Display the bar chart
bar_chart_freelance_killed.show()

In [52]:
#Freelance journalists killed per month

### Finding 4: The year that had the highest number of deaths was 2024 with 76 deaths. 

In [53]:
#Between 2023-2025, what year had the highest number of deaths? Why?

In [54]:
# Find the year with the highest number of deaths
highest_deaths_year = filtered_cpj[
    (filtered_cpj['Date'] >= '2023-01-01') & 
    (filtered_cpj['Date'] <= '2025-12-31')
].groupby('Year').size().reset_index(name='Count').loc[
    lambda df: df['Count'].idxmax()
]
# Print the year with the highest number of deaths
print(f"The year with the highest number of deaths between 2023-2025 is {highest_deaths_year['Year']} with {highest_deaths_year['Count']} deaths.")

The year with the highest number of deaths between 2023-2025 is 2024 with 76 deaths.


### Finding 5: On a global level, the location that experienced the highest number of journalist deaths was Iraq with 193 deaths. The second was the Israel and the Occupied Palestinian Territory with 178 deaths.

In [55]:
#On a global level, which locations have experienced the highest number of journalist deaths since 1992?
# Within the cpj.csv group by 'location' and count the number of deaths 
global_deaths = cpj.groupby('location').size().reset_index(name='Deaths').sort_values(by='Deaths', ascending=False)
# Display the top 10 locations with the highest number of journalist deaths
print("Top 10 locations with the highest number of journalist deaths since 1992:")
print(global_deaths.head(10))

Top 10 locations with the highest number of journalist deaths since 1992:
                                         location  Deaths
46                                           Iraq     193
48  Israel and the Occupied Palestinian Territory     178
90                                          Syria     145
75                                    Philippines      96
84                                        Somalia      73
70                                       Pakistan      68
63                                         Mexico      65
43                                          India      61
78                                         Russia      60
1                                         Algeria      60


In [58]:
#Define the top 10 locations with the highest number of journalist deaths
top_10_locations = global_deaths.head(10)
global_deaths = cpj.groupby('location').size().reset_index(name='Deaths').sort_values(by='Deaths', ascending=False)
alt.themes.enable("dark")

# Create a bubble chart for the top 10 locations with locations on the y-axis
bubble_chart_top_10_locations = alt.Chart(top_10_locations).mark_circle(size=100).encode(
    x=alt.X('Deaths:Q', title='Number of Deaths'),
    y=alt.Y('location:N', title='Location', sort='-x'),  # Sort locations by number of deaths
    size=alt.Size('Deaths:Q', title='Number of Deaths', scale=alt.Scale(range=[10, 500])),  # Bubble size
    color=alt.Color('Deaths:Q', scale=alt.Scale(scheme='reds')),  # Light-to-dark red shades
    tooltip=['location', 'Deaths']  # Add hover tooltip
).properties(
    title='Top 10 Locations with Highest Journalist Deaths since 1992',
    width=500,  # Make the chart less wide
    height=400  # Set height for better visibility
).interactive()  # Enable hover interaction

# Display the bubble chart
bubble_chart_top_10_locations.show()