# Seeking Truth in A Time of War

TK: Brief overview of the thesis idea (a few shorts paragraphs). Include links and brief desription of the data source(s) used in the the notebook's analysis.

List of findings (based on your QQs):

- The number of journalists killed since 1992 was 1,687 journalists.
- The number of journalists killed since Oct. 7, 2023 was 159 journalists in the Israel-Palestine Conflict.
- The number of freelance journalists killed was 47 journalists since Oct. 7, 2023 in the Israel-Palestine Conflict.
- The year that had the highest number of deaths was 2024 with 76 deaths. 
- 
- 
-

Main questions:
- What was the number of journalists killed since 1992-2025?
- What was the number of journalists killed since Oct. 7, 2023 attacks?
- What was the number of freelance journalists killed since Oct. 7, 2023 attacks?
- Between 2023-2025, what year had the highest number of deaths? Why?
- Which regions have experienced the highest number of journalist deaths?
- What are the age and gender demographics of journalists being killed?
- How many journalists were killed reporting from civilian versus active combat zones during the conflict?
- What was the rate of journalists being killed per news organizations?
- How many journalists died under each category (crossfire, crossfire/combat related, dangerous assignment, murder, unknown) from 2023-2025?
- Which category accounted for the highest number of journalist deaths over 2023-2025?

In [2]:
import pandas as pd
from vega_datasets import data
import altair as alt
import numpy as np
import datetime
import os
from pathlib import Path
pd.set_option('display.max_columns', None)

In [3]:
# Get the current working directory 
current_directory = Path(os.getcwd()).resolve()

In [4]:
DATA_DIR = current_directory.parent.joinpath('data')
DATA_DIR.mkdir(parents=True, exist_ok=True)

In [5]:
# Load the data
csv_file = DATA_DIR / "cpj_data.csv" # Combine path and file name
cpj = pd.read_csv(csv_file)

In [6]:
#Check categorical values like the location column to see if their are any missing values
print("Unique values in the 'location' column:", cpj['location'].unique())

Unique values in the 'location' column: ['Afghanistan' 'Ethiopia' 'Syria'
 'Israel and the Occupied Palestinian Territory' 'Algeria' 'Iraq' 'Libya'
 'Somalia' 'Pakistan' 'Bangladesh' 'South Africa' 'Sierra Leone' 'Yemen'
 'Russia' 'India' 'South Sudan' 'USA' 'Azerbaijan' 'Peru'
 'Democratic Republic of the Congo' 'Mexico' 'East Timor' 'Egypt'
 'Lebanon' 'Ghana' 'Bahrain' 'Maldives' 'Turkey' 'Sri Lanka' 'Sudan'
 'Angola' 'Mozambique' 'Belarus' 'Philippines' 'Central African Republic'
 'Bosnia' 'Georgia' 'Burundi' 'Colombia' 'El Salvador' 'Indonesia'
 'Kyrgyzstan' 'Rwanda' 'Ukraine' 'Madagascar' 'Nicaragua' 'Ivory Coast'
 'Brazil' 'Thailand' 'Myanmar' 'Serbia' 'Nigeria' 'France' 'Nepal'
 'Canada' 'Haiti' 'Ecuador' 'Honduras' 'Bolivia' 'Cambodia' 'Gambia'
 'Barbados' 'Mali' 'Guatemala' 'Malta' 'Panama' 'Tanzania' 'Burkina Faso'
 'Montenegro' 'Iran' 'Paraguay' 'Zimbabwe' 'Guinea' 'Chad'
 'Republic of Congo' 'China' 'Eritrea' 'Tajikistan' 'Kenya' 'Chile'
 'Yugoslavia' 'Kazakhstan' 'Cameroon

In [7]:
#Check for missing values in the location column
print("Missing values in the 'location' column:", cpj['location'].isnull().sum())

Missing values in the 'location' column: 0


## Data Quality Assessments

In [8]:
#Check row counts if that makes sense (and compare to what is on the source agency site)
print("Total number of rows in the dataset:", cpj.shape[0])

Total number of rows in the dataset: 1687


In [9]:
#Get the column names
print("Column names in the dataset:", cpj.columns.tolist())

Column names in the dataset: ['fullName', 'organizations', 'location', 'status', 'typeOfDeath', 'startDisplay', 'mtpage', 'type', 'motiveConfirmed', 'charges']


In [10]:
#Change startDisplay column name to "Date"
cpj = cpj.rename(columns={'startDisplay': 'Date'})
#Convert the Date column to datetime format
cpj['Date'] = pd.to_datetime(cpj['Date'], errors='coerce')

In [11]:
cpj['Date'] = cpj['Date'].dt.strftime('%Y-%m-%d')  # Format the date as YYYY-MM-DD
# Check if the Date column is in the correct format
print("First few rows of the dataset with formatted Date column:")
print(cpj[['Date', 'location']].head())

First few rows of the dataset with formatted Date column:
         Date                                       location
0  2018-04-30                                    Afghanistan
1  1998-02-09                                       Ethiopia
2  2012-12-21                                          Syria
3  2023-12-18  Israel and the Occupied Palestinian Territory
4  1996-02-10                                        Algeria


In [12]:
#Filter for journalists location column in ""Israel and the Occupied Palestinian Territory"
filtered_cpj = cpj[cpj['location'] == "Israel and the Occupied Palestinian Territory"]
filtered_cpj.count()

fullName           178
organizations      178
location           178
status             178
typeOfDeath        178
Date               178
mtpage             178
type               178
motiveConfirmed    178
charges              0
dtype: int64

In [13]:
#The number of journalists killed in the Israel and the Occupied Palestinian Territory was 178 journalists.

## Data Analysis

### Finding 1: The number of journalists killed since 1992 was 1,687 journalists.

In [14]:
#What was the number of journalists killed since 1992-2025?
cpj.count()

fullName           1687
organizations      1687
location           1687
status             1687
typeOfDeath        1687
Date               1687
mtpage             1687
type               1687
motiveConfirmed    1687
charges               0
dtype: int64

### Finding 2: The number of journalists killed since Oct. 7, 2023 was 159 journalists.

In [15]:
#What was the number of journalists killed ('status' column) since Oct. 7?

In [16]:
#Filter for number of journalists killed since Oct. 7, 2023 in filtered_cpj
killed_since_oct7 = filtered_cpj[(filtered_cpj['status'] == 'Killed') & (filtered_cpj['Date'] >= '2023-10-07')]
#Count the number of journalists killed
num_killed_since_oct7 = killed_since_oct7.shape[0]
print(f"Number of journalists killed since Oct. 7, 2023: {num_killed_since_oct7}")

Number of journalists killed since Oct. 7, 2023: 159


In [35]:
# Create a copy of the filtered DataFrame to avoid "SettingWithCopyWarning"
filtered_cpj = filtered_cpj.copy()

# Ensure 'Date' column is in datetime format
filtered_cpj['Date'] = pd.to_datetime(filtered_cpj['Date'])

# Filter for journalists killed since October 7
oct_7_date = datetime.datetime(2023, 10, 7)
killed_since_oct7 = filtered_cpj.loc[
    (filtered_cpj['status'] == 'Killed') & (filtered_cpj['Date'] >= oct_7_date)
].copy()  # Explicitly create a copy of the filtered DataFrame

# Add a 'Month' column safely
killed_since_oct7['Month'] = killed_since_oct7['Date'].dt.month

print(killed_since_oct7)

                    fullName                              organizations  \
3             Abdallah Alwan  Holy Quran Radio,Midan,Mugtama,Al-Jazeera   
5        Abdallah Iyad Breis                Rawafed educational channel   
10    Abdel Rahman al-Tanani                                  Freelance   
36         Abdul Rahman Bahr                    Palestine Breaking News   
38        Abdul Rahman Saima                                  Raqami TV   
...                      ...                                        ...   
1650  Yasser Mamdouh El-Fady                         Kan'an news agency   
1651        Yazan al-Zuweidi                                    Al-Ghad   
1655      Yousef Maher Dawas                                  Freelance   
1672       Zahraa Abu Skheil                                  Freelance   
1679          Zayd Abu Zayed                                Quran Radio   

                                           location  status  \
3     Israel and the Occupied Palest

In [18]:
# Add 'Year' column to filtered_cpj DataFrame
filtered_cpj['Year'] = filtered_cpj['Date'].dt.year

### Finding 3: The number of freelance journalists killed was 47 journalists since Oct. 7, 2023.

In [19]:
#What was the number of freelance journalists killed ('status' column) since Oct. 7?

In [20]:
# Filter for number of freelance journalists killed since Oct. 7, 2023 in filtered_cpj
freelance_killed_since_oct7 = filtered_cpj[
    (filtered_cpj['status'] == 'Killed') & 
    (filtered_cpj['Date'] >= '2023-10-07') & 
    (filtered_cpj['organizations'] == 'Freelance')
]
# Count the number of freelance journalists killed
num_freelance_killed_since_oct7 = freelance_killed_since_oct7.shape[0]
print(f"Number of freelance journalists killed since Oct. 7, 2023: {num_freelance_killed_since_oct7}")

Number of freelance journalists killed since Oct. 7, 2023: 47


In [21]:
#Filter freelance journalists killed per year since Oct. 7, 2023
freelance_killed_per_year = freelance_killed_since_oct7.groupby('Year').size().reset_index(name='Freelance Killed')
print("Freelance journalists killed per year since Oct. 7, 2023:")
print(freelance_killed_per_year)

Freelance journalists killed per year since Oct. 7, 2023:
   Year  Freelance Killed
0  2023                12
1  2024                25
2  2025                10


In [27]:
# Create a bar chart for the number of freelance journalists killed per year since Oct. 7, 2023
bar_chart_freelance_killed = alt.Chart(freelance_killed_per_year).mark_bar().encode(
    x=alt.X('Year:O', title='Year', axis=alt.Axis(labelAngle=0)),  # Set x-axis labels horizontal
    y=alt.Y('Freelance Killed:Q', title='Number of Freelance Journalists Killed'),
    color=alt.Color('Freelance Killed:Q', scale=alt.Scale(scheme='blues')),  # Light-to-dark blue shades
    tooltip=['Year', 'Freelance Killed']  # Add hover tooltip
).properties(
    title='Freelance Journalists Killed per Year since Oct. 7, 2023',
    width=800  # Make the chart wider
).interactive()  # Enable hover interaction

# Display the bar chart
bar_chart_freelance_killed.show()

### Finding 4: The year that had the highest number of deaths was 2024 with 76 deaths. 

In [23]:
#Between 2023-2025, what year had the highest number of deaths? Why?

In [24]:
# Find the year with the highest number of deaths
highest_deaths_year = filtered_cpj[
    (filtered_cpj['Date'] >= '2023-01-01') & 
    (filtered_cpj['Date'] <= '2025-12-31')
].groupby('Year').size().reset_index(name='Count').loc[
    lambda df: df['Count'].idxmax()
]
# Print the year with the highest number of deaths
print(f"The year with the highest number of deaths between 2023-2025 is {highest_deaths_year['Year']} with {highest_deaths_year['Count']} deaths.")

The year with the highest number of deaths between 2023-2025 is 2024 with 76 deaths.


### Finding 5: On a global level, the location that experienced the highest number of journalist deaths was ...

In [32]:
#On a global level, which locations have experienced the highest number of journalist deaths since 1992?
# Group by 'location' and count the number of deaths
global_deaths = cpj.groupby('location').size().reset_index(name='Deaths')
# Sort by the number of deaths in descending order
global_deaths_sorted = global_deaths.sort_values(by='Deaths', ascending=False)
# Print the top locations with the highest number of journalist deaths
print("Locations with the highest number of journalist deaths since 1992:")
print(global_deaths_sorted.head(10))  # Display the top 10 locations

Locations with the highest number of journalist deaths since 1992:
                                         location  Deaths
46                                           Iraq     193
48  Israel and the Occupied Palestinian Territory     178
90                                          Syria     145
75                                    Philippines      96
84                                        Somalia      73
70                                       Pakistan      68
63                                         Mexico      65
43                                          India      61
78                                         Russia      60
1                                         Algeria      60


In [40]:
# Load the CSV
global_deaths = pd.read_csv('global_deaths.csv')
global_deaths['Year'] = pd.to_datetime(global_deaths['Year'], format='%Y')

alt.Chart(global_deaths).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1,
    strokeOpacity=0.4
).encode(
    alt.X('Year:T')
        .title(None)
        .scale(domain=[pd.to_datetime('1992'), pd.to_datetime('2025')]),
    alt.Y('Deaths:Q')
        .title("Journalist Deaths"),
    alt.Size('Deaths:Q')
        .scale(range=[0, 2500])
        .title('Deaths'),
    alt.ColorValue('steelblue'),
    tooltip=[
        alt.Tooltip("Year:T", format='%Y'),
        alt.Tooltip("Deaths:Q", format='~s')
    ],
).properties(
    width=450,
    height=320,
    title=alt.Title(
        text="Global Journalist Deaths (1992–2025)",
        subtitle="Each bubble represents the total journalist deaths in that year",
        anchor='start'
    )
).configure_axisX(
    grid=False
).configure_view(
    stroke=None
)

FileNotFoundError: [Errno 2] No such file or directory: 'global_deaths.csv'

In [38]:
alt.Chart(filtered_cpj).transform_filter(
    alt.datum.Entity != 'Global Journalist Deaths'
).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1,
    strokeOpacity=0.4
).encode(
    alt.X('Year:T')
        .title(None)
        .scale(domain=['1992','1993', '1994', '1995', '1996', '1997', '1998', '1999',
                      '2000', '2001', '2002', '2003', '2004', '2005', '2006', 
                      '2007', '2008', '2009', '2010', '2011', '2012', '2013',
                      '2014', '2015', '2016', '2017', '2018', '2019', '2020',
                      '2021', '2022', '2023', '2024', '2025']),
    alt.Y('Entity:N')
        .title(None)
        .sort(field="Deaths", op="sum", order='descending'),
    alt.Size('Deaths:Q')
        .scale(range=[0, 2500])
        .title('Deaths')
        .legend(clipHeight=30, format='s'),
    alt.Color('Entity:N').legend(None),
    tooltip=[
        "Entity:N",
        alt.Tooltip("Year:T", format='%Y'),
        alt.Tooltip("Deaths:Q", format='~s')
    ],
).properties(
    width=450,
    height=320,
    title=alt.Title(
        text="Global Journalist Deaths (1992-2025)",
        subtitle="The size of the bubble represents the total death count per year",
        anchor='start'
    )
).configure_axisY(
    domain=False,
    ticks=False,
    offset=10
).configure_axisX(
    grid=False,
).configure_view(
    stroke=None
)