## What is this notebook about?

This notebook makes sure that the data is limited to only participants who completed the entire study, e.g. answered both the pre- and post-questionnaires and donated their data. In addition, we exclude participants who didn't donate the required number of WhatsApp chats. 
If you don't want to replicate the filtering process, you can move on to the next notebook using the file *messages_table_CHB.csv*.  

In [2]:
import os
import sys
import pandas as pd
from pathlib import Path
sys.path.insert(1, os.path.abspath('../../..'))
sys.path.insert(1, os.path.abspath('../'))

path_for_files = "../../data/raw"
# Get a list of participants who have donated the data
all_donations = pd.read_csv(Path(f'{path_for_files}/donation_table_CHB.csv')) 

# Get the surveyIDs of people in the post-survey questionnaire, these already include only completed entries. 
full_participation_IDs = set(pd.read_excel(Path(f'{path_for_files}/post-survey_CHB.xlsx'), usecols=['external_id'])['external_id']) 

# Get the list of donations with BOTH messaging and survey data! 
relevant_donations = all_donations[all_donations['external_id'].isin(full_participation_IDs)] 

### Removing participants not fullfilling the required number of chat donations. 

Since participants were asked to donate between 5 and 7 chats, we only include donations with more than 4 and fewer than 8 chats. Feedback plots were generated for a maximum of 10 chats, and participants could choose which ones to view — a choice we did not track. As a result, for donations with more than 7 chats, the mapping between objective and subjective data becomes unreliable.

In [3]:
import pandas as pd

# Load messages and filter to relevant donations
messages_file = pd.read_csv(Path(f'{path_for_files}/messages_table_CHB.csv'))
messages_file = messages_file[messages_file['donation_id'].isin(relevant_donations['donation_id'])]

# Group messages by donation_id
grouped_messages = messages_file.groupby('donation_id')

# Track donation IDs to remove: those with less than 5 and more than 7 donations!
removed_donations = set()

# Process each donation to decide if it should be excluded
for _, donation in relevant_donations.iterrows():
    ID = donation['donation_id']
    original_chats = grouped_messages.get_group(ID)['conversation_id'].nunique()

    if not (4 < original_chats <= 7):
        removed_donations.add(ID)

# Filter out donation messages that didn't pass the filter of donated chat number in [5,6,7]
messages_filtered = messages_file[~messages_file['donation_id'].isin(removed_donations)].reset_index(drop=True)

# Filter relevant_donations 
donations_filtered = relevant_donations[~relevant_donations['donation_id'].isin(removed_donations)].copy()

# Save output
messages_filtered.to_csv(Path(f'{path_for_files}/messages_table_CHB_filtered.csv'), index=False)
donations_filtered.to_csv(Path(f'{path_for_files}/donation_table_CHB_filtered.csv'), index=False)

65

#### Some sanity checks to make sure numbers add up after filtering

In [6]:
# Message-level summary
original_total = len(messages_file)
removed_total = messages_file[messages_file['donation_id'].isin(removed_donations)].shape[0]
final_total = len(messages_filtered)
expected_final = original_total - removed_total

# Donation-level summary
donation_original_total = len(relevant_donations)
donation_removed_total = len(removed_donations)
donation_final_total = len(donations_filtered)
donation_expected_final = donation_original_total - donation_removed_total

# Print message-level summary
print("Message Summary")
print(f"Original messages: {original_total}")
print(f"Removed messages:  {removed_total}")
print(f"Remaining messages: {final_total}")
print("✅ Message numbers add up!" if final_total == expected_final else "❌ Message count mismatch.")

# Print donation-level summary
print("Donation Summary")
print(f"Original donations: {donation_original_total}")
print(f"Removed donations:  {donation_removed_total}")
print(f"Remaining donations: {donation_final_total}")
print("✅ Donation numbers add up!" if donation_final_total == donation_expected_final else "❌ Donation count mismatch.")

Message Summary
Original messages: 3379218
Removed messages:  157765
Remaining messages: 3221453
✅ Message numbers add up!
Donation Summary
Original donations: 68
Removed donations:  3
Remaining donations: 65
✅ Donation numbers add up!
