## Using Python to troubleshoot altered datasets

This notebook uses Python and the `pandas` library to compare two versions of the Behavioral Risk Factor Surveillance System (BRFSS) Prevalence Data, 2010 and prior.

The first version was obtained from the Harvard Library Innovation Lab (HLIL) archive, and represents a snapshot of the data as of December 2024. 

The second version was obtained from the data.gov website and represents the data as updated in early 2025.

#### Links: 
 - HLIL [Archive of Data.gov](https://source.coop/harvard-lil/gov-data/web/data)
 - [BRFSS data](https://catalog.data.gov/dataset/behavioral-risk-factor-surveillance-system-brfss-prevalence-data-2010-and-prior) on Data.gov
 - [pandas](https://pandas.pydata.org/) Python library

In [3]:
import pandas as pd
import json

### Obtaining the data

In order to access the BRFSS Prevalence dataset in the HLIL archive, we need to download the archive's catalog and find the entry for this dataset. 

In [None]:
#Dataset catalog from https://source.coop/harvard-lil/gov-data/web/data
datasets = pd.read_parquet('~/Downloads/datasets.parquet', engine='fastparquet')

In [16]:
datasets.loc[datasets.title == 'Behavioral Risk Factor Surveillance System (BRFSS) Prevalence Data (2010 and prior)'].name.iloc[0]

'behavioral-risk-factor-surveillance-system-brfss-prevalence-data-2010-and-prior'

We can use the value in the catalog's `name` column to construct a URL to download the dataset from the Source Cooperative site. 

**Note that the download is a very large `.zip` file containing the same dataset in multiple formats.**

After extracting the ZIP archive, we'll open the data in CSV format.

In [18]:
# Downloaded from https://source.coop/harvard-lil/gov-data/collections/data_gov/behavioral-risk-factor-surveillance-system-brfss-prevalence-data-2010-and-prior/v1.zip
v1 = pd.read_csv('~/Downloads/v1/data/files/rows.csv')

Downloading the data from data.gov is more straightforward: we can download the CSV directly.

In [None]:
# https://catalog.data.gov/dataset/behavioral-risk-factor-surveillance-system-brfss-prevalence-data-2010-and-prior
# Metadata: https://catalog.data.gov/harvest/object/4151bcb5-de5b-4097-b69d-33b25bda0fd5

In [19]:
v_latest = pd.read_csv('~/Downloads/Behavioral_Risk_Factor_Surveillance_System__BRFSS__Prevalence_Data__2010_and_prior_.csv')

### Comparing the datasets

To spot differences in column **names**, we can treat each array of column headers as a set and find the symmetric difference.

In [96]:
set(v_latest.columns) ^ set(v1.columns)

set()

A simple check shows that the two datasets have the same number of rows.

In [101]:
len(v1) == len(v_latest)

True

Since the column headers appear to be unchanged, and at first glance, no rows appear to have been added or deleted, we can look for changes in categorical values. 

The following is a subset of columns with categorical values where intentional manipulation seems most likely to have occurred.

In [20]:
categorical = ['Class', 'Topic', 'Question', 'Response', 'Break_Out_Category', 'Break_Out']

The following code computes the symmetric difference for the **unique values** in each of the categorical columns, printing out the names of any columns where the difference is not null, as well as the values found in one dataset but not the other

In [22]:
for cat in categorical:
    v_latest_vals = set(v_latest[cat].unique())
    v_1_vals = set(v1[cat].unique())
    non_overlap =  v_latest_vals ^ v_1_vals 
    if non_overlap != set():
        print(f'Column: {cat}')
        print(f'Latest version: {[v for v in v_latest_vals if v not in v_1_vals]}')
        print(f'Version 1: {[v for v in v_1_vals if v not in v_latest_vals]}')
        print('----------------')
        

Column: Question
Latest version: ['Sex of respondent']
Version 1: ['Gender of respondent']
----------------
Column: Break_Out_Category
Latest version: ['Sex']
Version 1: ['Gender']
----------------


We can check whether the changes were consistently applied by using the following method:
- group by values in a categorical column
- count the number of rows with each value

In [7]:
v1.groupby('Break_Out_Category').Year.count().sort_values()

Break_Out_Category
Overall                86487
Gender                160196
Education Attained    323122
Race/Ethnicity        364552
Household Income      399078
Age Group             439995
Name: Year, dtype: int64

In [8]:
v_latest.groupby('Break_Out_Category').Year.count().sort_values()

Break_Out_Category
Overall                86487
Sex                   160196
Education Attained    323122
Race/Ethnicity        364552
Household Income      399078
Age Group             439995
Name: Year, dtype: int64

#### Slightly more advanced comparison

Can we confirm that no other changes have been made to the data? If the alterations consisted in replacing instances of one categorical value with another, then the rest of the data should still align between the two versions. 

If the dataset included unique identifiers, we could check simply by matching rows between the two versions on their unique identifiers and confirming that -- excepting these categorical changes -- the values remain unchanged.

This dataset lacks unique identifiers. But we can create our own using a [hash function](https://en.wikipedia.org/wiki/Hash_function), giving each row its unique "thumbprint." If all the thumbprints match up across the two versions, then the data haven't changed. 

In [23]:
import hashlib

In [24]:
# This boilerplate allows us to convert rows that contain null values in pandas to strings
import numpy as np

class NpEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        if isinstance(obj, np.floating):
            return float(obj)
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return super(NpEncoder, self).default(obj)

**How the algorithm works**

For each row in the data:

- Sort the values in a fixed order, putting the values from the columns where alterations occur at the end.
- If the row contains altered categorical values, revert them (only for the purposes of creating the hash) to the prior values.
- Otherwise, keep the categorical values as is.
- Convert the row to a string (using the `json.dumps` method).
- Create a hash of the string (using the `sha256` method, which is guaranteed to produce unique identifiers on different inputs).
- Assign the hash digest (a hexadecimal number) to a new column in the dataset.

In [108]:
columns = [c for c in v1.columns if c not in ['Break_Out_Category', 'Question']]
def hashing_fn(row):
    # If a row
    tail_values = []
    else:
        tail_values.append(row.Question)
    values = [row[c] for c in columns] + tail_values
    return hashlib.sha256(json.dumps(values, cls=NpEncoder).encode('utf-8')).hexdigest()
    

Now we apply the algorithm to each version of the dataset.

In [109]:
v1['id'] = v1.apply(hashing_fn, axis=1)

In [110]:
# Sanity check: are the identifiers unique? (They should be, provided the data contain no duplicate rows)
len(v1.id.unique()) == len(v1)
    

True

In [111]:
v_latest['id'] = v_latest.apply(hashing_fn, axis=1)

In [112]:
len(v_latest.id.unique()) == len(v_latest)

True

If every ID in the original dataset appears in the altered version, then the hashed version of each row matches, and no changes exist that we haven't accounted for.

In [114]:
len(v1.loc[v1.id.isin(v_latest.id)]) == len(v1)

True

The rest of the code is irrelevant to our comparison: it just creates a random sampling of each version of the data (for demo purposes) using the same rows (matched by hash-identifier) and with the same proportion of altered rows as in the original data.

In [116]:
gender_set = v1.loc[(v1.Break_Out_Category == 'Gender') | (v1.Question == 'Gender of respondent')]

In [117]:
len(gender_set) / len(v1)

0.11003704685270915

In [118]:
sample_ids = v1.loc[~v1.id.isin(gender_set.id)].id.sample(n=1000)

In [119]:
sample_gender_ids = gender_set.id.sample(n=110)

In [124]:
v1_sample = v1.loc[v1.id.isin(list(sample_ids.values) + list(sample_gender_ids.values))]

In [126]:
len(v1_sample)

1110

In [127]:
v_latest_sample = v_latest.loc[v_latest.id.isin(v1_sample.id)]

In [128]:
len(v_latest_sample)

1110

In [130]:
v1_sample.sort_values(by='id').drop('id', axis=1).to_csv('~/Downloads/brfss_prevalence_data_2010_and_prior_v2024.csv', index=False)
v_latest_sample.sort_values(by='id').drop('id', axis=1).to_csv('~/Downloads/brfss_prevalence_data_2010_and_prior_v2025.csv', index=False)