# Predict the number of members missing from the dataset

We've discovered that approximately 88 percent of the logbook subscription events survived. Since all members would have been listed in the logbooks, can we guess what proportion of the membership we have? Use the marginal increase of the number of members given the number of logbook events to determine the number of missing members.

**Givens**
* `surviving_members_count`
* `surviving_logbook_event_count`
* `surviving_members_from_logbooks_count`
* `total_logbook_event_count`

**Assuming**
* `total_logbook_member_count == member_count`

**Find**
* `percent_surviving_members`

In [3]:
import pandas as pd

In [4]:
csv_urls = {
    'members': 'https://dataspace.princeton.edu/bitstream/88435/dsp01b5644v608/2/SCoData_members_v1.1_2021-01.csv',
    'books': 'https://dataspace.princeton.edu/bitstream/88435/dsp016d570067j/2/SCoData_books_v1.1_2021-01.csv',
    'events': 'https://dataspace.princeton.edu/bitstream/88435/dsp012n49t475g/2/SCoData_events_v1.1_2021-01.csv'

}

# load members, books, events as csv
members_df = pd.read_csv(csv_urls['members'])
books_df = pd.read_csv(csv_urls['books'])
events_df = pd.read_csv(csv_urls['events'])

In [5]:
LOGBOOK_PROPORTION_ESTIMATE = 0.8826

In [6]:
# Givens by query
surviving_members_count = members_df.shape[0]
surviving_logbook_event_count = events_df.shape[0]
surviving_members_from_logbooks_count = len(set([uri for member_list in events_df[events_df['source_type'].str.contains('Logbook')]['member_uris'].str.split(';').tolist() for uri in member_list]))

# Givens by previous estimate
total_logbook_event_count = int(events_df.shape[0] / LOGBOOK_PROPORTION_ESTIMATE)

# Calculations
n_members_per_logbook_event = surviving_members_from_logbooks_count / surviving_logbook_event_count
total_logbook_member_count  = total_logbook_event_count * n_members_per_logbook_event
total_member_count = total_logbook_member_count  # Key assumption by definition
percent_surviving_members = round(surviving_members_count / total_member_count * 100, 2)

# Results
print(f'Surviving number of logbook events: {surviving_logbook_event_count}')
print(f'Total logbook event count estimate: {total_logbook_event_count}')
print(f'Total number of surviving members: {surviving_members_count}')
print(f'Number of members sourced from the logbooks: {surviving_members_from_logbooks_count}')
print(f'Number of unique members per logbook event: {n_members_per_logbook_event}')
print(f'Estimate of the total member count: {int(total_member_count)}')
print(f'Estimate of the percent of total members that survived: {percent_surviving_members}')

Surviving number of logbook events: 35031
Total logbook event count estimate: 39690
Total number of surviving members: 5601
Number of members sourced from the logbooks: 5016
Number of unique members per logbook event: 0.1431874625331849
Estimate of the total member count: 5683
Estimate of the percent of total members that survived: 98.56


## Shared Account Info

In [7]:
# How many shared accounts are there?
shared_accounts = events_df[events_df['member_uris'].str.contains(';')]['member_uris'].unique()
shared_account_members = {uri for account in shared_accounts for uri in account.split(';')}
print(f'Shared accounts: {len(shared_accounts)}')
print(f'Across {len(shared_account_members)} members')
# Double check that all accounts were shared between two people
assert all([len(account.split(';')) == 2 for account in shared_accounts])

Shared accounts: 29
Across 58 members


## Address Book Comparison

In [33]:
def get_members_from_events(df):
    return set([uri for member_list in df['member_uris'].str.split(';').tolist() for uri in member_list])

events_df['start_date_dt'] = pd.to_datetime(events_df['start_date'], errors='coerce')
events_df['end_date_dt'] = pd.to_datetime(events_df['end_date'], errors='coerce')


# How many people were added from 1919–1935?
events_df_1919_1935 = events_df[(events_df['start_date_dt'] >= pd.to_datetime('1919-01-01')) & (events_df['start_date_dt'] <= pd.to_datetime('1935-12-31'))]
address_book_1919_1935_members = get_members_from_events(events_df_1919_1935[events_df_1919_1935['source_type'] == 'Address Book'])
percent_added_from_address_books = round(len(address_book_1919_1935_members) / len(get_members_from_events(events_df_1919_1935)) * 100, 2)
print(f'From 1919 to 1935, {percent_added_from_address_books} percent of members came exclusively from address books')

# How many people were added from 1935–1937?
events_df_1935_1937 = events_df[(events_df['start_date_dt'] >= pd.to_datetime('1935-01-01')) & (events_df['start_date_dt'] <= pd.to_datetime('1937-12-31'))]
address_book_1935_1937_members = get_members_from_events(events_df_1935_1937[events_df_1935_1937['source_type'] == 'Address Book'])
percent_added_from_address_books = round(len(address_book_1935_1937_members) / len(get_members_from_events(events_df_1919_1935)) * 100, 2)
print(f'From 1935 to 1937, {percent_added_from_address_books} percent of members came exclusively from address books')

From 1919 to 1935, 5.78 percent of members came exclusively from address books
From 1935 to 1937, 2.13 percent of members came exclusively from address books
