# Defining communities by spatial proximity
Natalia Vélez, March 2021

So far, we've defined communities as family lineages. However, belonging to the same family line does not guarantee that two individuals are of the same community, in any meaningful sense. Intuitively, communities are collections of people who live and work together. Thus, we have developed two additional and complementary ways to define communities: spatial proximity, and networks of interactions.

In this notebook, we define communities by spatial proximity.

In [1]:
%matplotlib inline

import os, glob, re, datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from os.path import join as opj

sns.set_context('talk')
sns.set_style('white')

def gsearch(*args): return glob.glob(opj(*args))
def str_extract(pattern, s): return re.search(pattern, s).group(0)
def int_extract(pattern, s): return int(str_extract(pattern, s))

## Prepare data

Seed files:

In [6]:
seed_times

[1573895673,
 1574102503,
 1576038671,
 1578345720,
 1578354747,
 1579713519,
 1580144896,
 1581985139,
 1583642903,
 1584061484,
 1585440511,
 1585512770,
 1585603481,
 1587166656,
 1603997647,
 1608411674]

In [2]:
data_dir = '../../data'
seed_files = gsearch(data_dir, 'publicMapChangeData', 'bigserver2.onehouronelife.com', '*time_mapSeed.txt')
seed_files.sort()

seed_times = [int_extract(r'([0-9]+)(?=time)', f) for f in seed_files]

print('%i seed changes' % len(seed_files))
print(*[datetime.datetime.fromtimestamp(t) for t in seed_times], sep='\n')

16 seed changes
2019-11-16 09:14:33
2019-11-18 18:41:43
2019-12-11 04:31:11
2020-01-06 21:22:00
2020-01-06 23:52:27
2020-01-22 17:18:39
2020-01-27 17:08:16
2020-02-18 00:18:59
2020-03-08 04:48:23
2020-03-13 01:04:44
2020-03-29 00:08:31
2020-03-29 20:12:50
2020-03-30 21:24:41
2020-04-17 23:37:36
2020-10-29 18:54:07
2020-12-19 21:01:14


Get seed # for a given family:

In [20]:
fam_name = 'time-1609746195_eve-3833245_name-STALLINS'
birth_t = int_extract('(?<=time-)([0-9]+)', fam_name)
tmp_ver = pd.DataFrame({'seed': seed_times})
tmp_ver['lag'] = birth_t - tmp_ver['seed']
tmp_ver = tmp_ver[tmp_ver['lag'] >= 0]
tmp_ver

Unnamed: 0,seed,lag
0,1573895673,35850522
1,1574102503,35643692
2,1576038671,33707524
3,1578345720,31400475
4,1578354747,31391448
5,1579713519,30032676
6,1580144896,29601299
7,1581985139,27761056
8,1583642903,26103292
9,1584061484,25684711


In [21]:
def find_seed(fam_name):

    birth_t = int_extract('(?<=time-)([0-9]+)', fam_name)
    
    tmp_ver = pd.DataFrame({'seed': seed_times})
    tmp_ver['lag'] = birth_t - tmp_ver['seed']
    tmp_ver = tmp_ver[tmp_ver['lag'] >= 0]

    try: 
        file_ver = tmp_ver['lag'].idxmin()
    except ValueError: 
        print(fam_name)
    
    return file_ver

Load families:

In [23]:
seed_times[0]

1573895673

In [22]:
family_df = pd.read_csv('outputs/family_generations.tsv', sep='\t')
family_df['seed'] = family_df.family.apply(find_seed)
print(family_df.shape)
family_df.head()

time-1573313384_eve-2254999_name-SUGAR


UnboundLocalError: local variable 'file_ver' referenced before assignment

Mark each family by epoch:

In [10]:
tmp_ver = pd.DataFrame({'seed': seed_times})
tmp_ver

Unnamed: 0,seed
0,1573895673
1,1574102503
2,1576038671
3,1578345720
4,1578354747
5,1579713519
6,1580144896
7,1581985139
8,1583642903
9,1584061484


Load lifelogs:

In [15]:
lifelog_df = pd.read_csv('outputs/all_lifelogs_compact.tsv', sep='\t', index_col=0)
print('Original size: %i' % lifelog_df.shape[0])
lifelog_df = pd.merge(lifelog_df, family_df, on = 'avatar')
print('Size after merge: %i' % lifelog_df.shape[0])
lifelog_df['seed'] = lifelog_df.tBirth.apply(find_seed)
lifelog_df.head()

  mask |= (ar1 == a)


Original size: 2161128
Size after merge: 725825


KeyboardInterrupt: 