# Brightkite Location-Based Social Network Dataset

This notebook analyzes the Brightkite dataset from SNAP:
- **Network**: 58,228 users with 214,078 friendships
- **Check-ins**: 4.4M location check-ins from April 2008 - October 2010

Source: https://snap.stanford.edu/data/loc-brightkite.html

In [None]:
import pandas as pd
import requests
import gzip
from io import BytesIO, StringIO
import graphistry

# To specify Graphistry account & server, use:
# graphistry.register(api=3, protocol="https", server="hub.graphistry.com",
#                     username="...", password="...")

## Download and Parse Friendship Network

In [50]:
# Download friendship network
edges_url = 'https://snap.stanford.edu/data/loc-brightkite_edges.txt.gz'
print('Downloading friendship network...')
edges_response = requests.get(edges_url)

# Decompress and parse
with gzip.GzipFile(fileobj=BytesIO(edges_response.content)) as f:
    edges_content = f.read().decode('utf-8')

# Parse into DataFrame
edges_df = pd.read_csv(
    StringIO(edges_content),
    sep='\t',
    comment='#',
    names=['user1', 'user2'],
    dtype={'user1': int, 'user2': int}
)

print(f'Loaded {len(edges_df):,} edges')
edges_df.head()

Downloading friendship network...
Loaded 428,156 edges


Unnamed: 0,user1,user2
0,0,1
1,0,2
2,0,3
3,0,4
4,0,5


## Download and Parse Check-in Data

In [51]:
# Download check-in data
checkins_url = 'https://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz'
print('Downloading check-in data...')
checkins_response = requests.get(checkins_url)

# Decompress and parse
with gzip.GzipFile(fileobj=BytesIO(checkins_response.content)) as f:
    checkins_content = f.read().decode('utf-8')

# Parse into DataFrame
checkins_df = pd.read_csv(
    StringIO(checkins_content),
    sep='\t',
    comment='#',
    names=['user', 'check_in_time', 'latitude', 'longitude', 'location_id'],
    dtype={'user': int},
    parse_dates=['check_in_time']
)

# Filter out likely invalid coordinates: (0, 0) or missing values
checkins_df = checkins_df[
    checkins_df['latitude'].notna() & 
    checkins_df['longitude'].notna() & 
    ((checkins_df['latitude'] != 0) | (checkins_df['longitude'] != 0))
]

print(f'Loaded {len(checkins_df):,} check-ins')
checkins_df.head()

Downloading check-in data...
Loaded 4,491,144 check-ins


Unnamed: 0,user,check_in_time,latitude,longitude,location_id
0,0,2010-10-17 01:48:53+00:00,39.747652,-104.99251,88c46bf20db295831bd2d1718ad7e6f5
1,0,2010-10-16 06:02:04+00:00,39.891383,-105.070814,7a0f88982aa015062b95e3b4843f9ca2
2,0,2010-10-16 03:48:54+00:00,39.891077,-105.068532,dd7cd3d264c2d063832db506fba8bf79
3,0,2010-10-14 18:25:51+00:00,39.750469,-104.999073,9848afcc62e500a01cf6fbf24b797732f8963683
4,0,2010-10-14 00:21:47+00:00,39.752713,-104.996337,2ef143e12038c870038df53e0478cefc


In [52]:
# Filter edges to only include users with valid check-ins
valid_users = set(checkins_df['user'].unique())
edges_df_filtered = edges_df[
    edges_df['user1'].isin(valid_users) & 
    edges_df['user2'].isin(valid_users)
]

print(f'Filtered edges: {len(edges_df):,} -> {len(edges_df_filtered):,}')
print(f'Users in network: {pd.concat([edges_df["user1"], edges_df["user2"]]).nunique():,}')
print(f'Users with valid check-ins: {len(valid_users):,}')
print(f'Users in filtered network: {pd.concat([edges_df_filtered["user1"], edges_df_filtered["user2"]]).nunique():,}')

Filtered edges: 428,156 -> 388,180
Users in network: 58,228
Users with valid check-ins: 50,686
Users in filtered network: 50,111


## Visualize Friendship Network with Graphistry

This visualization shows the social network of Brightkite users. Each node represents a user, positioned at their first check-in location. Edges represent friendships between users.

**What to explore:**
- Community clusters: Groups of highly connected friends
- Geographic patterns: Whether friend groups cluster geographically
- Network hubs: Users with many connections (high degree)
- Network structure: Identify isolated groups vs. the main component

In [53]:
# Visualize friendship network (filtered to users with valid check-ins)
# Use only first check-in per user for node positioning

g = graphistry.edges(edges_df_filtered, 'user1', 'user2').nodes(checkins_df.groupby('user').first().reset_index(), 'user') \
    .layout_settings(play=0) \
    .settings(height=800, url_params={"pointOpacity": 0.6, "edgeOpacity": 0.01})
g.plot()

## Create Hypergraph: Users + Check-ins

This hypergraph combines two types of nodes: **user nodes** (blue, at average location) and **check-in nodes** (red, at actual check-in locations). Two types of edges connect them: **friendships** (blue) between users, and **user-to-check-in** edges (red) linking users to their check-ins.

**What to explore:**
- Mobility patterns: Check-in scatter around user's average location reveals travel behavior
- Social-spatial correlation: Do friends visit similar locations?
- Activity levels: Number of red edges from a user shows check-in frequency
- Geographic hotspots: Dense red node clusters indicate popular locations
- User movement range: Distance between user node and their check-ins shows mobility

In [54]:
# Sample check-ins using per-user cap for fair representation
# Users with â‰¤6 check-ins: keep all
# Users with >6 check-ins: randomly sample 6

max_per_user = 6  # Maximum check-ins per user

checkins_sampled = checkins_df.groupby('user', group_keys=False)[checkins_df.columns].apply(
    lambda x: x if len(x) <= max_per_user else x.sample(n=max_per_user, random_state=42)
)

print(f'Max check-ins per user: {max_per_user}')
print(f'Original check-ins: {len(checkins_df):,}')
print(f'Sampled check-ins: {len(checkins_sampled):,}')
print(f'Users with check-ins: {checkins_sampled["user"].nunique():,}')

# Create aggregated user nodes with average coordinates (using SAMPLED check-ins for consistency)
user_nodes = checkins_sampled.groupby('user').agg({
    'latitude': 'mean',
    'longitude': 'mean',
    'check_in_time': 'count'
}).reset_index()
user_nodes.columns = ['user', 'avg_latitude', 'avg_longitude', 'checkin_count']
user_nodes['type'] = 'user'
user_nodes['node_id'] = 'user_' + user_nodes['user'].astype(str)

# Create check-in nodes from SAMPLED data
checkin_nodes = checkins_sampled.copy()
checkin_nodes['type'] = 'checkin'
checkin_nodes['node_id'] = 'checkin_' + checkin_nodes.index.astype(str)

# Create user->check-in edges
user_checkin_edges = pd.DataFrame({
    'source': 'user_' + checkin_nodes['user'].astype(str),
    'destination': checkin_nodes['node_id'],
    'type': 'user_to_checkin',
    'user': checkin_nodes['user'].astype(str)
})

# Create friendship edges between user nodes
friendship_edges = pd.DataFrame({
    'source': 'user_' + edges_df_filtered['user1'].astype(str),
    'destination': 'user_' + edges_df_filtered['user2'].astype(str),
    'type': 'friendship'
})

# Combine all edges
all_edges = pd.concat([friendship_edges, user_checkin_edges], ignore_index=True)

# Combine all nodes
all_nodes = pd.concat([
    user_nodes[['node_id', 'user', 'avg_latitude', 'avg_longitude', 'type', 'checkin_count']].rename(
        columns={'avg_latitude': 'latitude', 'avg_longitude': 'longitude'}
    ),
    checkin_nodes[['node_id', 'user', 'latitude', 'longitude', 'type', 'check_in_time', 'location_id']]
], ignore_index=True)

print(f'User nodes: {len(user_nodes):,}')
print(f'Check-in nodes: {len(checkin_nodes):,}')
print(f'Friendship edges: {len(friendship_edges):,}')
print(f'User->check-in edges: {len(user_checkin_edges):,}')
print(f'Total nodes: {len(all_nodes):,}')
print(f'Total edges: {len(all_edges):,}')

Max check-ins per user: 6
Original check-ins: 4,491,144
Sampled check-ins: 231,714
Users with check-ins: 50,686
User nodes: 50,686
Check-in nodes: 231,714
Friendship edges: 388,180
User->check-in edges: 231,714
Total nodes: 282,400
Total edges: 619,894


In [55]:
# Create hypergraph visualization
g_hyper = graphistry.edges(all_edges, 'source', 'destination').nodes(all_nodes, 'node_id') \
    .encode_point_color("type", as_categorical=True, categorical_mapping={"checkin": "red", "user": "blue"}) \
    .encode_edge_color("type", as_categorical=True, categorical_mapping={"user_to_checkin": "red", "friendship": "blue"}) \
    .layout_settings(play=0) \
    .settings(height=800, url_params={"pointOpacity": 0.6, "edgeOpacity": 0.01})
g_hyper.plot()

## Add Choropleth Map Layer

This visualization adds a geographic choropleth layer using Kepler.gl that color-codes countries by the total number of nodes (users + check-ins) within their borders. The choropleth overlays the hypergraph to provide geographic context for network activity.

**What to explore:**
- Country-level aggregation: Total node count per country shown via color intensity
- Color gradient interpretation: Darker (black/dark green) = minimal activity, brighter (vibrant green) = high activity
- Logarithmic binning: Each color step represents order-of-magnitude increases (1, 10, 100, 1K, 5K, 10K, 15K+)
- Geographic patterns: Compare regional concentration vs. global distribution
- Cross-reference: Match choropleth colors to underlying point clusters on the map
- Network geography: Identify where users and check-ins are concentrated globally

In [56]:
# Add country information using reverse_geocoder (fast, offline)
import reverse_geocoder as rg

# Filter nodes with valid coordinates
nodes_with_coords = all_nodes[all_nodes['latitude'].notna() & all_nodes['longitude'].notna()].copy()

print(f'Adding country information to {len(nodes_with_coords):,} nodes with coordinates...')

# Prepare coordinates for batch reverse geocoding
coords = list(zip(nodes_with_coords['latitude'], nodes_with_coords['longitude']))

# Batch reverse geocode (much faster than individual requests)
results = rg.search(coords)

# Extract country codes
nodes_with_coords['country'] = [result['cc'] for result in results]

# Merge back to all_nodes
all_nodes = all_nodes.drop(columns=['country'], errors='ignore')
all_nodes = all_nodes.merge(
    nodes_with_coords[['node_id', 'country']], 
    on='node_id', 
    how='left'
)

print('\nCountry distribution:')
print(all_nodes['country'].value_counts().head(20))

Adding country information to 282,400 nodes with coordinates...

Country distribution:
country
US    170507
JP     18165
GB     17155
AU      8031
CA      7654
DE      7360
SE      4748
NL      4442
IT      3453
FR      3157
NO      3144
ES      2818
FI      1941
CN      1866
BE      1434
CL      1322
IN      1313
BR      1307
PT      1270
CH      1207
Name: count, dtype: int64


In [57]:
from graphistry.kepler import KeplerDataset, KeplerLayer, KeplerEncoding

# Create visualization with countries colored by activity
kepler_ps_encoding = (
    KeplerEncoding()

    # Nodes dataset
    .with_dataset(
        KeplerDataset(
            id="nodes",
            type="nodes",
            label="Nodes"
        )
    )

    # Edges dataset with mapped coordinates
    .with_dataset(
        KeplerDataset(
            id="edges",
            type="edges",
            label="Edges"
        )
    )

    # Countries dataset
    .with_dataset(
        KeplerDataset(
            id="countries",
            type="countries",
            label="Nodes in Countries",
            resolution=110,
            boundary_lakes=False,
            computed_columns={
                "nodes_in_countries": {
                    "type": "aggregate",
                    "computeFromDataset": "nodes",
                    "sourceKey": "country",
                    "targetKey": "iso_a2_eh",
                    "aggregate": "count",
                    "aggregateCol": "node_id",
                    "bins": [0, 1, 10, 100, 1000, 5000, 10000, 15000, 9999999],
                    "right": False,
                    "includeLowest": True
                }
            }
        )
    )

    # Countries geojson layer with color encoding
    .with_layer(
        KeplerLayer({
            "id": "countries-ps-layer",
            "type": "geojson",
            "config": {
                "dataId": "countries",
                "label": "Countries by Num Users",
                "columns": {
                    "geojson": "_geometry"
                },
                "isVisible": True,
                "visConfig": {
                    "opacity": 0.7,
                    "strokeOpacity": 0.8,
                    "thickness": 0.5,
                    "strokeColor": [60, 60, 60],
                    "colorRange": {
                        "name": "Custom Gradient",
                        "type": "sequential",
                        "category": "Custom",
                        "colors": [
                            "#000000",   # Black for lowest value (0-0.5)
                            "#001a0a",   # Very dark green (0.5-1)
                            "#003314",   # Dark green (1-2)
                            "#004d1f",   # Green (2-3)
                            "#00802d",   # Dark lime green (3-5)
                            "#00b340",   # Medium green (5-7)
                            "#00e65c",   # Bright green (7-10)
                            "#1aff8c"    # Vibrant green for highest value (10+)
                        ]
                    },
                    "filled": True,
                    "outline": True,
                    "extruded": False,
                    "wireframe": False
                }
            },
            "visualChannels": {
                "colorField": {
                    "name": "nodes_in_countries",
                    "type": "string"
                },
                "colorScale": "ordinal",
                "sizeField": None,
                "sizeScale": "linear"
            }
        })
    )
)

# Create hypergraph visualization
g_hyper = graphistry.edges(all_edges, 'source', 'destination').nodes(all_nodes, 'node_id') \
    .encode_point_color("type", as_categorical=True, categorical_mapping={"checkin": "red", "user": "blue"}) \
    .encode_edge_color("type", as_categorical=True, categorical_mapping={"user_to_checkin": "red", "friendship": "blue"}) \
    .encode_kepler(kepler_ps_encoding) \
    .layout_settings(play=0) \
    .settings(height=800, url_params={"pointOpacity": 0.6, "edgeOpacity": 0.01})
g_hyper.plot()