# Brightkite Location-Based Social Network Dataset

This notebook analyzes the Brightkite dataset from SNAP:
- **Network**: 58,228 users with 214,078 friendships
- **Check-ins**: 4.4M location check-ins from April 2008 - October 2010

Source: https://snap.stanford.edu/data/loc-brightkite.html

In [None]:
import pandas as pd
import requests
import gzip
from io import BytesIO, StringIO
import graphistry

# To specify Graphistry account & server, use:
# graphistry.register(api=3, protocol="https", server="hub.graphistry.com",
#                     username="...", password="...")


## Download and Parse Friendship Network

In [2]:
# Download friendship network
edges_url = 'https://snap.stanford.edu/data/loc-brightkite_edges.txt.gz'
print('Downloading friendship network...')
edges_response = requests.get(edges_url)

# Decompress and parse
with gzip.GzipFile(fileobj=BytesIO(edges_response.content)) as f:
    edges_content = f.read().decode('utf-8')

# Parse into DataFrame
edges_df = pd.read_csv(
    StringIO(edges_content),
    sep='\t',
    comment='#',
    names=['user1', 'user2'],
    dtype={'user1': int, 'user2': int}
)

print(f'Loaded {len(edges_df):,} edges')
edges_df.head()

Downloading friendship network...
Loaded 428,156 edges


Unnamed: 0,user1,user2
0,0,1
1,0,2
2,0,3
3,0,4
4,0,5


## Download and Parse Check-in Data

In [3]:
# Download check-in data
checkins_url = 'https://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz'
print('Downloading check-in data...')
checkins_response = requests.get(checkins_url)

# Decompress and parse
with gzip.GzipFile(fileobj=BytesIO(checkins_response.content)) as f:
    checkins_content = f.read().decode('utf-8')

# Parse into DataFrame
checkins_df = pd.read_csv(
    StringIO(checkins_content),
    sep='\t',
    comment='#',
    names=['user', 'check_in_time', 'latitude', 'longitude', 'location_id'],
    dtype={'user': int},
    parse_dates=['check_in_time']
)

# Filter out likely invalid coordinates: (0, 0) or missing values
checkins_df = checkins_df[
    checkins_df['latitude'].notna() & 
    checkins_df['longitude'].notna() & 
    ((checkins_df['latitude'] != 0) | (checkins_df['longitude'] != 0))
]

print(f'Loaded {len(checkins_df):,} check-ins')
checkins_df.head()

Downloading check-in data...
Loaded 4,491,144 check-ins


Unnamed: 0,user,check_in_time,latitude,longitude,location_id
0,0,2010-10-17 01:48:53+00:00,39.747652,-104.99251,88c46bf20db295831bd2d1718ad7e6f5
1,0,2010-10-16 06:02:04+00:00,39.891383,-105.070814,7a0f88982aa015062b95e3b4843f9ca2
2,0,2010-10-16 03:48:54+00:00,39.891077,-105.068532,dd7cd3d264c2d063832db506fba8bf79
3,0,2010-10-14 18:25:51+00:00,39.750469,-104.999073,9848afcc62e500a01cf6fbf24b797732f8963683
4,0,2010-10-14 00:21:47+00:00,39.752713,-104.996337,2ef143e12038c870038df53e0478cefc


In [4]:
# Filter edges to only include users with valid check-ins
valid_users = set(checkins_df['user'].unique())
edges_df_filtered = edges_df[
    edges_df['user1'].isin(valid_users) & 
    edges_df['user2'].isin(valid_users)
]

print(f'Filtered edges: {len(edges_df):,} -> {len(edges_df_filtered):,}')
print(f'Users in network: {pd.concat([edges_df["user1"], edges_df["user2"]]).nunique():,}')
print(f'Users with valid check-ins: {len(valid_users):,}')
print(f'Users in filtered network: {pd.concat([edges_df_filtered["user1"], edges_df_filtered["user2"]]).nunique():,}')

Filtered edges: 428,156 -> 388,180
Users in network: 58,228
Users with valid check-ins: 50,686
Users in filtered network: 50,111


## Visualize Friendship Network with Graphistry

This visualization shows the social network of Brightkite users. Each node represents a user, positioned at their first check-in location. Edges represent friendships between users.

**What to explore:**
- Community clusters: Groups of highly connected friends
- Geographic patterns: Whether friend groups cluster geographically
- Network hubs: Users with many connections (high degree)
- Network structure: Identify isolated groups vs. the main component

In [9]:
# Visualize friendship network (filtered to users with valid check-ins)
# Use only first check-in per user for node positioning

g = graphistry.edges(edges_df_filtered, 'user1', 'user2').nodes(checkins_df.groupby('user').first().reset_index(), 'user') \
    .layout_settings(play=0) \
    .settings(height=800, url_params={"pointOpacity": 0.6, "edgeOpacity": 0.01})
g.plot()

## Create Hypergraph: Users + Check-ins

This hypergraph combines two types of nodes: **user nodes** (blue, at average location) and **check-in nodes** (red, at actual check-in locations). Two types of edges connect them: **friendships** (blue) between users, and **user-to-check-in** edges (red) linking users to their check-ins.

**What to explore:**
- Mobility patterns: Check-in scatter around user's average location reveals travel behavior
- Social-spatial correlation: Do friends visit similar locations?
- Activity levels: Number of red edges from a user shows check-in frequency
- Geographic hotspots: Dense red node clusters indicate popular locations
- User movement range: Distance between user node and their check-ins shows mobility

In [7]:
# Sample check-ins: keep at least 1 per user, then randomly sample the rest
# This ensures every user has representation while reducing total nodes

# Target number of check-ins to approximate
target_checkins = 500_000
min_per_user = 1  # At least 1 check-in per user

# Calculate sample fraction to approximate target
sample_fraction = target_checkins / len(checkins_df)

sampled_checkins = []
for user_id, user_data in checkins_df.groupby('user'):
    n_checkins = len(user_data)
    n_sample = max(min_per_user, int(n_checkins * sample_fraction))
    sampled_checkins.append(user_data.sample(n=n_sample, random_state=42))

checkins_sampled = pd.concat(sampled_checkins, ignore_index=True)

print(f'Target check-ins: {target_checkins:,}')
print(f'Original check-ins: {len(checkins_df):,}')
print(f'Sampled check-ins: {len(checkins_sampled):,}')
print(f'Sample fraction: {sample_fraction:.2%}')
print(f'Users with check-ins: {checkins_sampled["user"].nunique():,}')

# Create aggregated user nodes with average coordinates (using ALL check-ins for accuracy)
user_nodes = checkins_df.groupby('user').agg({
    'latitude': 'mean',
    'longitude': 'mean',
    'check_in_time': 'count'
}).reset_index()
user_nodes.columns = ['user', 'avg_latitude', 'avg_longitude', 'checkin_count']
user_nodes['type'] = 'user'
user_nodes['node_id'] = 'user_' + user_nodes['user'].astype(str)

# Create check-in nodes from SAMPLED data
checkin_nodes = checkins_sampled.copy()
checkin_nodes['type'] = 'checkin'
checkin_nodes['node_id'] = 'checkin_' + checkin_nodes.index.astype(str)

# Create user->check-in edges
user_checkin_edges = pd.DataFrame({
    'source': 'user_' + checkin_nodes['user'].astype(str),
    'destination': checkin_nodes['node_id'],
    'type': 'user_to_checkin'
})

# Create friendship edges between user nodes
friendship_edges = pd.DataFrame({
    'source': 'user_' + edges_df_filtered['user1'].astype(str),
    'destination': 'user_' + edges_df_filtered['user2'].astype(str),
    'type': 'friendship'
})

# Combine all edges
all_edges = pd.concat([friendship_edges, user_checkin_edges], ignore_index=True)

# Combine all nodes
all_nodes = pd.concat([
    user_nodes[['node_id', 'avg_latitude', 'avg_longitude', 'type', 'checkin_count']].rename(
        columns={'avg_latitude': 'latitude', 'avg_longitude': 'longitude'}
    ),
    checkin_nodes[['node_id', 'latitude', 'longitude', 'type', 'check_in_time', 'location_id']]
], ignore_index=True)

print(f'User nodes: {len(user_nodes):,}')
print(f'Check-in nodes: {len(checkin_nodes):,}')
print(f'Friendship edges: {len(friendship_edges):,}')
print(f'User->check-in edges: {len(user_checkin_edges):,}')
print(f'Total nodes: {len(all_nodes):,}')
print(f'Total edges: {len(all_edges):,}')

Target check-ins: 500,000
Original check-ins: 4,491,144
Sampled check-ins: 503,445
Sample fraction: 11.13%
Users with check-ins: 50,686
User nodes: 50,686
Check-in nodes: 503,445
Friendship edges: 388,180
User->check-in edges: 503,445
Total nodes: 554,131
Total edges: 891,625


In [15]:
# Create hypergraph visualization
g_hyper = graphistry.edges(all_edges, 'source', 'destination').nodes(all_nodes, 'node_id') \
    .encode_point_color("type", as_categorical=True, categorical_mapping={"checkin": "red", "user": "blue"}) \
    .encode_edge_color("type", as_categorical=True, categorical_mapping={"user_to_checkin": "red", "friendship": "blue"}) \
    .layout_settings(play=0) \
    .settings(height=800, url_params={"pointOpacity": 0.6, "edgeOpacity": 0.01})
g_hyper.plot()