# Project Overview

This project uses high-resolution Geolife GPS trajectories to study everyday mobility as a sequence of discrete **visits**, **trips**, and **exploration decisions**.  
The analysis is organised into three main notebooks, which move from data engineering to behavioural modelling and finally to cross-user comparison.


### Notebook 1 — From raw trajectories to visit-level events

Notebook 1 establishes the data foundation. Starting from the original Geolife `.plt` files, it performs a series of spatial and temporal transformations to convert noisy GPS logs into a clean visit-level dataset. Raw points are first loaded and cleaned, then restricted to the Beijing study area and projected into a 500 m UTM grid that smooths GPS jitter and provides a common spatial reference. Night-time observations are used to infer the user’s primary home location (HOME) and a secondary frequent place (SW), while line–grid intersections identify grid cells that are genuinely “visited” rather than simply passed through in transit. Finally, consecutive points within the same activity cell are collapsed into **visit events**, each labelled as HOME, SW, first-time place (Pv), or return place (Pn), together with timing, distance-from-home, and within-day order attributes.  

The resulting `visit_level_table_XXX.csv` is the input to all subsequent notebooks.

![Notebook 1 overview](image/001.png)

### Visit-level concepts (Notebook 1)

- **Raw GPS point / trajectory** – Single Geolife record (lat, lon, timestamp); ordered points form a trajectory.
- **Grid cell (`ci`, `rj`)** – 500 m × 500 m UTM grid; every GPS point is assigned to a cell.
- **HOME** – Grid cell with the highest and most persistent night-time (00:00–06:00) presence.
- **SW** – Second-strongest night-time cell after removing HOME (e.g. workplace / dorm).
- **Activity cell** – Grid cell that is crossed by “enough” line segments in a day; road-only cells are excluded.
- **Visit (visit-level event)** – Continuous stay in the same activity cell; consecutive points merged, short GPS gaps bridged.
- **Pv (first-time place)** – First day an activity cell (non-HOME/SW) appears.
- **Pn (return place)** – Later days when the same activity cell is revisited.
- **`place_id`** – Label for the cell: `HOME`, `SW`, `Pv#`, or `Pn#`.
- **`dist_home_m`** – Distance (metres) from the visit cell to HOME in UTM space.
- **`visit_order_in_day`** – Chronological index of visits within a calendar day.
- **`action_order`** – Index that resets to 0 at HOME and counts non-home visits while the user is away.
- **`next_step`** – Category of the following visit: `home`, `sw`, `pv`, `pn`, or `none` (end of day).

### Notebook 2 — Trip-level behaviour within a single user

Notebook 2 takes the visit-level table for one user and reconstructs complete home–home trips. Visits are grouped into outings that start at HOME, move through one or more non-home stops, and end when HOME is reached again (or the day ends). Within each trip, the notebook analyses two key dimensions of behaviour.  

First, it estimates a discrete-time hazard of going home: for each stop order \(k\), it measures the probability that the next move is a return to HOME, conditional on the trip having already reached stop \(k\). This yields both empirical hazard and survival curves and a simple logit trend in stop order. 

Second, it models the propensity for exploration by tracking when first-time places (Pv) occur along the sequence of stops, and fitting a logistic regression of the Pv indicator on within-trip stop order. Together, these components characterise whether a trip is a short one-stop errand or a longer outing, and how the balance between revisiting known places and exploring new ones evolves as the trip unfolds.

![Notebook 2 overview](image/002.png)


### Trip-level variables (Notebook 2)

- **`trip_id`** – Identifier of a home–home trip within a day (0 = not in any trip).
- **`action_order_in_trip`** – Order of non-home stops within a trip (0 at HOME, 1, 2, … away from HOME).
- **`n_stops`** – Number of non-home stops in a trip.
- **`trip_start`, `trip_end`** – Start and end timestamps of the trip.
- **`trip_duration_min`** – Trip duration in minutes.
- **`max_dist_home`, `mean_dist_home`** – Maximum and average distance from HOME during the trip.
- **`outcome`** – For each non-home stop: `home`, `explore` (another non-home place), or `end` (trip ends).
- **Hazard of going home `h_k`** – Probability that the next stop is HOME, conditional on the trip having reached stop \(k\).
- **Survival `S_k`** – Probability that the trip is still ongoing after stop \(k\).
- **`overall_pv_share` (within-trip)** – Share of non-home stops that are Pv.
- **`pv_odds_ratio_per_stop`** – Multiplicative change in the odds of Pv per extra stop in the trip (from a logit model).

### Notebook 3 — Cross-user comparison of daily exploration patterns

Notebook 3 extends the analysis from a single individual to a panel of long-coverage Geolife users. For each user in the sample, it reloads the corresponding visit-level table, re-attaches the same home–home trip structure as in Notebook 2, and computes a compact set of trip-level and behavioural summaries: number of active days and trips, distribution of non-home stops per trip, user-specific going-home hazards, and within-trip exploration rates. The notebook then fits separate discrete-time logit models for the going-home hazard and the Pv probability for each user, producing a small set of interpretable parameters that describe their overall level of “home-boundness” and their tendency to explore as trips grow longer.  

Finally, these features are combined into a multivariate “trip behaviour space”, where users are compared and clustered into broad types (e.g. errand-oriented, revisiting-oriented, or exploratory roamers). In this way, Notebook 3 acts as a cross-sectional validation of the patterns identified in Notebooks 1 and 2 and shows how similar or heterogeneous daily exploration behaviour is across individuals.








### User-level summary variables (Notebook 3)
Each row of `summary_fit` summarises one user:
- **`uid`** – User ID.
- **`n_days_total`** – Days with any GPS data.
- **`n_days_nonhome`** – Days with at least one non-home stop.
- **`n_trips`** – Number of detected home–home trips.
- **`mean_stops_per_trip`** – Average number of non-home stops per trip.
- **`median_stops_per_trip`** – Median number of non-home stops per trip.
- **`max_stops_per_trip`** – Maximum number of non-home stops in any trip.
- **`hazard_h1`** – Empirical probability of going home right after the first non-home stop.
- **`hazard_h2_const`** – Average going-home hazard for stops \(k \ge 2\).
- **`hazard_beta_k`** – Logit slope of the going-home hazard with respect to stop order \(k\).
- **`overall_pv_share`** – Fraction of non-home trip stops that are Pv (first-time places).
- **`pv_odds_ratio_per_stop`** – Odds ratio for Pv per additional stop (from the within-trip Pv logit model).
- **`hazard_mse`** – Mean squared error between empirical and fitted going-home hazards.
- **`pv_mse`** – Mean squared error between empirical and fitted Pv probabilities.
- **`cluster3`** – k-means cluster label (0, 1, 2), corresponding to multi-stop revisitors, errand-homebodies, and exploratory roamers.
- **`PC1`, `PC2`, `PC3`** – Scores on the first three principal components of the nine user-level features:  
  - **PC1** – Trip complexity and routine (long multi-stop vs short exploratory outings).  
  - **PC2** – Mobility frequency and intensity (how often and on how many days users go out).  
  - **PC3** – Late-trip behaviour (how quickly the return-home hazard rises or stays low at higher stop orders).