# Literature Review & Project Overview

This project uses high-resolution Geolife GPS trajectories to examine everyday mobility as a sequence of discrete visits, trips, and exploration decisions.  
The analysis is structured into five Jupyter notebooks.



### 1. What is EPR?
We examine everyday movement as a balance between revisiting familiar places and exploring new ones. In behavioral terms, most people follow habitual travel routines (e.g., home–work–home) yet occasionally deviate to novel destinations. This “exploit vs. explore” trade-off can be framed by an Exploration–Preferencing Ratio (EPR): a higher EPR means more exploration of new sites relative to returning to known ones. Conceptually, EPR is analogous to the exploration-and-preferential-return model in human mobility research, in which at each move an agent either visits a new location or returns to a past one (Pappalardo, Rinzivillo, & Simini, 2016; Song, Koren, Wang, & Barabási, 2010). Large-scale visitation patterns have been shown to arise from EPR-like dynamics in empirical data (Schläpfer et al., 2021; Song, Qu, Blumm, & Barabási, 2010).

In mobility data, we treat EPR operationally as the ratio of novel stops to repeated stops. For example, if a trip includes three previously unvisited stops (exploration) and one repeated stop, the person’s trip-level EPR would be 3:1, indicating exploratory behavior. In our context, we adapt this idea to individual GPS trajectories by explicitly labelling each stop as either “Pv” (previously visited) or “Pn” (novel), and then modeling the patterns of Pv/Pn occurrences within trips.

In our data, we operationalize this by labelling each non-home stop as Pn (novel) if it is the first time the user has stopped at that particular location, or Pv (visited) if it falls at a location the user has visited before. Over a trip (home → … → home), we then summarize the trip’s exploratory tendency by, for example, the count of Pn stops or the ratio Pn/(Pv + Pn). A person’s overall EPR can be aggregated from their trips (e.g., average per-trip exploration rate).

This trip-focused EPR differs from classic probabilistic models (e.g., the EPR model of Song et al., 2010) in that we measure empirical behavior rather than impose a fixed probability of exploration (Song et al., 2010). Prior literature has examined related metrics: Pappalardo et al. (2015), for instance, show that individuals cluster into “explorers” (many new locations) or “returners” (few new locations) based on visit-count ratios. Our approach is similar in spirit but works at the granularity of trips and discrete stops.

We also draw on ecological ideas of foraging: just as people navigating information maximize an information-gain rate by choosing to explore “new patches” only when the expected gain outweighs the cost (Pirolli & Card, 1999; Nielsen, 2019), travelers may implicitly weigh the novelty of a potential stop against its travel or time cost. We do not explicitly model that decision rule, but the EPR encapsulates its outcome. In short, EPR is the key behavioral concept linking individual choices of revisiting vs. exploring, and our goal is to measure it from trajectory data.

### 2. Data Processing: From Raw GPS to Home–Home Trips

We apply the methodology to the Microsoft Geolife GPS dataset (184 users, multi-year GPS tracks; Zheng, Xie, & Ma, 2010). After loading and cleaning the Geolife .plt trajectories, coordinates are projected to UTM Zone 50N so that distances are measured in meters. The trajectory is then discretized onto a regular 500 m × 500 m grid, and consecutive fixes assigned to the same grid cell are treated as a single stay segment. This representation reduces sensitivity to GPS jitter and irregular sampling, while producing visits with explicit arrival/departure times and durations.

Home location is inferred from night-time (00:00–06:00) observations rather than overall visit frequency: the grid cell with the strongest night-time presence is labeled HOME, and a secondary frequent night-time cell may be labeled SW. To avoid misclassifying fast in-transit corridors as “places,” daily activity locations are identified conservatively using line–grid intersection counts: consecutive points are connected into segments, and grid cells with sufficiently many segment intersections are retained, with an adaptive threshold to cap the number of activity cells per day. Visit events are then constructed by collapsing the within-day cell sequence, dropping non-activity blocks as transit, and merging adjacent identical blocks to reduce fragmentation.

Each non-home visit is labeled as first-time versus return based on the user’s cumulative visitation history across days: a cell is treated as novel the first day it appears as an activity location and as a return on subsequent days. The resulting visit sequence provides the input for subsequent steps, where home–home trips are constructed and the trip-level Pv/Pn patterns are analyzed to quantify exploration behavior.

![Notebook 1 overview](image/001.png)

### 3. Trip-level behaviour within a single user

Given the home–home trip sequences with first-time/return labels, two step-level aspects of trip behavior are modeled.

The first is the hazard of returning home (trip termination). Each trip is treated as a discrete-time survival process: at every stop (time step) prior to termination, there is some probability that the next move is a return to HOME. Following Singer and Willett’s discrete-time framework, a logistic hazard model is estimated by “exploding” each trip into one row per stop, with a binary outcome indicating whether the next step is HOME (return = 1) or not (0). This outcome is regressed on covariates observed at that step (e.g., stop order, elapsed time since departure, distance from home, and whether the most recent stop was first-time versus a return). The fitted hazard model summarizes how the propensity to end a trip evolves as the trip unfolds.

The second component models exploration propensity at each step. Here the binary outcome indicates whether the current stop is a first-time place versus a return (under the notebook’s Pv/Pn convention). A logistic regression (or similar classifier) is fitted for “explore = 1” using predictors such as step index, time of day, day of week, and distance from home. This model captures how the probability of making a first-time stop changes with trip progress and context, and can be interpreted as a step-level analogue of EPR.

![Notebook 2 overview](image/002.png)


### 4. Cross-User Comparison: PCA and Clustering of EPR Profiles

Finally, we compare users in the space of these model-derived behaviors. Each user’s model yields a feature vector (e.g., [hazard intercept, hazard slope, exploration intercept, exploration slope, …]) that encodes their tendency to explore and to return. Since this vector may be high-dimensional, we first apply principal components analysis (PCA) to reduce dimensionality and identify the main axes of behavioral variance. PCA reveals whether users vary along, for example, a “high-exploration vs. high-preference for revisitation” spectrum or along other blended factors.

We then apply k-means clustering (with k chosen by silhouette or cross-validation criteria) to group users into behavioral phenotypes based on their PCA scores. Prior work has shown that clustering mobility patterns often yields well-separated groups defined by exploration intensity and travel range. In our case, the clustering identifies typical EPR/hazard profiles in the population—for example, “local explorers” (generally short trips but often to new places) versus “long-distance returners” (travel far but largely between the same hubs). We can then summarize each cluster by its mean EPR, radius of gyration, and similar indicators, and interpret it in behavioral terms.