# FOURSQUARE PROCESSING

# Overview of Dataset Files

This project relies on three main datasets, all of which are described below. These datasets are essential for processing and building a framework for analyzing user interactions with Points of Interest (POIs). Details about the files and their structure can also be found in the accompanying README.

## 1. **dataset_TIST2015_Checkins.txt**
This file contains all user check-ins and provides information about interactions with various venues (POIs). The data is structured into 4 columns:

1. **User ID**: An anonymized identifier for the user.  
2. **Venue ID**: The unique identifier for a venue (Foursquare).  
3. **UTC Time**: The timestamp of the check-in in Coordinated Universal Time (UTC).  
4. **Timezone Offset**: The offset in minutes between the check-in's local time and UTC. This can be used to calculate the local time of the check-in using:  
   \[
   \text{Local Time} = \text{UTC Time} + \text{Timezone Offset}
   \]

---

## 2. **dataset_TIST2015_POIs.txt**
This file contains metadata about the Points of Interest (POIs) and is structured into 5 columns:

1. **Venue ID**: The unique identifier for the venue (Foursquare).  
2. **Latitude**: The latitude coordinate of the venue.  
3. **Longitude**: The longitude coordinate of the venue.  
4. **Venue Category Name**: The name of the category associated with the venue (e.g., "Restaurant", "Museum").  
5. **Country Code**: A two-letter ISO 3166-1 alpha-2 country code indicating the country where the venue is located.  

---

## 3. **dataset_TIST2015_Cities.txt**
This file provides a list of cities and their geographic details. It is structured into 6 columns:

1. **City Name**: The name of the city.  
2. **Latitude**: The latitude coordinate of the city's center.  
3. **Longitude**: The longitude coordinate of the city's center.  
4. **Country Code**: A two-letter ISO 3166-1 alpha-2 country code indicating the country where the city is located.  
5. **Country Name**: The full name of the country.  
6. **City Type**: The classification of the city, such as "national capital" or "provincial capital".  

---

### Key Notes:
- The **Check-ins** file provides interaction data, which will be mapped to specific cities and POIs using the information from the **POIs** and **Cities** files.
- The **POIs** file links venue details (e.g., location and category) with their respective country codes.
- The **Cities** file is used to assign POIs to cities based on geographic proximity, given the latitude and longitude of both the venue and the city center.

These datasets form the foundation for processing and analysis in this project.


In [1]:
import math
import csv
from datetime import datetime, timedelta
from collections import defaultdict

In [5]:
def haversine(lat1, lon1, lat2, lon2):
     
    # distance between latitudes
    # and longitudes
    dLat = (lat2 - lat1) * math.pi / 180.0
    dLon = (lon2 - lon1) * math.pi / 180.0
 
    # convert to radians
    lat1 = (lat1) * math.pi / 180.0
    lat2 = (lat2) * math.pi / 180.0
 
    # apply formulae
    a = (pow(math.sin(dLat / 2), 2) +
         pow(math.sin(dLon / 2), 2) *
             math.cos(lat1) * math.cos(lat2));
    rad = 6371
    c = 2 * math.asin(math.sqrt(a))
    return rad * c

In [6]:
cities_file = 'data/FoursquareGlobalCheckinDataset/dataset_TIST2015_Cities.txt'
pois_file = 'data/FoursquareGlobalCheckinDataset/dataset_TIST2015_POIs.txt'
checkins_file = 'data/FoursquareGlobalCheckinDataset/dataset_TIST2015_Checkins.txt'
target_countries = ['US', 'GB', 'JP']
target_cities = ['New York', 'London', 'Tokyo']
non_target_categories = ['Residential Building (Apartment / Condo)', 'College Residence Hall']

### Step 1: Process `dataset_TIST2015_Cities.txt`
- **Read the `dataset_TIST2015_Cities.txt` file** containing city information.
- **Create a dictionary** with the city name as the key and the coordinates (latitude and longitude) as the value.
- **Create a second dictionary** for countries, where the key is the country code (ISO 3166-1 alpha-2) and the value is a list of cities that belong to that country. Each city is added to the country’s list as it is read from the file.

In [7]:
def process_cities(file_path):
    city_dict = {}
    country_dict = {}
    with open(file_path, 'r') as file:
        reader = csv.reader(file, delimiter='\t')
        for row in reader:
            city_name, lat, lon, country_code, *_ = row
            lat, lon = float(lat), float(lon)
            city_dict[city_name] = (lat, lon) # city dictionary: {city name: (latitude, longitude)}
            if country_code not in country_dict:
                country_dict[country_code] = []
            country_dict[country_code].append(city_name) # country dictionary: {country code: list of cities in that country}
    return city_dict, country_dict

city_dict, country_dict = process_cities(cities_file)

### Step 2: Process `dataset_TIST2015_POIs.txt`
- **Read the `dataset_TIST2015_POIs.txt` file** containing the information about the POIs.
- For each **Point of Interest**, use the country code to find the cities belonging to that country using the dictionary created in Step 1.
- **Calculate the distance** between the POI and all cities in the country and select the city with the shortest distance. This allows us to determine which city the POI belongs to.
- Assign a **new ID** to each POI (as an integer) and create a new file containing the following data:
<Old_Foursquare_ID> <New_ID> <Latitude> <Longitude> <CountryCode_City>
- **Filter out POIs** applying the following filters:
    1. **Country Filter**: Only POIs located in the target countries (US, GB, JP) are included.
    2. **City Filter**: Only POIs located in the target cities (New York, London, Tokyo) are included.
    3. **Category Filter**: POIs belonging to the categories "Residential Building (Apartment / Condo)" and "College Residence Hall" are excluded.

The new ID is an integer rather than the original Foursquare ID (which is a string) because working with integers is generally easier for processing.

In [8]:
def process_pois(file_path, city_dict, country_dict, target_countries, target_cities, non_target_categories):
    pois = []
    map_id = {}
    new_id = 1
    with open(file_path, 'r') as file:
        reader = csv.reader(file, delimiter='\t')
        for row in reader:
            venue_id, lat, lon, category, country_code = row
            lat, lon = float(lat), float(lon)
            if country_code in target_countries:
                cities = country_dict[country_code]
                min_distance = float('inf')
                closest_city = None
                for city in cities:
                    city_lat, city_lon = city_dict[city]
                    distance = haversine(lat, lon, city_lat, city_lon)
                    if distance < min_distance:
                        min_distance = distance
                        closest_city = city
                if closest_city:
                    if closest_city in target_cities and category not in non_target_categories:
                        pois.append((venue_id, new_id, lat, lon, country_code, closest_city))
                        map_id[venue_id] = new_id
                        new_id += 1
                        
    return pois, map_id

pois, map_id = process_pois(pois_file, city_dict, country_dict, target_countries, target_cities, non_target_categories)
poi_dict = {poi[0]: poi for poi in pois}

The file is saved at `data/FoursquareProcessed/filtered_pois.txt` and contains the following columns:

1. **Old Foursquare ID**: The original identifier for the POI from Foursquare.
2. **New ID**: A new integer identifier assigned to the POI for easier processing.
3. **Latitude**: The latitude coordinate of the POI.
4. **Longitude**: The longitude coordinate of the POI.
5. **Country Code**: The two-letter ISO 3166-1 alpha-2 country code indicating the country where the POI is located.
6. **City**: The name of the city to which the POI belongs.

This file is essential for mapping user check-ins to specific POIs and their corresponding cities.

In [9]:
def save_pois(pois, output_file):
    with open(output_file, 'w') as file:
        for poi in pois:
            file.write("\t".join(map(str, poi)) + "\n")

save_pois(pois, 'data/FoursquareProcessed/filtered_pois.txt')

In [12]:
def save_venue_id_mappings(map_id, output_file):
    with open(output_file, 'w') as file:
        for old, new in map_id.items():
            file.write(f"{old}\t{new}\n")

save_venue_id_mappings(map_id, 'data/FoursquareProcessed/venue_id_mappings.txt')



### Step 3: Process `dataset_TIST2015_Checkins.txt`
- **Read the `dataset_TIST2015_Checkins.txt` file** containing user check-in data.
- For each check-in, determine the corresponding **city** by mapping the POI to a city.
- Use the **new IDs** (assigned in Step 2) rather than Foursquare's IDs.
- **Convert the check-in timestamps** to a consistent timestamp format for easier processing.
- **Filter out POIs** that do not belong to New York, Tokyo or London, ignoring those check-ins for now.
- **Output** a file for each city (New York, Tokyo or London)

In [13]:
def process_checkins(file_path, target_cities, output_files, output_files_agg):
    
    city_files = {city: open(output_files[city], 'w') for city in target_cities}
    city_agg_files = {city: open(output_files_agg[city], 'w') for city in target_cities}
    user_scores = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))  # {user_id: {city: {poi_id: score}}}

    with open(file_path, 'r') as file:
        reader = csv.reader(file, delimiter='\t')
        for row in reader:
            user_id, venue_id, utc_time, offset = row
            if venue_id in poi_dict:
                poi = poi_dict[venue_id]
                city = poi[5]
                if city in target_cities:

                    utc_dt = datetime.strptime(utc_time, '%a %b %d %H:%M:%S +0000 %Y')
                    local_dt = utc_dt + timedelta(minutes=int(offset))
                    timestamp = int(local_dt.timestamp())

                    city_files[city].write(f"{user_id}\t{poi[1]}\t{timestamp}\n")

                    user_scores[user_id][city][poi[1]] += 1

    for file in city_files.values():
        file.close()

    for user_id, city_visits in user_scores.items():
        for city, poi_visits in city_visits.items():
            for poi_id, score in poi_visits.items():
                city_agg_files[city].write(f"{user_id}\t{poi_id}\t{score}\n")

    for file in city_agg_files.values():
        file.close()

output_files = {city: f"data/FoursquareProcessed/{city}_checkins.txt" for city in target_cities}
output_files_agg = {city: f"data/FoursquareProcessed/{city}_checkins_agg.txt" for city in target_cities}
process_checkins(checkins_file, target_cities, output_files, output_files_agg)


The script processes check-in data and generates two types of files for each target city:

1. **Raw Check-ins Files**:
   - Location: `data/FoursquareProcessed/<city>_checkins.txt`
   - Content:
     - **User ID**: Anonymized identifier for the user.
     - **POI ID**: Numeric identifier for the Point of Interest (POI) visited by the user.
     - **Timestamp**: UNIX timestamp of the check-in.

2. **Aggregated Check-ins Files**:
   - Location: `data/FoursquareProcessed/<city>_checkins_agg.txt`
   - Content:
     - **User ID**: Anonymized identifier for the user.
     - **POI ID**: Numeric identifier for the POI.
     - **Score**: Frequency of visits by the user to the POI, representing their interaction level.

These files are useful for analyzing individual check-ins as well as aggregated user preferences for specific POIs within each city.
