# Lab 7 - Barcelona Bike Sharing Station Criticality Analysis

This notebook analyzes historical data about Barcelona bike sharing stations to identify the most "critical" timeslot for each station and generates a KML file for map visualization.

## Objectives:

1. **Analyze station criticality** by timeslot (day of week + hour)
2. **Calculate criticality values** as ratio of full station readings to total readings
3. **Filter by minimum criticality threshold** (command line argument)
4. **Select most critical timeslot** for each station
5. **Generate KML output** for map visualization

## Input Files:

1. **register.csv**: Historical station data
   - Format: `stationId\ttimestamp\tusedslots\tfreeslots`
   - Example: `23 2008-05-15 19:01:00 5 13`

2. **stations.csv**: Station location data
   - Format: `stationId\tlongitude\tlatitude\tname`
   - Example: `1 2.180019 41.397978 Gran Via Corts Catalanes`

## Criticality Definition:
A station is "critical" when free_slots = 0 (station is full).

Criticality = (Number of readings with free_slots = 0) / (Total readings for station-timeslot pair)

## Import libraries and configuration

In [None]:
from typing import Tuple, Dict, List
from pyspark import SparkConf, SparkContext
from datetime import datetime
import sys

## Parameters configuration

In [None]:
# Configuration of paths and parameters
registerPath = "sampleData/registerSample.csv"  # For local testing
stationsPath = "sampleData/stations.csv"        # For local testing
# registerPath = "/data/students/bigdata-01QYD/Lab7/register.csv"  # For HDFS
# stationsPath = "/data/students/bigdata-01QYD/Lab7/stations.csv"  # For HDFS

outputPath = "critical_stations_output/"
kmlOutputPath = "critical_stations.kml"

# Minimum criticality threshold (can be set as command line argument)
minCriticalityThreshold = 0.3  # Example: 30%

## Read and preprocess station data

In [None]:
# Read stations file
stationsRDD = sc.textFile(stationsPath)

# Remove header
stationsHeader = stationsRDD.first()

stationsDataRDD = stationsRDD.filter(lambda line: line != stationsHeader)

def parse_station_line(line: str) -> Tuple[int, Tuple[float, float, str]]:
    """
    Parse station line: id\tlongitude\tlatitude\tname
    Returns (stationId, (longitude, latitude, name))
    """
    try:
        fields = line.split('\t')
        if len(fields) >= 4:
            station_id = int(fields[0].strip())
            longitude = float(fields[1].strip())
            latitude = float(fields[2].strip())
            name = fields[3].strip()
            return (station_id, (longitude, latitude, name))
        else:
            return None
    except:
        return None

# Parse station data and create lookup dictionary
stationsLookupRDD = stationsDataRDD.map(parse_station_line).filter(lambda x: x is not None)

# Cache for joins
stationsLookupRDD.cache()

## Read and preprocess register data

In [None]:
# Read register file
registerRDD = sc.textFile(registerPath)

# Remove header
registerHeader = registerRDD.first()

registerDataRDD = registerRDD.filter(lambda line: line != registerHeader)

def parse_register_line(line: str) -> Tuple[int, str, int, int]:
    """
    Parse register line: stationId\ttimestamp\tusedslots\tfreeslots
    Returns (stationId, timestamp, usedSlots, freeSlots)
    Filter out invalid data (used=0 and free=0)
    """
    try:
        fields = line.split('\t')
        if len(fields) >= 4:
            station_id = int(fields[0].strip())
            timestamp = fields[1].strip()
            used_slots = int(fields[2].strip())
            free_slots = int(fields[3].strip())
            
            # Filter out invalid readings (both used and free = 0)
            if used_slots == 0 and free_slots == 0:
                return None
                
            return (station_id, timestamp, used_slots, free_slots)
        else:
            return None
    except:
        return None

# Parse register data
validRegisterRDD = registerDataRDD.map(parse_register_line).filter(lambda x: x is not None)

# Cache for multiple operations
validRegisterRDD.cache()

## Extract timeslots (day of week + hour) from timestamps

In [None]:
def extract_timeslot(timestamp: str) -> Tuple[str, int]:
    """
    Extract day of week and hour from timestamp
    Returns (dayOfWeek, hour)
    """
    try:
        # Parse timestamp: "2008-05-15 12:01:00"
        datetime_obj = datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
        day_of_week = datetime_obj.strftime("%a")  # Mon, Tue, Wed, etc.
        hour = datetime_obj.hour
        return (day_of_week, hour)
    except:
        return None

def create_station_timeslot_data(reading: Tuple[int, str, int, int]):
    """
    Convert reading to ((stationId, timeslot), (isCritical, 1))
    where isCritical = 1 if free_slots == 0, else 0
    """
    station_id, timestamp, used_slots, free_slots = reading
    
    timeslot = extract_timeslot(timestamp)
    if timeslot is None:
        return None
    
    day_of_week, hour = timeslot
    is_critical = 1 if free_slots == 0 else 0
    
    return ((station_id, (day_of_week, hour)), (is_critical, 1))

# Transform readings to station-timeslot pairs with criticality info
stationTimeslotRDD = validRegisterRDD.map(create_station_timeslot_data).filter(lambda x: x is not None)

## Calculate criticality for each station-timeslot pair

In [None]:
# Aggregate data: sum critical readings and total readings for each (station, timeslot)
aggregatedRDD = stationTimeslotRDD.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))

def calculate_criticality(data):
    """
    Calculate criticality ratio from aggregated data
    Returns ((stationId, timeslot), criticalityValue)
    """
    (station_id, timeslot), (critical_count, total_count) = data
    
    if total_count == 0:
        return None
    
    criticality = critical_count / total_count
    return ((station_id, timeslot), criticality)

# Calculate criticality for each station-timeslot pair
criticalityRDD = aggregatedRDD.map(calculate_criticality).filter(lambda x: x is not None)

# Cache for multiple operations
criticalityRDD.cache()

## Filter by minimum criticality threshold

In [None]:
# Filter pairs with criticality >= threshold
filteredCriticalityRDD = criticalityRDD.filter(lambda x: x[1] >= minCriticalityThreshold)

# Cache for multiple operations
filteredCriticalityRDD.cache()

filteredCount = filteredCriticalityRDD.count()

## Select most critical timeslot for each station

In [None]:
if filteredCount > 0:
    # Transform to (stationId, (timeslot, criticality)) for grouping by station
    stationGroupedRDD = filteredCriticalityRDD.map(lambda x: (x[0][0], (x[0][1], x[1])))
    
    def select_most_critical_timeslot(timeslots_data):
        """
        Select most critical timeslot for a station
        If tied, select earliest hour, then lexicographical day order
        """
        station_id, timeslots = timeslots_data
        timeslots_list = list(timeslots)
        
        if not timeslots_list:
            return None
        
        # Sort by: 1) criticality (desc), 2) hour (asc), 3) day (lexicographical)
        def sort_key(item):
            (day, hour), criticality = item
            return (-criticality, hour, day)  # negative criticality for descending order
        
        sorted_timeslots = sorted(timeslots_list, key=sort_key)
        most_critical = sorted_timeslots[0]
        
        (day, hour), criticality = most_critical
        return (station_id, (day, hour, criticality))
    
    # Group by station and select most critical timeslot
    mostCriticalRDD = stationGroupedRDD.groupByKey().map(select_most_critical_timeslot).filter(lambda x: x is not None)
    
    # Cache for output operations
    mostCriticalRDD.cache()
    
else:
    print("No data to process for most critical timeslot selection")
    mostCriticalRDD = sc.emptyRDD()
    stationsWithCriticalTimeslots = 0

## Join with station location data and generate KML

In [None]:
if stationsWithCriticalTimeslots > 0:
    # Join most critical timeslots with station location data
    # mostCriticalRDD: (stationId, (day, hour, criticality))
    # stationsLookupRDD: (stationId, (longitude, latitude, name))
    joinedRDD = mostCriticalRDD.join(stationsLookupRDD)
    
    def generate_kml_placemark(data):
        """
        Generate KML Placemark for a station
        """
        station_id, ((day, hour, criticality), (longitude, latitude, name)) = data
        
        # Format KML Placemark
        kml_line = (
            f'<Placemark><name>{station_id}</name><ExtendedData>'
            f'<Data name="DayWeek"><value>{day}</value></Data>'
            f'<Data name="Hour"><value>{hour}</value></Data>'
            f'<Data name="Criticality"><value>{criticality}</value></Data></ExtendedData>'
            f'<Point><coordinates>{longitude},{latitude}</coordinates></Point></Placemark>'
        )
        return kml_line
    
    # Generate KML placemarks
    kmlRDD = joinedRDD.map(generate_kml_placemark)
    
    # Coalesce to single partition for single output file
    singlePartitionKmlRDD = kmlRDD.coalesce(1)
        
else:
    print("No data available for KML generation")
    kmlRDD = sc.emptyRDD()
    kmlEntries = 0

## Save results to output files

In [None]:
if kmlEntries > 0:
    try:
        # Save KML data
        singlePartitionKmlRDD.saveAsTextFile(outputPath)
        print(f"\nKML data saved successfully to: {outputPath}")
    except Exception as e:
        print(f"\nError saving to {outputPath}: {e}")
        print("Note: You may need to delete the output folder if it already exists")
else:
    print(f"\nNo KML data to save")

## Sample KML file structure

To create a complete KML file for map visualization, wrap the generated placemarks in this structure:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
  <name>Barcelona Bike Sharing Critical Stations</name>
  <description>Stations with critical timeslots (high occupancy)</description>
  
  <!-- Insert generated placemarks here -->
  
</Document>
</kml>
```

Each placemark contains:
- **Station ID** as name
- **Day of week** and **hour** of most critical timeslot
- **Criticality value** (0.0 to 1.0)
- **GPS coordinates** for map positioning

**Visualization tools:**
- [KML Viewer](https://kmlviewer.nsspot.net)
- [GPS Visualizer](https://www.gpsvisualizer.com)
- Google Earth
- Google My Maps