# Naolib API Data Collector

This notebook collects data from the Naolib API (Nantes public transportation) and sends it to Kafka for further processing. We'll focus on collecting real-time waiting times for various stops in Nantes.

In [2]:
import requests
from kafka import KafkaProducer
import json
import time
from datetime import datetime

## Configure Kafka Producer

First, let's set up our Kafka producer to send the data we collect:

In [3]:
# Kafka configuration
kafka_config = {
    'bootstrap_servers': 'kafka:9092',
}

# Initialize Kafka Producer
producer = KafkaProducer(
    bootstrap_servers=kafka_config['bootstrap_servers'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

## Define Key Stops

Let's define a list of important stops in Nantes for our analysis:

In [4]:
# Define key stops (code, name)
key_stops = [
    ("COMM", "Commerce"),         # Central hub
    ("CTRE", "Chantrerie-Grandes Ecoles"),  # University area
    ("CRQU", "Place du Cirque"),  # City center
    ("HDMN", "Haluchère"),        # Connection to tram line
    ("GSEV", "Gare SNCF Sud")     # Main train station
]

## Define Data Collection Function

Create a function to collect data from the API and send it to Kafka:

In [5]:
def collect_and_send_to_kafka(stop_code, stop_name, topic_name="naolib_realtime"):
    """
    Collect waiting time data from Naolib API for a specific stop and send it to Kafka
    """
    # API URL - Get waiting times for the stop with up to 20 arrivals
    url = f"https://open.tan.fr/ewp/tempsattentelieu.json/{stop_code}/20"
    
    try:
        # Request data from API
        response = requests.get(url)
        
        if response.status_code == 200:
            data = response.json()
            
            # Add metadata
            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            enriched_data = {
                "timestamp": timestamp,
                "stop_code": stop_code,
                "stop_name": stop_name,
                "arrivals": data
            }
            
            # Send to Kafka
            producer.send(topic_name, value=enriched_data)
            producer.flush()
            
            print(f"Data for {stop_name} ({stop_code}) sent to Kafka topic '{topic_name}'")
            return True
        else:
            print(f"Failed to fetch data for {stop_name} ({stop_code}): HTTP {response.status_code}")
            return False
    except Exception as e:
        print(f"Error collecting data for {stop_name} ({stop_code}): {str(e)}")
        return False

## Manual Test for a Single Stop

Let's test our function with a single stop:

In [6]:
# Test with Commerce stop
test_stop_code, test_stop_name = key_stops[0]
success = collect_and_send_to_kafka(test_stop_code, test_stop_name)

if success:
    print("Test successful!")
else:
    print("Test failed!")

Data for Commerce (COMM) sent to Kafka topic 'naolib_realtime'
Test successful!


## Continuous Data Collection Loop

Now, let's set up a loop to continuously collect data from all our key stops:

In [7]:
# Run for a specific duration (in minutes) or until interrupted
duration_minutes = 60  # Change this as needed
interval_seconds = 30  # Time between API calls for each stop

start_time = datetime.now()
end_time = start_time.timestamp() + (duration_minutes * 60)

try:
    while datetime.now().timestamp() < end_time:
        # Collect data for each stop
        for stop_code, stop_name in key_stops:
            collect_and_send_to_kafka(stop_code, stop_name)
            time.sleep(2)  # Brief pause between stops to avoid API rate limits
        
        # Wait for the next collection cycle
        print(f"Completed collection cycle at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        print(f"Waiting {interval_seconds} seconds for next cycle...")
        time.sleep(interval_seconds)
        
except KeyboardInterrupt:
    print("Data collection interrupted by user.")
finally:
    print("Closing Kafka producer...")
    producer.close()
    print("Data collection complete.")

Data for Commerce (COMM) sent to Kafka topic 'naolib_realtime'
Data for Chantrerie-Grandes Ecoles (CTRE) sent to Kafka topic 'naolib_realtime'
Data for Place du Cirque (CRQU) sent to Kafka topic 'naolib_realtime'
Data for Haluchère (HDMN) sent to Kafka topic 'naolib_realtime'
Data for Gare SNCF Sud (GSEV) sent to Kafka topic 'naolib_realtime'
Completed collection cycle at 2025-03-25 00:02:04
Waiting 30 seconds for next cycle...
Data collection interrupted by user.
Closing Kafka producer...
Data collection complete.


## Analyze Collected Data Structure

Let's look at the structure of the data we're collecting:

In [None]:
# Fetch a sample of data to analyze its structure
sample_url = f"https://open.tan.fr/ewp/tempsattentelieu.json/COMM/5"
response = requests.get(sample_url)

if response.status_code == 200:
    data = response.json()
    
    # Print the JSON structure
    print(json.dumps(data, indent=2))
    
    # Print the important fields we'll use for analysis
    print("\nKey data points for analysis:")
    for item in data:
        line_num = item.get('ligne', {}).get('numLigne', 'Unknown')
        terminus = item.get('terminus', 'Unknown')
        wait_time = item.get('temps', 'Unknown')
        is_real_time = item.get('tempsReel', False)
        
        print(f"Line {line_num} to {terminus}: Wait time {wait_time} (Real-time: {is_real_time})")
else:
    print(f"Failed to fetch sample data: HTTP {response.status_code}")

[
  {
    "sens": 2,
    "terminus": "Commerce",
    "infotrafic": true,
    "temps": "9mn",
    "dernierDepart": "false",
    "tempsReel": "true",
    "ligne": {
      "numLigne": "1",
      "typeLigne": 1
    },
    "arret": {
      "codeArret": "COMC1"
    }
  },
  {
    "sens": 1,
    "terminus": "Fran\u00e7ois Mitterrand",
    "infotrafic": true,
    "temps": "12mn",
    "dernierDepart": "true",
    "tempsReel": "true",
    "ligne": {
      "numLigne": "1",
      "typeLigne": 1
    },
    "arret": {
      "codeArret": "COMB2"
    }
  },
  {
    "sens": 2,
    "terminus": "Commerce",
    "infotrafic": true,
    "temps": "36mn",
    "dernierDepart": "false",
    "tempsReel": "true",
    "ligne": {
      "numLigne": "1",
      "typeLigne": 1
    },
    "arret": {
      "codeArret": "COMC1"
    }
  },
  {
    "sens": 2,
    "terminus": "Espace Diderot",
    "infotrafic": false,
    "temps": "8mn",
    "dernierDepart": "false",
    "tempsReel": "true",
    "ligne": {
      "numLigne": 