# RAMP on predicting cyclist traffic in Paris


## Introduction

The dataset was collected with cyclist counters installed by Paris city council in multiple locations. It contains hourly information about cyclist traffic, as well as the following features,
 - counter name
 - counter site name
 - date
 - counter installation date
 - latitude and longitude
 
Available features are quite scarce. However, **we can also use any external data that can help us to predict the target variable.** 

In [2]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

# Loading the data with pandas

First, download the data files from Kaggle, and put them to into the data folder.


Data is stored in [Parquet format](https://parquet.apache.org/), an efficient columnar data format. We can load the train set with pandas,

In [3]:
data = pd.read_parquet(Path("data") / "train.parquet")
data.head()

data_test = pd.read_parquet(Path("data") / "final_test.parquet")
data["date"]

48321    2020-09-01 02:00:00
48324    2020-09-01 03:00:00
48327    2020-09-01 04:00:00
48330    2020-09-01 15:00:00
48333    2020-09-01 18:00:00
                 ...        
929175   2021-09-09 06:00:00
929178   2021-09-09 10:00:00
929181   2021-09-09 15:00:00
929184   2021-09-09 22:00:00
929187   2021-09-09 23:00:00
Name: date, Length: 496827, dtype: datetime64[us]

# Visualizing the data


Let's visualize the data, starting from the spatial distribution of counters on the map

In [17]:
import folium

m = folium.Map(location=data[["latitude", "longitude"]].mean(axis=0), zoom_start=13)

for _, row in (
    data[["counter_name", "latitude", "longitude"]]
    .drop_duplicates("counter_name")
    .iterrows()
):
    folium.Marker(
        row[["latitude", "longitude"]].values.tolist(), popup=row["counter_name"]
    ).add_to(m)

m

  coords = (location[0], location[1])


In [4]:
import seaborn as sns

grouped_data = (
    data.groupby(["counter_name", pd.Grouper(freq="1M", key="date")])
    ["log_bike_count"].sum()
    .to_frame()
)
grouped_data = grouped_data.reset_index()
coordinates_mapper = data[['counter_name', 'latitude', 'longitude']].drop_duplicates()
grouped_data = pd.merge(grouped_data, 
                        coordinates_mapper[['counter_name', 'latitude', 'longitude']], 
                        on='counter_name', 
                        how='left')
grouped_data

  data.groupby(["counter_name", pd.Grouper(freq="1M", key="date")])
  data.groupby(["counter_name", pd.Grouper(freq="1M", key="date")])


Unnamed: 0,counter_name,date,log_bike_count,latitude,longitude
0,152 boulevard du Montparnasse E-O,2020-09-30,2572.257371,48.840801,2.333233
1,152 boulevard du Montparnasse E-O,2020-10-31,2382.779737,48.840801,2.333233
2,152 boulevard du Montparnasse E-O,2020-11-30,2032.810230,48.840801,2.333233
3,152 boulevard du Montparnasse E-O,2020-12-31,2026.296477,48.840801,2.333233
4,152 boulevard du Montparnasse E-O,2021-01-31,1577.473441,48.840801,2.333233
...,...,...,...,...,...
723,Voie Georges Pompidou SO-NE,2021-05-31,2402.760973,48.848400,2.275860
724,Voie Georges Pompidou SO-NE,2021-06-30,2674.675392,48.848400,2.275860
725,Voie Georges Pompidou SO-NE,2021-07-31,2677.705478,48.848400,2.275860
726,Voie Georges Pompidou SO-NE,2021-08-31,2438.580608,48.848400,2.275860


In [10]:
import folium
from folium.plugins import TimeSliderChoropleth
from folium.plugins import TimestampedGeoJson
import json
from matplotlib import cm, colors


# Normalize log_bike_count for gradient mapping
log_min, log_max = grouped_data["log_bike_count"].min(), grouped_data["log_bike_count"].max()
norm = colors.Normalize(vmin=log_min, vmax=log_max)

# Create a colormap (e.g., green → yellow → red)
cmap = cm.get_cmap("RdYlGn_r")  # Reverse 'RdYlGn' for red at high values

def get_gradient_color(log_value):
    """Map log_bike_count to a gradient color."""
    rgba_color = cmap(norm(log_value))  # Map normalized value to RGBA
    return colors.rgb2hex(rgba_color[:3])  # Convert to hex color



# Convert DataFrame to GeoJSON-like format
features = [
    {
        "type": "Feature",
        "geometry": {
            "type": "Point",
            "coordinates": [row["longitude"], row["latitude"]],
        },
        "properties": {
            "time": row["date"].isoformat(),  # Convert to milliseconds
            "icon": "circle",
            "iconstyle": {
                "fillColor": get_gradient_color(row["log_bike_count"]),
                "fillOpacity": 1,
                "stroke": "false",
                "radius": 8,
            },
            "style": {
                "weight": 0,
            },
            "popup": f"Bike Count: {row['log_bike_count']}",
        }
    }
    for _, row in grouped_data.iterrows()
]

geojson_data = {
    "type": "FeatureCollection",
    "features": features,
}

# Initialize a folium map
m = folium.Map(location=data[["latitude", "longitude"]].mean(axis=0), zoom_start=13)

TimestampedGeoJson(
    data=geojson_data,
    period="P1M",  # Time interval for each step (1 month here)
    add_last_point=False,
    auto_play=True,
    loop=True,
    max_speed=10,  # Adjust speed of the animation
    loop_button=True,
    date_options="YYYY-MM",  # Format for the date display (monthly)
    time_slider_drag_update=True,
).add_to(m)

m

  cmap = cm.get_cmap("RdYlGn_r")  # Reverse 'RdYlGn' for red at high values
  coords = (location[0], location[1])


Note that in this challenge, we consider only the 30 most frequented counting sites, to limit data size.

Next we will look into the temporal distribution of the most frequented bike counter. If we plot it directly we will not see much because there are half a million data points: