# Description :

Cette POC rassemble juste quelques essais autour de la création manuelle de DataFrame ou de row d'une DataFrame.

C'est surtout la fin qui est intéressante : on créée une DataFrame vide avec la même structure qu'une autre, et on ajoute des lignes à une DataFrame donnée.

# Chargement des données :

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("/tmp/stop_times.txt", nrows=10000)
df.drop(["stop_time_desc", "pickup_type", "drop_off_type"], axis=1, inplace=True)
# itérer sur la dataframe doit permettre de reconstruire chaque trip :
df.sort_values(["trip_id", "stop_sequence"])
df

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence
0,125830582-1_406672,10:27:00,10:27:00,StopPoint:50162949,0
1,125830582-1_406672,10:29:00,10:29:00,StopPoint:50162933,1
2,125830582-1_406672,10:30:00,10:30:00,StopPoint:50162935,2
3,125830582-1_406672,10:32:00,10:32:00,StopPoint:50162925,3
4,125830582-1_406672,10:34:00,10:34:00,StopPoint:50162941,4
...,...,...,...,...,...
9995,125830834-1_407200,16:09:00,16:09:00,StopPoint:50162944,11
9996,125830834-1_407200,16:11:00,16:11:00,StopPoint:50162946,12
9997,125830834-1_407200,16:13:00,16:13:00,StopPoint:50162924,13
9998,125830834-1_407200,16:14:00,16:14:00,StopPoint:50162942,14


# Calcul des infos de chaque trip : nombre de stops + route_label

In [5]:
grouped_by_trip_id = df.groupby("trip_id", group_keys=False)
# les rows ont déjà été triées plus haut par "trip_id" (avec "stop_sequence" comme critère secondaire)
trips = grouped_by_trip_id["stop_id"].aggregate(["size", lambda x: "|".join(x)])
# le route_label est la concaténation des ids de chaque stop du trip, séparé par "|"
trips.columns = ["nb_stops", "route_label"]
trips.head()
# à ce stade, pour chaque trip_id, on sait retrouver son nombre de stops, et son route_label

Unnamed: 0_level_0,nb_stops,route_label
trip_id,Unnamed: 1_level_1,Unnamed: 2_level_1
125830456-1_406678,19,StopPoint:50162949|StopPoint:50162933|StopPoin...
125830457-1_407019,19,StopPoint:50162948|StopPoint:50162954|StopPoin...
125830458-1_406679,19,StopPoint:50162949|StopPoint:50162933|StopPoin...
125830459-1_407020,19,StopPoint:50162948|StopPoint:50162954|StopPoin...
125830460-1_406680,19,StopPoint:50162949|StopPoint:50162933|StopPoin...


# Mise à jour de la dataframe initiale pour ajouter le nombre de stops et le route_lbel

In [6]:
# mute la dataframe pour ajouter le nombre de stops de chaque trip :
nb_stops = df["trip_id"].apply(lambda trip_id: trips.loc[trip_id]["nb_stops"])
df["nb_stops_in_this_trip"] = nb_stops
# mute la dataframe pour ajouter le route_label de chaque trip :
df["route_label"] = df["trip_id"].apply(lambda trip_id: trips.loc[trip_id]["route_label"])

df.head()

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,nb_stops_in_this_trip,route_label
0,125830582-1_406672,10:27:00,10:27:00,StopPoint:50162949,0,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
1,125830582-1_406672,10:29:00,10:29:00,StopPoint:50162933,1,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
2,125830582-1_406672,10:30:00,10:30:00,StopPoint:50162935,2,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
3,125830582-1_406672,10:32:00,10:32:00,StopPoint:50162925,3,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
4,125830582-1_406672,10:34:00,10:34:00,StopPoint:50162941,4,17,StopPoint:50162949|StopPoint:50162933|StopPoin...


# POC = création manuelle de DataFrame :

In [9]:
manual1 = pd.DataFrame([[19, "val1", "val2", "val3"]], index=["coucou"], columns=["nb_stops", "stop1", "stop2", "stop3"])
manual1

Unnamed: 0,nb_stops,stop1,stop2,stop3
coucou,19,val1,val2,val3


In [10]:
# different dataframe with identical row :
manual2 = pd.DataFrame({
    "nb_stops": 19,
    "stop1": "val1",
    "stop2": "val2",
    "stop3": "val3",
}, index=["coucou"])
manual2

Unnamed: 0,nb_stops,stop1,stop2,stop3
coucou,19,val1,val2,val3


In [11]:
assert manual1.equals(manual2)

# POC = créer une DataFrame VIDE avec la même structure qu'une autre

In [25]:
clone = pd.DataFrame(data=None, columns=df.columns)
clone

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,nb_stops_in_this_trip,route_label


**interprétation** : la dataframe créée est vide (pas de row), mais a la même structure que `df`

In [26]:
# c'est la façon d'ajouter des rows à une DataFrame :
# (attention qu'elle est sans doute très inefficace)
clone = pd.concat([clone, df.loc[0:3]], ignore_index=True)
clone

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,nb_stops_in_this_trip,route_label
0,125830582-1_406672,10:27:00,10:27:00,StopPoint:50162949,0,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
1,125830582-1_406672,10:29:00,10:29:00,StopPoint:50162933,1,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
2,125830582-1_406672,10:30:00,10:30:00,StopPoint:50162935,2,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
3,125830582-1_406672,10:32:00,10:32:00,StopPoint:50162925,3,17,StopPoint:50162949|StopPoint:50162933|StopPoin...


In [27]:
manual3 = pd.DataFrame([
        ["manualtrip", "14:15:00", "14:15:00", "superstop1", 0, 2, "superstop1|superstop2"],
        ["manualtrip", "14:17:00", "14:17:00", "superstop2", 1, 2, "superstop1|superstop2"],
    ],
    index=[5, 6],
    columns=["trip_id", "arrival_time", "departure_time", "stop_id", "stop_sequence", "nb_stops_in_this_trip", "route_label"],
)
clone = pd.concat([clone, manual3])
clone

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,nb_stops_in_this_trip,route_label
0,125830582-1_406672,10:27:00,10:27:00,StopPoint:50162949,0,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
1,125830582-1_406672,10:29:00,10:29:00,StopPoint:50162933,1,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
2,125830582-1_406672,10:30:00,10:30:00,StopPoint:50162935,2,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
3,125830582-1_406672,10:32:00,10:32:00,StopPoint:50162925,3,17,StopPoint:50162949|StopPoint:50162933|StopPoin...
5,manualtrip,14:15:00,14:15:00,superstop1,0,2,superstop1|superstop2
6,manualtrip,14:17:00,14:17:00,superstop2,1,2,superstop1|superstop2
