<center><div style="width: 900px;  padding-top:5px; padding-bottom:10px;border: 3px solid #1625CB; text-align: left;background: #1625CB;">
 <center>DataFrame.apply </center>
</div></center>


La méthode apply est un outil polyvalent pour les transformations personnalisées dans les pipelines ETL. 

Elle vous permet de :

-  Appliquer une logique personnalisée aux lignes ou aux colonnes : traiter des données avec des règles métier complexes ;
-  Gérer le nettoyage et la standardisation des données : formater les chaînes, gérer les valeurs manquantes, etc. ;
-  Créer des entités dérivées : calculer des métriques, des scores ou des segments basés sur plusieurs champs ;
-  Mettre en œuvre des transformations spécifiques à un domaine : appliquer des calculs ou des règles spécifiques à un secteur.


Bien qu'apply soit extrêmement flexible, il est important de noter qu'elle peut être plus lente que les opérations vectorisées pour les grands ensembles de données. Dans la mesure du possible, utilisez les fonctions vectorisées intégrées de Pandas (comme np.where dans l'exemple) pour de meilleures performances. 

Dans les pipelines ETL, vous utiliserez souvent une combinaison d'opérations vectorisées pour les transformations simples et apply pour les logiques complexes difficiles à vectoriser.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime



# Données clients 
customers = pd.DataFrame({
    'customer_id': range(1, 11),
    'name': ['Alice Smith', 'Bob Johnson', 'Charlie Williams', 'David Brown', 'Eve Davis',
             'Frank Miller', 'Grace Wilson', 'Hannah Moore', 'Ian Taylor', 'Julia Anderson'],
    'signup_date': pd.date_range(start='2022-01-01', periods=10),
    'last_purchase': pd.date_range(start='2023-01-15', periods=10),
    'total_spent': [1250.50, 890.25, 2300.75, 450.00, 3200.80,
                    750.60, 1800.30, 920.45, 2100.90, 1500.20],
    'items_purchased': [15, 8, 25, 5, 30, 10, 20, 12, 22, 18],
    'address': ['123 Main St, New York, NY', '456 Oak Ave, Los Angeles, CA', 
                '789 Pine Rd, Chicago, IL', '321 Cedar Ln, Houston, TX',
                '654 Maple Dr, Phoenix, AZ', '987 Birch Ct, Philadelphia, PA',
                '234 Elm Blvd, San Antonio, TX', '567 Willow Way, San Diego, CA',
                '890 Spruce Path, Dallas, TX', '432 Redwood Cir, San Jose, CA']
})
customers

Unnamed: 0,customer_id,name,signup_date,last_purchase,total_spent,items_purchased,address
0,1,Alice Smith,2022-01-01,2023-01-15,1250.5,15,"123 Main St, New York, NY"
1,2,Bob Johnson,2022-01-02,2023-01-16,890.25,8,"456 Oak Ave, Los Angeles, CA"
2,3,Charlie Williams,2022-01-03,2023-01-17,2300.75,25,"789 Pine Rd, Chicago, IL"
3,4,David Brown,2022-01-04,2023-01-18,450.0,5,"321 Cedar Ln, Houston, TX"
4,5,Eve Davis,2022-01-05,2023-01-19,3200.8,30,"654 Maple Dr, Phoenix, AZ"
5,6,Frank Miller,2022-01-06,2023-01-20,750.6,10,"987 Birch Ct, Philadelphia, PA"
6,7,Grace Wilson,2022-01-07,2023-01-21,1800.3,20,"234 Elm Blvd, San Antonio, TX"
7,8,Hannah Moore,2022-01-08,2023-01-22,920.45,12,"567 Willow Way, San Diego, CA"
8,9,Ian Taylor,2022-01-09,2023-01-23,2100.9,22,"890 Spruce Path, Dallas, TX"
9,10,Julia Anderson,2022-01-10,2023-01-24,1500.2,18,"432 Redwood Cir, San Jose, CA"


In [4]:
# Appliquer à une seule colonne : extraire le prénom
customers['first_name'] = customers['name'].apply(lambda x: x.split()[0])
customers

Unnamed: 0,customer_id,name,signup_date,last_purchase,total_spent,items_purchased,address,first_name
0,1,Alice Smith,2022-01-01,2023-01-15,1250.5,15,"123 Main St, New York, NY",Alice
1,2,Bob Johnson,2022-01-02,2023-01-16,890.25,8,"456 Oak Ave, Los Angeles, CA",Bob
2,3,Charlie Williams,2022-01-03,2023-01-17,2300.75,25,"789 Pine Rd, Chicago, IL",Charlie
3,4,David Brown,2022-01-04,2023-01-18,450.0,5,"321 Cedar Ln, Houston, TX",David
4,5,Eve Davis,2022-01-05,2023-01-19,3200.8,30,"654 Maple Dr, Phoenix, AZ",Eve
5,6,Frank Miller,2022-01-06,2023-01-20,750.6,10,"987 Birch Ct, Philadelphia, PA",Frank
6,7,Grace Wilson,2022-01-07,2023-01-21,1800.3,20,"234 Elm Blvd, San Antonio, TX",Grace
7,8,Hannah Moore,2022-01-08,2023-01-22,920.45,12,"567 Willow Way, San Diego, CA",Hannah
8,9,Ian Taylor,2022-01-09,2023-01-23,2100.9,22,"890 Spruce Path, Dallas, TX",Ian
9,10,Julia Anderson,2022-01-10,2023-01-24,1500.2,18,"432 Redwood Cir, San Jose, CA",Julia


In [6]:
# Fonction nommée pour extraire l'état depuis l'adresse
def extract_state(address):
    """Extrait l'État à partir d'une adresse"""
    try:
        return address.split(',')[-1].strip().split()[-1]
    except:
        return 'Unknown'

customers['state'] = customers['address'].apply(extract_state)
customers

Unnamed: 0,customer_id,name,signup_date,last_purchase,total_spent,items_purchased,address,first_name,state
0,1,Alice Smith,2022-01-01,2023-01-15,1250.5,15,"123 Main St, New York, NY",Alice,NY
1,2,Bob Johnson,2022-01-02,2023-01-16,890.25,8,"456 Oak Ave, Los Angeles, CA",Bob,CA
2,3,Charlie Williams,2022-01-03,2023-01-17,2300.75,25,"789 Pine Rd, Chicago, IL",Charlie,IL
3,4,David Brown,2022-01-04,2023-01-18,450.0,5,"321 Cedar Ln, Houston, TX",David,TX
4,5,Eve Davis,2022-01-05,2023-01-19,3200.8,30,"654 Maple Dr, Phoenix, AZ",Eve,AZ
5,6,Frank Miller,2022-01-06,2023-01-20,750.6,10,"987 Birch Ct, Philadelphia, PA",Frank,PA
6,7,Grace Wilson,2022-01-07,2023-01-21,1800.3,20,"234 Elm Blvd, San Antonio, TX",Grace,TX
7,8,Hannah Moore,2022-01-08,2023-01-22,920.45,12,"567 Willow Way, San Diego, CA",Hannah,CA
8,9,Ian Taylor,2022-01-09,2023-01-23,2100.9,22,"890 Spruce Path, Dallas, TX",Ian,TX
9,10,Julia Anderson,2022-01-10,2023-01-24,1500.2,18,"432 Redwood Cir, San Jose, CA",Julia,CA


In [8]:
# Appliquer à plusieurs colonnes : calculer le panier moyen par article
def avg_spend_per_item(row):
    """Calcule la dépense moyenne par article"""
    return row['total_spent'] / row['items_purchased']

customers['avg_item_value'] = customers.apply(avg_spend_per_item, axis=1)
customers


Unnamed: 0,customer_id,name,signup_date,last_purchase,total_spent,items_purchased,address,first_name,state,avg_item_value
0,1,Alice Smith,2022-01-01,2023-01-15,1250.5,15,"123 Main St, New York, NY",Alice,NY,83.366667
1,2,Bob Johnson,2022-01-02,2023-01-16,890.25,8,"456 Oak Ave, Los Angeles, CA",Bob,CA,111.28125
2,3,Charlie Williams,2022-01-03,2023-01-17,2300.75,25,"789 Pine Rd, Chicago, IL",Charlie,IL,92.03
3,4,David Brown,2022-01-04,2023-01-18,450.0,5,"321 Cedar Ln, Houston, TX",David,TX,90.0
4,5,Eve Davis,2022-01-05,2023-01-19,3200.8,30,"654 Maple Dr, Phoenix, AZ",Eve,AZ,106.693333
5,6,Frank Miller,2022-01-06,2023-01-20,750.6,10,"987 Birch Ct, Philadelphia, PA",Frank,PA,75.06
6,7,Grace Wilson,2022-01-07,2023-01-21,1800.3,20,"234 Elm Blvd, San Antonio, TX",Grace,TX,90.015
7,8,Hannah Moore,2022-01-08,2023-01-22,920.45,12,"567 Willow Way, San Diego, CA",Hannah,CA,76.704167
8,9,Ian Taylor,2022-01-09,2023-01-23,2100.9,22,"890 Spruce Path, Dallas, TX",Ian,TX,95.495455
9,10,Julia Anderson,2022-01-10,2023-01-24,1500.2,18,"432 Redwood Cir, San Jose, CA",Julia,CA,83.344444


In [9]:
# Appliquer une fonction avec arguments supplémentaires
def days_since_event(date, reference_date=None):
    """Calcule le nombre de jours depuis une date donnée jusqu'à une date de référence"""
    if reference_date is None:
        reference_date = datetime.now().date()
    return (reference_date - date.date()).days

In [11]:
# Date de référence personnalisée
reference = datetime(2023, 12, 31).date()

# Appliquer aux dates d'inscription et d'achat
customers['days_since_signup'] = customers['signup_date'].apply(
    days_since_event, reference_date=reference
)
customers['days_since_purchase'] = customers['last_purchase'].apply(
    days_since_event, reference_date=reference
)
customers

Unnamed: 0,customer_id,name,signup_date,last_purchase,total_spent,items_purchased,address,first_name,state,avg_item_value,days_since_signup,days_since_purchase
0,1,Alice Smith,2022-01-01,2023-01-15,1250.5,15,"123 Main St, New York, NY",Alice,NY,83.366667,729,350
1,2,Bob Johnson,2022-01-02,2023-01-16,890.25,8,"456 Oak Ave, Los Angeles, CA",Bob,CA,111.28125,728,349
2,3,Charlie Williams,2022-01-03,2023-01-17,2300.75,25,"789 Pine Rd, Chicago, IL",Charlie,IL,92.03,727,348
3,4,David Brown,2022-01-04,2023-01-18,450.0,5,"321 Cedar Ln, Houston, TX",David,TX,90.0,726,347
4,5,Eve Davis,2022-01-05,2023-01-19,3200.8,30,"654 Maple Dr, Phoenix, AZ",Eve,AZ,106.693333,725,346
5,6,Frank Miller,2022-01-06,2023-01-20,750.6,10,"987 Birch Ct, Philadelphia, PA",Frank,PA,75.06,724,345
6,7,Grace Wilson,2022-01-07,2023-01-21,1800.3,20,"234 Elm Blvd, San Antonio, TX",Grace,TX,90.015,723,344
7,8,Hannah Moore,2022-01-08,2023-01-22,920.45,12,"567 Willow Way, San Diego, CA",Hannah,CA,76.704167,722,343
8,9,Ian Taylor,2022-01-09,2023-01-23,2100.9,22,"890 Spruce Path, Dallas, TX",Ian,TX,95.495455,721,342
9,10,Julia Anderson,2022-01-10,2023-01-24,1500.2,18,"432 Redwood Cir, San Jose, CA",Julia,CA,83.344444,720,341


In [None]:
# Utiliser np.where pour des opérations vectorisées (plus rapide que apply)
customers['recency_score'] = np.where(
    customers['days_since_purchase'] < 30, 'Actif',
    np.where(customers['days_since_purchase'] < 90, 'Récent', 'Inactif')
)

In [18]:
# Segmentation client personnalisée selon les dépenses et la récence
def customer_segment(row):
    """Détermine le segment client basé sur le montant dépensé et la récence"""
    if row['total_spent'] > 2000:
        tier = 'Valeur Élevée'
    elif row['total_spent'] > 1000:
        tier = 'Valeur Moyenne'
    else:
        tier = 'Faible Valeur'
        
    if row['recency_score'] == 'Actif':
        status = 'Actif'
    elif row['recency_score'] == 'Récent':
        status = 'À Risque'
    else:
        status = 'Perdu'
        
    return f"{tier} - {status}"

customers['segment'] = customers.apply(customer_segment, axis=1)
customers

Unnamed: 0,customer_id,name,signup_date,last_purchase,total_spent,items_purchased,address,first_name,state,avg_item_value,days_since_signup,days_since_purchase,recency_score,segment
0,1,Alice Smith,2022-01-01,2023-01-15,1250.5,15,"123 Main St, New York, NY",Alice,NY,83.366667,729,350,Inactif,Valeur Moyenne - Perdu
1,2,Bob Johnson,2022-01-02,2023-01-16,890.25,8,"456 Oak Ave, Los Angeles, CA",Bob,CA,111.28125,728,349,Inactif,Faible Valeur - Perdu
2,3,Charlie Williams,2022-01-03,2023-01-17,2300.75,25,"789 Pine Rd, Chicago, IL",Charlie,IL,92.03,727,348,Inactif,Valeur Élevée - Perdu
3,4,David Brown,2022-01-04,2023-01-18,450.0,5,"321 Cedar Ln, Houston, TX",David,TX,90.0,726,347,Inactif,Faible Valeur - Perdu
4,5,Eve Davis,2022-01-05,2023-01-19,3200.8,30,"654 Maple Dr, Phoenix, AZ",Eve,AZ,106.693333,725,346,Inactif,Valeur Élevée - Perdu
5,6,Frank Miller,2022-01-06,2023-01-20,750.6,10,"987 Birch Ct, Philadelphia, PA",Frank,PA,75.06,724,345,Inactif,Faible Valeur - Perdu
6,7,Grace Wilson,2022-01-07,2023-01-21,1800.3,20,"234 Elm Blvd, San Antonio, TX",Grace,TX,90.015,723,344,Inactif,Valeur Moyenne - Perdu
7,8,Hannah Moore,2022-01-08,2023-01-22,920.45,12,"567 Willow Way, San Diego, CA",Hannah,CA,76.704167,722,343,Inactif,Faible Valeur - Perdu
8,9,Ian Taylor,2022-01-09,2023-01-23,2100.9,22,"890 Spruce Path, Dallas, TX",Ian,TX,95.495455,721,342,Inactif,Valeur Élevée - Perdu
9,10,Julia Anderson,2022-01-10,2023-01-24,1500.2,18,"432 Redwood Cir, San Jose, CA",Julia,CA,83.344444,720,341,Inactif,Valeur Moyenne - Perdu


In [17]:
# Affichage final des colonnes transformées
print(customers[['customer_id', 'first_name', 'state', 'avg_item_value', 
                 'days_since_purchase', 'recency_score', 'segment']])

   customer_id first_name state  avg_item_value  days_since_purchase  \
0            1      Alice    NY       83.366667                  350   
1            2        Bob    CA      111.281250                  349   
2            3    Charlie    IL       92.030000                  348   
3            4      David    TX       90.000000                  347   
4            5        Eve    AZ      106.693333                  346   
5            6      Frank    PA       75.060000                  345   
6            7      Grace    TX       90.015000                  344   
7            8     Hannah    CA       76.704167                  343   
8            9        Ian    TX       95.495455                  342   
9           10      Julia    CA       83.344444                  341   

  recency_score                 segment  
0       Inactif  Valeur Moyenne - Perdu  
1       Inactif   Faible Valeur - Perdu  
2       Inactif   Valeur Élevée - Perdu  
3       Inactif   Faible Valeur - Perdu