In [None]:
1. Descriptive analysis of our  data
Topic: 
    Prediction rate of volontary termination of pregnancy
Goal:
How do geographical and temporal factors influence the rate of voluntary termination of pregnancy at the departmental level, and how can we predict its future evolution ?

Benfits:
Prediction can help regional autority to predict the need in infrastructure and medical staff. 
    
Sources: 
    Data.drees.solidarites-sante.gouv.fr
    Système national des données de santé (SNDS) ; traitements Drees
    
    
Description of the Dataset: 
Number of rows: 1 121
Number of columns : 9


ZONE_GEO : The geographical area covered by the data, a department such as ‘Ain’ or an aggregation such as ‘France entière’ or ‘région’, from 2016 to 2024. Text (Object)
IVG_HOSP_INS : Number of abortions (IVG) performed in hospital (full hospitalisation) using surgical instruments. Number (Float)
IVG_HOSP_MED : Number of abortions performed in hospital (inpatient or outpatient) using medication. Number (Float)
IVG_HOSP_INC :  Number of abortions performed in hospital for unspecified or unknown reasons.    Number (Float)
IVG_CAB	: Number of abortions performed in a doctor's office or healthcare facility without full hospitalisation. Number (Float)
IVG_CEN	: Number of abortions performed in family planning or education centres. Number (Float)
TOT_IVG	: The total number of voluntary terminations of pregnancy recorded for the corresponding GEO_ZONE and year. Number (Float)
TAUX_rec : The abortion rate. This is generally the number of abortions per 1,000 women of childbearing age (15 to 49 years old). Number (Float)
annee : The year in which the abortion data was recorded. Number (Int)


In [14]:
import pandas as pd

df = pd.read_csv("donnees_feuil1.csv", delimiter=';', encoding='ISO-8859-1')

print(df.head())
print(df.info())

                  ZONE_GEO  IVG_HOSP_INS  IVG_HOSP_MED  IVG_HOSP_INC  IVG_CAB  \
0                      Ain         375.0         715.0           2.0    591.0   
1                    Aisne         313.0        1193.0           5.0    245.0   
2                   Allier         149.0         409.0           2.0    419.0   
3  Alpes-de-Haute-Provence         110.0         262.0           2.0    158.0   
4             Hautes-Alpes          87.0         114.0           2.0    299.0   

   IVG_CEN  TOT_IVG     TAUX_rec   annee  
0    153.0   1836.0  12,86038497  2024.0  
1     13.0   1769.0  17,22308224  2024.0  
2     13.0    992.0  17,02536642  2024.0  
3     62.0    594.0  20,04251442  2024.0  
4     48.0    550.0  21,82020154  2024.0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1121 entries, 0 to 1120
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ZONE_GEO      1080 non-null   object 
 1   IVG_HOSP_INS  1

In [None]:
2. Problèmes de qualité détectés

TAUX_rec is text with commas to convert to float.

Some values are missing in all columns.

We delete each region and total to conserve only departement to avoid biais

In [15]:
#3. Convestion of Taux_rec's separator:
df['TAUX_rec_clean'] = (
    df['TAUX_rec']
    .astype(str)                               
    .str.replace(',', '.', regex=False)       
    .str.replace(r'[^0-9\.]', '', regex=True) 
)


df['TAUX_rec_clean'] = pd.to_numeric(df['TAUX_rec_clean'], errors='coerce')

In [16]:
print(df['TAUX_rec_clean'].head())
print(df['TAUX_rec_clean'].isna().sum(), "valeurs manquantes après conversion")


0    12.860385
1    17.223082
2    17.025366
3    20.042514
4    21.820202
Name: TAUX_rec_clean, dtype: float64
59 valeurs manquantes après conversion


In [None]:
# 3. Filtrage des lignes agrégées (régions, totaux, etc.)
aggregate_keywords = ["Total", "France entière", "Résidence inconnue", "résidence à l'étranger",
    "Auvergne", "Bourgogne", "Bretagne", "Centre", "Corse", "Grand Est",
    "Guadeloupe", "Guyane", "Hauts-de-France", "Île-de-France", "La Réunion",
    "Martinique", "Mayotte", "Normandie", "Nouvelle-Aquitaine", "Occitanie",
    "Pays de la Loire", "Provence"]

mask = ~df['ZONE_GEO'].astype(str).str.contains('|'.join(aggregate_keywords), case=False, na=False)
df_filtered = df[mask].copy()

In [None]:
df_cleaned = df_filtered.dropna().copy()

df_cleaned['annee'] = df_cleaned['annee'].astype(int)

print(f"Pré-traitement terminé. Taille du DataFrame nettoyé : {len(df_cleaned)} lignes.")

In [None]:
3. Formalisation du problème

We are going to use linear regression to predict the number of the IVG into a specific region for the next year by using RMSE or MAE
Also, we going to classify by calculate accuracy, F1-score and confusion Matrix to classify for example regions by high use of IVG and also classify by method

In [None]:
4. Selection of a baseline model and implementation of the model.

Prediction of IVG rate 
by department, year and eventually type of IVG
it's multiple linear regression

Classification
Classify a department for a specific year. 
Decision tree or logistic regression

In [None]:
#baseline Linear regression 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np


X = df_cleaned[['ZONE_GEO', 'annee']]
y = df_cleaned['TAUX_rec']


test_year = 2024
X_train = X[X['annee'] < test_year]
X_test = X[X['annee'] == test_year]
y_train = y[X['annee'] < test_year]
y_test = y[X['annee'] == test_year]


X_full = pd.get_dummies(X, columns=['ZONE_GEO'], drop_first=True)
X_train_processed = X_full.loc[X_train.index]
X_test_processed = X_full.loc[X_test.index]


baseline_model = LinearRegression()
baseline_model.fit(X_train_processed, y_train)


y_pred = baseline_model.predict(X_test_processed)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Le RMSE du modèle de référence sur les données de 2024 est : {rmse:.4f}")

In [None]:
#baseline classification
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


test_year = 2024
df_train = df_cleaned[df_cleaned['annee'] < test_year].copy()
df_test = df_cleaned[df_cleaned['annee'] == test_year].copy()


tertiles = df_train['TAUX_rec'].quantile([0.33, 0.66])
Q1 = tertiles.iloc[0]
Q2 = tertiles.iloc[1]


def categorize_taux(taux):
    if taux <= Q1:
        return 'Faible'
    elif taux <= Q2:
        return 'Moyen'
    else:
        return 'Élevé'


df_cleaned['TAUX_CLASS'] = df_cleaned['TAUX_rec'].apply(categorize_taux)

X = df_cleaned[['ZONE_GEO', 'annee']]
y = df_cleaned['TAUX_CLASS']

X_train = X[df_cleaned['annee'] < test_year]
X_test = X[df_cleaned['annee'] == test_year]
y_train = y[df_cleaned['annee'] < test_year]
y_test = y[df_cleaned['annee'] == test_year]


preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['ZONE_GEO']),
        ('num', 'passthrough', ['annee'])
    ],
    remainder='passthrough'
)

# Créer le Pipeline (Pré-traitement + Modèle de Classification)

model_classification = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='liblinear', multi_class='auto', random_state=42))
])


model_classification.fit(X_train, y_train)
y_pred_class = model_classification.predict(X_test)


accuracy = accuracy_score(y_test, y_pred_class)

print(f"\n--- Résultat du Modèle de Classification Baseline (Régression Logistique) ---")
print(f"Quantiles utilisés pour la classification (Q1: {Q1:.2f}, Q2: {Q2:.2f})")
print(f"Précision (Accuracy) sur l'année 2024 : {accuracy:.4f}")
print("\n--- Rapport de Classification (Détails) ---")
print(classification_report(y_test, y_pred_class))