In [1]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyse Exploratoire des Données (EDA) - Diabète\n",
    "\n",
    "**Objectif :** Comprendre les relations entre les différents symptômes et le diagnostic du diabète à partir du fichier `train_with_id.csv`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Importation des Bibliothèques"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "# Configuration esthétique pour les graphiques\n",
    "sns.set_style(\"whitegrid\")\n",
    "plt.style.use(\"fivethirtyeight\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Chargement et Inspection des Données"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Charger le fichier CSV\n",
    "df = pd.read_csv('train_with_id.csv')\n",
    "\n",
    "# Afficher les 5 premières lignes\n",
    "print(\"Aperçu des données :\")\n",
    "display(df.head())\n",
    "\n",
    "# Informations générales (types, valeurs nulles)\n",
    "print(\"\\nInformations sur le DataFrame :\")\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Nettoyage Initial\n",
    "\n",
    "La colonne `ID` n'est pas une caractéristique pertinente pour la prédiction, nous la supprimons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_cleaned = df.drop('ID', axis=1)\n",
    "print(\"Dimensions après suppression de l'ID :\", df_cleaned.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Visualisation des Données"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Distribution de la variable cible (`class`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8, 6))\n",
    "sns.countplot(x='class', data=df_cleaned, palette='viridis')\n",
    "plt.title('Distribution des Diagnostics (Positif vs. Négatif)')\n",
    "plt.xlabel('Diagnostic')\n",
    "plt.ylabel('Nombre de patients')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On observe un léger déséquilibre des classes, avec plus de cas positifs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Analyse des variables catégorielles\n",
    "\n",
    "Nous analysons l'impact de chaque symptôme sur le diagnostic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "categorical_features = [col for col in df_cleaned.columns if col not in ['Age', 'class']]\n",
    "\n",
    "for feature in categorical_features:\n",
    "    plt.figure(figsize=(10, 6))\n",
    "    sns.countplot(x=feature, hue='class', data=df_cleaned, palette='coolwarm')\n",
    "    plt.title(f'Distribution de \"{feature}\" en fonction du Diagnostic')\n",
    "    plt.xlabel(feature)\n",
    "    plt.ylabel('Nombre de patients')\n",
    "    plt.legend(title='Diagnostic')\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Les symptômes **Polyuria** (uriner fréquemment) et **Polydipsia** (soif excessive) semblent être des indicateurs très forts d'un diagnostic positif."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Préparation des données pour la modélisation et le fichier clean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copier le dataframe pour l'encodage\n",
    "df_encoded = df_cleaned.copy()\n",
    "\n",
    "# Utiliser LabelEncoder pour convertir les colonnes textuelles en nombres\n",
    "for column in df_encoded.columns:\n",
    "    if df_encoded[column].dtype == 'object':\n",
    "        le = LabelEncoder()\n",
    "        df_encoded[column] = le.fit_transform(df_encoded[column])\n",
    "\n",
    "print(\"Aperçu des données encodées numériquement :\")\n",
    "display(df_encoded.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Matrice de Corrélation\n",
    "\n",
    "Maintenant que les données sont numériques, nous pouvons calculer la corrélation entre les caractéristiques."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculer la matrice de corrélation\n",
    "corr_matrix = df_encoded.corr()\n",
    "\n",
    "# Afficher la carte de chaleur (heatmap)\n",
    "plt.figure(figsize=(18, 15))\n",
    "sns.heatmap(corr_matrix, annot=True, cmap='viridis', fmt='.2f', linewidths=0.5)\n",
    "plt.title('Matrice de Corrélation des Caractéristiques')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Création du Fichier `diabetes_clean.csv`\n",
    "\n",
    "Nous sauvegardons le DataFrame nettoyé et encodé dans un nouveau fichier CSV."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_filename = 'diabetes_clean.csv'\n",
    "df_encoded.to_csv(output_filename, index=False)\n",
    "\n",
    "print(f\"Le fichier '{output_filename}' a été créé avec succès.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

NameError: name 'null' is not defined