In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploratory Data Analysis (EDA) for Xente Dataset\n",
    "\n",
    "This notebook explores the Xente dataset to uncover patterns, identify data quality issues, and guide feature engineering for the Bati Bank credit risk model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Set seaborn theme for better visualizations\n",
    "sns.set_theme(style='whitegrid')\n",
    "\n",
    "# Load dataset\n",
    "df = pd.read_csv('../data/raw/xente_data.csv')\n",
    "\n",
    "# Overview\n",
    "print('Dataset Info:')\n",
    "print(df.info())\n",
    "print('\\nFirst 5 Rows:')\n",
    "print(df.head())\n",
    "\n",
    "# Summary Statistics\n",
    "print('\\nSummary Statistics:')\n",
    "print(df.describe())\n",
    "\n",
    "# Missing Values\n",
    "print('\\nMissing Values:')\n",
    "print(df.isnull().sum())\n",
    "\n",
    "# Numerical Feature Distributions\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.histplot(df['Amount'], bins=50, kde=True)\n",
    "plt.title('Distribution of Transaction Amount')\n",
    "plt.xlabel('Amount')\n",
    "plt.ylabel('Count')\n",
    "plt.show()\n",
    "\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.histplot(df['Value'], bins=50, kde=True)\n",
    "plt.title('Distribution of Transaction Value')\n",
    "plt.xlabel('Value')\n",
    "plt.ylabel('Count')\n",
    "plt.show()\n",
    "\n",
    "# Categorical Feature Distributions\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.countplot(y='ProductCategory', data=df, order=df['ProductCategory'].value_counts().index)\n",
    "plt.title('Distribution of Product Categories')\n",
    "plt.xlabel('Count')\n",
    "plt.ylabel('Product Category')\n",
    "plt.show()\n",
    "\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.countplot(y='ChannelId', data=df, order=df['ChannelId'].value_counts().index)\n",
    "plt.title('Distribution of Channel IDs')\n",
    "plt.xlabel('Count')\n",
    "plt.ylabel('Channel ID')\n",
    "plt.show()\n",
    "\n",
    "# Correlation Analysis\n",
    "plt.figure(figsize=(8, 6))\n",
    "sns.heatmap(df[['Amount', 'Value']].corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)\n",
    "plt.title('Correlation Heatmap')\n",
    "plt.show()\n",
    "\n",
    "# Outlier Detection\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.boxplot(x=df['Amount'])\n",
    "plt.title('Box Plot of Transaction Amount')\n",
    "plt.xlabel('Amount')\n",
    "plt.show()\n",
    "\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.boxplot(x=df['Value'])\n",
    "plt.title('Box Plot of Transaction Value')\n",
    "plt.xlabel('Value')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Insights\n",
    "1. **Skewed Distributions**: The `Amount` and `Value` columns are highly right-skewed, with most transactions having low values and a few large outliers, suggesting the need for log-transformation or robust scaling in feature engineering.\n",
    "2. **Dominant Categories**: The `ProductCategory` column shows a few categories (e.g., airtime, financial services) dominate transactions, indicating potential feature importance for credit risk modeling.\n",
    "3. **No Missing Values**: The dataset has no missing values, simplifying preprocessing steps.\n",
    "4. **High Correlation**: `Amount` and `Value` are highly correlated (close to 1), suggesting potential redundancy that may require feature selection or dimensionality reduction.\n",
    "5. **Outliers**: Box plots reveal significant outliers in `Amount`, which may need to be capped or removed to improve model robustness."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}