In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "90cf9f1c",
   "metadata": {},
   "source": [
    "# Bank Statement Analyzer with Mistral AI\n",
    "\n",
    "This Jupyter Notebook demonstrates how to extract and analyze data from bank statements using the Mistral AI API."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "213a011d",
   "metadata": {},
   "source": [
    "## Setup and Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1234abc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "from mistralai.client import MistralClient\n",
    "from mistralai.models.chat_completion import ChatMessage\n",
    "from datetime import datetime\n",
    "\n",
    "# Replace with your actual Mistral API key\n",
    "api_key = \"YOUR_MISTRAL_API_KEY\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95a2481c",
   "metadata": {},
   "source": [
    "## Core Functions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d786d11",
   "metadata": {},
   "source": [
    "### 1. `extract_data_with_mistral(ocr_text, statement_period)`\n",
    "\n",
    "This function sends the OCR text of a single bank statement page to the Mistral API and extracts key information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f223a114",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_data_with_mistral(ocr_text, statement_period):\n",
    "    \"\"\"\n",
    "    Extracts key data from OCR text of a bank statement using Mistral AI.\n",
    "\n",
    "    Args:\n",
    "        ocr_text: A string containing the OCR'd text of the bank statement.\n",
    "        statement_period: A string representing the statement period, e.g., \"01/09/2024-30/09/2024\".\n",
    "\n",
    "    Returns:\n",
    "        A dictionary containing the extracted information, or None if an error occurs.\n",
    "    \"\"\"\n",
    "\n",
    "    client = MistralClient(api_key=api_key)\n",
    "\n",
    "    messages = [\n",
    "        ChatMessage(role=\"user\", content=f\"\"\"\n",
    "You are a helpful financial assistant that extracts information from bank statements.  You are provided with the OCR text of a single bank statement page. Analyze the text and extract the following information in JSON format:\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"customer_name\": \"string\",  // Full name of the customer\n",
    "  \"bank_name\": \"string\", // Name of the bank\n",
    "  \"account_number\": \"string\", // Account number\n",
    "  \"statement_period\": \"string\", // Statement period (e.g., \"01/09/2024-30/09/2024\")\n",
    "  \"transactions\": [ // List of transactions\n",
    "    {\n",
    "      \"date\": \"string\", // Date of transaction (YYYY-MM-DD)\n",
    "      \"description\": \"string\", // Transaction description\n",
    "      \"debit\": \"float or null\", // Debit amount (null if not a debit)\n",
    "      \"credit\": \"float or null\", // Credit amount (null if not a credit)\n",
    "      \"balance\": \"float\" // Running balance after the transaction\n",
    "    }}\n",
    "  ],\n",
    "      \"total_inflow\": \"float or null\",  // Use null if not available on this page.\n",
    "    \"total_outflow\": \"float or null\", // Use null if not available on this page.\n",
    "      \"possible_salary\": \"float or null\"\n",
    "}}\n",
    "```\n",
    "\n",
    "Do *NOT* hallucinate information.  If a piece of information is not present, use \"null\".  Prioritize accuracy. Extract Total inflow and outflow, *ONLY* if mentioned in the statement.\n",
    "Here's the statement period to help with date formatting: {statement_period}\n",
    "\n",
    "Here is the OCR text of the bank statement:\n",
    "\n",
    "```\n",
    "{ocr_text}\n",
    "```\n",
    "        \"\"\"),\n",
    "    ]\n",
    "\n",
    "    chat_response = client.chat(\n",
    "        model=\"mistral-large-latest\",  # Or another suitable model like \"mistral-medium\"\n",
    "        messages=messages,\n",
    "    )\n",
    "\n",
    "    try:\n",
    "        # The response content should be a JSON string.\n",
    "        response_json = json.loads(chat_response.choices[0].message.content)\n",
    "        return response_json\n",
    "    except (json.JSONDecodeError, KeyError, IndexError) as e:\n",
    "        print(f\"Error processing Mistral response: {e}\")\n",
    "        print(f\"Raw response: {chat_response}\")\n",
    "        return None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a5c1e4f",
   "metadata": {},
   "source": [
    "### 2. `categorize_transactions(transactions)`\n",
    "\n",
    "This function takes the extracted transactions and categorizes them based on their descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6c1b77a",
   "metadata": {},
   "outputs": [],
   "source": [
    "def categorize_transactions(transactions):\n",
    "    \"\"\"Categorizes transactions based on descriptions.\"\"\"\n",
    "    categorized_transactions = {}\n",
    "    for transaction in transactions:\n",
    "        description = transaction['description'].lower()\n",
    "\n",
    "        # Initialize categories\n",
    "        if not categorized_transactions: #first time, add all categories.\n",
    "            categorized_transactions = {\n",
    "              \"Transfers to Other Banks\": 0.0,\n",
    "              \"Transfers to Other SCB Accounts\": 0.0,\n",
    "              \"Bill Payments and Shopping\": 0.0,\n",
    "              \"ATM Withdrawals\": 0.0,\n",
    "              \"PromptPay Transfers\": 0.0,\n",
    "              \"Other Inflow\": 0.0\n",
    "          }\n",
    "\n",
    "        # Categorization Logic (add more rules as needed)\n",
    "        if any(bank in description for bank in [\"kbank\", \"ktb\", \"gsb\", \"bay\", \"ttb\", \"cimb\"]):\n",
    "            categorized_transactions[\"Transfers to Other Banks\"] += (transaction.get('debit') or 0.0)\n",
    "\n",
    "        elif \"scb x\" in description and \"transfer from\" not in description:\n",
    "            categorized_transactions[\"Transfers to Other SCB Accounts\"] += (transaction.get('debit') or 0.0)\n",
    "\n",
    "        elif any(keyword in description for keyword in [\"k+ shop\", \"flash pay\", \"minuteshome shop\", \"true money\", \"omise\", \"sips shopeepay\",\"abacus digital\"]):\n",
    "            categorized_transactions[\"Bill Payments and Shopping\"] += (transaction.get('debit') or 0.0)\n",
    "        elif \"online disb. money thunder\" in description:\n",
    "            categorized_transactions[\"Other Inflow\"] += (transaction.get('credit') or 0.0)\n",
    "        elif \"cardless atm\" in description.lower() or \"terminal no.\" in description.lower():\n",
    "            categorized_transactions[\"ATM Withdrawals\"] += (transaction.get('debit') or 0.0)\n",
    "        elif \"promptpay\" in description:\n",
    "            categorized_transactions[\"PromptPay Transfers\"] += (transaction.get('debit') or 0.0)\n",
    "        elif \"transfer from\" in description.lower():\n",
    "             categorized_transactions[\"Other Inflow\"] += (transaction.get('credit') or 0.0)\n",
    "        elif \"ออมสิน\" in description.lower():\n",
    "             categorized_transactions[\"Other Inflow\"] += (transaction.get('credit') or 0.0)\n",
    "        elif \"กสิกรไทย\" in description.lower():\n",
    "            categorized_transactions[\"Other Inflow\"] += (transaction.get('credit') or 0.0)\n",
    "\n",
    "\n",
    "    return categorized_transactions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91a5cf91",
   "metadata": {},
   "source": [
    "### 3. `combine_monthly_data(all_extracted_data)`\n",
    "\n",
    "This function combines the extracted data from multiple pages (represented as a list of dictionaries) into monthly summaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c55d863",
   "metadata": {},
   "outputs": [],
   "source": [
    "def combine_monthly_data(all_extracted_data):\n",
    "    \"\"\"Combines extracted data from multiple pages into monthly summaries.\"\"\"\n",
    "    combined_data = {}\n",
    "\n",
    "    for page_data in all_extracted_data:\n",
    "        if not page_data or 'transactions' not in page_data:\n",
    "            continue\n",
    "\n",
    "        # Extract statement period and infer month and year\n",
    "        statement_period = page_data.get('statement_period')\n",
    "        if not statement_period:\n",
    "            continue  # Skip if we can't determine the period\n",
    "\n",
    "        try:\n",
    "            # Attempt to parse the *end* date of the statement period to get month/year\n",
    "            end_date_str = statement_period.split('-')[1].strip()\n",
    "            end_date = datetime.strptime(end_date_str, '%d/%m/%Y')\n",
    "            month_year_key = end_date.strftime('%B %Y')  # e.g., \"September 2024\"\n",
    "        except (ValueError, IndexError):\n",
    "            continue #skip this\n",
    "\n",
    "        if month_year_key not in combined_data:\n",
    "            combined_data[month_year_key] = {\n",
    "                \"total_inflow\": 0.0,\n",
    "                \"total_outflow\": 0.0,\n",
    "                \"categories\": {  # Initialize categories\n",
    "                    \"Transfers to Other Banks\": 0.0,\n",
    "                    \"Transfers to Other SCB Accounts\": 0.0,\n",
    "                    \"Bill Payments and Shopping\": 0.0,\n",
    "                    \"ATM Withdrawals\": 0.0,\n",
    "                    \"PromptPay Transfers\": 0.0,\n",
    "                    \"Other Inflow\": 0.0,\n",
    "                },\n",
    "                \"possible_salary\": None,  # Initialize possible salary\n",
    "            }\n",
    "\n",
    "        # Accumulate inflow, outflow\n",
    "        combined_data[month_year_key][\"total_inflow\"] += (page_data.get('total_inflow') or 0.0)\n",
    "        combined_data[month_year_key][\"total_outflow\"] += (page_data.get('total_outflow') or 0.0)\n",
    "\n",
    "        # Combine categorized transactions\n",
    "        categorized = categorize_transactions(page_data['transactions'])\n",
    "        for category, amount in categorized.items():\n",
    "            combined_data[month_year_key][\"categories\"][category] += amount\n",
    "\n",
    "\n",
    "        # Update possible salary (keep the highest value seen)\n",
    "        if page_data.get('possible_salary') is not None:\n",
    "            if (combined_data[month_year_key][\"possible_salary\"] is None or\n",
    "                page_data['possible_salary'] > combined_data[month_year_key][\"possible_salary\"]):\n",
    "                combined_data[month_year_key][\"possible_salary\"] = page_data['possible_salary']\n",
    "\n",
    "    return combined_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65b8855e",
   "metadata": {},
   "source": [
    "## Main Execution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e6a483a",
   "metadata": {},
   "outputs": [],
   "source": [
    "def main():\n",
    "    all_pages_data = []\n",
    "\n",
    "    # --- Process each OCR text file (simulating multiple pages) ---\n",
    "    ocr_texts = {\n",
    "          \"page_1.txt\": \"01/09/2024-30/09/2024\",  # Example filename and period\n",
    "          \"page_2.txt\": \"01/09/2024-30/09/2024\",\n",
    "          \"page_3.txt\":\"01/09/2024-30/09/2024\",\n",
    "          \"page_4.txt\":\"01/01/2025-31/01/2025\",\n",
    "          \"page_5.txt\":\"01/01/2025-31/01/2025\",\n",
    "          \"page_6.txt\":\"01/01/2025-31/01/2025\",\n",
    "          \"page_7.txt\": \"01/01/2025-31/01/2025\",\n",
    "          \"page_8.txt\": \"01/01/2025-31/01/2025\",\n",
    "          \"page_9.txt\": \"01/12/2024-31/12/2024\",\n",
    "          \"page_10.txt\": \"01/12/2024-31/12/2024\",\n",
    "          \"page_11.txt\": \"01/12/2024-31/12/2024\",\n",
    "          \"page_12.txt\": \"01/12/2024-31/12/2024\",\n",
    "          \"page_13.txt\": \"01/12/2024-31/12/2024\",\n",
    "          \"page_14.txt\":\"01/11/2024-30/11/2024\",\n",
    "          \"page_15.txt\": \"01/11/2024-30/11/2024\",\n",
    "          \"page_16.txt\": \"01/10/2024-31/10/2024\",\n",
    "          \"page_17.txt\": \"01/10/2024-31/10/2024\",\n",
    "          \"page_18.txt\":\"01/08/2024-31/08/2024\",\n",
    "          \"page_19.txt\":\"01/08/2024-31/08/2024\"\n",
    "    }\n",
    "\n",
    "    for filename, statement_period in ocr_texts.items():\n",
    "        try:\n",
    "            with open(filename, \"r\", encoding=\"utf-8\") as file:\n",
    "                ocr_text = file.read()\n",
    "                page_data = extract_data_with_mistral(ocr_text, statement_period)\n",
    "                if page_data:\n",
    "                    all_pages_data.append(page_data)\n",
    "        except FileNotFoundError:\n",
    "            print(f\"Error: File '{filename}' not found.\")\n",
    "            return\n",
    "\n",
    "    # Combine data from all pages into monthly summaries\n",
    "    combined_monthly_data = combine_monthly_data(all_pages_data)\n",
    "\n",
    "    # Print the combined results\n",
    "    print(json.dumps(combined_monthly_data, indent=2, ensure_ascii=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c3a1b4c",
   "metadata": {},
   "source": [
    "## Run the Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9b98e7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "if __name__ == \"__main__\":\n",
    "    main()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}