In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BPS Question Deduplication — Interactive Notebook\n",
    "### Semantic Similarity • Cross‑Survey Duplicate Detection • Clustering\n",
    "\n",
    "Notebook ini menampilkan pipeline lengkap:\n",
    "1. Load data pertanyaan BPS\n",
    "2. Generate semantic embeddings\n",
    "3. Hitung similarity & candidate duplikat\n",
    "4. Visualisasi similarity heatmap\n",
    "5. Clustering pertanyaan mirip\n",
    "6. Tampilkan hasil akhir\n",
    "\n",
    "**Model NLP:** SentenceTransformers (MiniLM v2)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sentence_transformers import SentenceTransformer\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "from sklearn.neighbors import NearestNeighbors\n",
    "import networkx as nx\n",
    "\n",
    "print(\"Libraries loaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load Data\n",
    "Dataset contoh berisi 50 pertanyaan dummy lintas survei BPS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"data/raw_questions.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Embedding dengan SentenceTransformers\n",
    "Menggunakan model:\n",
    "**`all-MiniLM-L6-v2` (English/Multilingual friendly)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = SentenceTransformer(\"sentence-transformers/all-MiniLM-L6-v2\")\n",
    "emb = model.encode(df['question_text'].tolist(), batch_size=32, show_progress_bar=True)\n",
    "emb.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Similarity Matrix + Heatmap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sim_matrix = cosine_similarity(emb)\n",
    "plt.figure(figsize=(10,8))\n",
    "sns.heatmap(sim_matrix, cmap=\"viridis\")\n",
    "plt.title(\"Similarity Matrix Heatmap\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Cari pasangan duplikat\n",
    "Threshold default: **0.78**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "THRESHOLD = 0.78\n",
    "\n",
    "pairs = []\n",
    "n = len(df)\n",
    "\n",
    "for i in range(n):\n",
    "    for j in range(i+1, n):\n",
    "        sim = sim_matrix[i][j]\n",
    "        if sim >= THRESHOLD:\n",
    "            pairs.append({\n",
    "                \"id1\": df.loc[i,\"question_id\"],\n",
    "                \"id2\": df.loc[j,\"question_id\"],\n",
    "                \"survey1\": df.loc[i,\"survey_name\"],\n",
    "                \"survey2\": df.loc[j,\"survey_name\"],\n",
    "                \"question1\": df.loc[i,\"question_text\"],\n",
    "                \"question2\": df.loc[j,\"question_text\"],\n",
    "                \"similarity\": float(sim)\n",
    "            })\n",
    "\n",
    "pairs_df = pd.DataFrame(pairs)\n",
    "pairs_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Clustering Dengan Connected Components\n",
    "Cluster = kelompok pertanyaan yang saling mirip."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "G = nx.Graph()\n",
    "\n",
    "for _, r in pairs_df.iterrows():\n",
    "    G.add_edge(r['id1'], r['id2'])\n",
    "\n",
    "clusters = list(nx.connected_components(G))\n",
    "clusters = [sorted(list(c)) for c in clusters]\n",
    "\n",
    "clusters[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Simpan Output ke Folder results/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, json\n",
    "\n",
    "os.makedirs(\"results\", exist_ok=True)\n",
    "\n",
    "pairs_df.to_csv(\"results/similarity_pairs.csv\", index=False)\n",
    "np.save(\"results/embeddings.npy\", emb)\n",
    "\n",
    "with open(\"results/clusters.json\", \"w\") as f:\n",
    "    json.dump({\"clusters\": clusters}, f, indent=4, ensure_ascii=False)\n",
    "\n",
    "print(\"Saved all results to /results/\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
