<a href="https://colab.research.google.com/github/noranazmy/learnai/blob/main/QuranWordEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring word proximities in the Quran

This notebook is an exploration of the distribution of words in the Quran.

Using a neural network, words are transformed into vectors where the distance between two vectors represents how often they appear in similar contexts. **Clusters of nodes represent where words tend to occur close to each other in the text, not what they mean or how they relate to each other conceptually.**

The Quran is a text whose meaning and significance cannot be reduced to statistical patterns. These patterns cannot capture its spiritual depth or uncover any fundamental truths within it. On a personal note, I believe that **datafication is fundamentally opposite to the kind of engagement a text such as the Quran demands.**

In fact, Arabic speakers will find this exercise to be a clear demonstration of the difficulty of applying this type of computation to Quranic Arabic, the richness that is immediately lost, and the sheer number of errors encountered as soon as you try to preprocess the individual words down to some normal form.

For instance, attempting to reduce:
* 'مالك'

from the third verse of the Quran 'مالك يوم الدين'
returns:

* 'مال'

when it should return: 'ملك'

This exercise should not be considered as anything more than a technical challenge to:

1. Work with neural networks
1. Create and visualize word embeddings
1. Work with the Arabic language and the Quran as a particularly challenging dataset, testing out different approaches to preprocessing and discovering the corresponding python libraries


## 1. Setup

This notebook uses `camel-tools` to preprocess each word in the Quran down to a primary form such as a lemma or root. Particles like prepositions can also be removed.

In [None]:
Processing = "Lemma" # @param ["Lemma", "Root", "None"] {type:"string"}
Particles = "Exclude" # @param ["Include", "Exclude"] {type:"string"}

In [None]:
# Install the necessary packages
!pip -q install fasttext camel-tools

In [None]:
# Download the morphology database
!camel_data -i morphology-db-msa-r13

In [None]:
from camel_tools.morphology.database import MorphologyDB
from camel_tools.morphology.analyzer import Analyzer

db = MorphologyDB.builtin_db()
analyzer = Analyzer(db)

def preprocess_token(token, form="lex", skip_particles=False):
    analyses = analyzer.analyze(token)
    if not analyses:
        return token

    analysis = analyses[0]
    pos = analysis.get('pos', '')

    # Skip particles (prepositions, conjunctions, etc.)
    if skip_particles and (pos in ['prep', 'conj', 'part']):
        return None

    if form in analysis and analysis[form].upper() != "UNKNOWN" and analysis[form].upper() != "NTWS":
        return analysis[form]
    return token

def preprocess(verses):
  if (Processing == "None"):
    return verses
  form = "lex" if Processing == "Lemma" else "root"
  skip_particles = True if Particles == "Exclude" else False
  processed_verses = []
  for verse in verses:
      tokens = verse.split()
      roots = [preprocess_token(token, form, skip_particles) for token in tokens]
      roots = [r for r in roots if r is not None]
      if roots:
          processed_verses.append(" ".join(roots))
  return processed_verses

## 1. Load the Quran corpus

We use [Tanzil](http://tanzil.net/updates/) to download the entire Quran as text with diacritics, verse numbers, and other markers removed.

* Tanzils Quran Text (Simple Clean, Version 1.1)
* License: Creative Commons Attribution 3.0
* Copyright (C) 2007-2025 Tanzil Project

In [None]:
from pathlib import Path

# Read the Quran file contents
contents = Path('quran-simple-clean-no-verse-numbers.txt').read_text(encoding="utf-8")

# Remove the copyright after the blank line
quran_lines = contents.split("\n\n", 1)[0]
verses = quran_lines.split("\n")

print(f"Loaded {len(verses)} Quranic verses")
print(f"{verses[:10]}")

## 2. Preprocessing

We use CAMeL tools to replace each individual word with some normal form. This can be the **root** or **lemma** depending on user input. This step can also be fully disabled.

Particles such as prepositions can also be removed.

In [None]:
# Transform derived words into a primary form
processed_verses = preprocess(verses)

# Write the processed corpus temporarily with the copyright removed
Path("quran-corpus.tmp").write_text("\n".join(processed_verses), encoding="utf-8")

print(f"Processing chosen: {Processing}.")
print(processed_verses[:15])

## 3. Training

In [None]:
import fasttext

model = fasttext.train_unsupervised(
    input="quran-corpus.tmp",
    model="skipgram",
    dim=100,
    ws=5,
    minn=1,
    maxn=1,
    epoch=25,
    lr=0.05,
    thread=2
)

vocabulary = set(model.get_words())
vocabulary_size = len(model.get_words())
print(f"Finished training on {len(verses)} Quranic verses. Vocabulary size is {vocabulary_size}.")

## 4. Visualization

In [None]:
from collections import Counter
import pandas as pd

# fastText doesn’t expose token counts directly
token_counts = Counter()
for verse in verses:
    token_counts.update(verse.split())

most_common = token_counts.most_common(vocabulary_size)
export_words = [w for w, _ in most_common if w in vocabulary]
pd.DataFrame(most_common, columns=["Token", "Frequency"])

In [None]:
Path("metadata.tsv").write_text("\n".join(export_words) + "\n", encoding="utf-8")

meta_lines = ["token\tfreq"]
for w in export_words:
    meta_lines.append(f"{w}\t{token_counts[w]}")
Path("full_metadata.tsv").write_text("\n".join(meta_lines) + "\n", encoding="utf-8")
print("Wrote full_metadata.tsv")

In [None]:
import numpy as np

vecs = np.vstack([model.get_word_vector(w) for w in export_words])

with open("vectors.tsv", "w", encoding="utf-8") as f:
    for row in vecs:
        f.write("\t".join(map(lambda x: f"{x:.6f}", row.tolist())) + "\n")

print("vectors.tsv shape:", vecs.shape)


In [None]:
import json
import numpy as np
from pathlib import Path

k = 15

X = vecs.astype(np.float32)
X /= (np.linalg.norm(X, axis=1, keepdims=True) + 1e-12)

neighbors = {}
for i, w in enumerate(export_words):
    sims = X @ X[i]                 # cosine sims to all
    sims[i] = -1.0                  # exclude self
    idx = np.argpartition(-sims, k)[:k]
    idx = idx[np.argsort(-sims[idx])]
    neighbors[w] = [{"token": export_words[j], "score": float(sims[j])} for j in idx]

Path("neighbors.json").write_text(json.dumps(neighbors, ensure_ascii=False, indent=2), encoding="utf-8")
print("Wrote neighbors.json")

In [None]:
!rm -r projector

In [None]:
from tensorboard.plugins import projector
import tensorflow as tf
import os
import shutil

LOG_DIR = "projector"
os.makedirs(LOG_DIR, exist_ok=True)

# Copy metadata
shutil.copy("metadata.tsv", os.path.join(LOG_DIR, "metadata.tsv"))

# Create checkpoint
embedding_var = tf.Variable(vecs, name="embedding")
checkpoint = tf.train.Checkpoint(embedding=embedding_var)
checkpoint.save(os.path.join(LOG_DIR, "embedding.ckpt"))

# Configure projector
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = "metadata.tsv"

projector.visualize_embeddings(LOG_DIR, config)

In [None]:
%load_ext tensorboard
%tensorboard --logdir projector