In [1]:
import pandas as pd
import numpy as np
import os 

from utils import load_eviction_case_identifiers, extract_eviction_case_text
from utils import clean_texts, prepare_topic_modeling_corpus, selection_of_number_of_topics
from utils import visualize_topics, generate_topic_distributions

%load_ext autoreload
%autoreload 2

Here is the pipeline for the model 'nl_core_news_lg': 
	['tok2vec', 'morphologizer', 'lemmatizer', 'attribute_ruler', 'ner', 'company_name_detector', 'merge_entities']


## Preprocessing Description

In this preprocessing phase, we perform the following steps to prepare the data for topic modeling:

#### Load Eviction Case Identifiers:

- `load_eviction_case_identifiers()` function:
it loads the European Case Law Identifier (ECLI) codes associated with eviction cases.

- `extract_eviction_case_text()` function:
we extract the full text of the judgments for the eviction cases. This function takes:
     - the list of ECLI codes and retrieves the corresponding judgment texts from a preloaded dataset.

- `clean_texts()` function:
During this process, we perform the following steps:
    - Lemmatization: Convert words to their base or dictionary form (lemma).
    - Lowercasing: Convert all text to lowercase to ensure uniformity.
    - Stop Word Removal: Exclude common stop words and legal terms that are not relevant to the analysis.
    - Token Filtering: Remove non-alphabetic tokens and words with fewer than four characters.
    - Entity Removal: Exclude tokens that are part of named entities, as these may not contribute meaningfully to the topic modeling.
    - Save the Cleaned Data:

This ensures that the preprocessed data is ready for the subsequent topic modeling step.


In [2]:
# load ECLI numbers and 
eviction_ecli = load_eviction_case_identifiers()
eviction_ecli_with_texts, eviction_texts = extract_eviction_case_text(eviction_ecli)

ecli_nos, cleaned_texts = clean_texts(eviction_ecli_with_texts, eviction_texts, "./data/clean_texts.csv")

There are 5047 ecli numbers of eviction cases.
There are 5021 eviction-related cases with texts (26 cases with no text).


## Topic Modeling Process

__1) Loading Preprocessed Data:__

We begin by loading the preprocessed textual data from a CSV file containing cleaned text. The data is stored as a list of documents, where each document is represented as a list of words.


__2) Text Vectorization (Bag of Words):__

The text is then converted into a vector representation using the Bag of Words model. This step involves creating a dictionary (idx2word) that maps each unique word to an index and a document-term matrix (doc_term_matrix) that captures the frequency of words across documents. 
- Words that appear in fewer than 10 documents or in more than 40% of the documents are excluded to focus on meaningful and relevant terms.


__3) Coherence Measure Computation:__

To determine the optimal number of topics for the model, we compute coherence scores across a range of topic numbers (from 10 to 25). The coherence score measures the interpretability of the topics, with a higher score indicating more coherent topics. The c_v coherence type, which is based on a sliding window and uses a combination of indirect confirmation measures, is used in this analysis.


__4) Topic Visualization:__

Finally, we visualize the topics generated by the model. The `visualize_topics` function produces interactive visualizations that display the top words associated with each topic. This helps in understanding the distinct themes captured by the model.


This process allows us to identify and interpret the latent topics within the corpus, providing valuable insights into the underlying themes present in the dataset.


In [10]:
# 1) load preprocessed data
df = pd.read_csv("./data/clean_texts.csv")#, usecols=['clean_text'])
ecli = df.ecli.tolist()
clean_text = df.clean_text.tolist()
docs = [txt.split(" ") for txt in clean_text]
df.head()

Unnamed: 0,ecli,clean_text
0,ECLI:NL:RBAMS:2000:AA5199,rolnummer verloop procedure terechtzitting eis...
1,ECLI:NL:RBAMS:2000:AF0022,schorsing executie ontruimingsvonnis schuldsan...
2,ECLI:NL:RBMID:2000:AF0403,oordeelt schuldsaneringsregeling ontbinding hu...
3,ECLI:NL:RBROT:2000:AF0496,ontruimingsbevoegdheid belangenafweging verbod...
4,ECLI:NL:RBARN:2000:AA4293,vonnis president arrondissementsrechtbank kort...


In [7]:
# Text Vectorization (Bag of Words)
idx2word, doc_term_matrix = prepare_topic_modeling_corpus(docs, min_doc_count=10, max_doc_proportion=0.4)

Tokens considered: Appear in at least 10 documents and at most 40.0% of documents.
Number of unique tokens: 7186
Number of documents: 5021


In [8]:
# compute coherence measures for different number of topics 
ntopics_range, coherence_values = selection_of_number_of_topics(
    idx2word, 
    doc_term_matrix, 
    docs,
    start=10, # the minimum number of topics
    stop=26,  # the maximum number of topics
    step=1,
    coherence_type='c_v'
)

Computing coherence scores using "c_v" coherence measure...


  logger.warn("stats accumulation interrupted; <= %d documents processed", self._num_docs)
Traceback (most recent call last):
  File "/home/mohammad/anaconda3/lib/python3.9/multiprocessing/queues.py", line 251, in _feed
    send_bytes(obj)
  File "/home/mohammad/anaconda3/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/mohammad/anaconda3/lib/python3.9/multiprocessing/connection.py", line 410, in _send_bytes
    self._send(buf)
  File "/home/mohammad/anaconda3/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/home/mohammad/anaconda3/lib/python3.9/multiprocessing/queues.py", line 251, in _feed
    send_bytes(obj)
  File "/home/mohammad/anaconda3/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/moha

KeyboardInterrupt: 

In [5]:
# To visualize topics by using top words
visualize_topics(idx2word, doc_term_matrix)

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


Saved LDA visualization for model 5 to ./pics/lda_D5.html.


## Compute topic distribution within each case laws

In [18]:
from utils import generate_topic_distributions

ntopics = 10

df_embedding = generate_topic_distributions(ecli, doc_term_matrix, ntopics)

csv_file = f'./data/topics_distribution_D{ntopics}.csv'
df_embedding.to_csv(csv_file)
print(f"The topic distribution (for {ntopics} topics) has been saved in {csv_file}.")

df_embedding.head()

The topic distribution (for 10 topics) has been saved in ./data/topics_distribution_D10.csv.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ECLI:NL:RBAMS:2000:AA5199,0.067015,0.152803,0.0,0.198621,0.014613,0.0,0.012703,0.434296,0.11044,0.0
ECLI:NL:RBAMS:2000:AF0022,0.094789,0.036663,0.0,0.197276,0.0,0.362863,0.0,0.174135,0.107741,0.019224
ECLI:NL:RBMID:2000:AF0403,0.065546,0.023613,0.0,0.406954,0.013781,0.205834,0.0,0.201847,0.075587,0.0
ECLI:NL:RBROT:2000:AF0496,0.120674,0.091034,0.0,0.096353,0.0,0.360068,0.045406,0.217152,0.046998,0.018216
ECLI:NL:RBARN:2000:AA4293,0.13989,0.576206,0.0,0.0,0.089005,0.079133,0.0,0.040606,0.064075,0.0


(5021, 10)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ECLI:NL:RBAMS:2000:AA5199,0.067015,0.152803,0.0,0.198621,0.014613,0.0,0.012703,0.434296,0.11044,0.0
ECLI:NL:RBAMS:2000:AF0022,0.094789,0.036663,0.0,0.197276,0.0,0.362863,0.0,0.174135,0.107741,0.019224
ECLI:NL:RBMID:2000:AF0403,0.065546,0.023613,0.0,0.406954,0.013781,0.205834,0.0,0.201847,0.075587,0.0
ECLI:NL:RBROT:2000:AF0496,0.120674,0.091034,0.0,0.096353,0.0,0.360068,0.045406,0.217152,0.046998,0.018216
ECLI:NL:RBARN:2000:AA4293,0.13989,0.576206,0.0,0.0,0.089005,0.079133,0.0,0.040606,0.064075,0.0
