# Legal Documents File System Analysis

This notebook analyzes the file system structure of the legal documents repository. It examines file types, counts, and converts JSON metadata files to TSV format for easier processing.

### Overview
- Clone the legal documents repository
- Analyze file type distribution across different document categories
- Convert JSON metadata files to structured TSV format
- Prepare data for further processing and analysis

In [2]:
!git clone https://github.com/nuuuwan/lk_legal_docs

fatal: destination path 'lk_legal_docs' already exists and is not an empty directory.


## 1. Repository Setup

Clone the legal documents repository to access the complete file structure and metadata files.

In [6]:
import os
from collections import defaultdict

# 🔧 Set your target directory
ROOT_DIR = "lk_legal_docs/data/bills"  # <-- Change this

def count_file_extensions(root_dir):
    extension_counts = defaultdict(int)

    for dirpath, _, filenames in os.walk(root_dir):
        for filename in filenames:
            _, ext = os.path.splitext(filename)
            ext = ext.lower().strip('.')
            ext = ext if ext else "[no extension]"
            extension_counts[ext] += 1

    return extension_counts

counts = count_file_extensions(ROOT_DIR)

print(f"\n📁 File type counts in '{ROOT_DIR}':\n")
for ext, count in sorted(counts.items(), key=lambda x: (-x[1], x[0])):
    print(f".{ext:<15} : {count}")


📁 File type counts in 'lk_legal_docs/data/bills':

.json            : 1351


## 2. File Type Analysis

Analyze the distribution of file types across different document categories (bills, acts, gazettes) to understand the repository structure.

In [7]:
import os
import json
import csv

# 🔧 Define your input/output pairs here
JOBS = [
    ("lk_legal_docs/data/acts", "acts.tsv"),
    ("lk_legal_docs/data/bills", "bills.tsv"),
    ("lk_legal_docs/data/extra-gazettes", "extra-gazettes.tsv"),
    ("lk_legal_docs/data/gazettes", "gazettes.tsv"),
]

def flatten_json(y, parent_key='', sep='_'):
    """Recursively flatten a nested dictionary."""
    items = []
    for k, v in y.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_json(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

def parse_and_flatten(json_path):
    """Parse and flatten JSON; log errors if any."""
    try:
        with open(json_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        return flatten_json(data)
    except (json.JSONDecodeError, UnicodeDecodeError, OSError) as e:
        print(f"[ERROR] Failed to read {json_path}: {e}")
        return None

def get_all_json_files(root_dir):
    """Recursively collect all .json files."""
    return [
        os.path.join(dirpath, filename)
        for dirpath, _, filenames in os.walk(root_dir)
        for filename in filenames if filename.endswith(".json")
    ]

def write_tsv(data_list, output_file):
    """Write flattened dicts to TSV."""
    if not data_list:
        print(f"⚠️ No valid data to write for {output_file}")
        return

    all_keys = set()
    for row in data_list:
        all_keys.update(row.keys())
    fieldnames = sorted(all_keys)

    with open(output_file, "w", newline='', encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
        writer.writeheader()
        writer.writerows(data_list)

    print(f"✅ Wrote {len(data_list)} rows → {output_file}")

# 🔁 Process each job
for root_dir, output_path in JOBS:
    print(f"\n🔍 Processing folder: {root_dir}")
    json_files = get_all_json_files(root_dir)
    print(f"  → Found {len(json_files)} JSON files")

    rows = []
    for path in json_files:
        result = parse_and_flatten(path)
        if result:
            rows.append(result)

    write_tsv(rows, output_path)


🔍 Processing folder: lk_legal_docs/data/acts
  → Found 1647 JSON files
✅ Wrote 1647 rows → acts.tsv

🔍 Processing folder: lk_legal_docs/data/bills
  → Found 1351 JSON files
✅ Wrote 1351 rows → bills.tsv

🔍 Processing folder: lk_legal_docs/data/extra-gazettes
  → Found 34770 JSON files
✅ Wrote 34770 rows → extra-gazettes.tsv

🔍 Processing folder: lk_legal_docs/data/gazettes
  → Found 6275 JSON files
✅ Wrote 6275 rows → gazettes.tsv


## 3. JSON Metadata to TSV Conversion

Convert JSON metadata files from different document categories into structured TSV format for easier processing and analysis.

In [10]:
import pandas as pd
import os

# List of files (full relative paths)
files = [
    "acts.tsv",
    "bills.tsv",
    "extra-gazettes.tsv",
    "gazettes.tsv",
]

# List to store DataFrames
dfs = []

# Load each file, add doc_type_name from filename, and append to list
for file_path in files:
    doc_type = os.path.basename(file_path).replace(".tsv", "")
    df = pd.read_csv(file_path, sep='\t')
    df['doc_type_name'] = doc_type
    dfs.append(df)

# Concatenate all DataFrames
merged_df = pd.concat(dfs, ignore_index=True)

# Optional: Preview merged data
merged_df.head()

Unnamed: 0,date,description,dir_data,doc_num,doc_type_name,id,lang_to_source_url_en,lang_to_source_url_si,lang_to_source_url_ta
0,1989-05-11,Tax Amnesty,data/acts/1989/05-1989,05/1989,acts,05-1989,,https://documents.gov.lk/view/acts/1989/5/05-1...,
1,1989-05-30,Provincial Councils (Consequential Provisions),data/acts/1989/12-1989,12/1989,acts,12-1989,,https://documents.gov.lk/view/acts/1989/5/12-1...,
2,1989-12-30,Appropriation,data/acts/1989/18-1989,18/1989,acts,18-1989,,https://documents.gov.lk/view/acts/1989/12/18-...,
3,1989-05-15,Inland Revenue (Amendment),data/acts/1989/11-1989,11/1989,acts,11-1989,,https://documents.gov.lk/view/acts/1989/5/11-1...,
4,1989-11-22,Age of Majority (Amendment),data/acts/1989/17-1989,17/1989,acts,17-1989,,https://documents.gov.lk/view/acts/1989/11/17-...,


In [11]:
# Rename columns
merged_df = merged_df.rename(columns={
    'lang_to_source_url_si': 'source_url_si',
    'lang_to_source_url_ta': 'source_url_ta',
    'lang_to_source_url_en': 'source_url_en'
})

# Reorder columns
desired_order = [
    'doc_type_name',
    'date',
    'description',
    'source_url_si',
    'source_url_ta',
    'source_url_en',
    'doc_num',
    'id',
    'dir_data'
]
merged_df = merged_df[desired_order]

merged_df.head()


Unnamed: 0,doc_type_name,date,description,source_url_si,source_url_ta,source_url_en,doc_num,id,dir_data
0,acts,1989-05-11,Tax Amnesty,https://documents.gov.lk/view/acts/1989/5/05-1...,,,05/1989,05-1989,data/acts/1989/05-1989
1,acts,1989-05-30,Provincial Councils (Consequential Provisions),https://documents.gov.lk/view/acts/1989/5/12-1...,,,12/1989,12-1989,data/acts/1989/12-1989
2,acts,1989-12-30,Appropriation,https://documents.gov.lk/view/acts/1989/12/18-...,,,18/1989,18-1989,data/acts/1989/18-1989
3,acts,1989-05-15,Inland Revenue (Amendment),https://documents.gov.lk/view/acts/1989/5/11-1...,,,11/1989,11-1989,data/acts/1989/11-1989
4,acts,1989-11-22,Age of Majority (Amendment),https://documents.gov.lk/view/acts/1989/11/17-...,,,17/1989,17-1989,data/acts/1989/17-1989


In [12]:
merged_df.to_csv("all_manual.tsv", sep='\t', index=False)
