# üß† AI Twin ‚Äî Data Preprocessing & Hugging Face Push Pipeline

This notebook takes scraped text data (website, LinkedIn, CV, SOP, chat logs),
cleans and chunks it, extracts structured information using OpenAI,
converts everything into instruction-tuning JSONL format, and pushes
the final dataset to a **private** Hugging Face repository.

### Notebook Structure
| Step | Description |
|------|-------------|
| 0 | Load environment variables & libraries |
| 1 | Load scraped data |
| 2 | Clean & preprocess text |
| 3 | Chunk text into ~3 000-char blocks |
| 4 | Extract structured info via OpenAI |
| 5 | Save locally as `.jsonl` |
| 6 | Push dataset to Hugging Face Hub |

---
## Step 0 ‚Äî Load Environment Variables & Libraries

We read API keys from the `.env` file and import every library we'll need.
Make sure your `.env` contains:
```
OPENAI_API_KEY = sk-...
HF_TOKEN = hf_...
```

In [1]:
# ============================================================
# Step 0: Environment & Imports
# ============================================================

import os
import re
import json
import time
import textwrap
from pathlib import Path

from dotenv import load_dotenv
from openai import OpenAI
from tqdm.notebook import tqdm          # progress bars inside Jupyter
from datasets import Dataset
from huggingface_hub import login as hf_login

# Load .env file
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
HF_TOKEN       = os.getenv("HF_TOKEN")

assert OPENAI_API_KEY, "‚ùå OPENAI_API_KEY not found in .env"
assert HF_TOKEN,       "‚ùå HF_TOKEN not found in .env"

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)
MODEL  = "gpt-4o-mini"   # fast & cheap; swap to gpt-4o if you want higher quality

# Authenticate with Hugging Face
hf_login(token=HF_TOKEN)

print("‚úÖ Environment loaded and authenticated.")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


‚úÖ Environment loaded and authenticated.


In [2]:
#!pip install pymupdf

---
## Step 1 ‚Äî Load Scraped Data

All raw text from your various sources goes into the `scraped_data` dictionary.
Replace the placeholder strings below with your **actual scraped content**,
or load them from files.

> **Tip:** If you already ran `web_data(scrap).ipynb`, you can copy-paste
> the `scraped_data` dict from that notebook, or load the saved markdown file.

In [3]:
# ============================================================
# Step 1: Load Scraped Data (Markdown + PDFs)
# ============================================================

# --- Imports ---
from pathlib import Path
import fitz  # PyMuPDF

# --- 1a. Load the Personal Knowledge Base markdown ---
knowledge_base_path = Path(r"G:\Github_Projects\Ai_twin\file\CV_statement_details\Personal_Knowledge_Base.md")
knowledge_base_text = knowledge_base_path.read_text(encoding="utf-8") if knowledge_base_path.exists() else ""
print(f"üìÑ Personal Knowledge Base length: {len(knowledge_base_text):,} chars")

# --- 1b. Load CV / SOP / Personal Statement PDFs ---
def read_pdf(path: str) -> str:
    """Extract all text from a PDF file."""
    doc = fitz.open(path)
    pages = [page.get_text() for page in doc]
    doc.close()
    return "\n".join(pages)

cv_text       = read_pdf(r"G:\Github_Projects\Ai_twin\file\CV_statement_details\Md_Maruf_Mullah_CV.pdf")
sop_text      = read_pdf(r"G:\Github_Projects\Ai_twin\file\CV_statement_details\Statement of Purpose of Md. Maruf Mullah.pdf")
ps_text       = read_pdf(r"G:\Github_Projects\Ai_twin\file\CV_statement_details\Personal Statement of Md. Maruf Mullah.pdf")
chat_logs_text = read_pdf(r"G:\Github_Projects\Ai_twin\file\chat\chat50.pdf")

print(f"üìÑ CV length:  {len(cv_text):,} chars")
print(f"üìÑ SOP length: {len(sop_text):,} chars")
print(f"üìÑ PS length:  {len(ps_text):,} chars")
print(f"üìÑ Chat logs length: {len(chat_logs_text):,} chars")

# --- 1c. Assemble the master dictionary including Markdown ---
scraped_data = {
    "Personal_Knowledge_Base": knowledge_base_text,
    "CV": cv_text,
    "SOP": sop_text,
    "PersonalStatement": ps_text,
    "ChatLogs": chat_logs_text,
}

# Quick overview
print("\n Data source sizes:")
for label, text in scraped_data.items():
    print(f"   {label:35s} ‚Üí {len(text):>10,} chars")

üìÑ Personal Knowledge Base length: 5,141 chars
üìÑ CV length:  6,033 chars
üìÑ SOP length: 3,848 chars
üìÑ PS length:  3,129 chars
üìÑ Chat logs length: 7,346,040 chars

 Data source sizes:
   Personal_Knowledge_Base             ‚Üí      5,141 chars
   CV                                  ‚Üí      6,033 chars
   SOP                                 ‚Üí      3,848 chars
   PersonalStatement                   ‚Üí      3,129 chars
   ChatLogs                            ‚Üí  7,346,040 chars


---
## Step 2 ‚Äî Clean & Preprocess Text

We apply several cleaning rules:
1. Strip HTML artifacts, URLs, and excessive whitespace
2. Remove navigation / footer / cookie-banner boilerplate
3. Drop tiny junk lines (< 20 chars)
4. Collapse multiple blank lines

In [4]:
# ============================================================
# Step 2: Clean & Preprocess Text
# ============================================================

# Words that signal navigation / boilerplate noise
NOISE_KEYWORDS = [
    "menu", "navigation", "navbar", "footer", "sidebar",
    "cookie", "privacy policy", "terms of service",
    "sign in", "sign up", "log in", "log out",
    "subscribe", "newsletter", "advertisement",
    "all rights reserved", "copyright",
]

MIN_LINE_LENGTH = 20  # discard lines shorter than this

# ---- Cost-control: cap large sources ----
# ChatLogs alone is ~5 M chars ‚Üí ~2000+ chunks ‚Üí $$$
# We keep only the first N chars (at a clean boundary).
MAX_CHARS = {
    "ChatLogs": 20_000,   # <-- adjust this number as needed
}


def clean_text(text: str) -> str:
    """Deep-clean a block of text."""
    # 1. Remove leftover HTML tags
    text = re.sub(r"<[^>]+>", " ", text)

    # 2. Remove URLs
    text = re.sub(r"https?://\S+", "", text)

    # 3. Normalize whitespace within lines
    text = re.sub(r"[ \t]+", " ", text)

    # 4. Split into lines, strip, filter
    lines = [line.strip() for line in text.splitlines()]
    cleaned_lines = []
    for line in lines:
        if len(line) < MIN_LINE_LENGTH:
            continue
        low = line.lower()
        if any(kw in low for kw in NOISE_KEYWORDS):
            continue
        cleaned_lines.append(line)

    # 5. Collapse multiple blank lines
    result = "\n".join(cleaned_lines)
    result = re.sub(r"\n{3,}", "\n\n", result)
    return result.strip()


def smart_truncate(text: str, max_chars: int) -> str:
    """Truncate text at a clean paragraph or sentence boundary."""
    if len(text) <= max_chars:
        return text
    # Try to cut at a paragraph break
    cut = text.rfind('\n\n', 0, max_chars)
    if cut == -1:
        # Fall back to sentence break
        cut = text.rfind('. ', 0, max_chars)
    if cut == -1:
        cut = max_chars
    return text[:cut].strip()


# Apply cleaning to every source
cleaned_data = {}
for label, raw_text in tqdm(scraped_data.items(), desc="Cleaning"):
    cleaned = clean_text(raw_text)

    # Apply per-source character cap (if configured)
    if label in MAX_CHARS:
        before = len(cleaned)
        cleaned = smart_truncate(cleaned, MAX_CHARS[label])
        print(f"  {label}: {len(raw_text):,} -> {before:,} (cleaned) -> {len(cleaned):,} chars  (capped to {MAX_CHARS[label]:,})")
    else:
        print(f"  {label}: {len(raw_text):,} -> {len(cleaned):,} chars  "
              f"({100 * len(cleaned) / max(len(raw_text), 1):.1f}% kept)")

    cleaned_data[label] = cleaned

print(f"\nFinal sizes after cleaning + capping:")
for label, text in cleaned_data.items():
    print(f"   {label:35s} -> {len(text):>10,} chars")
total = sum(len(t) for t in cleaned_data.values())
print(f"   {'TOTAL':35s} -> {total:>10,} chars")
print(f"\nEstimated chunks: ~{total // 3000}")
print("\n‚úÖ Cleaning complete.")


Cleaning:   0%|          | 0/5 [00:00<?, ?it/s]

  Personal_Knowledge_Base: 5,141 -> 4,653 chars  (90.5% kept)
  CV: 6,033 -> 5,658 chars  (93.8% kept)
  SOP: 3,848 -> 3,676 chars  (95.5% kept)
  PersonalStatement: 3,129 -> 3,039 chars  (97.1% kept)
  ChatLogs: 7,346,040 -> 5,183,960 (cleaned) -> 19,995 chars  (capped to 20,000)

Final sizes after cleaning + capping:
   Personal_Knowledge_Base             ->      4,653 chars
   CV                                  ->      5,658 chars
   SOP                                 ->      3,676 chars
   PersonalStatement                   ->      3,039 chars
   ChatLogs                            ->     19,995 chars
   TOTAL                               ->     37,021 chars

Estimated chunks: ~12

‚úÖ Cleaning complete.


---
## Step 3 ‚Äî Chunk Text into ~3 000-char Blocks

Large texts need to be split into manageable chunks before we send them to
the LLM for extraction. We use a sliding-window approach that tries to
break at paragraph boundaries, with a small overlap to preserve context.

In [5]:
# ============================================================
# Step 3: Chunk Text  (memory-safe, infinite-loop-proof)
# ============================================================

import gc

CHUNK_SIZE    = 3000   # target characters per chunk
CHUNK_OVERLAP = 200    # overlap between consecutive chunks


def chunk_text(text: str, chunk_size: int = CHUNK_SIZE,
               overlap: int = CHUNK_OVERLAP):
    """
    Generator that yields chunks of ~`chunk_size` characters.
    Tries to break at paragraph or sentence boundaries.

    Key fix:  `start` is GUARANTEED to advance by at least
    `min_advance` characters on every iteration, so the loop
    always terminates ‚Äî even on multi-million-character texts.
    """
    if not text.strip():
        return

    length      = len(text)
    min_advance = max(chunk_size // 2, 1)   # never advance less than half a chunk
    start       = 0

    while start < length:
        end = min(start + chunk_size, length)

        if end < length:
            # Try to break at a paragraph boundary
            bp = text.rfind('\n\n', start + min_advance, end)
            if bp == -1:
                # Fall back to sentence-ending period
                bp = text.rfind('. ', start + min_advance, end)
            if bp == -1:
                # Fall back to any newline
                bp = text.rfind('\n', start + min_advance, end)
            if bp != -1:
                end = bp + 1

        chunk = text[start:end].strip()
        if chunk:
            yield chunk

        # Guarantee forward progress
        next_start = end - overlap if end < length else length
        if next_start <= start:
            next_start = start + min_advance   # safety: force advance
        start = next_start


# --- Chunk every cleaned source, one at a time ---
all_chunks = []

for label, text in tqdm(cleaned_data.items(), desc='Chunking'):
    chunk_count = 0
    for i, chunk in enumerate(chunk_text(text)):
        all_chunks.append({
            'source':      label,
            'chunk_index': i,
            'text':        chunk,
        })
        chunk_count += 1
    print(f'  {label}: {chunk_count} chunks')
    gc.collect()           # free intermediate objects between large sources

print(f'\n‚úÖ Total chunks: {len(all_chunks)}')
print(f'   Estimated memory: {sum(len(c["text"]) for c in all_chunks) / 1024:.0f} KB')

Chunking:   0%|          | 0/5 [00:00<?, ?it/s]

  Personal_Knowledge_Base: 2 chunks
  CV: 3 chunks
  SOP: 2 chunks
  PersonalStatement: 2 chunks
  ChatLogs: 9 chunks

‚úÖ Total chunks: 18
   Estimated memory: 39 KB


---
## Step 4 ‚Äî Extract Structured Info Using OpenAI

For each chunk we ask the LLM to:
1. **Summarize** the content
2. **Extract** structured facts (education, skills, projects, etc.)
3. Return the result as a **JSON object**

We then convert each response into an **instruction-tuning** sample:
```json
{"instruction": "...", "input": "...", "output": "..."}
```

> ‚è± This step makes API calls ‚Äî expect ~1-2 seconds per chunk.
> A progress bar shows real-time status.

In [8]:
# ============================================================
# Step 4: Extraction (Optimized & Resume-Safe)
# ============================================================

import textwrap

# üî• Define the prompt that was missing
SYSTEM_PROMPT = textwrap.dedent("""\
    You are an expert AI assistant building a structured knowledge base
    about a person named Md. Maruf Mullah. Your job is to extract factual,
    biographical, academic, professional, and personality-related information
    from a chunk of text so it can be used for an AI Twin and RAG system.

    Return your answer as a **valid JSON object** with these keys:
    {
      "summary": "<2-3 sentence summary of the chunk>",
      "category": "<one of: biography, education, skills, projects, research,
                    experience, achievements, goals, values, writing_style,
                    contact, chat, other>",
      "key_facts": ["<fact 1>", "<fact 2>", ...],
      "instruction_prompt": "<a natural question a user might ask that
                              this chunk answers>",
      "ideal_response": "<a complete, conversational answer to that question
                          using ONLY information in this chunk>"
    }

    Rules:
    - Only use facts present in the text
    - Do NOT invent anything
    - Keep the ideal_response in first person as Maruf would say it
    - Return ONLY the JSON, NO markdown fences
""")

OUTPUT_FILE = "extracted_records.jsonl"

def extract_info(chunk_text: str, source_label: str,
                 max_retries: int = 3):

    user_prompt = f"Source: {source_label}\n\nCONTENT:\n{chunk_text}"

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=MODEL,
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user",   "content": user_prompt},
                ],
                temperature=0.2,      # lower = more deterministic JSON
                max_tokens=700,       # üî• reduce from 1024 (saves cost)
            )

            raw = response.choices[0].message.content.strip()

            # Remove markdown fences safely
            raw = re.sub(r"^```(?:json)?\s*", "", raw)
            raw = re.sub(r"\s*```$", "", raw)

            return json.loads(raw)

        except json.JSONDecodeError:
            print("‚ö†Ô∏è JSON decode failed. Retrying...")
            time.sleep(2 ** attempt)

        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                print(f"‚ö†Ô∏è Failed: {e}")
                return None


# ============================================================
# Run Extraction (Memory Safe + Resume Safe)
# ============================================================

processed = 0
failed = 0

# Open file in append mode (safe if kernel stops)
with open(OUTPUT_FILE, "a", encoding="utf-8") as f:

    for chunk in tqdm(all_chunks, desc="Extracting structured info"):

        result = extract_info(chunk["text"], chunk["source"])

        if result:
            result["source"] = chunk["source"]
            result["chunk_index"] = chunk["chunk_index"]

            f.write(json.dumps(result) + "\n")  # üî• write immediately
            processed += 1
        else:
            failed += 1
        
        # Small delay to respect rate limits
        time.sleep(0.5)

print(f"\n‚úÖ Extraction complete.")
print(f"   Processed: {processed}")
print(f"   Failed:    {failed}")
print(f"   Saved to:  {OUTPUT_FILE}")

Extracting structured info:   0%|          | 0/18 [00:00<?, ?it/s]


‚úÖ Extraction complete.
   Processed: 18
   Failed:    0
   Saved to:  extracted_records.jsonl


### 4b ‚Äî Convert to Instruction-Tuning JSONL Format

Each record is transformed into the standard `instruction / input / output`
format used for supervised fine-tuning of LLMs.

In [10]:
# ============================================================
# Step 4b: Build instruction-tuning samples
# ============================================================

instruction_samples = []
OUTPUT_FILE = "extracted_records.jsonl"

# üî• Load records from the file we just saved
try:
    with open(OUTPUT_FILE, "r", encoding="utf-8") as f:
        extracted_records = [json.loads(line) for line in f]
    print(f"‚úÖ Loaded {len(extracted_records)} records from {OUTPUT_FILE}")
except FileNotFoundError:
    print(f"‚ö†Ô∏è Error: {OUTPUT_FILE} not found. Did Step 4 run?")
    extracted_records = []

for rec in extracted_records:
    # Primary sample: question ‚Üí answer
    instruction_samples.append({
        "instruction": rec.get("instruction_prompt", "Tell me about yourself."),
        "input":       "",  # no extra input needed
        "output":      rec.get("ideal_response", rec.get("summary", "")),
        "source":      rec.get("source", "unknown"),
        "category":    rec.get("category", "other"),
    })

    # Bonus sample: "Summarize this about Maruf" ‚Üí summary
    if rec.get("summary"):
        instruction_samples.append({
            "instruction": f"Summarize Maruf's {rec.get('category', 'background')} information.",
            "input":       "",
            "output":      rec["summary"],
            "source":      rec.get("source", "unknown"),
            "category":    rec.get("category", "other"),
        })

    # Bonus sample: key facts as bullet list
    facts = rec.get("key_facts", [])
    if facts:
        instruction_samples.append({
            "instruction": f"List key facts about Maruf's {rec.get('category', 'background')}.",
            "input":       "",
            "output":      "\n".join(f"- {f}" for f in facts),
            "source":      rec.get("source", "unknown"),
            "category":    rec.get("category", "other"),
        })

print(f"‚úÖ Total instruction-tuning samples: {len(instruction_samples)}")

# Preview the first sample
if instruction_samples:
    print("\n--- Sample Preview ---")
    print(json.dumps(instruction_samples[0], indent=2, ensure_ascii=False))


‚úÖ Loaded 49 records from extracted_records.jsonl
‚úÖ Total instruction-tuning samples: 147

--- Sample Preview ---
{
  "instruction": "What are Md. Maruf Mullah's professional background and research interests?",
  "input": "",
  "output": "I am a Mechanical Engineer and Researcher with a strong focus on bridging classical engineering and computational intelligence. My research interests include machine learning, materials science, robotics, and renewable energy applications. I have industrial experience at IFAD Autos PLC and PRAN-RFL Group, and I achieved 99.9% accuracy in a casting defect classification project.",
  "source": "Personal_Knowledge_Base",
  "category": "biography"
}


---
## Step 5 ‚Äî Save Locally as `.jsonl`

We write every instruction-tuning sample as one JSON object per line.
This file can be directly used with most fine-tuning frameworks
(Hugging Face Trainer, Axolotl, LLaMA-Factory, etc.).

In [11]:
# ============================================================
# Step 5: Save to JSONL locally
# ============================================================

OUTPUT_DIR  = Path(r"G:\Github_Projects\Ai_twin\dataset")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

JSONL_PATH  = OUTPUT_DIR / "ai_twin_instruction_data.jsonl"

with open(JSONL_PATH, "w", encoding="utf-8") as f:
    for sample in instruction_samples:
        f.write(json.dumps(sample, ensure_ascii=False) + "\n")

print(f"üíæ Saved {len(instruction_samples)} samples to:")
print(f"   {JSONL_PATH}")
print(f"   File size: {JSONL_PATH.stat().st_size / 1024:.1f} KB")

üíæ Saved 147 samples to:
   G:\Github_Projects\Ai_twin\dataset\ai_twin_instruction_data.jsonl
   File size: 68.0 KB


### 5b ‚Äî (Optional) Save extracted records for debugging

Save the raw structured extraction results as a separate JSON file
so you can inspect them later without re-running the API calls.

In [12]:
# Optional: save raw extraction results
RAW_JSON_PATH = OUTPUT_DIR / "ai_twin_raw_extractions.json"

with open(RAW_JSON_PATH, "w", encoding="utf-8") as f:
    json.dump(extracted_records, f, indent=2, ensure_ascii=False)

print(f"üíæ Raw extractions saved to: {RAW_JSON_PATH}")

üíæ Raw extractions saved to: G:\Github_Projects\Ai_twin\dataset\ai_twin_raw_extractions.json


---
## Step 6 ‚Äî Push Dataset to Hugging Face Hub

We load the JSONL file into a Hugging Face `Dataset` object and push it to
a **private** repository on the Hub.

| Setting | Value |
|---------|-------|
| Username | `marufmullah50` |
| Dataset name | `Ai-Twin-data` |
| Visibility | **private** |

In [13]:
# ============================================================
# Step 6: Push to Hugging Face Hub
# ============================================================

HF_USERNAME    = "marufmullah50"
HF_DATASET     = "Ai-Twin-data"
HF_REPO_ID     = f"{HF_USERNAME}/{HF_DATASET}"

# Load JSONL into a Hugging Face Dataset
dataset = Dataset.from_json(str(JSONL_PATH))

print(f"üì¶ Dataset loaded: {len(dataset)} rows")
print(f"   Columns: {dataset.column_names}")
print(f"\n   Preview:")
dataset[:3]

Generating train split: 0 examples [00:00, ? examples/s]

üì¶ Dataset loaded: 147 rows
   Columns: ['instruction', 'input', 'output', 'source', 'category']

   Preview:


{'instruction': ["What are Md. Maruf Mullah's professional background and research interests?",
  "Summarize Maruf's biography information.",
  "List key facts about Maruf's biography."],
 'input': ['', '', ''],
 'output': ['I am a Mechanical Engineer and Researcher with a strong focus on bridging classical engineering and computational intelligence. My research interests include machine learning, materials science, robotics, and renewable energy applications. I have industrial experience at IFAD Autos PLC and PRAN-RFL Group, and I achieved 99.9% accuracy in a casting defect classification project.',
  'Md. Maruf Mullah is a Mechanical Engineer and Researcher focused on integrating classical engineering with computational intelligence to address challenges in materials science, manufacturing, and robotics. He has a B.Sc. in Mechanical Engineering and is actively involved in various research areas including machine learning and smart manufacturing.',
  '- Md. Maruf Mullah is a Mechanica

In [14]:
# Push to Hub (private)
dataset.push_to_hub(
    repo_id=HF_REPO_ID,
    private=True,
    token=HF_TOKEN,
)

print(f"\nüöÄ Dataset pushed to: https://huggingface.co/datasets/{HF_REPO_ID}")
print("   Visibility: PRIVATE ‚úÖ")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            


üöÄ Dataset pushed to: https://huggingface.co/datasets/marufmullah50/Ai-Twin-data
   Visibility: PRIVATE ‚úÖ


---
## ‚úÖ Done!

### What was created

| Artifact | Location |
|----------|----------|
| Instruction-tuning data | `dataset/ai_twin_instruction_data.jsonl` |
| Raw extractions backup | `dataset/ai_twin_raw_extractions.json` |
| HF Dataset (private) | `huggingface.co/datasets/marufmullah50/Ai-Twin-data` |

### Next Steps
1. **Review** the JSONL file ‚Äî inspect a few samples for quality
2. **Fine-tune** an LLM (e.g., Llama 3, Mistral, Phi-3) using this dataset
3. **Iterate** ‚Äî add more data sources and re-run this pipeline
4. **Build a RAG system** with the knowledge base for real-time Q&A

In [15]:
# Quick dataset stats
print("üìä Final Dataset Statistics")
print("=" * 40)
print(f"Total samples:     {len(instruction_samples)}")
print(f"Unique sources:    {len(set(s['source'] for s in instruction_samples))}")
print(f"Unique categories: {len(set(s['category'] for s in instruction_samples))}")
print()

# Breakdown by source
from collections import Counter
source_counts = Counter(s["source"] for s in instruction_samples)
print("By Source:")
for src, cnt in source_counts.most_common():
    print(f"   {src:35s} ‚Üí {cnt:>4} samples")

print()
cat_counts = Counter(s["category"] for s in instruction_samples)
print("By Category:")
for cat, cnt in cat_counts.most_common():
    print(f"   {cat:35s} ‚Üí {cnt:>4} samples")

üìä Final Dataset Statistics
Total samples:     147
Unique sources:    5
Unique categories: 7

By Source:
   ChatLogs                            ‚Üí   66 samples
   CV                                  ‚Üí   27 samples
   Personal_Knowledge_Base             ‚Üí   18 samples
   SOP                                 ‚Üí   18 samples
   PersonalStatement                   ‚Üí   18 samples

By Category:
   other                               ‚Üí   51 samples
   biography                           ‚Üí   33 samples
   education                           ‚Üí   27 samples
   values                              ‚Üí    9 samples
   experience                          ‚Üí    9 samples
   research                            ‚Üí    9 samples
   goals                               ‚Üí    9 samples
