
# Thesis → Chunks (Direct from LaTeX, with starred DSA chapter)

Parses your LaTeX at `/mnt/data/main.tex`, resolves `\input/\include`, extracts Abstract/Resumen, numbered headings,
**and handles a starred `\chapter*{Data and Software Availability}`**. Produces `out/chunks.jsonl`.


## 1) Configuration

In [None]:

from __future__ import annotations

import os, re, io, json, uuid, pathlib
from dataclasses import dataclass, asdict
from typing import List, Dict, Tuple, Optional

ROOT_TEX = "/content/thesis_llm.tex"
OUT_DIR = "/content"
CHUNK_SIZE = 300
OVERLAP = 60
KEEP_CAPTIONS = True

os.makedirs(OUT_DIR, exist_ok=True)
print("Root TeX:", ROOT_TEX)


Root TeX: /content/thesis_llm.tex


## 2) Canonical titles (optional normalization)

In [None]:

CANONICAL_TITLES = {
    "1": "General Introduction",
    "1.1": "M dwarfs and the substellar realm",
    "1.1.1": "M dwarfs",
    "1.1.2": "... and beyond",
    "1.2": "From stars to data: the Virtual Observatory",
    "1.3": "The age of artificial intelligence",
    "1.4": "Aims and objectives of the thesis",
    "2": "Ultracool Dwarfs in J-PLUS",
    "2.1": "J-PLUS",
    "2.2": "Methodology",
    "2.2.1": "Parallax-based selection",
    "2.2.2": "Proper motion-based selection",
    "2.2.3": "Photometry-based selection",
    "2.2.4": "VOSA filtering",
    "2.3": "Analysis",
    "2.3.1": "Temperatures and distances",
    "2.3.2": "Kinematics",
    "2.3.3": "Binarity",
    "2.4": "Known ultracool dwarfs",
    "2.4.1": "Recovered known UCDs",
    "2.4.2": "New candidate UCDs vs. previously known",
    "2.5": "Machine learning analysis",
    "2.5.1": "PCA cut",
    "2.5.2": "SVM model",
    "2.5.3": "Blind test",
    "2.6": "Detection of strong emission line emitters",
    "2.7": "Conclusions",
    "3": "Detection of Flaring M dwarfs with multi-filter Photometry",
    "3.1": "Observations",
    "3.1.1": "Sample selection",
    "3.1.2": "Observational details",
    "3.1.3": "Data reduction",
    "3.2": "Results and discussion",
    "3.2.1": "Reduced spectra",
    "3.2.2": "Light curve analysis",
    "3.3": "Planetary habitability",
    "3.4": "Conclusions",
    "4": "Autoencoders and Deep Transfer Learning in CARMENES",
    "4.1": "Context",
    "4.2": "Data",
    "4.3": "Methodology",
    "4.3.1": "Feature extraction using an autoencoder",
    "4.3.2": "Deep transfer learning",
    "4.3.3": "Stellar parameter estimation",
    "4.4": "Results and discussion",
    "4.4.1": "Stellar parameters analysis",
    "4.4.2": "Comparison with the literature",
    "4.5": "Conclusions",
    "5": "Characterisation of Ultracool Dwarfs with Deep Transfer Learning",
    "5.1": "Testbed environment with SpeX",
    "5.2": "Ultracool dwarf characterisation",
    "6": "General conclusions and future work",
    "6.1": "Summary of the Thesis",
    "6.2": "Future Directions",
    "DSA": "Data and Software Availability",
}


## 3) Load LaTeX (resolve includes) and extract document body

In [None]:

INCLUDE_CMD = re.compile(r'\\(?:input|include)\{([^}]+)\}')
BEGIN_DOC = re.compile(r'\\begin\{document\}', re.IGNORECASE)
END_DOC = re.compile(r'\\end\{document\}', re.IGNORECASE)

def read_file_text(path: str) -> str:
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        return f.read()

def resolve_path(base_dir: str, rel: str) -> Optional[str]:
    p = pathlib.Path(base_dir) / rel
    if p.suffix != ".tex":
        if p.exists(): return str(p)
        p2 = p.with_suffix(".tex")
        if p2.exists(): return str(p2)
    else:
        if p.exists(): return str(p)
    return None

def read_tex_with_includes(path: str, visited=None) -> str:
    if visited is None: visited = set()
    path = str(pathlib.Path(path).resolve())
    if path in visited: return ""
    visited.add(path)
    base_dir = str(pathlib.Path(path).parent)
    text = read_file_text(path)

    def repl(m):
        rel = m.group(1)
        rp = resolve_path(base_dir, rel)
        if rp:
            return read_tex_with_includes(rp, visited)
        return ""  # silently skip missing include
    return INCLUDE_CMD.sub(repl, text)

full = read_tex_with_includes(ROOT_TEX)
m1 = BEGIN_DOC.search(full)
m2 = END_DOC.search(full)
body = full[m1.end():m2.start()] if (m1 and m2 and m2.start() > m1.end()) else full
print("Loaded LaTeX body characters:", len(body))


Loaded LaTeX body characters: 215245


## 4) Preprocess: strip comments/refs; keep caption text

In [None]:

# strip comments
def strip_comments(s: str) -> str:
    return re.sub(r'(?m)^[ \t]*%.*$', '', s)

# remove references (commands and thebibliography)
def strip_bibliography(s: str) -> str:
    s = re.sub(r'\\bibliography\{[^}]*\}', '', s)
    s = re.sub(r'\\bibliographystyle\{[^}]*\}', '', s)
    s = re.sub(r'\\printbibliography\b.*', '', s)
    s = re.sub(r'\\begin\{thebibliography\}.*?\\end\{thebibliography\}', '', s, flags=re.DOTALL|re.IGNORECASE)
    return s

# Only keep captions from figure/table envs
ENV_RE = re.compile(r'\\begin\{(figure\*?|table\*?)\}(.*?)\\end\{\1\}', re.DOTALL|re.IGNORECASE)
CAP_RE = re.compile(r'\\caption\{(.*?)\}', re.DOTALL)

def keep_captions_only(s: str) -> str:
    def repl(m):
        inner = m.group(2)
        caps = CAP_RE.findall(inner)
        if not caps: return ""
        texts = []
        for c in caps:
            t = re.sub(r'\s+', ' ', c).strip()
            if t: texts.append(f"[[CAPTION]] {t}")
        return "\n".join(texts) + "\n"
    return ENV_RE.sub(repl, s)

clean = strip_comments(body)
clean = strip_bibliography(clean)
clean = keep_captions_only(clean)
print("Preprocessed characters:", len(clean))


Preprocessed characters: 205754


## 5) Parse structure (includes starred DSA chapter)

In [None]:

# Abstract/Resumen
ABS_RE  = re.compile(r'\\begin\{abstract\}(.*?)\\end\{abstract\}', re.DOTALL|re.IGNORECASE)
RES_RE  = re.compile(r'\\begin\{resumen\}(.*?)\\end\{resumen\}', re.DOTALL|re.IGNORECASE)

# Headings (allow starred), capture title
CH_RE    = re.compile(r'\\chapter\*?\{(.*?)\}', re.DOTALL)
SEC_RE   = re.compile(r'\\section\*?\{(.*?)\}', re.DOTALL)
SUB_RE   = re.compile(r'\\subsection\*?\{(.*?)\}', re.DOTALL)
CAP_LINE = re.compile(r'\[\[CAPTION\]\]\s+(.*)$', re.MULTILINE)

@dataclass
class Event:
    kind: str     # 'abs'|'res'|'chapter'|'section'|'subsection'|'caption'|'text'
    start: int
    end: int
    content: str
    starred: bool = False

def find_all(regex, text, kind):
    out = []
    for m in regex.finditer(text):
        content = re.sub(r'\s+',' ', m.group(1)).strip()
        starred = text[m.start():m.end()].startswith("\\chapter*") if kind=="chapter" else text[m.start():m.end()].startswith("\\section*") if kind=="section" else text[m.start():m.end()].startswith("\\subsection*")
        out.append(Event(kind, m.start(), m.end(), content, starred))
    return out

events = []
events += [Event('abs', m.start(), m.end(), re.sub(r'\s+',' ', m.group(1)).strip()) for m in ABS_RE.finditer(clean)]
events += [Event('res', m.start(), m.end(), re.sub(r'\s+',' ', m.group(1)).strip()) for m in RES_RE.finditer(clean)]
events += find_all(CH_RE, clean, 'chapter')
events += find_all(SEC_RE, clean, 'section')
events += find_all(SUB_RE, clean, 'subsection')
events += [Event('caption', m.start(), m.end(), m.group(1).strip()) for m in CAP_LINE.finditer(clean)]

events.sort(key=lambda e: e.start)
merged = []
last_end = 0
for ev in events:
    if ev.start > last_end:
        txt = clean[last_end:ev.start]
        if txt.strip():
            merged.append(Event('text', last_end, ev.start, txt))
    merged.append(ev)
    last_end = ev.end
if last_end < len(clean):
    tail = clean[last_end:]
    if tail.strip():
        merged.append(Event('text', last_end, len(clean), tail))

print("Events:", len(merged))


Events: 110


## 6) LaTeX → plain text

In [None]:

def strip_math(s: str) -> str:
    s = re.sub(r'\$\$.*?\$\$', ' [MATH] ', s, flags=re.DOTALL)
    s = re.sub(r'\$[^$]*\$', ' [MATH] ', s)
    s = re.sub(r'\\\(.*?\\\)', ' [MATH] ', s, flags=re.DOTALL)
    s = re.sub(r'\\\[.*?\\\]', ' [MATH] ', s, flags=re.DOTALL)
    return s

def replace_citations(s: str) -> str:
    return re.sub(r'\\cite[a-zA-Z]*\*?\{[^}]*\}', ' (CITATION) ', s)

def strip_commands_preserve_text(s: str) -> str:
    s = re.sub(r'\\(textbf|textit|emph|texttt)\*?\{([^}]*)\}', r'\2', s)
    s = re.sub(r'\\href\{[^}]*\}\{([^}]*)\}', r'\1', s)
    s = re.sub(r'\\url\{([^}]*)\}', r'\1', s)
    s = re.sub(r'\\(label|ref|eqref)\{[^}]*\}', ' ', s)
    s = re.sub(r'\\footnote\{[^}]*\}', ' ', s)
    s = re.sub(r'\\[a-zA-Z@]+\*?(\[[^\]]*\])?(\{[^}]*\})?', ' ', s)
    return s

def latex_to_text(s: str) -> str:
    s = strip_math(s)
    s = replace_citations(s)
    s = strip_commands_preserve_text(s)
    s = re.sub(r'~', ' ', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s


## 7) Numbering + segmentation (with starred DSA detection)

In [None]:

@dataclass
class Segment:
    type: str
    text: str
    chapter_key: Optional[str]
    chapter: Optional[str]
    section_key: Optional[str]
    section: Optional[str]
    subsection_key: Optional[str]
    subsection: Optional[str]

segments: List[Segment] = []

ch = sec = sub = 0
ctx = {"chapter_key": None, "chapter": None, "section_key": None, "section": None, "subsection_key": None, "subsection": None}

def canonical_title(key: str, fallback: str) -> str:
    return CANONICAL_TITLES.get(key, fallback)

def is_dsa_title(title: str) -> bool:
    t = title.lower().strip()
    return "data and software availability" in t or "data & software availability" in t

for ev in merged:
    if ev.kind in ("abs","res"):
        seg_text = latex_to_text(ev.content)
        if not seg_text: continue
        seg_type = "abstract" if ev.kind == "abs" else "resumen"
        segments.append(Segment(seg_type, seg_text, "FRONT", "Front Matter", ev.kind.upper(), seg_type.capitalize(), None, None))
    elif ev.kind == "chapter":
        title = latex_to_text(ev.content)
        if ev.starred and is_dsa_title(title):
            # Special starred DSA chapter: key "DSA"
            ctx.update({"chapter_key": "DSA", "chapter": CANONICAL_TITLES["DSA"], "section_key": None, "section": None, "subsection_key": None, "subsection": None})
            continue
        ch += 1; sec = 0; sub = 0
        ck = str(ch)
        ctx.update({"chapter_key": ck, "chapter": canonical_title(ck, title), "section_key": None, "section": None, "subsection_key": None, "subsection": None})
    elif ev.kind == "section":
        if ctx["chapter_key"] is None:
            ch += 1; ctx["chapter_key"] = str(ch); ctx["chapter"] = canonical_title(ctx["chapter_key"], f"Chapter {ch}")
        sec += 1; sub = 0
        sk = f"{ctx['chapter_key']}.{sec}"
        title = latex_to_text(ev.content)
        ctx.update({"section_key": sk, "section": canonical_title(sk, title), "subsection_key": None, "subsection": None})
    elif ev.kind == "subsection":
        if ctx["section_key"] is None:
            sec += 1; ctx["section_key"] = f"{ctx['chapter_key']}.{sec}"; ctx["section"] = canonical_title(ctx["section_key"], f"Section {ctx['section_key']}")
        sub += 1
        sbk = f"{ctx['chapter_key']}.{sec}.{sub}"
        title = latex_to_text(ev.content)
        ctx.update({"subsection_key": sbk, "subsection": canonical_title(sbk, title)})
    elif ev.kind == "caption" and KEEP_CAPTIONS:
        cap_text = latex_to_text(ev.content)
        if cap_text:
            segments.append(Segment("caption", cap_text, ctx["chapter_key"], ctx["chapter"], ctx["section_key"], ctx["section"], ctx["subsection_key"], ctx["subsection"]))
    elif ev.kind == "text":
        body_text = latex_to_text(ev.content)
        if body_text:
            segments.append(Segment("body", body_text, ctx["chapter_key"], ctx["chapter"], ctx["section_key"], ctx["section"], ctx["subsection_key"], ctx["subsection"]))

print("Segments:", len(segments))
# show a few contexts
print("Sample contexts:")
shown = 0
for s in segments:
    if s.type in ("body","caption") and s.chapter_key:
        print(s.chapter_key, s.chapter, "|", s.section_key, s.section, "|", s.subsection_key, s.subsection)
        shown += 1
        if shown >= 10: break


Segments: 48
Sample contexts:
1 General Introduction | 1.1 M dwarfs and the substellar realm | 1.1.1 M dwarfs
1 General Introduction | 1.1 M dwarfs and the substellar realm | 1.1.2 ... and beyond
1 General Introduction | 1.2 From stars to data: the Virtual Observatory | None None
1 General Introduction | 1.3 The age of artificial intelligence | None None
1 General Introduction | 1.4 Aims and objectives of the thesis | None None
2 Ultracool Dwarfs in J-PLUS | None None | None None
2 Ultracool Dwarfs in J-PLUS | 2.1 J-PLUS | None None
2 Ultracool Dwarfs in J-PLUS | 2.2 Methodology | None None
2 Ultracool Dwarfs in J-PLUS | 2.2 Methodology | 2.2.1 Parallax-based selection
2 Ultracool Dwarfs in J-PLUS | 2.2 Methodology | 2.2.2 Proper motion-based selection


## 8) Chunk and write `chunks.jsonl`

In [None]:

def sliding_windows(paragraph: str, max_words: int, overlap: int) -> List[str]:
    words = paragraph.split()
    if not words: return []
    if len(words) <= max_words: return [paragraph]
    out, start = [], 0
    while start < len(words):
        end = min(len(words), start + max_words)
        out.append(" ".join(words[start:end]))
        if end == len(words): break
        start = max(0, end - overlap)
    return out

@dataclass
class Chunk:
    id: str
    text: str
    type: str
    page: int
    chapter_key: Optional[str]
    chapter: Optional[str]
    section_key: Optional[str]
    section: Optional[str]
    subsection_key: Optional[str]
    subsection: Optional[str]
    thesis_part: Optional[str]
    chunk_idx: int
    chunk_total: int

def thesis_part_from_ctx(ck: Optional[str], sec_title: Optional[str]) -> Optional[str]:
    if ck == "1": return "Introduction"
    if ck in {"2","3","4","5"}:
        return "Conclusions" if (sec_title and "conclusion" in sec_title.lower()) else "Methods/Results"
    if ck == "6": return "Conclusions"
    if ck == "DSA": return "Data & Software"
    if ck == "FRONT": return "Front Matter"
    return None

chunks: List[Chunk] = []
for seg in segments:
    if not seg.text.strip(): continue
    windows = sliding_windows(seg.text.strip(), CHUNK_SIZE, OVERLAP)
    total = len(windows)
    for i, w in enumerate(windows):
        chunks.append(Chunk(
            id=str(uuid.uuid4()), text=w, type=seg.type, page=-1,
            chapter_key=seg.chapter_key, chapter=seg.chapter,
            section_key=seg.section_key, section=seg.section,
            subsection_key=seg.subsection_key, subsection=seg.subsection,
            thesis_part=thesis_part_from_ctx(seg.chapter_key, seg.section),
            chunk_idx=i, chunk_total=total
        ))

out_path = os.path.join(OUT_DIR, "chunks.jsonl")
with open(out_path, "w", encoding="utf-8") as f:
    for c in chunks:
        f.write(json.dumps(asdict(c), ensure_ascii=False) + "\n")

print(f"Wrote {len(chunks)} chunks → {out_path}")


Wrote 130 chunks → /content/chunks.jsonl


## 9) Inspect & stats

In [None]:

import itertools, statistics

print("Sample chunks:")
for row in itertools.islice([asdict(c) for c in chunks], 5):
    print("-"*80)
    print(row["text"][:500])
    print({k: row[k] for k in ["type","chapter_key","chapter","section_key","section","subsection_key","subsection","chunk_idx","chunk_total"]})

lengths = [len(c.text.split()) for c in chunks]
print("\nChunk length (words) — count:", len(lengths))
if lengths:
    print("min/median/mean/max:", min(lengths), statistics.median(lengths), round(statistics.mean(lengths),1), max(lengths))


Sample chunks:
--------------------------------------------------------------------------------
For centuries, humans have gazed at the night sky, wondering what the bright objects up there might look like. Thanks to the technological advances in the last decades, which allow an ever more detailed exploration of our universe, we now know that most of our nearest neighbours were so faint that we could not see them with our naked eyes. Faint, cool, low-mass stars known as M dwarfs are by far the most common type of star in the Solar Neighbourhood (CITATION) . As presented by (CITATION) , the
{'type': 'body', 'chapter_key': '1', 'chapter': 'General Introduction', 'section_key': '1.1', 'section': 'M dwarfs and the substellar realm', 'subsection_key': '1.1.1', 'subsection': 'M dwarfs', 'chunk_idx': 0, 'chunk_total': 4}
--------------------------------------------------------------------------------
(CITATION) , several programs have been established with the goal of identifying potentially 

## Embeddings

In [None]:
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer
import json
import numpy as np
import pandas as pd

In [None]:
i=0

In [None]:
i+=1
chunks[i]

Chunk(id='b51bc4eb-a937-44de-ae80-c9f0dd9d41d8', text='(CITATION) , several programs have been established with the goal of identifying potentially habitable planets orbiting M dwarfs. Notable examples include the Transiting Exoplanet Survey Satellite []{tess}, the Echelle Spectrograph for Rocky Exoplanet and Stable Spectroscopic Observations []{pepe21} and its predecessor, the High-Accuracy Radial velocity Planet Searcher []{mayor2003,bonfils13}, or the Calar Alto high-Resolution search for M dwarfs with Exoearths with Near-infrared and optical Echelle Spectrographs []{Quirrenbach16,Quirrenbach20}. The small size and low luminosity of M dwarfs, compared to Sun-like stars, make it easier to detect close-in terrestrial planets in their habitable zones (CITATION) . Moreover, M dwarfs have established themselves in recent years as very suitable targets in the search for Earth-like planets (CITATION) , with several studies confirming an elevated occurrence rate of Earth-like planets around

In [None]:
# Load your chunks
chunks_path = "/content/chunks.jsonl"
chunks = []
with open(chunks_path, "r", encoding="utf-8") as f:
    for line in f:
        chunks.append(json.loads(line))

print(f"Loaded {len(chunks)} chunks")

# Load the model
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# Rebuild the texts you embed (do this before encoding)
def make_embed_text(row):
    path = " / ".join(x for x in [
        f"Chapter {row.get('chapter')}" if row.get('chapter_key') else None,
    ] if x)
    return f"{path}\n\n{row['text']}".strip() if path else row["text"]

texts = [make_embed_text(c) for c in chunks]  # then encode these

# Compute embeddings
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True, convert_to_numpy=True)
np.save("/content/embeddings_multiling.npy", embeddings)

# Attach embeddings to the chunks
for i, emb in enumerate(embeddings):
    chunks[i]["embedding"] = emb.tolist()  # JSON serializable

# Save enriched chunks with embeddings
out_path = "/content/chunks_with_embeddings_multiling.jsonl"
with open(out_path, "w", encoding="utf-8") as f:
    for c in chunks:
        f.write(json.dumps(c, ensure_ascii=False) + "\n")

print(f"Saved embeddings → {out_path}")
print("Embedding shape:", embeddings.shape)

Loaded 130 chunks


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Saved embeddings → /content/chunks_with_embeddings_multiling.jsonl
Embedding shape: (130, 768)
