# MPIA Arxiv on Deck 2: Debugging notebook

In this notebook, I keep some first order commands for diagnostic of issues with papers.
Main definitions are taken from the main notebook.

In [1]:
# Imports
import os
from IPython.display import Markdown, display
from tqdm.notebook import tqdm
import warnings
from PIL import Image 

# requires arxiv_on_deck_2

from arxiv_on_deck_2.arxiv2 import (get_new_papers, 
                                    get_paper_from_identifier,
                                    retrieve_document_source, 
                                    get_markdown_badge)
from arxiv_on_deck_2 import (latex,
                             latex_bib,
                             mpia,
                             highlight_authors_in_list)

# Sometimes images are really big
Image.MAX_IMAGE_PIXELS = 1000000000 

# Some useful definitions.
class AffiliationWarning(UserWarning):
    pass

class AffiliationError(RuntimeError):
    pass

def validation(source: str):
    """Raises error paper during parsing of source file
    
    Allows checks before parsing TeX code.
    
    Raises AffiliationWarning
    """
    check = mpia.affiliation_verifications(source, verbose=True)
    if check is not True:
        raise AffiliationError("mpia.affiliation_verifications: " + check)

        
warnings.simplefilter('always', AffiliationWarning)

We get the author list from the MPIA website

In [4]:
# !rm -f tmp_mpia_authors.yml

In [5]:
# Getting the list of authors can take sometimes (internet connection)
# Caching the MPIA author list to avoid running this line every time we restart the kernel.
import yaml
try:
    with open('tmp_mpia_authors.yml', 'r') as fin:
        mpia_authors = yaml.load(fin, yaml.BaseLoader)
    print("`mpia.get_mpia_mitarbeiter_list()`: restored from cache")
except FileNotFoundError:
    print("`mpia.get_mpia_mitarbeiter_list()`: cannot be restored from cache.")
    # get list from MPIA website
    # it automatically filters identified non-scientists :func:`mpia.filter_non_scientists`
    mpia_authors = mpia.get_mpia_mitarbeiter_list()
    with open('tmp_mpia_authors.yml', 'w') as fout:
        fout.write(yaml.dump(mpia_authors))

`mpia.get_mpia_mitarbeiter_list()`: cannot be restored from cache.


We get the paper to debug

In [6]:
which = "2409.13189"
paper = get_paper_from_identifier(which)
paper


|||
|---:|:---|
| [![arXiv](https://img.shields.io/badge/arXiv-2409.13189-b31b1b.svg)](https://arxiv.org/abs/2409.13189) | **Fast Outflow in the Host Galaxy of the Luminous z $=$ 7.5 Quasar J1007$+$2115**  |
|| Weizhe Liu, et al. |
|*Appeared on*| *2024-09-20*|
|*Comments*| **|
|**Abstract**| James Webb Space Telescope opens a new window to directly probe luminous quasars powered by billion solar mass black holes in the epoch of reionization and their co-evolution with massive galaxies with unprecedented details. In this paper, we report the first results from the deep NIRSpec integral field spectroscopy study of a quasar at $z = 7.5$. We obtain a bolometric luminosity of $\sim$$1.8\times10^{47}$ erg s$^{-1}$ and a black hole mass of $\sim$0.7--2.5$\times10^{9}$ M$_{\odot}$ based on H$\beta$ emission line from the quasar spectrum. We discover $\sim$2 kpc scale, highly blueshifted ($\sim$$-$870 km/s) and broad ($\sim$1400 km/s) [O III] line emission after the quasar PSF has been subtracted. Such line emission most likely originates from a fast, quasar-driven outflow, the earliest one on galactic-scale known so far. The dynamical properties of this outflow fall within the typical ranges of quasar-driven outflows at lower redshift, and the outflow may be fast enough to reach the circumgalactic medium. Combining both the extended and nuclear outflow together, the mass outflow rate, $\sim$300 M$_{\odot}$yr, is $\sim$60%--380% of the star formation rate of the quasar host galaxy, suggesting that the outflow may expel a significant amount of gas from the inner region of the galaxy. The kinetic energy outflow rate, $\sim$3.6$\times10^{44}$ erg s$^{-1}$, is $\sim$0.2% of the quasar bolometric luminosity, which is comparable to the minimum value required for negative feedback based on simulation predictions. The dynamical timescale of the extended outflow is $\sim$1.7 Myr, consistent with the typical quasar lifetime in this era.|

In [7]:
import re
from typing import Sequence

def author_match(author: str, hl_list: Sequence[str], verbose=False) -> Sequence[str]:
    """ Matching author names with a family name reference list
    
    :param author: the author string to check
    :param hl_list: the list of reference authors to match
    :param verbose: prints matching results if set
    :return: the matching sequences or empty sequence if None
    """
    for hl in hl_list:
        match = re.findall(r"\b{:s}\b".format(author), hl, re.IGNORECASE)
        if match:
            if verbose:
                print(author, ' -> ',  hl, ' | ', match)
            return match
        
from typing import Sequence

def highlight_authors_in_list(author_list: Sequence[str], 
                              hl_list: Sequence[str], 
                              verbose: bool = False) -> Sequence[str]:
    """ highlight all authors of the paper that match `lst` entries

    :param author_list: the list of authors
    :param hl_list: the list of authors to highlight
    :param verbose: prints matching results if set
    :return: the list of authors with the highlighted authors
    """
    new_authors = []
    for author in author_list:
        match = author_match(author, hl_list, verbose)
        if match:
            new_authors.append(f"<mark>{author}</mark>")
        else:
            new_authors.append(f"{author}")
    return new_authors

In [8]:
# Check author list with their initials
normed_author_list = [mpia.get_initials(k) for k in paper['authors']]
normed_mpia_authors = [k[1] for k in mpia_authors]
hl_authors = highlight_authors_in_list(normed_author_list, normed_mpia_authors, verbose=True)
matches = [(hl, orig) for hl, orig in zip(hl_authors, paper['authors']) if 'mark' in hl]
if not matches:
    warnings.warn(AffiliationWarning("WARNING: This paper does not seem to have MPIA authors."))
    
paper['authors'] = hl_authors
paper

E. Bañados  ->  E. Bañados  |  ['E. Bañados']
J. Wolf  ->  D. J. Wolf  |  ['J. Wolf']



|||
|---:|:---|
| [![arXiv](https://img.shields.io/badge/arXiv-2409.13189-b31b1b.svg)](https://arxiv.org/abs/2409.13189) | **Fast Outflow in the Host Galaxy of the Luminous z $=$ 7.5 Quasar J1007$+$2115**  |
|| W. Liu, et al. -- incl., <mark>E. Bañados</mark>, <mark>J. Wolf</mark> |
|*Appeared on*| *2024-09-20*|
|*Comments*| **|
|**Abstract**| James Webb Space Telescope opens a new window to directly probe luminous quasars powered by billion solar mass black holes in the epoch of reionization and their co-evolution with massive galaxies with unprecedented details. In this paper, we report the first results from the deep NIRSpec integral field spectroscopy study of a quasar at $z = 7.5$. We obtain a bolometric luminosity of $\sim$$1.8\times10^{47}$ erg s$^{-1}$ and a black hole mass of $\sim$0.7--2.5$\times10^{9}$ M$_{\odot}$ based on H$\beta$ emission line from the quasar spectrum. We discover $\sim$2 kpc scale, highly blueshifted ($\sim$$-$870 km/s) and broad ($\sim$1400 km/s) [O III] line emission after the quasar PSF has been subtracted. Such line emission most likely originates from a fast, quasar-driven outflow, the earliest one on galactic-scale known so far. The dynamical properties of this outflow fall within the typical ranges of quasar-driven outflows at lower redshift, and the outflow may be fast enough to reach the circumgalactic medium. Combining both the extended and nuclear outflow together, the mass outflow rate, $\sim$300 M$_{\odot}$yr, is $\sim$60%--380% of the star formation rate of the quasar host galaxy, suggesting that the outflow may expel a significant amount of gas from the inner region of the galaxy. The kinetic energy outflow rate, $\sim$3.6$\times10^{44}$ erg s$^{-1}$, is $\sim$0.2% of the quasar bolometric luminosity, which is comparable to the minimum value required for negative feedback based on simulation predictions. The dynamical timescale of the extended outflow is $\sim$1.7 Myr, consistent with the typical quasar lifetime in this era.|

We get the (TeX) source
* retrieve the tarball
* find the main tex file and parse it
* parse for affiliations (but debugging so we do not stop if not found)
* generate the the output markdown

In [9]:
def get_markdown_qrcode(paper_id: str):
    """ Generate a qrcode to the arxiv page using qrserver.com
    
    :param paper: Arxiv paper
    :returns: markdown text
    """
    url = r"https://api.qrserver.com/v1/create-qr-code/?size=100x100&data="
    txt = f"""<img src={url}"https://arxiv.org/abs/{paper_id}">"""
    txt = '<div id="qrcode">' + txt + '</div>'
    return txt

In [28]:
def clean_non_western_encoded_characters_commands(text: str) -> str:
    """ Remove non-western encoded characters from a string
    
    :param text: the text to clean
    :return: the cleaned text
    """
    text = re.sub(r"(\\begin{CJK}{UTF8}{gbsn})(.*?)(\\end{CJK})", r"\2", text)
    return text


def get_initials(name: str) -> str:
    """ Get the short name, e.g., A.-B. FamName
    :param name: full name
    :returns: initials
    """
    initials = []
    if '(' in name:
        name = clean_non_western_encoded_characters_commands(name)
        suffix = re.findall(r"\((.*?)\)", name)[0]
        name = name.replace(f"({suffix})", '')
    else:
        suffix = ''
    split = name.split()
    for token in split[:-1]:
        if '-' in token:
            current = '-'.join([k[0] + '.' for k in token.split('-')])
        else:
            current = token[0] + '.'
        initials.append(current)
    initials.append(split[-1].strip())
    if suffix:
        initials.append(f"({suffix})")
    return ' '.join(initials)

['W. Liu (刘伟哲)',
 'X. Fan',
 'J. Yang',
 'E. Bañados',
 'F. Wang',
 'J. Wolf',
 'A. J. Barth',
 'T. Costa',
 'R. Decarli',
 'A.-C. Eilers',
 'F. Loiacono',
 'Y. Shen',
 'E. P. Farina',
 'X. Jin',
 'H. D. Jun',
 'M. Li',
 'A. Lupi',
 'M. A. Marshall',
 'Z. Pan',
 'M. Pudoka',
 'M.-Y. Zhuang (庄明阳)',
 'Jaclyn~B.~Champagne',
 'H. Li',
 'F. Sun',
 'W. L. Tee',
 'A. Vayner',
 'H. Zhang']

In [30]:
paper_id = f'{which:s}'
folder = f'tmp_{paper_id:s}'

def robust_call(fn, value, *args, **kwargs):
    try:
        return fn(value, *args, **kwargs)
    except Exception:
        return value
    

if not os.path.isdir(folder):
    folder = retrieve_document_source(f"{paper_id}", f'tmp_{paper_id}')

try:
    doc = latex.LatexDocument(folder, validation=validation)    
except AffiliationError as affilerror:
    msg = f"ArXiv:{paper_id:s} is not an MPIA paper... " + str(affilerror)
    print(msg)

# Hack because sometimes author parsing does not work well
if (len(doc.authors) != len(paper['authors'])):
    doc._authors = paper['authors']
else:
    # highlight authors (FIXME: doc.highlight_authors)
    # done on arxiv paper already
    doc._authors = highlight_authors_in_list(
        [robust_call(get_initials, k) for k in doc.authors], 
        normed_mpia_authors, verbose=True)
if (doc.abstract) in (None, ''):
    doc._abstract = paper['abstract']

doc.comment = get_markdown_badge(paper_id) 
if paper['comments']:
    doc.comment += " _" + paper['comments'] + "_"

full_md = doc.generate_markdown_text()

full_md += get_markdown_qrcode(paper_id)

# replace citations
try:
    bibdata = latex_bib.LatexBib.from_doc(doc)
    full_md = latex_bib.replace_citations(full_md, bibdata)
except Exception as e:
    #raise e
    ...

E. Bañados  ->  E. Bañados  |  ['E. Bañados']
J. Wolf  ->  D. J. Wolf  |  ['J. Wolf']
Found 109 bibliographic references in tmp_2409.13189/J1007.bbl.


In [31]:
print(full_md)

<div class="macros" style="visibility:hidden;">
$\newcommand{\ensuremath}{}$
$\newcommand{\xspace}{}$
$\newcommand{\object}[1]{\texttt{#1}}$
$\newcommand{\farcs}{{.}''}$
$\newcommand{\farcm}{{.}'}$
$\newcommand{\arcsec}{''}$
$\newcommand{\arcmin}{'}$
$\newcommand{\ion}[2]{#1#2}$
$\newcommand{\textsc}[1]{\textrm{#1}}$
$\newcommand{\hl}[1]{\textrm{#1}}$
$\newcommand{\footnote}[1]{}$
$\newcommand{\ftm}{J1007+2115}$
$\newcommand{\vdag}{(v)^\dagger}$
$\newcommand$
$\newcommand$
$\newcommand{\xh}{<\chi_{H I}>}$
$\newcommand{\TQ}{t_{Q}}$
$\newcommand{\paa}{{Pa\alpha}}$
$\newcommand{\pab}{{Pa\beta}}$
$\newcommand{\av}{{A_{V}}}$
$\newcommand{\ebv}{E(B-V)}$
$\newcommand{\siv}{[S~{\sc iv}] 10.51 \mum}$
$\newcommand{\oiiitext}{[O~{\sc iii}]}$
$\newcommand{\sivtext}{[S~{\sc iv}]}$
$\newcommand{\lya}{Ly\alpha}$
$\newcommand{\cii}{[C~{\sc ii}] 158 \mum}$
$\newcommand{\ciitext}{[C~{\sc ii}]}$
$\newcommand{\mum}{\ifmmode{\rm \mu m}\else{\mum}\fi}$
$\newcommand{\vdisp}{\vdisp}$
$\newcommand{\wba}{W_{80}

In [32]:
print(doc.abstract)

James Webb Space Telescope opens a new window to directly probe luminous quasars powered by billion solar mass black holes in the epoch of reionization and their co-evolution with massive galaxies with unprecedented details. In this paper, we report the first results from the deep NIRSpec integral field spectroscopy study of a quasar at $z = 7.5$ . We obtain a bolometric luminosity of $\sim1.8\times10^{47}$ $\ergs$ and a black hole mass of ${$\sim$0.7--2.5$\times10^{9}$}$ $\msun$ based on $\hb$ emission line from the quasar spectrum. We discover $\sim$ 2 kpc scale, highly blueshifted ( $\sim-$ 870 $\kms$ ) and broad ( $\sim$ 1400 $\kms$ ) $\oiiitext$ line emission after the quasar PSF has been subtracted. Such line emission most likely originates from a fast, quasar-driven outflow, the earliest one on galactic-scale known so far. The dynamical properties of this outflow fall within the typical ranges of quasar-driven outflows at lower redshift, and the outflow may be fast enough to rea

In [33]:
def export_markdown_summary(md: str, md_fname:str, directory: str):
    """Export MD document and associated relevant images"""
    import os
    import shutil
    import re

    if (os.path.exists(directory) and not os.path.isdir(directory)):
        raise RuntimeError(f"a non-directory file exists with name {directory:s}")

    if (not os.path.exists(directory)):
        print(f"creating directory {directory:s}")
        os.mkdir(directory)

    fig_fnames = (re.compile(r'\[Fig.*\]\((.*)\)').findall(md) + 
                  re.compile(r'\<img src="([^>\s]*)"[^>]*/>').findall(md))
    for fname in fig_fnames:
        if 'http' in fname:
            # No need to copy online figures
            continue
        destdir = os.path.join(directory, os.path.dirname(fname))
        destfname = os.path.join(destdir, os.path.basename(fname))
        try:
            os.makedirs(destdir)
        except FileExistsError:
            pass
        shutil.copy(fname, destfname)
    with open(os.path.join(directory, md_fname), 'w') as fout:
        fout.write(md)
    print("exported in ", os.path.join(directory, md_fname))
    [print("    + " + os.path.join(directory,fk)) for fk in fig_fnames]

In [34]:
export_markdown_summary(full_md, f"{paper_id:s}.md", '_build/html/')

exported in  _build/html/2409.13189.md
    + _build/html/tmp_2409.13189/./J1007spec.png
    + _build/html/tmp_2409.13189/./J1007_Fig3.png
    + _build/html/tmp_2409.13189/./J1007_outflow.png
    + _build/html/tmp_2409.13189/./J1007_outflow2.png
    + _build/html/tmp_2409.13189/./J1007_outflow3.png
