# Convert Jama Glossary to LaTeX

1. In Jama, go to the glossary and choose *Export* $\to$ *Excel* to write out `CTA-Glossary.xls`

2. The ID column is not correctly read if you use XLS format, so open the result in Excel or Numbers, and export it in XLSX format

3. read it into a Pandas DataFrame:

Note that the header is on row 3, so we need to specify that (the rows before that will be ignored)

In [19]:
from openpyxl import load_workbook

In [20]:
import pandas as pd
import numpy as np
import re

First need to use openpyxl (or xlrd) to extract the hyperlink from the ID column (Pandas will ignore it, at least in the current version) see https://github.com/pandas-dev/pandas/issues/13439

In [21]:
def get_link_text(link):
    """ convert excel HYPERLINK() syntax to just the link"""
    x,y = link.split(',')
    return y[1:-2].replace('_','')

In [22]:
workbook = load_workbook('Inputs/CTA-Glossary-2.xlsx')
worksheet = workbook['Sheet1']
col = 4
outcol = 6

for row in worksheet.rows:
    cell = row[col]
    outcell = row[outcol]
    try:
        if len(cell.hyperlink.target) > 0:
            outcell.value = cell.hyperlink.target
            cell.value = get_link_text(cell.value)
    except:
        pass

workbook.save("Processed/CTA-Glossary-Processed.xlsx")

In [23]:
glossary = pd.read_excel(
    "Processed/CTA-Glossary-Processed.xlsx", 
    header=3, 
    sheet_name='Sheet1',
    usecols=[0,1,2,3,4,5,6],
    converters={'ID': lambda x: str(x),}
)  

glossary = glossary.rename(columns={'Unnamed: 6':'URL'})

In [24]:
glossary

Unnamed: 0,Modified Date,Last Activity Date,Name,Description,ID,Status,URL
0,09/06/2017,30/10/2018,CTA Constituents,,CTA-FLD-4,,https://jama.cta-observatory.org/perspective.r...
1,16/05/2018,30/10/2018,CTAO,"The Cherenkov Telescope Array Observatory, an ...",CTA-GLOS-206,Stable,https://jama.cta-observatory.org/perspective.r...
2,16/05/2018,30/10/2018,CTA North,CTA Observation site hosting an Array of Chere...,CTA-GLOS-207,Stable,https://jama.cta-observatory.org/perspective.r...
3,16/05/2018,30/10/2018,CTA South,CTA Observation site hosting an Array of Chere...,CTA-GLOS-208,Stable,https://jama.cta-observatory.org/perspective.r...
4,19/10/2018,30/10/2018,Headquarters,"The primary centre for CTAO governance, admini...",CTA-GLOS-209,Stable,https://jama.cta-observatory.org/perspective.r...
5,19/10/2018,30/10/2018,Science Data Management Centre (SDMC),The primary centre for the management of CTA d...,CTA-GLOS-210,Stable,https://jama.cta-observatory.org/perspective.r...
6,16/05/2018,30/10/2018,Array Site,"One of the two observation sites, CTA-N or CTA-S.",CTA-GLOS-211,Stable,https://jama.cta-observatory.org/perspective.r...
7,16/05/2018,30/10/2018,System,The word system is used at multiple levels in ...,CTA-GLOS-212,Stable,https://jama.cta-observatory.org/perspective.r...
8,16/05/2018,30/10/2018,Array,All of the Cherenkov Telescopes at one of the ...,CTA-GLOS-213,Stable,https://jama.cta-observatory.org/perspective.r...
9,16/05/2018,30/10/2018,Sub-array,A sub-set of the Cherenkov Telescopes at one o...,CTA-GLOS-214,Stable,https://jama.cta-observatory.org/perspective.r...


extract the acronyms if available by searching for parentheses, and split the acronym from the expanded version (used later to properly lable acronym-like glossary entries)

In [25]:
def is_acronym(name):
    match = re.match(pattern='.*(\(.*\)).*', string=name)
    if not match:
        return None
    if match:
        abbrev = match.group(1)[1:-1]
        return abbrev

def get_shortname(name):
    return re.sub(pattern='\(.*\)', repl='', string=name).strip()

In [26]:
glossary = glossary.dropna(subset=['Description']) # get rid of undefined terms
glossary['Acronym'] = glossary['Name'].apply(is_acronym)
glossary['ShortName'] = glossary['Name'].apply(get_shortname)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [27]:
glossary.head()

Unnamed: 0,Modified Date,Last Activity Date,Name,Description,ID,Status,URL,Acronym,ShortName
1,16/05/2018,30/10/2018,CTAO,"The Cherenkov Telescope Array Observatory, an ...",CTA-GLOS-206,Stable,https://jama.cta-observatory.org/perspective.r...,,CTAO
2,16/05/2018,30/10/2018,CTA North,CTA Observation site hosting an Array of Chere...,CTA-GLOS-207,Stable,https://jama.cta-observatory.org/perspective.r...,,CTA North
3,16/05/2018,30/10/2018,CTA South,CTA Observation site hosting an Array of Chere...,CTA-GLOS-208,Stable,https://jama.cta-observatory.org/perspective.r...,,CTA South
4,19/10/2018,30/10/2018,Headquarters,"The primary centre for CTAO governance, admini...",CTA-GLOS-209,Stable,https://jama.cta-observatory.org/perspective.r...,,Headquarters
5,19/10/2018,30/10/2018,Science Data Management Centre (SDMC),The primary centre for the management of CTA d...,CTA-GLOS-210,Stable,https://jama.cta-observatory.org/perspective.r...,SDMC,Science Data Management Centre


Define some format strings for glossaries and acronym entries:

In [28]:
glossary_rec = """
\\newglossaryentry{{{name}}}{{
    name={{{name}}}, 
    description={{{description} ({ident})}}
}}
"""

In [29]:
# a more simplified version for acronyms as glossary entries only
acronym_rec = """
\\newglossaryentry{{{label}}}{{
    name={{{abbrev}}}, 
    description={{{description} ({ident})}}, 
    first={{{name} ({abbrev})}}, 
}}
"""

In [30]:
import re
def convert_to_glossary(acro, name, description, ident,url):
    """
    convert a row in the table to a glossary or acronym entry
    """
    name = name.strip()
    description= description.strip()
    description = description.replace('_', r'\_')
    description = description.replace('%', r'\%')
    description = description.replace('\n', ' ')
    description =  re.sub('[^\x00-\x7F]+',' ', description ) # remove non-ascii chars
   
    ident = ident.strip()
    ident = ident.replace('_', r'\_') 
    # need extra set of {} outside href to hide it from the Tabular environment in the glossary
    ident = f'{{\href{{{url}}}{{{ident}}}}}' 
    ident = re.sub('[^\x00-\x7F]+',' ', ident )
    
    if acro is not None: # if it's an acronym
        return acronym_rec.format(
            label=acro, abbrev=acro, name=name, description=f"({name}) {description}", ident=ident
        )
    
    # otherwise regular glossary entry
    return glossary_rec.format(name=name, description=description, ident=ident)
    

Loop through the rows and write out a glossary entry as a LaTeX .inc file that you can use by 
```latex
\input cta-glossary-defs.inc
```
in the LaTeX file, and then later:
```latex
This is an example of \glspl{Dark Pedestal} calculated in the \gls{OES}
```

In [31]:
with open("cta-glossary-defs.inc", 'w') as outfile:
    for acro, name, description, ident, url in zip(glossary.Acronym, glossary.ShortName, glossary.Description, glossary.ID, glossary.URL):
        outfile.write(convert_to_glossary(acro, name, description, ident, url))


In [32]:
! tail -n 40 cta-glossary-defs.inc

\newglossaryentry{Availability}{
    name={Availability}, 
    description={The ability of an item or system to be in a state to perform a required function under given conditions over a given time interval assuming that the required external resources are provided. Generally, the Availability is defined by the formula A = (Uptime) / (Uptime + Downtime), where "Uptime" is the total time that the system is performing required functions and "Downtime" is the time where the system is not able to perform (can include the "time off" if corrective maintenance activities are deferred to be performed during daytime, or "MTTR" if corrective maintenance activities can be done during night in safe conditions, see ECA). ({\href{https://jama.cta-observatory.org/perspective.req?projectId=11&docId=31079}{CTA-GLOS-312}})}
}

\newglossaryentry{ACMT}{
    name={ACMT}, 
    description={(Active Corrective Maintenance Time) The direct time spent by maintenance personnel after the arrival at the loc