In [21]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import astro_parser

# Parsing the Astrochymist

## Game plan

The table of detections are found in the second table on the page. The basic idea behind the code is to loop through every row of the table, snagging every column and parsing the data one column at a time.

One of the major "hacks" involves parsing all `<th>` and `<td>` columns - the HTML is written such that these are exchanged sometimes and as a result you can't parse things with indexing reliably. By getting by types, you end up with a list in the correct order (fortuitiously...).

Additionally, there are some rows which are continuation of the previous row. These are checked by making sure that the molecule column (column 2) does not start with a number. Finally, rows of molecules with disputed detections are picked up by the fact that there are only three columns.

In [33]:
df = astro_parser.main()

In [4]:
with pd.option_context("display.max_rows",400):
    display(df)

Unnamed: 0,Year,Molecule,Source,Detection Method
0,1937,CHmethylidyne,ζ Oph,"[Radio, UV/Vis]"
1,1940,CNcyano radical,ζ Oph,"[Radio, UV/Vis]"
2,1941,CH+methylidyne cation,"ζ Oph, ξ Per, χ2 Ori, 55 Cyg",[UV/Vis]
3,1963,OHhydroxyl radical,Cas A,[Radio]
4,1968,NH3ammonia,galactic center,[Radio]
5,1969,H2Owater,"Sgr B2, Orion nebula, W49",[Radio]
6,1969,H2COformaldehyde,Numerous sources,[Radio]
7,1970,COcarbon monoxide,"Orion nebula, IRC +10216","[Radio, UV/Vis]"
8,1970,H2hydrogen,Persei,[UV/Vis]
9,1970,HCO+formyl cation,"Orion, W51, W3(OH), L134, Sgr A(NH3A)",[Radio]


## Timeseries analysis

Iterate over the rows of the dataframe, and make cumulative histograms of detections by the three observations.

In [41]:
timedata = list()
techniques = ["UV/Vis", "Radio", "IR"]
total_detections = 0
row_data = [0] * 5
for index, row in df.iterrows():
    row_data = row_data.copy()
    row_data[0] = row["Year"]
    for tech_index, technique in enumerate(techniques):
        if technique in row["Detection Method"]:
            row_data[tech_index + 1] = 1
    total_detections += 1
    row_data[-1] = total_detections
    timedata.append(row_data)

In [42]:
timedf = pd.DataFrame(timedata, columns=["Year", "UV/Vis", "Radio", "IR", "Total"])

In [43]:
timedf

Unnamed: 0,Year,UV/Vis,Radio,IR,Total
0,1937,1,1,0,1
1,1940,1,1,0,2
2,1941,1,1,0,3
3,1963,1,1,0,4
4,1968,1,1,0,5
5,1969,1,1,0,6
6,1969,1,1,0,7
7,1970,1,1,0,8
8,1970,1,1,0,9
9,1970,1,1,0,10
