# ALMA Dataset Generation

The dataset from ALMA correspond to some manually selected antenna containers that covers the full lifecycle of the related computer, between reboots. 

In [225]:
!ls ../../data/raw/alma | tail

dv25-acsStartContainer_cppContainer_2017-07-10_17.03.32.841
dv25-acsStartContainer_cppContainer_2017-07-10_17.12.59.636
dv25-acsStartContainer_cppContainer_2017-07-10_19.30.47.674
dv25-acsStartContainer_cppContainer_2017-07-10_20.43.06.773
dv25-acsStartContainer_cppContainer_2017-07-10_20.56.06.754
dv25-acsStartContainer_cppContainer_2017-07-11_19.55.26.410
dv25-acsStartContainer_cppContainer_2017-07-11_20.41.40.861
dv25-acsStartContainer_cppContainer_2017-07-11_20.55.42.275
dv25-acsStartContainer_cppContainer_2017-07-12_00.14.04.823
dv25-acsStartContainer_cppContainer_2017-07-12_00.40.12.586


In [226]:
!tail -n 5 ../../data/raw/alma/dv25-acsStartContainer_cppContainer_2017-07-10_17.03.32.841

2017-07-10T17:09:35.050 [CONTROL/DV25/cppContainer-GL - virtual void AmbDeviceImpl::monitorEnc(ACS::Time*, const AMBSystem::AmbRelativeAddr&, AmbDataLength_t&, AmbDataMem_t*)] CAMB Error (type=10000, code=0) Detail="The monitor request returned an error.  AMB status = 11, channel = 1, node number = 0x13, RCA = 0xb009."
terminate called after throwing an instance of 'St9bad_alloc'
  what():  St9bad_alloc
/alma/ACS-2015.8/ACSSW/bin/acsStartContainer: line 112:  3298 Aborted                 (core dumped) $COMMANDLINE
2017-07-10T17:09:48.100 INFO [acsStartContainer] Container: 'CONTROL/DV25/cppContainer' exited with code: 134.


## Load an antenna file

Given an antenna file:

In [345]:
ant_file="../../data/raw/alma/da41-acsStartContainer_cppContainer_2017-07-10_00.15.06.909"

Let's consider only the individual lines that follows the rule:
```
TIMESTAMP [Source] logtext
```
The file is read in a pandas dataframe

In [346]:
import pandas as pd
RAWLINES=!cat $ant_file | egrep "^2[0-9]..\-..\-..T.*[0-9][0-9][0-9] \["
raw=pd.DataFrame(RAWLINES)

And it is splitted into different columns, removing some useless strings (as CONTROL/ANT)

In [347]:
import re
regex = re.compile(r"CONTROL\/[A-Z][A-Z][0-9][0-9]")

raw["@timestamp"] = raw[0].apply( lambda r: pd.to_datetime( r[:23] ))
raw["source"]  = raw[0].apply( lambda r: regex.sub( "", r[24:].split("]")[0][1:] ) )
raw["logtext"] = raw[0].apply( lambda r: " ".join(r[24:].split("] ")[1:]) )
del raw[0]

In [348]:
raw

Unnamed: 0,@timestamp,source,logtext
0,2017-07-10 00:15:11.255,/cppContainer-GL - cdb::DAOImpl,DAO:'MACI/Containers/CONTROL/DA41/cppContainer...
1,2017-07-10 00:15:11.260,/cppContainer - maci::ContainerImpl::getManager,Resolving manager...
2,2017-07-10 00:15:11.263,/cppContainer-GL - maci::MACIHelper::resolveMa...,ManagerReference obtained via command line: 'c...
3,2017-07-10 00:15:11.783,/cppContainer - maci::Container::init,Recovery enabled.
4,2017-07-10 00:15:11.784,/cppContainer - maci::ContainerImpl::init,Container 'CONTROL/DA41/cppContainer' activated.
...,...,...,...
22079,2017-07-10 16:33:04.787,/cppContainer-GL - void Control::DelayCalculat...,"Illegal Parameter Error (type=10000, code=2) D..."
22080,2017-07-10 16:33:05.880,/cppContainer-GL - void Control::DelayCalculat...,"Illegal Parameter Error (type=10000, code=2) D..."
22081,2017-07-10 16:33:07.045,/cppContainer-GL - void Control::DelayCalculat...,"Illegal Parameter Error (type=10000, code=2) D..."
22082,2017-07-10 16:33:08.276,/cppContainer-GL - void Control::DelayCalculat...,"Illegal Parameter Error (type=10000, code=2) D..."


## Searching for Antenna Observing

From a previous work, we choose a high level task called "Antenna Observing" that are characterized by the following start / end events:

```
Request to load 'AntInterferometryController'
...
Switched state of component ... AntInterferometryController: DESTROYING -> DEFUNCT
```
Let's search in 

In [349]:
start = raw[ raw["logtext"].str.contains("Request to load.*AntInterferometryController", regex=True) ]
end = raw[ raw["logtext"].str.contains("AntInterferometryController: DESTROYING -> DEFUNCT", regex=False) ]

The events between those two  markers corresponds to AntennaObserving instances to be analyzed.

In [350]:
ant_obs=pd.DataFrame( { 'start': start["@timestamp"].values, 'end': end["@timestamp"].values })

In [351]:
ant_obs

Unnamed: 0,start,end
0,2017-07-10 00:38:38.588,2017-07-10 00:41:21.199
1,2017-07-10 00:42:11.811,2017-07-10 00:44:22.704
2,2017-07-10 01:04:21.772,2017-07-10 01:06:57.225
3,2017-07-10 01:07:56.474,2017-07-10 01:24:03.884
4,2017-07-10 01:24:35.320,2017-07-10 01:32:30.904
5,2017-07-10 01:36:19.689,2017-07-10 03:02:15.280
6,2017-07-10 03:03:01.875,2017-07-10 04:33:51.515
7,2017-07-10 04:34:21.718,2017-07-10 05:34:23.953
8,2017-07-10 05:34:57.171,2017-07-10 06:51:49.588
9,2017-07-10 06:52:27.492,2017-07-10 08:01:46.526


And now, let's filter the raw logs by the ant_obs dates

In [352]:
for i, r in ant_obs.iterrows():
    print( r["start"], len(raw[ raw["@timestamp"] >= r["start"] ][ raw["@timestamp"] <= r["end"] ]))

2017-07-10 00:38:38.588000 121
2017-07-10 00:42:11.811000 69
2017-07-10 01:04:21.772000 92
2017-07-10 01:07:56.474000 276
2017-07-10 01:24:35.320000 206
2017-07-10 01:36:19.689000 605
2017-07-10 03:03:01.875000 397
2017-07-10 04:34:21.718000 222
2017-07-10 05:34:57.171000 232
2017-07-10 06:52:27.492000 532
2017-07-10 08:02:32.387000 516
2017-07-10 09:12:34.736000 712
2017-07-10 10:33:46.244000 442
2017-07-10 11:35:29.013000 52
2017-07-10 11:42:15.192000 114
2017-07-10 11:54:56.997000 446


  


## All Together

In [356]:
import pandas as pd
import re

def get_alma_logs(ant_file):
    RAWLINES=!cat $ant_file | egrep "^2[0-9]..\-..\-..T.*[0-9][0-9][0-9] \["
    raw=pd.DataFrame(RAWLINES)
    if len(raw) == 0:
        return []

    regex = re.compile(r"CONTROL\/[A-Z][A-Z][0-9][0-9]")

    raw["@timestamp"] = raw[0].apply( lambda r: pd.to_datetime( r[:23] ))
    raw["source"]  = raw[0].apply( lambda r: regex.sub( "", r[24:].split("]")[0][1:] ) )
    raw["logtext"] = raw[0].apply( lambda r: " ".join(r[24:].split("] ")[1:]) )
    del raw[0]
    
    start = raw[ raw["logtext"].str.contains("Request to load.*AntInterferometryController", regex=True) ]
    end = raw[ raw["logtext"].str.contains("AntInterferometryController: DESTROYING -> DEFUNCT", regex=False) ]
    
    minl = min(len(start), len(end))
    ant_obs=pd.DataFrame( { 'start': start[:minl]["@timestamp"].values, 'end': end[:minl]["@timestamp"].values })
    
    obs_logs = []
    for i, r in ant_obs.iterrows():
        obs_logs.append( raw[ raw["@timestamp"] >= r["start"] ][ raw["@timestamp"] <= r["end"] ] )
    
    return obs_logs

In [357]:
for logs in get_alma_logs("../../data/raw/alma/da41-acsStartContainer_cppContainer_2017-07-10_00.15.06.909"):
    print ( "At %s there are %s logs" % (logs[["@timestamp"]].values[0], len(logs)) )

At ['2017-07-10T00:38:38.588000000'] there are 121 logs
At ['2017-07-10T00:42:11.811000000'] there are 69 logs
At ['2017-07-10T01:04:21.772000000'] there are 92 logs
At ['2017-07-10T01:07:56.474000000'] there are 276 logs
At ['2017-07-10T01:24:35.320000000'] there are 206 logs
At ['2017-07-10T01:36:19.689000000'] there are 605 logs
At ['2017-07-10T03:03:01.875000000'] there are 397 logs
At ['2017-07-10T04:34:21.718000000'] there are 222 logs
At ['2017-07-10T05:34:57.171000000'] there are 232 logs
At ['2017-07-10T06:52:27.492000000'] there are 532 logs
At ['2017-07-10T08:02:32.387000000'] there are 516 logs
At ['2017-07-10T09:12:34.736000000'] there are 712 logs
At ['2017-07-10T10:33:46.244000000'] there are 442 logs
At ['2017-07-10T11:35:29.013000000'] there are 52 logs
At ['2017-07-10T11:42:15.192000000'] there are 114 logs
At ['2017-07-10T11:54:56.997000000'] there are 446 logs




## Storing ALMA datasets in INTERIM

In [358]:
FILES=!ls ../../data/raw/alma/

In [360]:
!mkdir -p ../../data/interim/ALMA/
all_stats=[]
for f in FILES:
    alllogs = get_alma_logs("../../data/raw/alma/%s" % f)
    print("Processing %s with #%s logs" % (f, len(alllogs)))
    for logs in alllogs:
        stats={}
        stats["antenna"] = f[:4]
        stats["@timestamp"] = str(logs["@timestamp"][0:1].values[0])[:23]
        stats["# logs"] = len(logs)
        logs.to_csv("../../data/interim/ALMA/%s-%s-AntObs.csv" % (stats["@timestamp"], stats["antenna"]), index=False )
        
        all_stats.append(stats)
pstats=pd.DataFrame(all_stats)
pstats.to_csv("../../data/interim/ALMA/count-AntObs.csv", index=False)
pstats

Processing da41-acsStartContainer_cppContainer_2017-07-01_18.16.48.847 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-01_20.06.49.894 with #7 logs




Processing da41-acsStartContainer_cppContainer_2017-07-01_22.45.35.715 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-03_20.37.03.645 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-03_21.43.00.460 with #2 logs
Processing da41-acsStartContainer_cppContainer_2017-07-05_19.22.44.226 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-05_19.44.33.199 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-05_19.56.28.307 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-05_21.11.36.377 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-05_21.12.49.008 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-05_21.51.45.672 with #4 logs
Processing da41-acsStartContainer_cppContainer_2017-07-07_20.39.24.481 with #0 logs
Processing da41-acsStartContainer_cppContainer_2017-07-07_21.01.00.244 with #3 logs
Processing da41-acsStartContainer_cppContainer_2017-07-10_00.15.06.909 with 

Unnamed: 0,antenna,@timestamp,# logs
0,da41,2017-07-01T21:02:13.979,18
1,da41,2017-07-01T21:03:21.631,19
2,da41,2017-07-01T21:20:08.150,41
3,da41,2017-07-01T21:23:26.907,341
4,da41,2017-07-01T21:37:25.567,314
...,...,...,...
625,dv25,2017-07-12T11:02:24.567,273
626,dv25,2017-07-12T12:24:18.011,206
627,dv25,2017-07-12T12:54:40.157,963
628,dv25,2017-07-12T14:39:02.679,933
