# Analyse File Structure

In this notebook I will explore :
* How to load data
* How to verify data

In [58]:
import os

%load_ext autoreload
import utils as ut  # we'll store functions that work in here!
%autoreload 2

import pandas as pd
from matplotlib.pyplot import subplots, show

print("Complete!")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Complete!


## Structure of data

Looking into the file we see the data begins with a time in 24 hour with two other points,
followed by rows of data.

    0010     25         9                                   
     0    459     11.9   137.   -8.11    8.68    0.25   5   5   5    8    9    7


This is repeated for various times.
First thereforee we will parse these into time-based dictionaries.

In [59]:
for path, fn, data in ut.get_data():
    for j, key in enumerate(data.keys()):
        print(key)
        if j > 10:
            break
    break

data/20100715-0716 CONSON/WP SSP
kai00715.w3a
0010
0020
0030
0040
0050
0100
0110
0120
0129
0140
0150
0200


## Parsing rows

next we need to set datatypes for each row item correctly and then throw the damnable thing at Pandas.

    QC   Height    WS     WD      u       v       w    No. in Cns      SNR (db)     
    Code (m msl) (m/s)  (deg)   (m/s)   (m/s)   (m/s)  SW  NW   V   SW   NW   V     
    0009     25         9                                   
     0    213      8.2    96.   -8.17    0.83    0.33   5   5   5   16   16   14
     
Here is what we will use and assume the header remains fixed.
I have written a function :degrees: to ensure the odd '-950' result forces a None return instead.

In [36]:
ut.HEADER

{'QC': int,
 'Height': int,
 'WS': float,
 'WD': <function utils.degrees(value, dtype=<class 'int'>)>,
 'u': float,
 'v': float,
 'w': float,
 'No. in Cns SW': int,
 'No. in Cns NW': int,
 'No. in Cns V': int,
 'SNR (db) SW': int,
 'SNR (db) NW': int,
 'SNR (db) V': int}

In [60]:
def load_data_into_frame(data):
    frames = {}
    for time in data:
        ##t = pd.to_datetime(time, format="%H%M").time()
        t = time
        frames[t] = pd.DataFrame(data[time], columns=ut.HEADER.keys())
        
        for row in frames[t]:
            frames[t][row] = frames[t][row].apply(ut.HEADER[row])
        frames[t].set_index('Height', inplace=True)
        
    df = pd.concat(frames, axis=0, names=["time", "height"])
    return df
    
for path, fn, data in ut.get_data():
    df = load_data_into_frame(data)
    break
    
df.head()

data/20100715-0716 CONSON/WP SSP
kai00715.w3a


Unnamed: 0_level_0,Unnamed: 1_level_0,QC,WS,WD,u,v,w,No. in Cns SW,No. in Cns NW,No. in Cns V,SNR (db) SW,SNR (db) NW,SNR (db) V
time,height,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
10,315,0,11.4,82.0,-11.24,-1.66,0.3,5,6,6,7,6,4
10,518,0,13.0,81.0,-12.86,-2.06,0.2,6,5,6,19,11,10
10,720,0,13.0,78.0,-12.72,-2.79,0.1,6,6,6,27,10,9
10,923,0,14.8,95.0,-14.73,1.27,0.6,5,6,6,10,11,10
10,1125,0,12.6,99.0,-12.46,2.06,0.3,6,6,6,9,10,8
