# Cleaning

This notebook tests and describes the procedures for cleaning the NOTAMs.

## Preamble

In [481]:
# load basic libraries
import numpy as np
import os
import sys
import re
import pandas as pd
from importlib import reload

# add ./python to python path
sys.path.insert(0, '../python')

# load local libraries
import cleaning
import text_processing

## Options

In [482]:
# options
data_dir = '../0_data'

## Data cleaning

The steps below describe how to run the cleaning workflow.

In [489]:
# takes into account changes
# made in the cleaning.py file
reload(cleaning)

# create cleaner object
cleaner = cleaning.Cleaning()

# read the data
cleaner.read(data_dir+'/23-08-2018/Export.txt')

# split the NOTAM into items
cleaner.split()

# clean the structured and unstructured parts
cleaner.clean()

cleaner.write(data_dir+'/23-08-2018/clean.csv')

Reading file...done (found 98547 NOTAMs).
Splitting items...done.
Cleaning unstructured part...done.
Writting file...done.


## Tests

The section performs a few tests.

In [484]:
# get the data frame
df = cleaner.get_df()

In [485]:
# number of None entries
print(sum(df['text_clean'].values == ''))

458


In [486]:
# Text processing module 
# provided by Joao at SWISS.
# Modified so that it returns
# the processed text in the 
# "text_azureml" column

reload(text_processing)
_ = text_processing.azureml_main(df)

In [487]:
# load cleaning module
reload(cleaning)

# show one example
# on one NOTAM
text =  df['text'].iloc[119]
acronyms_dict = cleaning.load_acronyms_dict()
text_clean = cleaning.clean_unstructured(text, acronyms_dict)

print(text+'\n')
print(text_clean)

SELCAL FREQ 10024 KHZ U/S

selcal freq <num> khz u s


In [488]:
# show additional tests
for i in range(116, 121):
    print(df.iloc[i]['text'])
    print(df.iloc[i]['text_azureml'])
    print(df.iloc[i]['text_clean']+'\n')

(TDM TRK K 180701050001  1807010500 1807012100  DINTY CUNDU 33N140W 33N150W 32N160W 29N170W 28N180E 27N170E  26N160E 25N150E 24N140E TUNTO  RTS/KLAX DINTY  TUNTO R595 SEDKU  TUNTO IGURU  TUNTO GUMBO  RMK/NO TRK ADVISORY FOR TRK K TONIGHT  ALTITUDE MAY BE RESTRICTED WHILE CROSSING ATS ROUTES
 TDM TRK K <number>  <number> <number>  DINTY CUNDU <coordinate> <coordinate> <coordinate> <coordinate> <coordinate> <coordinate>  <coordinate> <coordinate> <coordinate> TUNTO  RTS KLAX DINTY  TUNTO R<number> SEDKU  TUNTO IGURU  TUNTO GUMBO  REMARK NO TRK ADVISORY FOR TRK K TONIGHT  ALTITUDE MAY BE RESTRICTED WHILE CROSSING ATS ROUTES
 tdm trk k <num> <num> <num> dinty cundu <coord> <coord> <coord> <coord> <coord> <coord> <coord> <coord> <coord> tunto rts klax dinty tunto r<num> sedku tunto iguru tunto gumbo rmk no trk advisory for trk k tonight alt may be restricted while crossing ats routes

(TDM TRK J 180701050001  1807010500 1807012100  BOXER KYLLE KANUA KURTT KATCH LOHNE ARCAL 59N160W ONEOX NUL