# Data Retrieval, Preprocessing and LSA

We retrieve our data from https://asrs.arc.nasa.gov/search/database.html (NASA’s Aviation Safety Reporting System) to analyze pilot and controller narratives to gain more insight into midair collisions and the factors affecting them.
Our study are limited 
To that end, please limit your study of narratives on the ASRS website to collisions. We do not differentiate between near midair collisions and actual midair collisions, since they are both events we wish to understand better.

In [1]:
import numpy as np
import json as js
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Retrieve the data from the csv file
data = pd.read_csv("ASRS_DBOnline.csv",index_col=0).reset_index()
data.head()

Unnamed: 0,Unnamed: 1,Time,Time.1,Place,Place.1,Place.2,Place.3,Place.4,Place.5,Environment,...,Events.3,Events.4,Events.5,Assessments,Assessments.1,Report 1,Report 1.1,Report 2,Report 2.1,Report 1.2
0,ACN,Date,Local Time Of Day,Locale Reference,State Reference,Relative Position.Angle.Radial,Relative Position.Distance.Nautical Miles,Altitude.AGL.Single Value,Altitude.MSL.Single Value,Flight Conditions,...,Detector,When Detected,Result,Contributing Factors / Situations,Primary Problem,Narrative,Callback,Narrative,Callback,Synopsis
1,,,,,,,,,,,...,,,,,,,,,,
2,85251,198804,0601-1200,BOS,MA,,,,16000,IMC,...,Automation Air Traffic Control; Person Air Tra...,,Air Traffic Control Issued New Clearance,,Human Factors,MLG Y WAS HANDED OFF TO ME DSNDING FROM FL240 ...,,,,ARTCC CTLR HAD LESS THAN STANDARD SEPARATION W...
3,85627,198804,1201-1800,GDM,MA,111,10,,10700,VMC,...,Person Air Traffic Control; Person Flight Crew,,Flight Crew Took Evasive Action,,Human Factors,I WAS FLYING THE ACFT AT 11000' MSL LEVEL FLT ...,,,,LGT ON IFR ARR ROUTE WAS GIVEN TRAFFIC ON UNK-...
4,87789,198805,1801-2400,BOS,MA,,7,,1500,Mixed,...,Automation Air Traffic Control; Person Air Tra...,,General None Reported / Taken,,Human Factors,THE WX WAS RPTED VFR 20 SCATTERED; 250 SCATTER...,,,,ACR-MLG BEING VECTORED FOR VISUAL APCH WAS DES...


In [3]:
# Extracting 2 columns(reports) we care about
report1 = data[data.columns[91]]
report2 = data[data.columns[95]]

In [4]:
# Preprocessing of the data by stemming 
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

ps = PorterStemmer()
new_report = []
word_data = []

insignificant_terms = ['air','data','follow','in','the','had','for','from','on','to','with','and','while','','than','less','in','of','at','an']
for i in range(0,len(report2)):
    temp_report = ''
    # Stem the narratives of each report
    if pd.isnull(report2[i]) != True:
        words = word_tokenize(report2[i])
        for w in words:
            if w not in word_data:
                word_data.append([w, ps.stem(w)])
            w = ps.stem(w)
            # Add the insignificant words here 
            if w not in insignificant_terms:
                temp_report += (w + " ")
        new_report.append((temp_report))

[nltk_data] Downloading package punkt to /Users/aslstem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
# Tfid Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
V = TfidfVectorizer(min_df =4, max_df =0.8)
dtm = V.fit_transform(new_report)
terms = V.get_feature_names()

# Center the dtm
dtm_dense = dtm.todense()
centered_dtm = dtm_dense - np.mean(dtm_dense, axis=0)
np.sum(centered_dtm,axis=0)[:,:10]

# Apply SVD to centered_dtm
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
lsa = TruncatedSVD(10, algorithm = 'randomized')
dtm_lsa = lsa.fit_transform(centered_dtm)
# Normalize the dtm
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)

In [6]:
# Finding the top significant categories for each components in LSA:
from collections import Counter
Filtered_Categories = []
for i in range(lsa.components_.shape[0]):
    top = np.argsort(lsa.components_[i])[::-1]
    Filtered_Categories.append([terms[top[0]],terms[top[1]], terms[top[2]],terms[top[3]],terms[top[4]],terms[top[5]],terms[top[6]],terms[top[7]]])
Filtered_Categories

[['acr', 'error', 'standard', 'system', 'separ', 'between', 'ltss', 'sy'],
 ['ctlr', 'zbw', 'experienc', 'operror', 'at', 'separ', 'standard', 'ft'],
 ['tcasii', 'to', 'alt', 'acr', 'ra', 'ltss', 'assign', 'dscnt'],
 ['rwi', 'on', 'acr', 'apch', 'experienc', 'ltss', 'operror', 'error'],
 ['rwi', 'on', 'acft', 'of', 'apch', 'to', 'separ', 'standard'],
 ['class', 'in', 'airspac', 'tcasii', 'separ', 'ra', 'at', 'system'],
 ['plt', 'at', 'pa28', 'pattern', 'conflict', 'same', 'ltss', 'bed'],
 ['ft', 'crew', 'alt', 'at', 'conflict', 'dep', 'through', 'rwi'],
 ['acft', 'close', 'prox', 'at', 'ha', 'sma', 'tcasii', 'ft'],
 ['acft', 'alt', 'anoth', 'same', 'crew', 'at', 'class', 'airspac']]

Full forms of the above abbreviations:
1. aircaft, error, standard, system, separation, between, Less Than Standard Separation
2. control, boston air route traffic control center, experience, operator, at, separation, standard, feet
3. Traffic Alert and Collision Avoidance, to, alert, aircaft, Resolution Advisory, Less Than Standard Separation, assign, descdent
4. runways, on, aircraft, approach, experience, Less Than Standard Separation, operator, error
5. Resolution Advisory, Traffic Alert and Collision Avoidance, report, crew, aircraft, approach runway, control

In [9]:
# Save Word Reference
word_reference = pd.DataFrame(word_data, columns=["before", "after"])
word_reference.head()

Unnamed: 0,before,after
0,Synopsis,synopsi
1,ARTCC,artcc
2,CTLR,ctlr
3,HAD,had
4,LESS,less
