# METAR Data Analysis Project

METAR is the acronym for Meteorological Aerodrome Report. It is internationally recognized shorthand for weather data used by the aviation community.

### The Problem:

A Flight school is looking to expand to new locations. You're given METAR data to identify the 10 best and 10 worst locations based on the following criteria:

* Visibility of 10 statute miles or greater 

* Cloud ceiling of 3,000 ft above ground or higher

* Winds less than 15 kts

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

%matplotlib inline

In [2]:
metar = 'metar_data_export.txt'

### Data Import

In [3]:
df = pd.read_csv(metar, delimiter='\t')

In [4]:
df.head()

Unnamed: 0,2015-03-25 21:15:00,KCXP 252115Z AUTO 08005KT 10SM CLR 17/M01 A3033 RMK AO2
0,2015-03-25 21:13:00,KWRB 252113Z AUTO 06005KT 10SM BKN014 OVC024 2...
1,2015-03-25 21:12:00,KTUL 252112Z 14006KT 10SM TS BKN035 BKN120 BKN...
2,2015-03-25 21:11:00,KDRT 252111Z AUTO 13007KT 10SM SCT023 22/17 A2...
3,2015-03-25 21:10:00,KBRL 252110Z AUTO 33009KT 8SM SCT016 07/03 A30...
4,2015-03-25 21:10:00,KCMX 252110Z AUTO 26016KT 4SM -SN BR OVC007 01...


In [5]:
df.shape

(7731240, 2)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7731240 entries, 0 to 7731239
Data columns (total 2 columns):
 #   Column                                                   Dtype 
---  ------                                                   ----- 
 0   2015-03-25 21:15:00                                      object
 1   KCXP 252115Z AUTO 08005KT 10SM CLR 17/M01 A3033 RMK AO2  object
dtypes: object(2)
memory usage: 59.0+ MB


The initial data import has two columns which would require more data cleaning. Reattempting data import with a different delimiter.

What each code corresponds to:

https://business.desu.edu/sites/business/files/document/16/metar_and_taf_codes.pdf

In [6]:
col_names = ['date', 'id', 'obsv_time', 'automated', 'wind_speed', 'visibility', 'cloud_cover', \
            'temp_dewpt', 'altimeter', 'altimeter_remarks', 'ao2']

In [7]:
df2 = pd.read_csv(metar, delimiter=' ', header = None, names = col_names)

In [8]:
df2.head()

Unnamed: 0,date,id,obsv_time,automated,wind_speed,visibility,cloud_cover,temp_dewpt,altimeter,altimeter_remarks,ao2
0,2015-03-25,21:15:00\tKCXP,252115Z,AUTO,08005KT,10SM,CLR,17/M01,A3033,RMK,AO2
1,2015-03-25,21:13:00\tKWRB,252113Z,AUTO,06005KT,10SM,BKN014,OVC024,20/17,A3009,RMK
2,2015-03-25,21:12:00\tKTUL,252112Z,14006KT,10SM,TS,BKN035,BKN120,BKN250,24/16,A2975
3,2015-03-25,21:11:00\tKDRT,252111Z,AUTO,13007KT,10SM,SCT023,22/17,A2986,RMK,AO2
4,2015-03-25,21:10:00\tKBRL,252110Z,AUTO,33009KT,8SM,SCT016,07/03,A3006,RMK,AO2


In [11]:
df2['time'], df2['id'] = df2['id'].str.split('\t', 1).str

  """Entry point for launching an IPython kernel.


ValueError: not enough values to unpack (expected 2, got 1)

In [12]:
assert len(df2['time']) == len(df2['id'])

In [13]:
df2.head()

Unnamed: 0,date,id,obsv_time,automated,wind_speed,visibility,cloud_cover,temp_dewpt,altimeter,altimeter_remarks,ao2,time
0,2015-03-25,KCXP,252115Z,AUTO,08005KT,10SM,CLR,17/M01,A3033,RMK,AO2,21:15:00
1,2015-03-25,KWRB,252113Z,AUTO,06005KT,10SM,BKN014,OVC024,20/17,A3009,RMK,21:13:00
2,2015-03-25,KTUL,252112Z,14006KT,10SM,TS,BKN035,BKN120,BKN250,24/16,A2975,21:12:00
3,2015-03-25,KDRT,252111Z,AUTO,13007KT,10SM,SCT023,22/17,A2986,RMK,AO2,21:11:00
4,2015-03-25,KBRL,252110Z,AUTO,33009KT,8SM,SCT016,07/03,A3006,RMK,AO2,21:10:00


In [14]:
df2.columns

Index(['date', 'id', 'obsv_time', 'automated', 'wind_speed', 'visibility',
       'cloud_cover', 'temp_dewpt', 'altimeter', 'altimeter_remarks', 'ao2',
       'time'],
      dtype='object')

In [22]:
cols = list(df2.columns)

cols

['date',
 'id',
 'obsv_time',
 'automated',
 'wind_speed',
 'visibility',
 'cloud_cover',
 'temp_dewpt',
 'altimeter',
 'altimeter_remarks',
 'ao2',
 'time']

In [23]:
cols = cols[-1:] + cols[:-1]

cols

['time',
 'date',
 'id',
 'obsv_time',
 'automated',
 'wind_speed',
 'visibility',
 'cloud_cover',
 'temp_dewpt',
 'altimeter',
 'altimeter_remarks',
 'ao2']

In [24]:
df2 = df2[cols]

In [25]:
df2.head()

Unnamed: 0,time,date,id,obsv_time,automated,wind_speed,visibility,cloud_cover,temp_dewpt,altimeter,altimeter_remarks,ao2
0,21:15:00,2015-03-25,KCXP,252115Z,AUTO,08005KT,10SM,CLR,17/M01,A3033,RMK,AO2
1,21:13:00,2015-03-25,KWRB,252113Z,AUTO,06005KT,10SM,BKN014,OVC024,20/17,A3009,RMK
2,21:12:00,2015-03-25,KTUL,252112Z,14006KT,10SM,TS,BKN035,BKN120,BKN250,24/16,A2975
3,21:11:00,2015-03-25,KDRT,252111Z,AUTO,13007KT,10SM,SCT023,22/17,A2986,RMK,AO2
4,21:10:00,2015-03-25,KBRL,252110Z,AUTO,33009KT,8SM,SCT016,07/03,A3006,RMK,AO2


Need to move some of the automated values over.

In [26]:
auto = 0
not_auto = 0

for i in df2['automated']:
    if i == 'AUTO':
        auto += 1
    elif i != 'AUTO':
        not_auto += 1
    else:
        continue

print(f"Num. of Automated: {auto}")
print(f"Num. of Non-Automated: {not_auto}")

Num. of Automated: 4361474
Num. of Non-Automated: 3369767


In [27]:
print(round(auto / len(df2), 2))
print(round(not_auto / len(df2), 2))

0.56
0.44


---

## Data Manipulation

Too many observations in df2['automated'] to throw out obvs != 'AUTO'

Separate into two different data frames to properly format then recombine.


In [28]:
df = df2.copy()

In [29]:
df.head()

Unnamed: 0,time,date,id,obsv_time,automated,wind_speed,visibility,cloud_cover,temp_dewpt,altimeter,altimeter_remarks,ao2
0,21:15:00,2015-03-25,KCXP,252115Z,AUTO,08005KT,10SM,CLR,17/M01,A3033,RMK,AO2
1,21:13:00,2015-03-25,KWRB,252113Z,AUTO,06005KT,10SM,BKN014,OVC024,20/17,A3009,RMK
2,21:12:00,2015-03-25,KTUL,252112Z,14006KT,10SM,TS,BKN035,BKN120,BKN250,24/16,A2975
3,21:11:00,2015-03-25,KDRT,252111Z,AUTO,13007KT,10SM,SCT023,22/17,A2986,RMK,AO2
4,21:10:00,2015-03-25,KBRL,252110Z,AUTO,33009KT,8SM,SCT016,07/03,A3006,RMK,AO2


In [30]:
df_auto = df[df['automated'] == 'AUTO']

df_auto.head()

Unnamed: 0,time,date,id,obsv_time,automated,wind_speed,visibility,cloud_cover,temp_dewpt,altimeter,altimeter_remarks,ao2
0,21:15:00,2015-03-25,KCXP,252115Z,AUTO,08005KT,10SM,CLR,17/M01,A3033,RMK,AO2
1,21:13:00,2015-03-25,KWRB,252113Z,AUTO,06005KT,10SM,BKN014,OVC024,20/17,A3009,RMK
3,21:11:00,2015-03-25,KDRT,252111Z,AUTO,13007KT,10SM,SCT023,22/17,A2986,RMK,AO2
4,21:10:00,2015-03-25,KBRL,252110Z,AUTO,33009KT,8SM,SCT016,07/03,A3006,RMK,AO2
5,21:10:00,2015-03-25,KCMX,252110Z,AUTO,26016KT,4SM,-SN,BR,OVC007,01/M01,A2959


In [31]:
df_auto = df_auto[['date', 'id', 'wind_speed', 'visibility', 'cloud_cover']]

In [32]:
df_auto.head()

Unnamed: 0,date,id,wind_speed,visibility,cloud_cover
0,2015-03-25,KCXP,08005KT,10SM,CLR
1,2015-03-25,KWRB,06005KT,10SM,BKN014
3,2015-03-25,KDRT,13007KT,10SM,SCT023
4,2015-03-25,KBRL,33009KT,8SM,SCT016
5,2015-03-25,KCMX,26016KT,4SM,-SN


In [33]:
df_non_auto = df[df['automated'] != 'AUTO']

df_non_auto.head()

Unnamed: 0,time,date,id,obsv_time,automated,wind_speed,visibility,cloud_cover,temp_dewpt,altimeter,altimeter_remarks,ao2
2,21:12:00,2015-03-25,KTUL,252112Z,14006KT,10SM,TS,BKN035,BKN120,BKN250,24/16,A2975
7,21:10:00,2015-03-25,KSZL,252110Z,03009KT,9SM,-RA,VCTS,SCT040,BKN100,10/07,A3000
9,21:09:00,2015-03-25,KGCK,252109Z,01029G38KT,10SM,BKN018,BKN033,BKN080,07/02,A3008,RMK
10,21:09:00,2015-03-25,KGSP,252109Z,02009KT,7SM,FEW011,SCT021,OVC041,16/13,A3015,RMK
15,21:07:00,2015-03-25,KSFO,252107Z,29014KT,10SM,FEW200,21/12,A3020,RMK,AO2,WSHFT


In [34]:
df_non_auto = df_non_auto[['date', 'id', 'automated', 'wind_speed', 'visibility']]

df_non_auto.head()

Unnamed: 0,date,id,automated,wind_speed,visibility
2,2015-03-25,KTUL,14006KT,10SM,TS
7,2015-03-25,KSZL,03009KT,9SM,-RA
9,2015-03-25,KGCK,01029G38KT,10SM,BKN018
10,2015-03-25,KGSP,02009KT,7SM,FEW011
15,2015-03-25,KSFO,29014KT,10SM,FEW200


In [35]:
df_non_auto.columns = df_auto.columns

df_non_auto.head()

Unnamed: 0,date,id,wind_speed,visibility,cloud_cover
2,2015-03-25,KTUL,14006KT,10SM,TS
7,2015-03-25,KSZL,03009KT,9SM,-RA
9,2015-03-25,KGCK,01029G38KT,10SM,BKN018
10,2015-03-25,KGSP,02009KT,7SM,FEW011
15,2015-03-25,KSFO,29014KT,10SM,FEW200


In [36]:
all_dfs = pd.concat([df_auto, df_non_auto])

In [37]:
all_dfs.head()

Unnamed: 0,date,id,wind_speed,visibility,cloud_cover
0,2015-03-25,KCXP,08005KT,10SM,CLR
1,2015-03-25,KWRB,06005KT,10SM,BKN014
3,2015-03-25,KDRT,13007KT,10SM,SCT023
4,2015-03-25,KBRL,33009KT,8SM,SCT016
5,2015-03-25,KCMX,26016KT,4SM,-SN


In [38]:
assert len(all_dfs) == len(df), "Not the same"

Columns of interest are: 

- id
- wind_speed
- visibility
- cloud cover

Need to parse out KT for wind_speed, SM for visibility, and make conditions for cloud_cover

In [39]:
all_dfs['cloud_cover'].nunique()

13000

In [40]:
all_dfs.dtypes

date           object
id             object
wind_speed     object
visibility     object
cloud_cover    object
dtype: object

In [41]:
all_dfs['visibility'], all_dfs['drop'] = all_dfs['visibility'].str.split('SM', 1).str

  """Entry point for launching an IPython kernel.


In [42]:
del all_dfs['drop']

In [43]:
all_dfs.head()

Unnamed: 0,date,id,wind_speed,visibility,cloud_cover
0,2015-03-25,KCXP,08005KT,10,CLR
1,2015-03-25,KWRB,06005KT,10,BKN014
3,2015-03-25,KDRT,13007KT,10,SCT023
4,2015-03-25,KBRL,33009KT,8,SCT016
5,2015-03-25,KCMX,26016KT,4,-SN


In [44]:
all_dfs.reset_index(inplace = True)

In [45]:
all_dfs.head()

Unnamed: 0,index,date,id,wind_speed,visibility,cloud_cover
0,0,2015-03-25,KCXP,08005KT,10,CLR
1,1,2015-03-25,KWRB,06005KT,10,BKN014
2,3,2015-03-25,KDRT,13007KT,10,SCT023
3,4,2015-03-25,KBRL,33009KT,8,SCT016
4,5,2015-03-25,KCMX,26016KT,4,-SN


In [46]:
del all_dfs['index']

all_dfs.head()

Unnamed: 0,date,id,wind_speed,visibility,cloud_cover
0,2015-03-25,KCXP,08005KT,10,CLR
1,2015-03-25,KWRB,06005KT,10,BKN014
2,2015-03-25,KDRT,13007KT,10,SCT023
3,2015-03-25,KBRL,33009KT,8,SCT016
4,2015-03-25,KCMX,26016KT,4,-SN


In [47]:
all_dfs.tail()

Unnamed: 0,date,id,wind_speed,visibility,cloud_cover
7731236,2016-04-28,KMFD,06006KT,10,OVC005
7731237,2016-04-28,KCRP,13013KT,7,SCT012
7731238,2016-04-28,KUES,07012G16KT,10,OVC008
7731239,2016-04-28,KVPS,20011KT,10,BKN012
7731240,2016-04-28,KMSN,07010KT,10,OVC016


### Wind_speed:

- first three digits are true direction in degrees
- next two digits are sustained speed
- if gusts are present, next two digits are max wind speed in past few minutes

Split 00KT from rest then drop? Then split 00 and KT and drop KT?

Regular expressions that would pull out relevant numbers

import re
^ check that out on pandas

In [49]:
pct_clr = len(all_dfs[all_dfs['cloud_cover'] == 'CLR'])

print(pct_clr)

3246810


In [53]:
clr = round(
pct_clr / len(all_dfs), 4
)*100

print(f"Clear Skies: {clr}%")

Clear Skies: 42.0%


### Cloud_cover:

- CLEAR = sky clear
- FEW = 0-2 eighths
- SCT = scattered 3-4 eighths
- BKN = Broken - 5-7 eighths
- OVC = Overcast - 8 eighths

cloud cover, thinking rules based approach eg:

- clear = 5
- few = 4
- sct = 3
- bkn = 2
- else: 1



almost binary outcome

what proportions of days can you fly

then rank on proportion of fliable days

In [54]:
all_dfs['id'].unique()

array(['KCXP', 'KWRB', 'KDRT', 'KBRL', 'KCMX', 'KINL', 'KSKA', 'KALI',
       'KDLF', 'KFRI', 'KHQM', 'KAFF', 'KDNL', 'KAST', 'KLBX', 'KTRM',
       'KDAN', 'KDAA', 'KADW', 'KBAB', 'KBAD', 'KBIF', 'KBIX', 'KCBM',
       'KCEF', 'KCOF', 'KDMA', 'KDYS', 'KFAF', 'KFBG', 'KFHU', 'KFSI',
       'KFTK', 'KGRF', 'KGRK', 'KGTB', 'KHMN', 'KHRT', 'KHST', 'KIAB',
       'KINS', 'KLFI', 'KLSF', 'KLSV', 'KLUF', 'KMCF', 'KMIB', 'KMUI',
       'KMXF', 'KOFF', 'KPAM', 'KPOB', 'KQGX', 'KRCA', 'KRDR', 'KRIV',
       'KRND', 'KSKF', 'KSSC', 'KSVN', 'KTCM', 'KTIK', 'KVAD', 'KWRI',
       'KHIB', 'KADF', 'KAND', 'KART', 'KBIH', 'KCEC', 'KCTB', 'KDIK',
       'KDUG', 'KEED', 'KEKO', 'KGFA', 'KINW', 'KJMS', 'KMTW', 'KOFK',
       'KQRH', 'KTPH', 'KUKI', 'KWJF', 'KWMC', 'KMMH', 'KOAJ', 'KONP',
       'KOUN', 'KPEQ', 'KPGV', 'KPNA', 'KPQI', 'KPVW', 'KQQY', 'KQSN',
       'KRKD', 'KRNH', 'KRUT', 'KSDY', 'KSEZ', 'KSGU', 'KSJS', 'KSME',
       'KSMN', 'KSOA', 'KTEX', 'KTVF', 'KVIS', 'KWWR', 'K36U', 'KASN',
      

In [55]:
uniques = len(all_dfs['id'].unique())

In [56]:
uniques

702

In [57]:
print(round(len(all_dfs)/ uniques))

11013


11000 observations per station

___