# METAR Data Analysis Project

METAR is the acronym for Meteorological Aerodrome Report. It is internationally recognized shorthand for weather data used by the aviation community.

Different abbreviations and codes from US Government National Weather Service: [source](https://www.weather.gov/media/wrh/mesowest/metar_decode_key.pdf)

### The Problem:

A Flight school is looking to expand to new locations. You're given METAR data to identify the 10 best and 10 worst locations based on the following criteria:

* Visibility of 10 statute miles or greater 

* Cloud ceiling of 3,000 ft above ground or higher

* Winds less than 15 kts

----

[Airport Data Link](https://data.humdata.org/dataset/ourairports-usa?)

[Kaggle Airport Data](https://www.kaggle.com/datasets/aravindram11/list-of-us-airports)

[US Cities Population](https://catalog.data.gov/dataset/500-cities-city-level-data-gis-friendly-format-2019-release)

### Imports and Reading in Data

In [1]:
import json
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium

import utils

In [2]:
airports = pd.read_csv('../data/airports.csv')
us_pop = pd.read_csv('../data/500_Cities__City-level_Data__GIS_Friendly_Format___2019_release.csv')


In [3]:
df = utils.parse_metar_file_fast("../data/metar_export.txt")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7731241 entries, 0 to 7731240
Data columns (total 18 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   time            object 
 2   station         object 
 3   ddhhmmZ         object 
 4   is_auto         bool   
 5   wind_dir        object 
 6   wind_spd_kt     float64
 7   wind_gst_kt     float64
 8   visibility_raw  object 
 9   visibility_sm   float64
 10  sky_layers      object 
 11  ceiling_ft      float64
 12  temp_c          float64
 13  dewpoint_c      float64
 14  altimeter_inHg  float64
 15  remarks         object 
 16  precise_temp_c  float64
 17  precise_dew_c   float64
dtypes: bool(1), float64(9), object(8)
memory usage: 1010.1+ MB


In [5]:
df.head()

Unnamed: 0,date,time,station,ddhhmmZ,is_auto,wind_dir,wind_spd_kt,wind_gst_kt,visibility_raw,visibility_sm,sky_layers,ceiling_ft,temp_c,dewpoint_c,altimeter_inHg,remarks,precise_temp_c,precise_dew_c
0,2015-03-25,21:15:00,KCXP,252115Z,True,80,5.0,,10,10.0,"[{'cov': 'CLR', 'base_ft': None}]",,17.0,-0.0,30.33,AO2,,
1,2015-03-25,21:13:00,KWRB,252113Z,True,60,5.0,,10,10.0,"[{'cov': 'BKN', 'base_ft': 1400}, {'cov': 'OVC...",1400.0,20.0,17.0,30.09,AO2,,
2,2015-03-25,21:12:00,KTUL,252112Z,False,140,6.0,,10,10.0,"[{'cov': 'BKN', 'base_ft': 3500}, {'cov': 'BKN...",3500.0,24.0,16.0,29.75,AO2 OCNL LTGICCC VC SW-NW TS VC SW-NW MOV NE T...,23.9,16.1
3,2015-03-25,21:11:00,KDRT,252111Z,True,130,7.0,,10,10.0,"[{'cov': 'SCT', 'base_ft': 2300}]",,22.0,17.0,29.86,AO2 T02170167,21.7,16.7
4,2015-03-25,21:10:00,KBRL,252110Z,True,330,9.0,,8,8.0,"[{'cov': 'SCT', 'base_ft': 1600}]",,7.0,3.0,30.06,AO2 T00720033,7.2,3.3


In [None]:
# Takes roughly  40 seconds to export
df.to_parquet("../data/metar_data_cleaned.parquet", index=False)

In [7]:
# Checking population Data

print(us_pop.head())
print('='*40)
us_pop.info()

  StateAbbr   PlaceName  PlaceFIPS  Population2010  ACCESS2_CrudePrev  \
0        AL  Birmingham     107000          212237               20.3   
1        AL      Hoover     135896           81619               11.5   
2        AL  Huntsville     137000          180105               16.4   
3        AL      Mobile     150000          195111               18.1   
4        AL  Montgomery     151000          205764               19.0   

  ACCESS2_Crude95CI  ACCESS2_AdjPrev ACCESS2_Adj95CI  ARTHRITIS_CrudePrev  \
0      (19.9, 20.6)             20.6    (20.2, 21.0)                 29.8   
1      (11.0, 12.0)             11.8    (11.3, 12.3)                 25.3   
2      (16.0, 16.9)             16.6    (16.1, 17.0)                 28.9   
3      (17.8, 18.5)             18.5    (18.1, 18.9)                 30.2   
4      (18.5, 19.5)             19.1    (18.7, 19.6)                 28.3   

  ARTHRITIS_Crude95CI  ...  SLEEP_Adj95CI STROKE_CrudePrev  STROKE_Crude95CI  \
0        (29.6, 29

In [8]:
# Push to EDA folder
# Filtering only columns of interest
us_pop = us_pop.loc[:, ['StateAbbr', 'PlaceName', 'Population2010', 'Geolocation']].copy()
# Seeing top 20 Cities
us_pop.sort_values(by = 'Population2010', ascending = False).head(20)

Unnamed: 0,StateAbbr,PlaceName,Population2010,Geolocation
342,NY,New York,8175133,"(40.69496068900, -73.9313850409)"
74,CA,Los Angeles,3792621,"(34.11822778980, -118.408500088)"
222,IL,Chicago,2695598,"(41.83729506150, -87.6862308732)"
429,TX,Houston,2099451,"(29.78066913960, -95.3860033966)"
388,PA,Philadelphia,1526006,"(40.00931478080, -75.1333888571)"
13,AZ,Phoenix,1445632,"(33.57241386950, -112.088995222)"
451,TX,San Antonio,1327407,"(29.47214753330, -98.5246763525)"
111,CA,San Diego,1307402,"(32.83556394180, -117.119792061)"
421,TX,Dallas,1197816,"(32.79398040660, -96.7656929463)"
213,HI,Honolulu,953207,"(21.45880393050, -157.973296737)"


In [9]:
# cleaning up latitude and longitude in us_pop dataframe.
us_pop.head()

Unnamed: 0,StateAbbr,PlaceName,Population2010,Geolocation
0,AL,Birmingham,212237,"(33.52756637730, -86.7988174678)"
1,AL,Hoover,81619,"(33.37676027290, -86.8051937568)"
2,AL,Huntsville,180105,"(34.69896926710, -86.6387042882)"
3,AL,Mobile,195111,"(30.67762486480, -88.1184482714)"
4,AL,Montgomery,205764,"(32.34726453330, -86.2677059552)"


In [10]:
# Testing split for Geolocation column
us_pop['Geolocation'].str.split(pat = ',', expand = True)

Unnamed: 0,0,1
0,(33.52756637730,-86.7988174678)
1,(33.37676027290,-86.8051937568)
2,(34.69896926710,-86.6387042882)
3,(30.67762486480,-88.1184482714)
4,(32.34726453330,-86.2677059552)
...,...,...
495,(43.08098656940,-89.3915106344)
496,(43.06412589250,-87.9672412429)
497,(42.72745994940,-87.8134530240)
498,(43.00933322150,-88.2457679157)


In [11]:
us_pop = pd.merge(us_pop, us_pop['Geolocation'].str.split(pat = ',', expand = True), left_index = True, right_index = True)

us_pop = us_pop.iloc[:, :6].copy()

us_pop.head()

Unnamed: 0,StateAbbr,PlaceName,Population2010,Geolocation,0,1
0,AL,Birmingham,212237,"(33.52756637730, -86.7988174678)",(33.52756637730,-86.7988174678)
1,AL,Hoover,81619,"(33.37676027290, -86.8051937568)",(33.37676027290,-86.8051937568)
2,AL,Huntsville,180105,"(34.69896926710, -86.6387042882)",(34.69896926710,-86.6387042882)
3,AL,Mobile,195111,"(30.67762486480, -88.1184482714)",(30.67762486480,-88.1184482714)
4,AL,Montgomery,205764,"(32.34726453330, -86.2677059552)",(32.34726453330,-86.2677059552)


In [12]:
# Naming coordinates columns UPPER CASE for merging with METAR data
us_pop.columns = ['state', 'city', 'population2010', 'geolocation', 'latitude',
       'longitude']

us_pop.head()

Unnamed: 0,state,city,population2010,geolocation,latitude,longitude
0,AL,Birmingham,212237,"(33.52756637730, -86.7988174678)",(33.52756637730,-86.7988174678)
1,AL,Hoover,81619,"(33.37676027290, -86.8051937568)",(33.37676027290,-86.8051937568)
2,AL,Huntsville,180105,"(34.69896926710, -86.6387042882)",(34.69896926710,-86.6387042882)
3,AL,Mobile,195111,"(30.67762486480, -88.1184482714)",(30.67762486480,-88.1184482714)
4,AL,Montgomery,205764,"(32.34726453330, -86.2677059552)",(32.34726453330,-86.2677059552)


In [13]:
# Removing unnecessary characters in coordinate columns to convert to floats
us_pop['latitude'] = us_pop['latitude'].apply(lambda x: x.replace('(', '')).astype(float)
us_pop['longitude'] = us_pop['longitude'].apply(lambda x: x.replace(')', '')).astype(float)

us_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   state           500 non-null    object 
 1   city            500 non-null    object 
 2   population2010  500 non-null    int64  
 3   geolocation     500 non-null    object 
 4   latitude        500 non-null    float64
 5   longitude       500 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 23.6+ KB


In [14]:
# checking first few rows to see if any changes need to be made
# Dropping Geolocation since coordinates columns are in float format
us_pop = us_pop.drop(columns = 'geolocation')

us_pop.head()

Unnamed: 0,state,city,population2010,latitude,longitude
0,AL,Birmingham,212237,33.527566,-86.798817
1,AL,Hoover,81619,33.37676,-86.805194
2,AL,Huntsville,180105,34.698969,-86.638704
3,AL,Mobile,195111,30.677625,-88.118448
4,AL,Montgomery,205764,32.347265,-86.267706


In [15]:
# Population looks good, exporting as a .csv
us_pop.to_csv('../data/us_pop.csv', index = False)

---

## Looking at .head() and .info() of airports dataset.

In [16]:
print(airports.head())
print('='*40) # line break
airports.info()

  iata                              airport           city state country  \
0  ABQ            Albuquerque International    Albuquerque    NM     USA   
1  ANC  Ted Stevens Anchorage International      Anchorage    AK     USA   
2  ATL    William B Hartsfield-Atlanta Intl        Atlanta    GA     USA   
3  AUS       Austin-Bergstrom International         Austin    TX     USA   
4  BDL                Bradley International  Windsor Locks    CT     USA   

    latitude   longitude  
0  35.040222 -106.609194  
1  61.174320 -149.996186  
2  33.640444  -84.426944  
3  30.194533  -97.669872  
4  41.938874  -72.683228  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341 entries, 0 to 340
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   iata       341 non-null    object 
 1   airport    341 non-null    object 
 2   city       337 non-null    object 
 3   state      337 non-null    object 
 4   country    341 non-null    objec

In [17]:
print(f"Number of unique airports: {airports['iata'].nunique()}")
print('='*50)
airports['iata'].unique()

Number of unique airports: 341


array(['ABQ', 'ANC', 'ATL', 'AUS', 'BDL', 'BHM', 'BNA', 'BOS', 'BUF',
       'BUR', 'BWI', 'CHS', 'CLE', 'CLT', 'CMH', 'CVG', 'DAL', 'DCA',
       'DEN', 'DFW', 'DTW', 'ELP', 'EWR', 'FLL', 'HNL', 'HOU', 'IAD',
       'IAH', 'IND', 'JAX', 'JFK', 'LAS', 'LAX', 'LGA', 'LIT', 'MCI',
       'MCO', 'MDW', 'MEM', 'MIA', 'MKE', 'MSP', 'MSY', 'OAK', 'OGG',
       'OKC', 'OMA', 'ONT', 'ORD', 'ORF', 'PBI', 'PDX', 'PHL', 'PHX',
       'PIT', 'PVD', 'RDU', 'RIC', 'RNO', 'RSW', 'SAN', 'SAT', 'SDF',
       'SEA', 'SFO', 'SJC', 'SJU', 'SLC', 'SMF', 'SNA', 'STL', 'TPA',
       'TUL', 'TUS', 'ABE', 'ABI', 'ABR', 'ABY', 'ACK', 'ACT', 'ACV',
       'ACY', 'ADK', 'ADQ', 'AEX', 'AGS', 'AKN', 'ALB', 'ALO', 'AMA',
       'APN', 'ART', 'ASE', 'ATW', 'AVL', 'AVP', 'AZO', 'BET', 'BFL',
       'BGM', 'BGR', 'BIL', 'BIS', 'BJI', 'BLI', 'BMI', 'BOI', 'BPT',
       'BQK', 'BQN', 'BRD', 'BRO', 'BRW', 'BTM', 'BTR', 'BTV', 'BZN',
       'CAE', 'CAK', 'CDC', 'CDV', 'CEC', 'CHA', 'CHO', 'CIC', 'CID',
       'CIU', 'CLD',

In [18]:
airports.to_csv('../data/airports.csv', index = False)

Changed columns to be lower case. Some information will be redundant when merged with `us_pop`. 

---

## Cleaning METAR Dataset, will require most work

In [20]:
df['station'].nunique()

702

In [19]:
df.head()

Unnamed: 0,date,time,station,ddhhmmZ,is_auto,wind_dir,wind_spd_kt,wind_gst_kt,visibility_raw,visibility_sm,sky_layers,ceiling_ft,temp_c,dewpoint_c,altimeter_inHg,remarks,precise_temp_c,precise_dew_c
0,2015-03-25,21:15:00,KCXP,252115Z,True,80,5.0,,10,10.0,"[{'cov': 'CLR', 'base_ft': None}]",,17.0,-0.0,30.33,AO2,,
1,2015-03-25,21:13:00,KWRB,252113Z,True,60,5.0,,10,10.0,"[{'cov': 'BKN', 'base_ft': 1400}, {'cov': 'OVC...",1400.0,20.0,17.0,30.09,AO2,,
2,2015-03-25,21:12:00,KTUL,252112Z,False,140,6.0,,10,10.0,"[{'cov': 'BKN', 'base_ft': 3500}, {'cov': 'BKN...",3500.0,24.0,16.0,29.75,AO2 OCNL LTGICCC VC SW-NW TS VC SW-NW MOV NE T...,23.9,16.1
3,2015-03-25,21:11:00,KDRT,252111Z,True,130,7.0,,10,10.0,"[{'cov': 'SCT', 'base_ft': 2300}]",,22.0,17.0,29.86,AO2 T02170167,21.7,16.7
4,2015-03-25,21:10:00,KBRL,252110Z,True,330,9.0,,8,8.0,"[{'cov': 'SCT', 'base_ft': 1600}]",,7.0,3.0,30.06,AO2 T00720033,7.2,3.3


---

Columns of interest are: 

- date
- time_collected
- station
- wind_speed
- visibility
- cloud cover

Need to parse out KT for wind_speed, SM for visibility, and make conditions/ordinal encoder from sklearn for cloud_cover

---

### Wind_speed:

- first three digits are true direction in degrees
- next two digits are sustained speed
- if gusts are present, next two digits are max wind speed in past few minutes

Example METAR report: METAR KJFK 291953Z 17015G25KT 10SM FEW040 BKN070 28/18 A3002 RMK AO2 SLP166 T02830183

In this example, the wind part is "17015G25KT." Here's what each section of this wind part means:

Wind Direction: "170"

The first three digits represent the wind direction in degrees true. In this case, the wind is coming from 170 degrees on the compass, which is roughly from the south-southeast direction.

Wind Speed: "15KT"

The next two digits represent the sustained wind speed in knots. In this case, the wind is blowing at a constant speed of 15 knots.
Wind Gusts: "G25KT"

We aren't too concerned with Gusts or direction of the wind. We'll merely want to parse out the windspeed, to help us get mean and medians for each station.

---


### Cloud_cover:

- CLEAR = sky clear
- FEW = 0-2 eighths
- SCT = scattered 3-4 eighths
- BKN = Broken - 5-7 eighths
- OVC = Overcast - 8 eighths

cloud cover, thinking rules based approach eg:

- clear = 5
- few = 4
- sct = 3
- bkn = 2
- else: 1



In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7731241 entries, 0 to 7731240
Data columns (total 18 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   time            object 
 2   station         object 
 3   ddhhmmZ         object 
 4   is_auto         bool   
 5   wind_dir        object 
 6   wind_spd_kt     float64
 7   wind_gst_kt     float64
 8   visibility_raw  object 
 9   visibility_sm   float64
 10  sky_layers      object 
 11  ceiling_ft      float64
 12  temp_c          float64
 13  dewpoint_c      float64
 14  altimeter_inHg  float64
 15  remarks         object 
 16  precise_temp_c  float64
 17  precise_dew_c   float64
dtypes: bool(1), float64(9), object(8)
memory usage: 1010.1+ MB


In [24]:
df.head()

Unnamed: 0,date,time,station,ddhhmmZ,is_auto,wind_dir,wind_spd_kt,wind_gst_kt,visibility_raw,visibility_sm,sky_layers,ceiling_ft,temp_c,dewpoint_c,altimeter_inHg,remarks,precise_temp_c,precise_dew_c
0,2015-03-25,21:15:00,KCXP,252115Z,True,80,5.0,,10,10.0,"[{'cov': 'CLR', 'base_ft': None}]",,17.0,-0.0,30.33,AO2,,
1,2015-03-25,21:13:00,KWRB,252113Z,True,60,5.0,,10,10.0,"[{'cov': 'BKN', 'base_ft': 1400}, {'cov': 'OVC...",1400.0,20.0,17.0,30.09,AO2,,
2,2015-03-25,21:12:00,KTUL,252112Z,False,140,6.0,,10,10.0,"[{'cov': 'BKN', 'base_ft': 3500}, {'cov': 'BKN...",3500.0,24.0,16.0,29.75,AO2 OCNL LTGICCC VC SW-NW TS VC SW-NW MOV NE T...,23.9,16.1
3,2015-03-25,21:11:00,KDRT,252111Z,True,130,7.0,,10,10.0,"[{'cov': 'SCT', 'base_ft': 2300}]",,22.0,17.0,29.86,AO2 T02170167,21.7,16.7
4,2015-03-25,21:10:00,KBRL,252110Z,True,330,9.0,,8,8.0,"[{'cov': 'SCT', 'base_ft': 1600}]",,7.0,3.0,30.06,AO2 T00720033,7.2,3.3


In [25]:
# Subsetting to only columns of interest

metar_df = df.loc[:, ['date', 'time', 'station', 'wind_spd_kt', 'wind_gst_kt', 'visibility_raw', 'visibility_sm', 'sky_layers', 'ceiling_ft']].copy()

metar_df.head()

Unnamed: 0,date,time,station,wind_spd_kt,wind_gst_kt,visibility_raw,visibility_sm,sky_layers,ceiling_ft
0,2015-03-25,21:15:00,KCXP,5.0,,10,10.0,"[{'cov': 'CLR', 'base_ft': None}]",
1,2015-03-25,21:13:00,KWRB,5.0,,10,10.0,"[{'cov': 'BKN', 'base_ft': 1400}, {'cov': 'OVC...",1400.0
2,2015-03-25,21:12:00,KTUL,6.0,,10,10.0,"[{'cov': 'BKN', 'base_ft': 3500}, {'cov': 'BKN...",3500.0
3,2015-03-25,21:11:00,KDRT,7.0,,10,10.0,"[{'cov': 'SCT', 'base_ft': 2300}]",
4,2015-03-25,21:10:00,KBRL,9.0,,8,8.0,"[{'cov': 'SCT', 'base_ft': 1600}]",


In [26]:
metar_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7731241 entries, 0 to 7731240
Data columns (total 9 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   time            object 
 2   station         object 
 3   wind_spd_kt     float64
 4   wind_gst_kt     float64
 5   visibility_raw  object 
 6   visibility_sm   float64
 7   sky_layers      object 
 8   ceiling_ft      float64
dtypes: float64(4), object(5)
memory usage: 530.9+ MB


In [28]:
metar_df.isnull().sum()

date                    0
time                    0
station                 0
wind_spd_kt         33659
wind_gst_kt       6769076
visibility_raw      91264
visibility_sm       91264
sky_layers              0
ceiling_ft        4688764
dtype: int64

In [30]:
metar_df.isnull().mean()

date              0.000000
time              0.000000
station           0.000000
wind_spd_kt       0.004354
wind_gst_kt       0.875548
visibility_raw    0.011805
visibility_sm     0.011805
sky_layers        0.000000
ceiling_ft        0.606470
dtype: float64

With nearly 87.5% of the `wind_gst_kt`, that column will be dropped. 

In [None]:
metar_df = metar_df.drop(columns=['wind_gst_kt']) # originally had ceiling_ft in here

metar_df.isnull().mean()

date              0.000000
time              0.000000
station           0.000000
wind_spd_kt       0.004354
visibility_raw    0.011805
visibility_sm     0.011805
sky_layers        0.000000
dtype: float64

Dealing with `ceiling_ft`. A simplistic approach is to binarize the variable to indicate if it had a ceiling or did not. Since the `sky_layers` column has a nested datatype, this is simplifying assumption, rather than a realistic one.

In [None]:
metar_df['has_ceiling'] = metar_df['ceiling_ft'].notna().astype(int)

In [32]:
metar_df.dtypes

date               object
time               object
station            object
wind_spd_kt       float64
visibility_raw     object
visibility_sm     float64
sky_layers         object
dtype: object

In [38]:
metar_df['visibility_raw'].value_counts().sort_values()

visibility_raw
18          1
64          1
53          1
99          1
26          1
       ...   
5      128559
8      151831
9      190645
7      196565
10    6260972
Name: count, Length: 66, dtype: int64

In [42]:
from fractions import Fraction

In [43]:
def parse_visibility(raw):
    if pd.isna(raw):
        return None
    try:
        s = str(raw)
        if s.startswith("P"):
            s = s[1:]
        if " " in s:
            a, s.split()
            return float(a) + float(Fraction(b))
        if "/" in s:
            return float(Fraction(s))
        return float(s)
    except Exception as e:
        print(e)
        return None

In [44]:
metar_df['visibility_raw'] = metar_df['visibility_raw'].apply(parse_visibility)

In [46]:
metar_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7731241 entries, 0 to 7731240
Data columns (total 7 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   time            object 
 2   station         object 
 3   wind_spd_kt     float64
 4   visibility_raw  float64
 5   visibility_sm   float64
 6   sky_layers      object 
dtypes: float64(3), object(4)
memory usage: 412.9+ MB


Visiblity data can sometime come in as strings or fractions. The parse file should've caught some but had to use the `parse_visiblity` function to correct a few that made it through.

---

Missing value imputation.

Since visibility can vary wildly, default imputation will be done with the median for each station. 



In [49]:
metar_df.isnull().mean()

date              0.000000
time              0.000000
station           0.000000
wind_spd_kt       0.004354
visibility_raw    0.011805
visibility_sm     0.011805
sky_layers        0.000000
dtype: float64

In [50]:
metar_df['wind_spd_kt'] = metar_df.groupby('station')['wind_spd_kt'].transform(lambda x: x.fillna(x.median()))

In [51]:
metar_df['visibility_raw'] = metar_df.groupby('station')['visibility_raw'].transform(lambda x: x.fillna(x.median()))

In [52]:
metar_df['visibility_sm'] = metar_df.groupby('station')['visibility_sm'].transform(lambda x: x.fillna(x.median()))

In [53]:
metar_df.isnull().mean()

date              0.000000
time              0.000000
station           0.000000
wind_spd_kt       0.000000
visibility_raw    0.003633
visibility_sm     0.003633
sky_layers        0.000000
dtype: float64

Surprisingly still have a few missing values in the visibility column, but it's less than 0.5%. That is considered enough to simply drop.

In [55]:
metar_df = metar_df.dropna()

metar_df.isnull().mean()

date              0.0
time              0.0
station           0.0
wind_spd_kt       0.0
visibility_raw    0.0
visibility_sm     0.0
sky_layers        0.0
dtype: float64

All missing values have been dealt with.

In [56]:
metar_df['sky_layers']

0                          [{'cov': 'CLR', 'base_ft': None}]
1          [{'cov': 'BKN', 'base_ft': 1400}, {'cov': 'OVC...
2          [{'cov': 'BKN', 'base_ft': 3500}, {'cov': 'BKN...
3                          [{'cov': 'SCT', 'base_ft': 2300}]
4                          [{'cov': 'SCT', 'base_ft': 1600}]
                                 ...                        
7731236    [{'cov': 'FEW', 'base_ft': 4800}, {'cov': 'OVC...
7731237                     [{'cov': 'OVC', 'base_ft': 800}]
7731238    [{'cov': 'BKN', 'base_ft': 1200}, {'cov': 'OVC...
7731239                    [{'cov': 'OVC', 'base_ft': 1000}]
7731240                    [{'cov': 'OVC', 'base_ft': 1600}]
Name: sky_layers, Length: 7703157, dtype: object