# Main script to clean wind data at the zip code, monthly level

Modules: N/A <br>
Author: Cornelia Ilin <br>
Email: cilin@wisc.edu <br>
Date created: May 14, 2021 <br>

**Citations (data sources)**

``Wind data:`` 

download the "MERRA2_100.tavgM_2d_slv_Nx" product; this provides monthly averages of U and V components

1. https://search.earthdata.nasa.gov/search/granules?p=C1276812859-GES_DISC&pg[0][qt]=1991-01-01T00%3A00%3A00.000Z%2C2017-12-31T23%3A59%3A59.999Z&pg[0][gsk]=-start_date&q=MERRA-2%20tavgM&tl=1624239533!3!!&m=-0.0703125!0.0703125!2!1!0!0%2C2

and data dictionary here:

2. https://gmao.gsfc.nasa.gov/pubs/docs/Bosilovich785.pdf
3. https://disc.gsfc.nasa.gov/datasets/M2T1NXSLV_5.12.4/summary


``Shapefiles for California ZIP codes (2010 census):``

4. https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2010&layergroup=ZIP+Code+Tabulation+Areas

``Installation errors with Geopandas:``

5. https://stackoverflow.com/questions/54734667/error-installing-geopandas-a-gdal-api-version-must-be-specified-in-anaconda

``How to compute wind speed and direction:``

6. https://stackoverflow.com/questions/21484558/how-to-calculate-wind-direction-from-u-and-v-wind-components-in-r
7. https://github.com/blaylockbk/Ute_WRF/blob/master/functions/wind_calcs.py

``Wind speed and direction intuition:``

8. http://colaweb.gmu.edu/dev/clim301/lectures/wind/wind-uv
9. https://www.earthdatascience.org/courses/use-data-open-source-python/intro-vector-data-python/spatial-data-vector-shapefiles/intro-to-coordinate-reference-systems-python/

``To create maps of this wind data:``

and also used to provide intuition for winddir and windspeed

10. https://disc.gsfc.nasa.gov/information/howto?title=How%20to%20calculate%20and%20plot%20wind%20speed%20using%20MERRA-2%20wind%20component%20data%20using%20Python



**Citations (persons)**
1. N/A

**Preferred environment**
1. Code written in Jupyter Notebooks

### Step 1: Import packages

In [None]:
!pip install cartopy geopandas osmnx

In [None]:
import pandas as pd
import numpy as np
import netCDF4 as ncdf
import os
from datetime import date, timedelta
from math import pi

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER
import matplotlib.ticker as mticker

# geography
import geopandas as gpd
import osmnx as ox
import shapely
from shapely.geometry import Point
import sklearn.neighbors
dist = sklearn.neighbors.DistanceMetric.get_metric(
    'haversine'
)

# ignore warnings
import warnings
warnings.filterwarnings(
    'ignore'
)

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
data_path = '/content/gdrive/MyDrive/Classes/W210_capstone/data'

os.chdir(data_path)
os.listdir(data_path)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


['merra2_data', 'tl_2010_06_zcta510', 'tl_2010_06_zcta510.zip']

### Step 2: Define working directories

In [None]:
!ls merra2_data

2000-2004  2001  2002  2005-2009  2010-2019  2015-2019


In [None]:
# in_dir_zip_shapes = 'C:/Users/cilin/Research/CA_hospitals/Input/raw_data/census_geo/shapefiles_zcta/'
# in_dir = 'C:/Users/cilin/Research/CA_hospitals/Input/raw_data/winds/'
# in_health = 'C:/Users/cilin/Research/CA_hospitals/Input/final_data/health/'
# out_dir = 'C:/Users/cilin/Research/CA_hospitals/Input/final_data/winds/'

in_dir_zip_shapes = 'tl_2010_06_zcta510/'
in_dir = 'merra2_data/2002/'
in_health = 'C:/Users/cilin/Research/CA_hospitals/Input/final_data/health/'
out_dir = 'out_dir/'

### Step 3: Define functions

``read_clean wind``

In [None]:
!ls merra2_data/2002/

MERRA2_300.tavgM_2d_slv_Nx.200201.nc4  MERRA2_300.tavgM_2d_slv_Nx.200207.nc4
MERRA2_300.tavgM_2d_slv_Nx.200202.nc4  MERRA2_300.tavgM_2d_slv_Nx.200208.nc4
MERRA2_300.tavgM_2d_slv_Nx.200203.nc4  MERRA2_300.tavgM_2d_slv_Nx.200209.nc4
MERRA2_300.tavgM_2d_slv_Nx.200204.nc4  MERRA2_300.tavgM_2d_slv_Nx.200210.nc4
MERRA2_300.tavgM_2d_slv_Nx.200205.nc4  MERRA2_300.tavgM_2d_slv_Nx.200211.nc4
MERRA2_300.tavgM_2d_slv_Nx.200206.nc4  MERRA2_300.tavgM_2d_slv_Nx.200212.nc4


In [None]:
data = ncdf.Dataset('merra2_data/2002/MERRA2_300.tavgM_2d_slv_Nx.200201.nc4', mode='r')

In [None]:
data

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    History: Original file generated: Thu Jun 25 07:04:44 2015 GMT
    Filename: MERRA2_300.tavgM_2d_slv_Nx.200201.nc4
    Comment: GMAO filename: d5124_m2_jan00.tavg1_2d_slv_Nx.monthly.200201.nc4
    Conventions: CF-1
    Institution: NASA Global Modeling and Assimilation Office
    References: http://gmao.gsfc.nasa.gov
    Format: NetCDF-4/HDF-5
    SpatialCoverage: global
    VersionID: 5.12.4
    TemporalRange: 1980-01-01 -> 2016-12-31
    identifier_product_doi_authority: http://dx.doi.org/
    ShortName: M2TMNXSLV
    RangeBeginningDate: 2002-01-01
    RangeEndingDate: 2002-01-31
    GranuleID: MERRA2_300.tavgM_2d_slv_Nx.200201.nc4
    ProductionDateTime: Original file generated: Thu Jun 25 07:04:44 2015 GMT
    LongName: MERRA2 tavg1_2d_slv_Nx: 2d,1-Hourly,Time-Averaged,Single-Level,Assimilation,Single-Level Diagnostics Monthly Mean
    Title: MERRA2 tavg1_2d_slv_Nx: 2d,1-Hourly,Time-Averaged,S

In [None]:
lons = data.variables['lon']
lats = data.variables['lat']
# 2-meter eastward wind m/s
U2M = data.variables['U2M']
# 2-meter northward wind m/s
V2M = data.variables['V2M']

# Replace vals #
################
#\_FillValues with NaNs:
U2M_nans = U2M[:]
V2M_nans = V2M[:]
_FillValueU2M = U2M._FillValue
_FillValueV2M = V2M._FillValue
U2M_nans[U2M_nans == _FillValueU2M] = np.nan
V2M_nans[V2M_nans == _FillValueV2M] = np.nan

# Add new vars #
################
# calculate wind speed
wspd = np.sqrt(U2M_nans**2+V2M_nans**2)

# calculate wind direction in radians
wdir = np.arctan2(V2M_nans, U2M_nans)

# transform wind direction from radians to degrees
#dir_to_degrees = np.mod(180+np.rad2deg(np.arctan2(V2M_nans, U2M_nans)), 360) # this computes "wind is blowing from"' meteorological convetion'
wdir_to_degrees = np.mod(np.rad2deg(wdir), 360) # this computes "wind is blowing towards" 'oceonographic convention', see here: https://www.esri.com/arcgis-blog/products/product/analytics/displaying-speed-and-direction-symbology-from-u-and-v-vectors/


## transform to df ##
#####################
# create an empty df for wind speed and direction with size len(lats) x len(lons) 
df_wdir = pd.DataFrame(index=lats[:], columns=lons[:])   
df_wspd = pd.DataFrame(index=lats[:], columns=lons[:])

# create an empty df for u and v components with size len(lats) x len(lons) 
df_u = pd.DataFrame(index=lats[:], columns=lons[:])
df_v = pd.DataFrame(index=lats[:], columns=lons[:])

In [None]:
df_u

Unnamed: 0,-180.000,-179.375,-178.750,-178.125,-177.500,-176.875,-176.250,-175.625,-175.000,-174.375,...,173.750,174.375,175.000,175.625,176.250,176.875,177.500,178.125,178.750,179.375
-90.0,,,,,,,,,,,...,,,,,,,,,,
-89.5,,,,,,,,,,,...,,,,,,,,,,
-89.0,,,,,,,,,,,...,,,,,,,,,,
-88.5,,,,,,,,,,,...,,,,,,,,,,
-88.0,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88.0,,,,,,,,,,,...,,,,,,,,,,
88.5,,,,,,,,,,,...,,,,,,,,,,
89.0,,,,,,,,,,,...,,,,,,,,,,
89.5,,,,,,,,,,,...,,,,,,,,,,


In [None]:
x = data.variables['TOX']
x[:]

masked_array(
  data=[[[0.00602853, 0.00602853, 0.00602853, ..., 0.00602853,
          0.00602853, 0.00602853],
         [0.00601954, 0.0060196 , 0.00601966, ..., 0.00601936,
          0.00601942, 0.00601947],
         [0.00600773, 0.0060078 , 0.00600788, ..., 0.00600753,
          0.00600759, 0.00600766],
         ...,
         [0.00787354, 0.00787383, 0.00787411, ..., 0.00787264,
          0.00787295, 0.00787325],
         [0.00784678, 0.00784694, 0.0078471 , ..., 0.0078463 ,
          0.00784646, 0.00784663],
         [0.00782271, 0.00782271, 0.00782271, ..., 0.00782271,
          0.00782271, 0.00782271]]],
  mask=False,
  fill_value=1e+20,
  dtype=float32)

In [None]:
x = data.variables['CLDPRS']
x[:]

masked_array(
  data=[[[55455.671875, 55455.671875, 55455.671875, ..., 55455.671875,
          55455.671875, 55455.671875],
         [55286.66796875, 55287.01953125, 55287.12890625, ...,
          55285.4921875, 55285.8125, 55286.15234375],
         [55697.9140625, 55691.96875, 55687.15625, ..., 55713.31640625,
          55708.07421875, 55703.109375],
         ...,
         [68224.9609375, 68235.5078125, 68245.90625, ..., 68197.515625,
          68206.078125, 68215.4921875],
         [68773.828125, 68778.6328125, 68784.8671875, ...,
          68757.6640625, 68762.859375, 68768.4609375],
         [68158.8828125, 68158.8828125, 68158.8828125, ...,
          68158.8828125, 68158.8828125, 68158.8828125]]],
  mask=[[[False, False, False, ..., False, False, False],
         [False, False, False, ..., False, False, False],
         [False, False, False, ..., False, False, False],
         ...,
         [False, False, False, ..., False, False, False],
         [False, False, False, ..., False,

In [None]:
lons = data.variables['lon']
lons

<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
    long_name: longitude
    units: degrees_east
    vmax: 1000000000000000.0
    vmin: -1000000000000000.0
    valid_range: [-1.e+15  1.e+15]
unlimited dimensions: 
current shape = (576,)
filling on, default _FillValue of 9.969209968386869e+36 used

In [None]:
U2M = data.variables['U2M']
U2M_nans = U2M[:]
U2M_nans

masked_array(
  data=[[[0.86191344, 0.83215487, 0.8022955 , ..., 0.9505544 ,
          0.9211084 , 0.8915578 ],
         [0.54720265, 0.5206443 , 0.49404448, ..., 0.62672794,
          0.60023975, 0.57373506],
         [0.07566843, 0.05541619, 0.03535309, ..., 0.13770711,
          0.11682366, 0.09615103],
         ...,
         [1.3206918 , 1.3432117 , 1.3658428 , ..., 1.2539389 ,
          1.2760417 , 1.2983309 ],
         [1.3445549 , 1.3663588 , 1.3881762 , ..., 1.2793579 ,
          1.3010587 , 1.3227874 ],
         [1.6106691 , 1.6322044 , 1.6537108 , ..., 1.5460885 ,
          1.5676283 , 1.5891727 ]]],
  mask=False,
  fill_value=1e+20,
  dtype=float32)

In [None]:
def read_clean_wind():
    ''''''
    # create empty df
    df = pd.DataFrame()

    for file in os.listdir(in_dir):
        if file.startswith('MERRA2'):
            print(file.split('.')[2])

            ## read .nc file ##
            ###################
            data = ncdf.Dataset(
                in_dir + file, mode='r'
            )
            # print metadata
            #print(data)

            # grab vars of interest ##
            ##########################
            # longitude and latitude
            lons = data.variables['lon']
            lats = data.variables['lat']
            # 2-meter eastward wind m/s
            U2M = data.variables['U2M']
            # 2-meter northward wind m/s
            V2M = data.variables['V2M']

            # Replace vals #
            ################
            #\_FillValues with NaNs:
            U2M_nans = U2M[:]
            V2M_nans = V2M[:]
            _FillValueU2M = U2M._FillValue
            _FillValueV2M = V2M._FillValue
            U2M_nans[U2M_nans == _FillValueU2M] = np.nan
            V2M_nans[V2M_nans == _FillValueV2M] = np.nan

            # Add new vars #
            ################
            # calculate wind speed
            wspd = np.sqrt(U2M_nans**2+V2M_nans**2)

            # calculate wind direction in radians
            wdir = np.arctan2(V2M_nans, U2M_nans)
            
            # transform wind direction from radians to degrees
            #dir_to_degrees = np.mod(180+np.rad2deg(np.arctan2(V2M_nans, U2M_nans)), 360) # this computes "wind is blowing from"' meteorological convetion'
            wdir_to_degrees = np.mod(np.rad2deg(wdir), 360) # this computes "wind is blowing towards" 'oceonographic convention', see here: https://www.esri.com/arcgis-blog/products/product/analytics/displaying-speed-and-direction-symbology-from-u-and-v-vectors/
            
            
            ## transform to df ##
            #####################
            # create an empty df for wind speed and direction with size len(lats) x len(lons) 
            df_wdir = pd.DataFrame(index=lats[:], columns=lons[:])   
            df_wspd = pd.DataFrame(index=lats[:], columns=lons[:])
            
            # create an empty df for u and v components with size len(lats) x len(lons) 
            df_u = pd.DataFrame(index=lats[:], columns=lons[:])
            df_v = pd.DataFrame(index=lats[:], columns=lons[:])

            # populate each row in the empty df above with the wdir_meteo and wspd data and u and v components
            for idx, idx_val in enumerate(df_wdir.index):
                df_wdir.loc[idx_val, :] = wdir_to_degrees[0][idx]
                df_wspd.loc[idx_val, :] = wspd[0][idx]
                df_u.loc[idx_val, :] = U2M_nans[0][idx]
                df_v.loc[idx_val, :] = V2M_nans[0][idx]

            # add index (latitude) as column
            df_wdir.reset_index(
                drop=False,
                inplace=True
            )
            
            df_wdir.rename(
                columns={'index':'lat'},
                inplace=True
            )
            
            
            df_wspd.reset_index(
                drop=False,
                inplace=True
            )
            
            df_wspd.rename(
                columns={'index':'lat'},
                inplace=True
            )
            
            df_u.reset_index(
                drop=False,
                inplace=True
            )
            
            df_u.rename(
                columns={'index':'lat'},
                inplace=True
            )
            
            df_v.reset_index(
                drop=False,
                inplace=True
            )
            
            df_v.rename(
                columns={'index':'lat'},
                inplace=True
            )

            # transform from wide to long
            df_wdir = pd.melt(
                df_wdir, id_vars='lat',
                var_name='lon',
                value_vars=lons[:],
                value_name='wdir'
            )
            
            df_wspd = pd.melt(
                df_wspd,
                id_vars='lat',
                var_name='lon',
                value_vars=lons[:],
                value_name='wspd'
            )
            
            df_u = pd.melt(
                df_u, id_vars='lat',
                var_name='lon',
                value_vars=lons[:],
                value_name='u'
            )
            
            df_v = pd.melt(
                df_v, id_vars='lat',
                var_name='lon',
                value_vars=lons[:],
                value_name='v'
            )

            # concatenate df_wdir and df_wspd
            df_temp1 = df_wdir.merge(
                df_wspd,
                on=['lat', 'lon'],
                how='left'
            )
            
            # concatenate df_u and df_v
            df_temp2 = df_u.merge(
                df_v,
                on=['lat', 'lon'],
                how='left'
            )
            
            # concatenate df_temp1 and df_temp2
            df_temp = df_temp2.merge(
                df_temp1,
                on=['lat', 'lon'],
                how='left'
            )
            
            # add time stamp 
            df_temp['year_month'] = file.split('.')[2]

            df = pd.concat(
                [df_temp, df],
                axis=0
            )
   
    # keep values in min, max range of California geometry
    df = df[
        df.lon.ge(-125) & df.lon.le(-115) & df.lat.ge(32) & df.lat.le(42)
    ]
    
    # transform vars
    df['lat'] = df.lat.astype(float)
    df['lon'] = df.lon.astype(float)
    
    return df

``read census geom``

In [None]:
!ls 

tl_2010_06_zcta510.dbf	tl_2010_06_zcta510.shp	    tl_2010_06_zcta510.shx
tl_2010_06_zcta510.prj	tl_2010_06_zcta510.shp.xml


In [None]:
!ls in_dir_zip_shapes

ls: cannot access 'in_dir_zip_shapes': No such file or directory


In [None]:
#df_gdf = gpd.read_file('tl_2010_06_zcta510/tl_2010_06_zcta510.shp')
# df_gdf.shape # (1769, 12)
#df_gdf['ZCTA5CE10'].value_counts()

(1769, 12)

In [None]:

df_gdf

Unnamed: 0,STATEFP10,ZCTA5CE10,GEOID10,CLASSFP10,MTFCC10,FUNCSTAT10,ALAND10,AWATER10,INTPTLAT10,INTPTLON10,PARTFLG10,geometry
0,06,94601,0694601,B5,G6350,S,8410939,310703,+37.7755447,-122.2187049,N,"POLYGON ((-122.22717 37.79197, -122.22693 37.7..."
1,06,94501,0694501,B5,G6350,S,20539466,9005303,+37.7737968,-122.2781230,N,"POLYGON ((-122.29181 37.76301, -122.30661 37.7..."
2,06,94560,0694560,B5,G6350,S,35757865,60530,+37.5041413,-122.0323587,N,"POLYGON ((-122.05499 37.54959, -122.05441 37.5..."
3,06,94587,0694587,B5,G6350,S,51075108,0,+37.6031556,-122.0186382,N,"POLYGON ((-122.06515 37.60485, -122.06499 37.6..."
4,06,94580,0694580,B5,G6350,S,8929836,17052,+37.6757312,-122.1330170,N,"POLYGON ((-122.12999 37.68445, -122.12995 37.6..."
...,...,...,...,...,...,...,...,...,...,...,...,...
1764,06,95375,0695375,B5,G6350,S,7889388,102350,+38.1865950,-120.0262348,N,"POLYGON ((-120.00768 38.18764, -120.00771 38.1..."
1765,06,95627,0695627,B5,G6350,S,133169251,804522,+38.7345127,-122.0253459,N,"POLYGON ((-122.06793 38.64914, -122.06818 38.6..."
1766,06,95607,0695607,B5,G6350,S,347705376,655867,+38.8344702,-122.1271651,N,"POLYGON ((-122.23848 38.89009, -122.23844 38.8..."
1767,06,95919,0695919,B5,G6350,S,64005512,0,+39.4328816,-121.2611427,N,"POLYGON ((-121.30793 39.40862, -121.30805 39.4..."


In [None]:
def read_census_geom():
    """ Read Census (lat, lon) coordinates for California zip-codes
    parameters:
    -----------
    None
    
    return:
    -------
    Df with osmnx_geom
    """
    ### Step 1 ### 
    ##############
    # Read the shapefiles for California's ZIP codes
    for file in os.listdir(in_dir_zip_shapes):
        if file.endswith('.shp'):
            gdf = gpd.read_file(in_dir_zip_shapes + file)

    # keep only cols of interest 
    # ('ZCTA5CE10' = 2010 Census ZIP codes,	'GEOID10' = 2010 Census Tract codes)
    gdf = gdf[
        ['ZCTA5CE10',
         'GEOID10',
         'geometry']
    ]
    
    
    ### Step 2 ###
    ###############
    # For each zip cpde extract polygon with (lat, lon) info

    zip_poly = pd.DataFrame()

    for idx, multipoly in enumerate(gdf.geometry):
        if isinstance(multipoly, shapely.geometry.polygon.Polygon):
            temp_df = pd.DataFrame(
                {
                    'lat': multipoly.exterior.coords.xy[1], 
                    'lon': multipoly.exterior.coords.xy[0],
                    'ZCTA10': gdf.loc[idx, 'ZCTA5CE10'],
                    'GEOID10': gdf.loc[idx, 'GEOID10']
                }
            )
            zip_poly = pd.concat(
                [zip_poly, temp_df],
                axis=0
            )

        if isinstance(multipoly, shapely.geometry.multipolygon.MultiPolygon):
            for poly in multipoly:
                temp_df = pd.DataFrame(
                    {
                        'lat': poly.exterior.coords.xy[1], 
                        'lon': poly.exterior.coords.xy[0],
                        'ZCTA10': gdf.loc[idx, 'ZCTA5CE10'],
                        'GEOID10': gdf.loc[idx, 'GEOID10']
                    }
                )
                zip_poly = pd.concat(
                    [zip_poly, temp_df],
                    axis=0
                )   
    

    # round (lat, lon) to 2 decimal points and add 0.005 to match the UW (lat, lon) values
    zip_poly['lat'] = zip_poly.lat.round(3)
    zip_poly['lon'] = zip_poly.lon.round(3)
    
    zip_poly.sort_values(
        by=['ZCTA10', 'lat', 'lon'],
        inplace=True
    )
    
    zip_poly.drop_duplicates(
        subset=['ZCTA10', 'lat', 'lon'],
        inplace=True
    )

    zip_poly.reset_index(
        drop=True,
        inplace=True
    )
    
    return zip_poly

``find zip (zcta) code for wind data``

In [None]:
def add_zcta_to_wind(df1, df2):
    '''
    params:
    -------
    df1: wind data
    df2: census geometry data
    
    return:
    -------
    '''
    
    # create labels
    df1['wind_lat_lon'] = [str(xy) for xy in zip(df1.lat, df1.lon)]
    df2['census_lat_lon'] = [str(xy) for xy in zip(df2.lat, df2.lon)]

    ## for each point in wind data find the nearest point in the census data ##
    ###############
    # keep only unique points in wind data
    df1_unique = df1.drop_duplicates(
        ['wind_lat_lon']
    )
    
    df2_unique = df2.drop_duplicates(
        ['census_lat_lon']
    )
    
    df1_unique.reset_index(
        drop=True,
        inplace=True
    )
    
    df2_unique.reset_index(
        drop=True,
        inplace=True
    )

    # transform to radians
    df1_unique['lat_r'] = np.radians(df1_unique.lat)
    df1_unique['lon_r'] = np.radians(df1_unique.lon)
    df2_unique['lat_r'] = np.radians(df2_unique.lat)
    df2_unique['lon_r'] = np.radians(df2_unique.lon)


    # compute pairwise distance (in miles)
    dist_matrix = (dist.pairwise(
        df2_unique[['lat_r', 'lon_r']],
        df1_unique[['lat_r', 'lon_r']]
    ))*3959

    # create a df from dist_matrix
    dist_matrix = pd.DataFrame(
        dist_matrix,
        index=df2_unique['census_lat_lon'],
        columns=df1_unique['wind_lat_lon']
    )
    
    # for each row (census_lat_lon point) extract the closest column (wind_lat_lon point) 
    closest_point = pd.DataFrame(
        dist_matrix.idxmin(axis=1),
        columns=['closest_wind_lat_lon']
    )
    
    closest_point.reset_index(
        drop=False,
        inplace=True
    )

    # merge with census data
    df2_unique = df2_unique.merge(
        closest_point,
        on='census_lat_lon',
        how='left'
    )
    
    # merge with census data 
    df2_unique = df2_unique.merge(
        df2[['census_lat_lon']],
        on=['census_lat_lon'],
        how='left'
    )
    
    # replicate df2_unique based on number of year_month entries in df1
    df2_unique = pd.concat(
        [df2_unique]*(df1.year_month.nunique()),
        axis=0
    )
    
    df2_unique.reset_index(
        drop=True,
        inplace=True
    )
    
    # add year_month column to df2_unique
    df2_unique['year_month'] = 0
    indeces = [n for n in range(1, df2_unique.shape[0]) if n%956926==0]

    year_month = np.sort(df1.year_month.unique())
    for idx, index in enumerate(indeces):
        if idx==0:
            df2_unique.iloc[0:indeces[idx], 8] = year_month[idx]
        else:
            df2_unique.iloc[indeces[idx-1]:indeces[idx], 8] = year_month[idx]
            
            
    # from df1 keep only cols of interest
    df1 = df1[
        ['year_month',
         'u',
         'v',
         'wdir',
         'wspd',
         'wind_lat_lon']
    ]
    
    # merge df2_unique with df1
    df2_unique = df2_unique.merge(
        df1,
        left_on=['year_month', 'closest_wind_lat_lon'],
        right_on=['year_month', 'wind_lat_lon'],
        how='left'
    )
    # keep only cols of interest
    df2_unique = df2_unique[
        ['lat',
         'lon',
         'ZCTA10',
         'u',
         'v',
         'wdir',
         'wspd',
         'year_month']
    ]
    
    df2_unique.dropna(
        inplace=True
    )
    
    df2_unique.reset_index(
        drop=True,
        inplace=True
    )
    
    df2_unique.drop_duplicates(
    ['year_month', 'ZCTA10'],
    inplace=True
    )

    df2_unique.reset_index(
        drop=True,
        inplace=True
    )
    
    return df2_unique

### Step 4: Read data

``wind``

In [None]:
df = read_clean_wind()
df.head(2)

200212
200211
200210
200209
200208
200207
200206
200205
200204
200203
200202
200201


Unnamed: 0,lat,lon,u,v,wdir,wspd,year_month
32012,32.0,-125.0,0.594468,-3.938878,278.582489,3.983485,200201
32013,32.5,-125.0,0.824193,-3.972357,281.721558,4.056959,200201


``census geom``

In [None]:
zip_poly = read_census_geom()
zip_poly.head(2)

Unnamed: 0,lat,lon,ZCTA10,GEOID10
0,37.465,-117.936,89010,689010
1,37.465,-117.935,89010,689010


### Step 5: Find zip (zcta) code for wind data

In [None]:
df_final = add_zcta_to_wind(df, zip_poly)
df_final.head(2)

Unnamed: 0,lat,lon,ZCTA10,u,v,wdir,wspd,year_month
0,37.465,-117.936,89010,0.969241,-0.321199,341.665192,1.021076,200201
1,35.396,-116.322,89019,0.164038,-0.60478,285.175537,0.626631,200201


### Step 6: Export data

In [None]:
df_final.to_parquet(os.path.join('merra2_data/output', 'wind_2002.parquet'))

In [None]:
df_final['wdir'] = df_final.wdir.astype('float')

In [None]:
df2 = df_final[df_final['year_month'] == '200201']
df2

Unnamed: 0,lat,lon,ZCTA10,u,v,wdir,wspd,year_month
0,37.465,-117.936,89010,0.969241,-0.321199,341.665192,1.021076,200201
1,35.396,-116.322,89019,0.164038,-0.60478,285.175537,0.626631,200201
2,36.161,-116.139,89060,-0.177593,-0.478777,249.648666,0.510653,200201
3,35.957,-115.897,89061,-0.04495,-0.61467,265.817474,0.616311,200201
4,39.520,-120.032,89439,0.361939,0.456577,51.595329,0.582634,200201
...,...,...,...,...,...,...,...,...
1628,39.149,-120.248,96146,0.210386,0.539701,68.703171,0.579257,200201
1629,39.236,-120.062,96148,0.210386,0.539701,68.703171,0.579257,200201
1630,38.732,-120.033,96150,-0.068072,0.280101,103.659683,0.288254,200201
1631,39.184,-120.427,96161,-0.189069,0.143775,142.749359,0.237526,200201


In [None]:
df2

In [None]:
df_final.wdir.describe()

count    19596.000000
mean       183.371045
std        134.578489
min          0.074351
25%         36.403069
50%        214.252823
75%        324.572968
max        359.919220
Name: wdir, dtype: float64

In [None]:
df_final.wdir.describe()

count    599311.000000
mean        177.887713
std         133.092457
min           0.001104
25%          40.066475
50%         174.607162
75%         319.594696
max         359.999634
Name: wdir, dtype: float64

In [None]:
df_out = pd.read_parquet('merra2_data/output')

In [None]:
df_out

Unnamed: 0,lat,lon,ZCTA10,u,v,wdir,wspd,year_month
0,37.465,-117.936,89010,0.969241,-0.321199,341.665192,1.021076,200201
1,35.396,-116.322,89019,0.164038,-0.604780,285.175537,0.626631,200201
2,36.161,-116.139,89060,-0.177593,-0.478777,249.648666,0.510653,200201
3,35.957,-115.897,89061,-0.044950,-0.614670,265.817474,0.616311,200201
4,39.520,-120.032,89439,0.361939,0.456577,51.595329,0.582634,200201
...,...,...,...,...,...,...,...,...
176359,39.061,-120.210,96145,0.741713,1.044104,54.610687,1.280738,200212
176360,39.149,-120.248,96146,0.741713,1.044104,54.610687,1.280738,200212
176361,39.236,-120.062,96148,0.741713,1.044104,54.610687,1.280738,200212
176362,38.732,-120.033,96150,0.192210,0.466555,67.609436,0.504597,200212
