# Preprocess Data

Before running any analysis on the data, let's process it and fit it to the lower contiguous United States using a [shapefile from the US Census](https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html) (2024 > 1 : 500,000 > cb_2024_us_region_500k > cb_2024_us_region_500k.shp). We can then process this as follows:

In [1]:
# import necessary packages
import geopandas as geop
import xarray as xarr
import rioxarray as riox

# define file names
dataset = "rough_CONUS_precipitation_temperature_1988.nc"
shapefile = "./data/cb_2024_us_state_500k/cb_2024_us_state_500k.shp"

# load datasets
roughData = xarr.open_dataset(dataset)
conus_shapefile = geop.read_file(shapefile)

# filter to continental US; filter by two STUSPS, two-letter abbreviation for states and territories
conusSeparate = conus_shapefile[~conus_shapefile['STUSPS'].isin(['AK', 'HI', 'GU', 'MP', 'PR', 'VI', 'AS'])]

# dissolve to make continuous boundary
conus_boundary = conusSeparate.dissolve()

# now, write the same coordinate reference system 
roughData = roughData.rio.set_spatial_dims(x_dim="longitude", y_dim='latitude')
cleanData = roughData.rio.write_crs(conus_boundary.crs)


#### Clipping (GeoPandas to GeoJSON), Saving

Now we need to ensure we can clip it to fit the border of the continental United States,  including those partially on the border (drop=False). This will check if each grid cell is in or overlapping with the US polygon boundary.

In [2]:
from shapely.geometry import mapping

cleanData_clipped = cleanData.rio.clip(conus_boundary.geometry.apply(mapping), conus_boundary.crs, drop=False)
cleanData_clipped.to_netcdf("CONUS_clipped_precipitation_temperature_1988.nc")

Let's now make a heatmap for the temperature on the first of August to ensure this worked preprocess worked.

In [None]:
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature

# get August data, and then the first temperature data point of the month
august_data = cleanData_clipped.where(cleanData_clipped.valid_time.dt.month == 8, drop=True)
august_temp = august_data['t'][0] # in Kelvin




<xarray.Dataset> Size: 3MB
Dimensions:         (valid_time: 31, pressure_level: 1, latitude: 105,
                     longitude: 245)
Coordinates:
    number          int64 8B 0
  * pressure_level  (pressure_level) float64 8B 1.0
  * latitude        (latitude) float64 840B 50.0 49.75 49.5 ... 24.5 24.25 24.0
  * longitude       (longitude) float64 2kB -126.0 -125.8 ... -65.25 -65.0
  * valid_time      (valid_time) datetime64[ns] 248B 1988-08-01 ... 1988-08-31
    spatial_ref     int64 8B 0
Data variables:
    t               (valid_time, pressure_level, latitude, longitude) float32 3MB ...
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-08-05T23:30 GRIB to CDM+CF via cfgrib-0.9.1...
