# NASS Statistics by Federal Reserve Bank District

Which Federal Reserve Bank district has the most farmland? What commodities generate the most income in each district? Which district sells the most llamas? Using USDA 2022 Ag Census data and a nifty shapefile compiled by Colton Tousey of the Federal Reserve Bank of Kansas City, we can answer all of these questions, and more.

## 1. Gathering Data

***DON'T RUN, it will take 3.5 hrs with good wifi***

First, we need an API key from the USDA to query the NASS database. This is an untracked file in the GitHub repository for this project; it needs to be independently requested from the USDA by whoever wants to run this code.

FedCounties.csv records the Federal Reserve Bank district for every county in the United States, along with state and county FIPS codes. NASS statistics also include FIPS codes.

In [None]:
import polars as pl
import requests as req
import os
from dotenv import load_dotenv
import json
from alive_progress import alive_bar
from time import sleep
from wakepy import keep
import polars.selectors as cs
from great_tables import GT, style, loc

In [None]:
fed_counties_df = pl.read_csv("FedCounties.csv")
fed_counties_df = fed_counties_df.filter(
    (pl.col("STATEFP") != 78) & (pl.col("STATEFP") != 72)
)  # excl. PR and U.S. Virgin Islands
tuples = []

for dist in range(1, 13):
    filtered = fed_counties_df.filter(fed_counties_df["District"] == dist)
    tuples.extend(
        zip(
            filtered["District"].to_list(),
            filtered["STATEFP"].to_list(),
            filtered["COUNTYFP"].to_list(),
        )
    )

tuples = [(t[0], str(t[1]).zfill(2), str(t[2]).zfill(3)) for t in tuples]

Now we have the unique county, state, Fed district pairs. The next step is to gather ALL 2022 Census data for every county and add a new variable to the USDA data: "District".

***DON'T RUN, continue from 2***

Note: Puerto Rico and the U.S. Virgin Islands are excluded (part of the N.Y. Fed district), there was trouble with querying those state FIPS codes...

In [None]:
load_dotenv()
url = "https://quickstats.nass.usda.gov/api/api_GET"
api_key = os.getenv("NASS_api_key")

district_dfs = []
with keep.presenting():  # took approx. 3:29 hrs
    for dist in range(1, 13):
        pairs = [
            (state, county) for district, state, county in tuples if district == dist
        ]

        county_dfs = []
        with alive_bar(len(pairs), title="Pairs") as bar:
            for state, county in pairs:
                bar()
                raw = req.get(
                    url,
                    params={
                        "key": api_key,
                        "state_fips_code": state,
                        "county_code": county,
                        "agg_level_desc": "COUNTY",
                        "source_desc": "CENSUS",
                        "year": 2022,
                        "format": "json",
                    },
                ).text
                sleep(2)

                try:
                    content = json.loads(raw)
                except json.decoder.JSONDecodeError as e:
                    print(raw)
                    raise e

                if "error" in content:
                    print(state, county)
                    print(content["error"])
                    continue

                county_df = pl.DataFrame(json.loads(raw)["data"])
                county_df = county_df.select(
                    [pl.col(c) for c in sorted(county_df.columns)]
                )
                county_dfs.append(county_df)

        district_df = pl.concat(county_dfs)
        district_df = district_df.with_columns(pl.lit(dist).alias("District"))
        district_dfs.append(district_df)

NASS_pull = pl.concat(district_dfs)
NASS_pull.write_parquet("NASS_pull.parquet")

Collect data on Puerto Rico in a separate query

In [None]:
# load_dotenv()
# url = "https://quickstats.nass.usda.gov/api/api_GET"
# api_key = os.getenv("NASS_api_key")

raw = req.get(
    url,
    params={
        "key": api_key,
        "agg_level_desc": "PUERTO RICO & OUTLYING AREAS",
        "state_name": "PUERTO RICO",
        "source_desc": "CENSUS",
        "year": 2022,
        "format": "json",
    },
).text

try:
    content = json.loads(raw)
except json.decoder.JSONDecodeError as e:
    print(raw)
    raise e

pr_df = pl.DataFrame(json.loads(raw)["data"])
pr_df = pr_df.select(
    [pl.col(c) for c in sorted(pr_df.columns)]
)
NASS_pull_pr = pr_df.with_columns(pl.lit(2).alias("District"))
NASS_pull_pr.write_parquet("NASS_pull_pr.parquet")

shape: (45_633, 40)
┌────────┬─────────────┬────────────────┬──────────┬───┬─────────────┬──────┬───────┬──────────┐
│ CV (%) ┆ Value       ┆ agg_level_desc ┆ asd_code ┆ … ┆ week_ending ┆ year ┆ zip_5 ┆ District │
│ ---    ┆ ---         ┆ ---            ┆ ---      ┆   ┆ ---         ┆ ---  ┆ ---   ┆ ---      │
│ str    ┆ str         ┆ str            ┆ str      ┆   ┆ str         ┆ i64  ┆ str   ┆ i32      │
╞════════╪═════════════╪════════════════╪══════════╪═══╪═════════════╪══════╪═══════╪══════════╡
│ 22.1   ┆ 28,813,951  ┆ PUERTO RICO &  ┆          ┆ … ┆             ┆ 2022 ┆       ┆ 2        │
│        ┆             ┆ OUTLYING AREAS ┆          ┆   ┆             ┆      ┆       ┆          │
│ 8.0    ┆ 48,301,595  ┆ PUERTO RICO &  ┆          ┆ … ┆             ┆ 2022 ┆       ┆ 2        │
│        ┆             ┆ OUTLYING AREAS ┆          ┆   ┆             ┆      ┆       ┆          │
│ 11.0   ┆ 41,205,033  ┆ PUERTO RICO &  ┆          ┆ … ┆             ┆ 2022 ┆       ┆ 2        │
│        ┆

The final dataset is stored as a .parquet file; this is very similar to a CSV file but it takes up a fraction of the space. There are over 3 million rows in "NASS_pull.parquet; a CSV file with that many rows costs actual money to upload to GitHub.

## 2. Cleaning

Some values in the final dataset are not actual values, so we need to filter these rows out. Then, we can aggregate our data to get rid of extraneous information, which at this point is any and all columns excluding "short_desc".

In [None]:
df_big = pl.read_parquet("NASS_pull.parquet")
df_pr = pl.read_parquet("NASS_pull_pr.parquet")
dfs = [df_big, df_pr]
df = pl.concat(dfs)
df = df.filter(~pl.col("Value").str.contains(r"\(D\)|\(Z\)"))
df = df.with_columns(pl.col("Value").str.replace_all(",", "").cast(pl.Float64))

district_dfs = []

for dist in df.partition_by("District"):
    district_df = dist.group_by("short_desc").agg(
        [
            pl.when(pl.col("short_desc").str.contains("PCT"))
            .then(pl.col("Value").median())
            .otherwise(pl.col("Value").sum())
            .alias("District_Total"),
            pl.mean("District").cast(pl.Int32),
        ]
    )
    district_dfs.append(district_df)

df = pl.concat(district_dfs)
print(df)

shape: (3_918_933, 40)
┌────────┬────────────┬────────────────┬──────────┬───┬─────────────┬──────┬───────┬──────────┐
│ CV (%) ┆ Value      ┆ agg_level_desc ┆ asd_code ┆ … ┆ week_ending ┆ year ┆ zip_5 ┆ District │
│ ---    ┆ ---        ┆ ---            ┆ ---      ┆   ┆ ---         ┆ ---  ┆ ---   ┆ ---      │
│ str    ┆ str        ┆ str            ┆ str      ┆   ┆ str         ┆ i64  ┆ str   ┆ i32      │
╞════════╪════════════╪════════════════╪══════════╪═══╪═════════════╪══════╪═══════╪══════════╡
│ (L)    ┆ 12,612,000 ┆ COUNTY         ┆ 10       ┆ … ┆             ┆ 2022 ┆       ┆ 1        │
│ 8.3    ┆ 272        ┆ COUNTY         ┆ 10       ┆ … ┆             ┆ 2022 ┆       ┆ 1        │
│ (D)    ┆ (D)        ┆ COUNTY         ┆ 10       ┆ … ┆             ┆ 2022 ┆       ┆ 1        │
│ (L)    ┆ 2          ┆ COUNTY         ┆ 10       ┆ … ┆             ┆ 2022 ┆       ┆ 1        │
│ (D)    ┆ (D)        ┆ COUNTY         ┆ 10       ┆ … ┆             ┆ 2022 ┆       ┆ 1        │
│ …      ┆ …     

NameError: name 'endd' is not defined

Note: We take the median percentages (robust to outliers) across all counties in the dataset. The interpretation of these values is not super intuitive. Each mean percent is the "average percent ___ for all counties in the district", not the percent ___ for the district.

Also, the conditional aggregation creates a dataframe where "District_Total" is actually a column of lists. We resolve this in step 3.

## 3. Analyzing

***RUN FROM HERE***

To filter through the data and find commodities that we want to know more about, we can use a keyword search approach applied to the short description of the data item. Some examples are presented below:

In [14]:
districts = range(1, 13)
# districts = [10, 11, 12]

keyword_list = []
excl_keyword_list = []

# k1 = ["acres", "ag land"]
# keyword_list.append(k1)
# ek1 = [
#     "treated",
#     "wood",
#     "pasture",
#     "reserv",
#     "to",
#     "crop",
#     "pct",
#     "irrigated",
#     "organic",
# ]
# excl_keyword_list.append(ek1)

# k2 = ["number", "ag land"]
# keyword_list.append(k2)
# ek2 = ["wood", "pasture", "reserv", "to", "crop", "pct", "irrigated", "organic"]
# excl_keyword_list.append(ek2)

# k3 = ["number", "asset value", "\$"]
# keyword_list.append(k3)
# ek3 = []
# excl_keyword_list.append(ek3)

# k4 = ["income", "receipts", "\$"]
# keyword_list.append(k4)
# ek4 = ["operation", "other", "dividends", "insurance", "forest", "tourism"]
# excl_keyword_list.append(ek4)

# k5 = ["income", "net", "\$"]
# keyword_list.append(k5)
# ek5 = ["gain", "loss", "/ operation"]
# excl_keyword_list.append(ek5)

commodity_keywords = ["sales, measured in \$"]
keyword_list.append(commodity_keywords)
commodity_excl_keywords = ["totals"]
excl_keyword_list.append(commodity_excl_keywords)

dfs = []

for incl, excl in zip(keyword_list, excl_keyword_list):
    custom = df.filter(
        [pl.col("short_desc").str.to_lowercase().str.contains(k.lower()) for k in incl],
        *[
            ~pl.col("short_desc").str.to_lowercase().str.contains(exk.lower())
            for exk in excl
        ],
        pl.col("District").is_in(districts),
    )
    dfs.append(custom)

custom = pl.concat(dfs)
custom = custom.with_columns(pl.col("District_Total").list.unique().list.first())

custom.write_parquet("custom_df.parquet")

If we know exactly which data items we would like included in a final table, then we can move on to 4. If some extra analysis needs doing, then go to step 3a first.

### 3a. More Analyzing

If there are some secondary characteristics we want more information on, such as which commodities generate the most cash sales in each district, then some more work needs to be done before a dataframe will be ready for final formatting. Below we find the top 10 highest value commodities in each district, per our earlier keyword search.

In [15]:
df = pl.read_parquet("custom_df.parquet")

district_dfs = []

for dist in df.partition_by("District"):
    district_df = dist.sort("District_Total", descending=True).head(10)
    district_dfs.append(district_df)

df = pl.concat(district_dfs)
print(df)

df.write_parquet("custom_df.parquet")

shape: (120, 3)
┌─────────────────────────────────┬────────────────┬──────────┐
│ short_desc                      ┆ District_Total ┆ District │
│ ---                             ┆ ---            ┆ ---      │
│ str                             ┆ f64            ┆ i32      │
╞═════════════════════════════════╪════════════════╪══════════╡
│ MILK - SALES, MEASURED IN $     ┆ 1.4799e9       ┆ 1        │
│ FIELD CROPS, OTHER, INCL HAY -… ┆ 4.93656e8      ┆ 1        │
│ CATTLE, INCL CALVES - SALES, M… ┆ 2.11866e8      ┆ 1        │
│ MAPLE SYRUP - SALES, MEASURED … ┆ 1.37478e8      ┆ 1        │
│ GRAIN - SALES, MEASURED IN $    ┆ 8.7437e7       ┆ 1        │
│ …                               ┆ …              ┆ …        │
│ CORN - SALES, MEASURED IN $     ┆ 1.4037e9       ┆ 12       │
│ COTTON, LINT & SEED - SALES, M… ┆ 1.2953e9       ┆ 12       │
│ RICE - SALES, MEASURED IN $     ┆ 7.23593e8      ┆ 12       │
│ CUT CHRISTMAS TREES & SHORT TE… ┆ 4.21853e8      ┆ 12       │
│ GRAIN, OTHER - SALES, 

## 4. Table Formatting

In [None]:
df = pl.read_parquet("custom_df.parquet")

df = df.pivot("District", values=cs.starts_with("District_Total"))

df = df.rename(dict)
df = df.with_columns(pl.col("Description").str.to_titlecase())
print(df)

# refer to NASS for units
gt_df = GT(df)

dist_cols = [
    "Boston",
    "New York (excl. PR and U.S. VI)",
    "Philadelphia",
    "Cleveland",
    "Richmond",
    "Atlanta",
    "Chicago",
    "St. Louis",
    "Minneapolis",
    "Kansas City",
    "Dallas",
    "San Francisco",
]

gt_df = gt_df.tab_spanner(label="District", columns=dist_cols).tab_style(
    style=style.text(size="9px", font="Helvetica"),
    locations=loc.body(columns="Description"),
)

gt_df

  df_pivot = df.pivot(


ColumnNotFoundError: short_desc

Resolved plan until failure:

	---> FAILED HERE RESOLVING 'select' <---
DF ["Description", "Boston", "New York (excl. PR and U.S. VI)", "Philadelphia"]; PROJECT */13 COLUMNS

From here, I think a good amount of hard-coding is needed for table formatting; districts will have different top-production commodities, so how do we want to display that information? It's tougher to decide than when you are comparing particular commodity classes across districts...