# NASS Statistics by Federal Reserve Bank District

Which Federal Reserve Bank district has the most farmland? What commodities generate the most income in each district? Which district sells the most llamas? Using USDA 2022 Ag Census data and a nifty shapefile compiled by Colton Tousey of the Federal Reserve Bank of Kansas City, we can answer all of these questions, and more.

## 1. Gathering Data

***DON'T RUN, it will take 3.5 hrs with good wifi***

First, we need an API key from the USDA to query the NASS database. This is an untracked file in the GitHub repository for this project; it needs to be independently requested from the USDA by whoever wants to run this code.

FedCounties.csv records the Federal Reserve Bank district for every county in the United States, along with state and county FIPS codes. NASS statistics also include FIPS codes.

In [2]:
import polars as pl
import requests as req
import os
from dotenv import load_dotenv
import json
from alive_progress import alive_bar
from time import sleep
from wakepy import keep
import polars.selectors as cs
from great_tables import GT

In [None]:
fed_counties_df = pl.read_csv("FedCounties.csv")
fed_counties_df = fed_counties_df.filter(
    (pl.col("STATEFP") != 78) & (pl.col("STATEFP") != 72)
)  # excl. PR and U.S. Virgin Islands
tuples = []

for dist in range(1, 13):
    filtered = fed_counties_df.filter(fed_counties_df["District"] == dist)
    tuples.extend(
        zip(
            filtered["District"].to_list(),
            filtered["STATEFP"].to_list(),
            filtered["COUNTYFP"].to_list(),
        )
    )

tuples = [(t[0], str(t[1]).zfill(2), str(t[2]).zfill(3)) for t in tuples]

Now we have the unique county, state, Fed district pairs. The next step is to gather ALL 2022 Census data for every county and add a new variable to the USDA data: "District".

***DON'T RUN, continue from 2***

Note: Puerto Rico and the U.S. Virgin Islands are excluded (part of the N.Y. Fed district), there was trouble with querying those state FIPS codes...

In [None]:
load_dotenv()
url = "https://quickstats.nass.usda.gov/api/api_GET"
api_key = os.getenv("NASS_api_key")

district_dfs = []
with keep.presenting():  # took approx. 3:29 hrs
    for dist in range(1, 13):
        pairs = [
            (state, county) for district, state, county in tuples if district == dist
        ]

        county_dfs = []
        with alive_bar(len(pairs), title="Pairs") as bar:
            for state, county in pairs:
                bar()
                raw = req.get(
                    url,
                    params={
                        "key": api_key,
                        "state_fips_code": state,
                        "county_code": county,
                        "agg_level_desc": "COUNTY",
                        "source_desc": "CENSUS",
                        "year": 2022,
                        "format": "json",
                    },
                ).text
                sleep(2)

                try:
                    content = json.loads(raw)
                except json.decoder.JSONDecodeError as e:
                    print(raw)
                    raise e

                if "error" in content:
                    print(state, county)
                    print(content["error"])
                    continue

                county_df = pl.DataFrame(json.loads(raw)["data"])
                county_df = county_df.select(
                    [pl.col(c) for c in sorted(county_df.columns)]
                )
                county_dfs.append(county_df)

        district_df = pl.concat(county_dfs)
        district_df = district_df.with_columns(pl.lit(dist).alias("District"))
        district_dfs.append(district_df)

NASS_pull = pl.concat(district_dfs)
NASS_pull.write_parquet("NASS_pull.parquet")

The final dataset is stored as a .parquet file; this is very similar to a CSV file but it takes up a fraction of the space. There are over 3 million rows in "NASS_pull.parquet; a CSV file with that many rows costs actual money to upload to GitHub.

## 2. Cleaning

Some values in the final dataset are not actual values, so we need to filter these rows out. Then, we can aggregate our data to get rid of extraneous information, which at this point is any and all columns excluding "short_desc".

In [None]:
df = pl.read_parquet("NASS_pull.parquet")
df = df.filter(~pl.col("Value").str.contains(r"\(D\)|\(Z\)"))
df = df.with_columns(pl.col("Value").str.replace_all(",", "").cast(pl.Float64))

district_dfs = []

for dist in df.partition_by("District"):
    district_df = dist.group_by("short_desc").agg(
        [pl.sum("Value").alias("District_Total"), pl.mean("District").cast(pl.Int32)]
    )
    district_dfs.append(district_df)

df = pl.concat(district_dfs)

## 3. Analyzing

***RUN FROM HERE***

To filter through the data and find commodities that we want to know more about, we can use a keyword search approach applied to the short description of the data item. Some examples are presented below:

In [None]:
districts = range(1, 13)
# districts = [10, 11, 12]

keyword_list = []
excl_keyword_list = []

# farmland_keywords = ["acres", "irrigated", "cropland"]
# keyword_list.append(farmland_keywords)
# farmland_excl_keywords = []
# excl_keyword_list.append(farmland_excl_keywords)

# cattle_keywords = ["cattle", "\$", "sales"]
# keyword_list.append(cattle_keywords)
# cattle_excl_keywords = ["excl"]
# excl_keyword_list.append(cattle_excl_keywords)

commodity_keywords = ["sales, measured in \$"]
keyword_list.append(commodity_keywords)
commodity_excl_keywords = []
excl_keyword_list.append(commodity_excl_keywords)

dfs = []

for incl, excl in zip(keyword_list, excl_keyword_list):
    custom = df.filter(
        [pl.col("short_desc").str.to_lowercase().str.contains(k.lower()) for k in incl],
        *[
            ~pl.col("short_desc").str.to_lowercase().str.contains(s.lower())
            for s in excl
        ],
        pl.col("District").is_in(districts),
    )
    print(custom)
    dfs.append(custom)

custom = pl.concat(dfs)
print(custom)

custom.write_parquet("custom_df.parquet")

If we know exactly which data items we would like included in a final table, then we can move on to 4. If some extra analysis needs doing, then go to step 3a first.

### 3a. More Analyzing

If there are some secondary characteristics we want more information on, such as which commodities generate the most cash sales in each district, then some more work needs to be done before a dataframe will be ready for final formatting. Below we find the top 10 highest value commodities in each district, per our earlier keyword search.

In [None]:
df = pl.read_parquet("custom_df.parquet")

district_dfs = []

for dist in df.partition_by("District"):
    district_df = dist.sort("District_Total", descending=True).head(10)
    district_dfs.append(district_df)

df = pl.concat(district_dfs)
print(df)

df.write_parquet("custom_df.parquet")

## 4. Table Formatting

In [None]:
dict = {
    "short_desc": "Description",
    "1": "Boston",
    "2": "New York (excl. PR and U.S. VI)",
    "3": "Philadelphia",
    "4": "Cleveland",
    "5": "Richmond",
    "6": "Atlanta",
    "7": "Chicago",
    "8": "St. Louis",
    "9": "Minneapolis",
    "10": "Kansas City",
    "11": "Dallas",
    "12": "San Francisco",
}

df = pl.read_parquet("custom_df.parquet")

df = df.pivot("District", values=cs.starts_with("District_Total"))

df = df.rename(dict)
print(df)

# refer to NASS for units
gt_df = GT(df)
gt_df

shape: (19, 13)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ Descripti ┆ Boston    ┆ New York  ┆ Philadelp ┆ … ┆ Minneapol ┆ Kansas    ┆ Dallas    ┆ San Fran │
│ on        ┆ ---       ┆ (excl. PR ┆ hia       ┆   ┆ is        ┆ City      ┆ ---       ┆ cisco    │
│ ---       ┆ f64       ┆ and U.S.  ┆ ---       ┆   ┆ ---       ┆ ---       ┆ f64       ┆ ---      │
│ str       ┆           ┆ VI…       ┆ f64       ┆   ┆ f64       ┆ f64       ┆           ┆ f64      │
│           ┆           ┆ ---       ┆           ┆   ┆           ┆           ┆           ┆          │
│           ┆           ┆ f64       ┆           ┆   ┆           ┆           ┆           ┆          │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ COMMODITY ┆ 8.9802e9  ┆ 2.1464e10 ┆ 3.1657e10 ┆ … ┆ 1.5024e11 ┆ 2.0044e11 ┆ 1.0393e11 ┆ 3.2486e1 │
│ TOTALS -  ┆           ┆           ┆           ┆   ┆           ┆          

Description,Boston,New York (excl. PR and U.S. VI),Philadelphia,Cleveland,Richmond,Atlanta,Chicago,St. Louis,Minneapolis,Kansas City,Dallas,San Francisco
"COMMODITY TOTALS - SALES, MEASURED IN $",8980210000.0,21463889000.0,31657377000.0,43565874000.0,88704912000.0,124438788000.0,249329940000.0,117955835000.0,150242878000.0,200438140000.0,103928738000.0,324858738000.0
"CROP TOTALS - SALES, MEASURED IN $",2991141000.0,5468263000.0,6205892000.0,13378250000.0,14611633000.0,33156185000.0,86012923000.0,35148630000.0,54792952000.0,46397860000.0,17693612000.0,140863636000.0
"ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED IN $",2200120000.0,7425384000.0,13226555000.0,11840145000.0,39790732000.0,45266618000.0,59487446000.0,33891797000.0,33098786000.0,70157946000.0,46219331000.0,65990283000.0
"MILK - SALES, MEASURED IN $",1479925000.0,5671875000.0,3306338000.0,2183427000.0,,,15773508000.0,,6097496000.0,,6864611000.0,34639441000.0
"HORTICULTURE TOTALS, (EXCL CUT TREES & VEGETABLE SEEDS & TRANSPLANTS) - SALES, MEASURED IN $",970083000.0,1354653000.0,1934063000.0,,2121563000.0,10217132000.0,,,,,2084317000.0,14632756000.0
"VEGETABLE TOTALS, INCL SEEDS & TRANSPLANTS, IN THE OPEN - SALES, MEASURED IN $",871449000.0,808249000.0,,,,4590846000.0,,,,,,37148585000.0
"FIELD CROPS, OTHER, INCL HAY - SALES, MEASURED IN $",493656000.0,,,,,5788718000.0,,,,,2143873000.0,13062889000.0
"FRUIT & TREE NUT TOTALS - SALES, MEASURED IN $",465274000.0,1153716000.0,,,,3551893000.0,,,,,,71234048000.0
"COMMODITY TOTALS, ORGANIC - SALES, MEASURED IN $",396126000.0,,996822000.0,,,,,,,,,
"COMMODITY TOTALS, INCL VALUE-ADDED, WHOLESALE, DIRECT TO RETAILERS & INSTITUTIONS & FOOD HUBS, LOCAL OR REGIONALLY BRANDED PRODUCTS, HUMAN CONSUMPTION - SALES, MEASURED IN $",375743000.0,,,,,,,,,,,
