# NASS Statistics by Federal Reserve Bank District

Which Federal Reserve Bank district has the most farmland? What commodities generate the most income in each district? Which district sells the most llamas? Using USDA 2022 Ag Census data and a nifty shapefile compiled by Colton Tousey of the Federal Reserve Bank of Kansas City, we can answer all of these questions, and more.

In [1]:
import polars as pl
import requests as req
import os
from dotenv import load_dotenv
import json
from alive_progress import alive_bar
from time import sleep
from wakepy import keep
import polars.selectors as cs
from great_tables import GT, style, loc

## 1. Gathering Data

***DON'T RUN, it will take 3.5 hrs with good wifi***

First, we need an API key from the USDA to query the NASS database. This is an untracked file in the GitHub repository for this project; it needs to be independently requested from the USDA by whoever wants to run this code.

FedCounties.csv records the Federal Reserve Bank district for every county in the United States, along with state and county FIPS codes. NASS statistics also include FIPS codes.

In [None]:
fed_counties_df = pl.read_csv("FedCounties.csv")
fed_counties_df = fed_counties_df.filter(~pl.col("STATEFP").is_in([78, 72, 69, 66, 60]))
tuples = []

for dist in range(1, 13):
    filtered = fed_counties_df.filter(fed_counties_df["District"] == dist)
    tuples.extend(
        zip(
            filtered["District"].to_list(),
            filtered["STATEFP"].to_list(),
            filtered["COUNTYFP"].to_list(),
        )
    )

tuples = [(t[0], str(t[1]).zfill(2), str(t[2]).zfill(3)) for t in tuples]

In [25]:
print(len(tuples))

3142


Now we have the unique county, state, Fed district pairs. The next step is to gather ALL 2022 Census data for every county and add a new variable to the USDA data: "District".

***DON'T RUN, continue from 2***

Note: Puerto Rico and the U.S. Virgin Islands are excluded (part of the N.Y. Fed district), there was trouble with querying those state FIPS codes...

In [64]:
load_dotenv()
url = "https://quickstats.nass.usda.gov/api/api_GET"
api_key = os.getenv("NASS_api_key")

district_dfs = []
with keep.presenting():  # took approx. 3:29 hrs
    for dist in range(1, 13):
        pairs = [
            (state, county) for district, state, county in tuples if district == dist
        ]

        county_dfs = []
        with alive_bar(len(pairs), title="Pairs") as bar:
            for state, county in pairs:
                bar()
                raw = req.get(
                    url,
                    params={
                        "key": api_key,
                        "state_fips_code": state,
                        "county_code": county,
                        "agg_level_desc": "COUNTY",
                        "source_desc": "CENSUS",
                        "year": 2017,
                        "format": "json",
                    },
                ).text
                sleep(2)

                try:
                    content = json.loads(raw)
                except json.decoder.JSONDecodeError as e:
                    print(raw)
                    raise e

                if "error" in content:
                    print(state, county)
                    print(content["error"])
                    continue

                county_df = pl.DataFrame(json.loads(raw)["data"])
                county_df = county_df.select(
                    [pl.col(c) for c in sorted(county_df.columns)]
                )
                county_dfs.append(county_df)

        district_df = pl.concat(county_dfs)
        district_df = district_df.with_columns(pl.lit(dist).alias("District"))
        district_dfs.append(district_df)

NASS_pull = pl.concat(district_dfs)
NASS_pull.write_parquet("NASS_pull_2017.parquet")

Pairs |█████▌⚠︎                                 | (!) 9/66 [14%] in 28.4s (0.32/s) 


KeyboardInterrupt: 

Collect data on Puerto Rico in a separate query

In [None]:
# load_dotenv()
# url = "https://quickstats.nass.usda.gov/api/api_GET"
# api_key = os.getenv("NASS_api_key")

raw = req.get(
    url,
    params={
        "key": api_key,
        "agg_level_desc": "PUERTO RICO & OUTLYING AREAS",
        "state_name": "PUERTO RICO",
        "source_desc": "CENSUS",
        "year": 2017,
        "format": "json",
    },
).text

try:
    content = json.loads(raw)
except json.decoder.JSONDecodeError as e:
    print(raw)
    raise e

pr_df = pl.DataFrame(json.loads(raw)["data"])
pr_df = pr_df.select([pl.col(c) for c in sorted(pr_df.columns)])
NASS_pull_pr = pr_df.with_columns(pl.lit(2).alias("District"))
NASS_pull_pr.write_parquet("NASS_pull_pr_2017.parquet")

shape: (45_633, 40)
┌────────┬─────────────┬────────────────┬──────────┬───┬─────────────┬──────┬───────┬──────────┐
│ CV (%) ┆ Value       ┆ agg_level_desc ┆ asd_code ┆ … ┆ week_ending ┆ year ┆ zip_5 ┆ District │
│ ---    ┆ ---         ┆ ---            ┆ ---      ┆   ┆ ---         ┆ ---  ┆ ---   ┆ ---      │
│ str    ┆ str         ┆ str            ┆ str      ┆   ┆ str         ┆ i64  ┆ str   ┆ i32      │
╞════════╪═════════════╪════════════════╪══════════╪═══╪═════════════╪══════╪═══════╪══════════╡
│ 22.1   ┆ 28,813,951  ┆ PUERTO RICO &  ┆          ┆ … ┆             ┆ 2022 ┆       ┆ 2        │
│        ┆             ┆ OUTLYING AREAS ┆          ┆   ┆             ┆      ┆       ┆          │
│ 8.0    ┆ 48,301,595  ┆ PUERTO RICO &  ┆          ┆ … ┆             ┆ 2022 ┆       ┆ 2        │
│        ┆             ┆ OUTLYING AREAS ┆          ┆   ┆             ┆      ┆       ┆          │
│ 11.0   ┆ 41,205,033  ┆ PUERTO RICO &  ┆          ┆ … ┆             ┆ 2022 ┆       ┆ 2        │
│        ┆

The final dataset is stored as a .parquet file; this is very similar to a CSV file but it takes up a fraction of the space. There are over 3 million rows in "NASS_pull.parquet; a CSV file with that many rows costs actual money to upload to GitHub.

## 2. Cleaning

Some values in the final dataset are not actual values, so we need to filter these rows out. Then, we can aggregate our data to get rid of extraneous information, which at this point is any and all columns excluding "short_desc".

In [None]:
df_big = pl.read_parquet("NASS_pull_2022.parquet")
df_pr = pl.read_parquet("NASS_pull_pr_2022.parquet")
dfs = [df_big, df_pr]
df = pl.concat(dfs)
df = df.filter(
    (~pl.col("Value").str.contains(r"\(D\)|\(Z\)")) & (pl.col("domain_desc") == "TOTAL") & ~(pl.col("short_desc").is_in(["AG LAND - ACRES", "AG LAND - NUMBER OF OPERATIONS"]))
)
df = df.with_columns(pl.col("Value").str.replace_all(",", "").cast(pl.Float64))

# want to know if there are multiple of the same short_desc entries per county...
# want to know if filtering by domain_desc == "TOTAL" ensures that there are ONLY unique short_descs for each county...
# grouped_df = df.group_by(['state_fips_code', 'county_code', 'short_desc']).agg(
#     pl.len().alias('count')
# )

# # Now find which state_fips_code and county_code combinations have duplicated short_desc values
# # We group again by state_fips_code and county_code, and filter where any short_desc appears more than once
# result = grouped_df.group_by(['state_fips_code', 'county_code']).agg([
#     pl.len().alias('unique_short_desc_count'),
#     (pl.col('count') > 1).any().alias('has_duplicate_short_desc')
# ]).filter(
#     pl.col('has_duplicate_short_desc') == True
# )

# print(result)
# print(result["unique_short_desc_count"].sum())

# thank you Claude

district_dfs = []

for dist in df.partition_by("District"):
    district_df = dist.group_by("short_desc").agg(
        [
            pl.when(pl.col("short_desc").str.contains("PCT") | pl.col("short_desc").str.contains("/ OPERATION") | pl.col("short_desc").str.contains("/ ACRE"))
            .then(pl.col("Value").mean())
            .otherwise(pl.col("Value").sum())
            .alias("District_Total"),
            pl.mean("District").cast(pl.Int32),
        ]
    )
    district_dfs.append(district_df)

df = pl.concat(district_dfs)
print(df)

shape: (20_611, 3)
┌─────────────────────────────────┬─────────────────────────────────┬──────────┐
│ short_desc                      ┆ District_Total                  ┆ District │
│ ---                             ┆ ---                             ┆ ---      │
│ str                             ┆ list[f64]                       ┆ i32      │
╞═════════════════════════════════╪═════════════════════════════════╪══════════╡
│ CABBAGE, CHINESE, FRESH MARKET… ┆ [113.0, 113.0, … 113.0]         ┆ 1        │
│ PEAS, GREEN, (EXCL SOUTHERN), … ┆ [3.0, 3.0, 3.0]                 ┆ 1        │
│ PARSNIPS, FRESH MARKET - ACRES… ┆ [15.0, 15.0, … 15.0]            ┆ 1        │
│ INCOME, FARM-RELATED, GOVT PRO… ┆ [564.0, 564.0, … 564.0]         ┆ 1        │
│ PRODUCERS, YEARS ON PRESENT OP… ┆ [11673.0, 11673.0, … 11673.0]   ┆ 1        │
│ …                               ┆ …                               ┆ …        │
│ PEAS, DRY EDIBLE, IRRIGATED - … ┆ [173.0, 173.0, … 173.0]         ┆ 12       │
│ TOMATOE

Note: We take the median percentages (robust to outliers) across all counties in the dataset. The interpretation of these values is not super intuitive. Each mean percent is the "average percent ___ for all counties in the district", not the percent ___ for the district. We also make sure that domain_desc = "TOTAL" or else we double-count some values.

Also, the conditional aggregation creates a dataframe where "District_Total" is actually a column of lists. We resolve this in step 3.

## 3. Analyzing

***RUN FROM HERE***

To filter through the data and find commodities that we want to know more about, we can use a keyword search approach applied to the short description of the data item. Some examples are presented below:

In [91]:
districts = range(1, 13)

# income and expenses
# keyword_pairs = [
#     # expenses
#     (
#         ["expense totals, operating", "measured in \$"],
#         ["operation", "landlord"],
#     ),
#     (
#         ["taxes, property, real estate", "non-real estate", "measured in \$"], 
#         ["only", "operations"]),
#     (
#         ["rent, cash", "\$"], 
#         ["only", "operation"]
#     ),
#     (
#         ["interest", "expense", "\$"], 
#         ["pct", "/ operation", "operations", "for", "real"]
#     ),
#     (
#         ["depreciation", "expense", "\$"], 
#         ["operations"]
#     ),
#     #income
#     (
#         ["commodity totals", "sales", "\$"], 
#         ["operation", "marketed", "direct", "landlord", "organic", "retail", "value"]
#     ),
#     (
#         ["by-products", "receipts", "\$"], 
#         ["operations"]
#     ),
#     (
#         ["govt programs", "receipts", "\$"], 
#         ["operation", "conservation"]
#     ),
#     (
#         ["income, farm-related", "receipts", "\$"], 
#         ["operation", "associated", "services", "tourism", "payments", "programs", "products", "other", "dividends"]
#     ),
#     (
#         ["income, net cash farm", "operations", "net", "\$"], 
#         ["/ operation"]
#     ),
# ]

# income
# keyword_pairs = [
#     (
#         ["income", "receipts, measured in \$"], 
#         ["nada"]
#     ),
# ]

# ag land
keyword_pairs = [
    (
        ["ag land"], 
        ["crop", "buildings", "irrigated", "organic", "owned", "rented", "pasture", "wood"]
    ),
    (
        ["ag land, owned"], 
        ["crop"]
    ),
    (
        ["ag land, rented"], 
        ["crop"]
    ),
    (
        ["ag land, cropland"], 
        ["harvested", "pastured"]
    ),
    (
        ["ag land, pastureland"], 
        ["excl"]
    ),
    (
        ["ag land, woodland"], 
        ["pastured"]
    ),
    (
        ["ag land, incl buildings"], 
        ["operations"]
    ),
    (
        ["land area, incl non-ag"], 
        ["crop"]
    ),
]
# keyword_pairs = [
#     (
#         ["ag land", "acres"], 
#         ["nada"]
#     ),
# ]


keyword_list = [pair[0] for pair in keyword_pairs]
excl_keyword_list = [pair[1] for pair in keyword_pairs]

# # TOP COMMODITIES
# commodity_keywords = ["sales, measured in \$"]
# keyword_list.append(commodity_keywords)
# commodity_excl_keywords = ["totals"]
# excl_keyword_list.append(commodity_excl_keywords)

dfs = []

for incl, excl in zip(keyword_list, excl_keyword_list):
    custom = df.filter(
        [pl.col("short_desc").str.to_lowercase().str.contains(k.lower()) for k in incl],
        *[
            ~pl.col("short_desc").str.to_lowercase().str.contains(exk.lower())
            for exk in excl
        ],
        pl.col("District").is_in(districts),
    )
    dfs.append(custom)

custom = pl.concat(dfs)
custom = custom.with_columns(pl.col("District_Total").list.unique().list.first())

custom.write_parquet("custom_df.parquet")

If we know exactly which data items we would like included in a final table, then we can move on to 4. If some extra analysis needs doing, then go to step 3a first.

### 3a. More Analyzing

If there are some secondary characteristics we want more information on, such as which commodities generate the most cash sales in each district, then some more work needs to be done before a dataframe will be ready for final formatting. Below we find the top 10 highest value commodities in each district, per our earlier keyword search.

In [75]:
df_2 = pl.read_parquet("custom_df.parquet")

district_dfs = []

for dist in df_2.partition_by("District"):
    district_df = dist.sort("District_Total", descending=True).head(20)
    district_dfs.append(district_df)

df_2 = pl.concat(district_dfs)
print(df_2)

df_2.write_parquet("custom_df.parquet")

shape: (134, 3)
┌─────────────────────────────────┬────────────────┬──────────┐
│ short_desc                      ┆ District_Total ┆ District │
│ ---                             ┆ ---            ┆ ---      │
│ str                             ┆ f64            ┆ i32      │
╞═════════════════════════════════╪════════════════╪══════════╡
│ INCOME, FARM-RELATED - RECEIPT… ┆ 3.26761e8      ┆ 1        │
│ INCOME, FARM-RELATED, OTHER - … ┆ 1.49519e8      ┆ 1        │
│ INCOME, FARM-RELATED, AG TOURI… ┆ 4.9575e7       ┆ 1        │
│ INCOME, FARM-RELATED, FOREST P… ┆ 4.264e7        ┆ 1        │
│ INCOME, FARM-RELATED, AG SERVI… ┆ 2.7924e7       ┆ 1        │
│ …                               ┆ …              ┆ …        │
│ INCOME, FARM-RELATED, AG TOURI… ┆ 1.93228e8      ┆ 12       │
│ INCOME, FARM-RELATED, FOREST P… ┆ 1.44229e8      ┆ 12       │
│ INCOME, FARM-RELATED, GOVT PRO… ┆ 1.5043e7       ┆ 12       │
│ INCOME, FARM-RELATED - RECEIPT… ┆ 41680.0        ┆ 12       │
│ INCOME, FARM-RELATED, 

## 4. Table Formatting

In [92]:
dict = {
    "short_desc": "Description",
    "1": "Boston",
    "2": "New York",
    "3": "Philadelphia",
    "4": "Cleveland",
    "5": "Richmond",
    "6": "Atlanta",
    "7": "Chicago",
    "8": "St. Louis",
    "9": "Minneapolis",
    "10": "Kansas City",
    "11": "Dallas",
    "12": "San Francisco",
}

df_3 = pl.read_parquet("custom_df.parquet")

df_3 = df_3.pivot("District", values=cs.starts_with("District_Total"))

df_3 = df_3.rename(dict)
df_3 = df_3.with_columns(pl.col("Description").str.to_titlecase())
df_3.write_csv("custom_table.csv")
print(df_3)

# # refer to NASS for units
gt_df = GT(df_3)

dist_cols = [
    "Boston",
    "New York",
    "Philadelphia",
    "Cleveland",
    "Richmond",
    "Atlanta",
    "Chicago",
    "St. Louis",
    "Minneapolis",
    "Kansas City",
    "Dallas",
    "San Francisco",
]

gt_df = gt_df.tab_spanner(label="District", columns=dist_cols).tab_style(
    style=style.text(size="9px", font="Helvetica"),
    locations=loc.body(columns="Description"),
)

gt_df

shape: (22, 13)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ Descripti ┆ New York  ┆ Boston    ┆ Philadelp ┆ … ┆ Minneapol ┆ Kansas    ┆ Dallas    ┆ San Fran │
│ on        ┆ ---       ┆ ---       ┆ hia       ┆   ┆ is        ┆ City      ┆ ---       ┆ cisco    │
│ ---       ┆ f64       ┆ f64       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ f64       ┆ ---      │
│ str       ┆           ┆           ┆ f64       ┆   ┆ f64       ┆ f64       ┆           ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ Ag Land,  ┆ 118429.0  ┆ null      ┆ null      ┆ … ┆ null      ┆ null      ┆ null      ┆ null     │
│ Agricultu ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ ral       ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ Reserve … ┆           ┆           ┆           ┆   ┆           ┆          

Description,District,District,District,District,District,District,District,District,District,District,District,District
Description,Boston,New York,Philadelphia,Cleveland,Richmond,Atlanta,Chicago,St. Louis,Minneapolis,Kansas City,Dallas,San Francisco
"Ag Land, Agricultural Reserve - Cuerdas",,118429.0,,,,,,,,,,
"Ag Land, Agricultural Reserve - Number Of Operations",,1368.0,,,,,,,,,,
"Ag Land, Agricultural Reserve - Area, Measured In Pct Of Farm Operations",,9.0,,,,,,,,,,
"Farm Operations - Area Operated, Measured In Pct Of Ag Land",,100.0,,,,,,,,,,
"Ag Land, Agricultural Reserve - Area, Measured In Pct Of Ag Land",,12.0,,,,,,,,,,
"Ag Land, Owned, In Farms - Number Of Operations",28498.0,35429.0,38820.0,112612.0,132181.0,197168.0,241356.0,190528.0,148112.0,246949.0,236492.0,184837.0
"Ag Land, Owned, In Farms - Acres",3048130.0,4989470.0,3836130.0,12637395.0,16629172.0,29332682.0,38618962.0,34492311.0,97416421.0,125067118.0,87404877.0,75529494.0
"Ag Land, Rented From Others, In Farms - Number Of Operations",6512.0,9747.0,12295.0,28804.0,35088.0,47497.0,91475.0,53740.0,64370.0,85191.0,56313.0,39810.0
"Ag Land, Rented From Others, In Farms - Acres",621059.0,1891420.0,2007551.0,7155919.0,8629610.0,14053354.0,38530047.0,25999136.0,70664105.0,83904617.0,62359480.0,28686423.0
"Ag Land, Cropland - Number Of Operations",23038.0,45025.0,34470.0,97248.0,105084.0,132612.0,228187.0,151433.0,137330.0,188685.0,127625.0,137869.0


From here, I think a good amount of hard-coding is needed for table formatting; districts will have different top-production commodities, so how do we want to display that information? It's tougher to decide than when you are comparing particular commodity classes across districts...