Skip to content

Bash Project to handle creation, loading, and configuration of a Geospatial Database in Postgresql (POSTGIS) using US Census Tiger Geodata

Notifications You must be signed in to change notification settings

groenewt/BashTigerLoading

Repository files navigation

Bash and Tiger Maps!

Tiger Map

This project showcases a bash script that robostly interacts with the POSTGIS Tiger Extension to setup, load, and configure a geospatial database with various levels of depth Before continuing, the project structure is as follows: .

├── ./Logs : Directory for Log files generated during script (unique logs created for each distinct state loaded)
├── ./Resources : Directory containing various files utilized throughout the project
│ ├── ./Resources/Images : Images used
│ └── ./Resources/Scripts : Examples Scripts
│ └── ./Resources/mtfccs2022.csv : Raw Mtfccs Key table loaded into Database with Script ├── ./tiger_overview_with_ddl.md : Claude generated markdown incorpating DDL from Database ├── ./overview.md : Granular examination of the Bash Script and the ELT process
├── ./README.md : General overview of project with high level steps
└── ./tigerload.sh : Consolidated Bash script that will allow user to configure, load, and get detailed information of output

To run, clone the repository and run the following in the command line:

chmod +x tigerload.sh
./tigerload.sh

I.Getting Started

During the initial commit, this project has only been optimized for a Debian OS (Debian 12 Bookworm). Plans exists to showcase parallel processes via PS1 (powershell) and PY (python) With that being said, the first steps are to install postgres with postgis!

sudo apt-get update -y && sudo apt-get upgrade -y
sudo apt-get install -y wget systemctl
sudo apt-get install -y gnupg2 && wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update -y
sudo apt-get install -y postgresql postgresql-contrib postgresql-client postgis

While a degree of familiarity with the command line will be necessary to fully follow along; the next steps I undertake after installing PostgreSql is configuring a Superuser profile with the same name as the local user (me)

#Create a super user
sudo psql -U postgres -c "CREATE USER $(whoami) WITH PASSWORD 'PASSWORD_HERE' SUPERUSER;"
# psql -d postgres -c "CREATE DATABASE tigercensus;" This step is automatically done in the script FYI

II. Tiger Flow

sequenceDiagram
    participant Census as US Census Bureau
    participant Staging as Staging Directory
    participant TempDB as tiger_staging Schema
    participant ProdDB as tiger_data Schema
    participant Geocoder as PostGIS Geocoder

    Note over Census, Geocoder: National Loader Process
    Census->>Staging: Download TIGER/Line Files (ZIP)
    Staging->>Staging: Extract ZIP files to temp directory
    Staging->>TempDB: Load shapefiles using shp2pgsql
    TempDB->>TempDB: Apply transformations & validations
    TempDB->>ProdDB: Move processed data via loader_load_staged_data()
    ProdDB->>Geocoder: Install and enable national geocoder support

    Note over Census, Geocoder: State Loader Process
    Census->>Staging: Download State-specific files
    Staging->>TempDB: Load state data to staging
    TempDB->>ProdDB: Transform & load state data
    ProdDB->>Geocoder: Rebuild geocoder indexes for updated state data
Loading

The above diagram provides a generalized overview of how the data ELT process will run. Refer to documentation: 2.4. Installing, Upgrading Tiger Geocoder, and loading data

To start of with, you will need to create the following extensions in a Database of your chosing (tigercensus from above).

CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS address_standardizer;
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
CREATE EXTENSION IF NOT EXISTS postgis_tiger_geocoder;

This will go ahead and populate the database with geospatial functions, defined structure via inheritance for Tiger components. To view all tables, run the following command in the console:

psql -d tigercensus -c "SELECT table_catalog, table_schema, table_name, table_type, is_typed FROM information_schema.tables as x1 where table_schema != 'pg_catalog' and table_schema != 'information_schema';"
#or Via the official documentation, you can test everything is setup correctly via:
psql -d tigercensus -c "SELECT na.address, na.streetname,na.streettypeabbrev, na.zip
	FROM normalize_address('1 Devonshire Place, Boston, MA 02109') AS na;"

III. Loading the Database (Nation Load)

At this point, you will create a 'loader' profile which can be done rather simply, or a little more confidently:

INSERT INTO tiger.loader_platform(os, declare_sect, pgbin, wget, unzip_command, psql, path_sep,
                                  loader, environ_set_command, county_process_command)
SELECT 'debbie', declare_sect, pgbin, wget, unzip_command, psql, path_sep,
       loader, environ_set_command, county_process_command
FROM tiger.loader_platform
WHERE os = 'sh';
UPDATE tiger.loader_variables
SET
    tiger_year = '${TIGER_YEAR}',
    website_root = '${TIGER_WEBSITE_ROOT}',
    staging_fold = '${STAGING_FOLDER}',
    data_schema = 'tiger_data',
    staging_schema = 'tiger_staging'

or, the shell version:

db_fnc_run_psql "${DB_NAME}" <<EOF
INSERT INTO tiger.loader_platform (
    os, declare_sect, pgbin, wget, unzip_command, psql, path_sep, loader, environ_set_command, county_process_command
)
VALUES (
    'debbie',
    E'TMPDIR=${STAGING_FOLDER}/temp/\\nUNZIPTOOL=unzip\\nWGETTOOL="/usr/bin/wget"\\nexport PGBIN=/usr/lib/postgresql/15/bin\\nexport PGPORT=5432\\nexport PGHOST=localhost\\nexport PGUSER=${PG_USER}\\nexport PGPASSWORD=${PGPASSWORD}\\nexport PGDATABASE=${DB_NAME}\\nPSQL=\${PGBIN}/psql\\nSHP2PGSQL=shp2pgsql\\ncd ${STAGING_FOLDER}',
    '/usr/lib/postgresql/15/bin',
    'wget',
    E'rm -f \${TMPDIR}/*.*\\n\${PSQL} -c "DROP SCHEMA IF EXISTS \${staging_schema} CASCADE;"\\n\${PSQL} -c "CREATE SCHEMA \${staging_schema};"\\nfor z in *.zip; do \$UNZIPTOOL -o -d \$TMPDIR \$z; done\\ncd \$TMPDIR;',
    '\${PSQL}',
    '/',
    '\${SHP2PGSQL}',
    'export',
    E'for z in *\${table_name}*.dbf; do\\n\${loader} -D -s 4269 -g the_geom -W "latin1" \$z \${staging_schema}.\${state_abbrev}_\${table_name} | \${psql}\\n\${PSQL} -c "SELECT loader_load_staged_data(lower(\'\${state_abbrev}_\${table_name}\'), lower(\'\${state_abbrev}_\${lookup_name}\'));"\\ndone'
)
EOF
db_fnc_run_psql "${DB_NAME}" <<EOF
UPDATE tiger.loader_variables
SET
    tiger_year = '${TIGER_YEAR}',
    website_root = '${TIGER_WEBSITE_ROOT}',
    staging_fold = '${STAGING_FOLDER}',
    data_schema = 'tiger_data',
    staging_schema = 'tiger_staging'
EOF

Additional loading configuration includes:

--Impacts initial setup
UPDATE tiger.loader_lookuptables SET load = true WHERE table_name = 'zcta520';
--Impacts State specific setup
UPDATE tiger.loader_lookuptables SET load = true WHERE load = false AND lookup_name IN('tract', 'bg', 'tabblock20');

The final touches are straightforward and just maintenance and generation of the 'nation_loading script.'

psql -d tigercensus -c "ALTER EXTENSION postgis UPDATE;"
psql -d tigercensus -c "ALTER EXTENSION postgis_tiger_geocoder UPDATE;"
nation_script_path="${STAGING_FOLDER}/nation_script_load.sh"
psql -d tigercensus  -tA -c "SELECT Loader_Generate_Nation_Script('debbie');" > "${nation_script_path}";
#and finally to run it:
chmod +x "${nation_script_path}"
#if you want to run it with debug: bash -x "${nation_script}" 
bash "${nation_script_path}"

IV.State Loader Script

At this point, individual state data with boundaries-from tablock level and up, place, and features — roads, utilities, rails, water. This is done with the following:

$STATE_TO_LOAD="KS"
state_script_path="${STAGING_FOLDER}/${STATE_TO_LOAD}_script_load.sh"
psql -d tigercensus -tA -c "SELECT Loader_Generate_Script(Array['${STATE_TO_LOAD}'],'debbie');" > "${state_script_path}"
chmod +x "${state_script_path}"
#if you want to run it with debug: bash -x "${state_script_path}" 
bash "${state_script_path}"
#Alternative use: psql -c "SELECT Loader_Generate_Script(Array['TX','NM','SC','NC','TN'],'debbie')" -d testdb -tA > ~/GISData/batch_02.sh
flowchart TD
    StartState([Start State Loading]) --> SelectState[Select State by Abbreviation<br/>e.g., 'ks', 'tx', 'ca']
    SelectState --> GenStateScript[Generate State Script<br/>Loader_Generate_Script]
    
    GenStateScript --> DownloadState[Download State-Specific Files<br/>from Census TIGER]
    DownloadState --> ExtractState[Extract State ZIP Files]
    
    ExtractState --> ProcessState{Process Each County/Feature}
    ProcessState --> LoadCounty[Load County Shapefiles<br/>to tiger_staging]
    LoadCounty --> TransformCounty[Transform County Data<br/>loader_load_staged_data]
    
    TransformCounty --> ProcessState
    ProcessState --> StateComplete([State Loading Complete])
    
    subgraph "State Data Types"
        Roads[Roads & Streets]
        Boundaries[County Boundaries]
        Census[Census Blocks]
        Water[Water Features]
        Landmarks[Landmarks]
    end
    
    LoadCounty --> Roads
    LoadCounty --> Boundaries
    LoadCounty --> Census
    LoadCounty --> Water
    LoadCounty --> Landmarks
    
    style SelectState fill:#e8f5e8
    style DownloadState fill:#e3f2fd
    style LoadCounty fill:#fff3e0
    style TransformCounty fill:#fce4ec
Loading

A diagram (TO BE FIXED) showcasing this process

V. Conclusion

Example Output

While by no means will this be a finalized script/product, but rather a way to continue developing skills and help plan additional projects on my end. With this document, capturing the components at play will hopefully allow for others to either replicate, critique, or offer suggestions! Tentative goals with this as a 'framework' are:

  1. Manipulation of Geospatial data across years and various sources (ex. introducing multiple years of census data, loading rasterized data to supplement census data)
  2. Connection to other data sources (Census,FRED,BLS,USDA,USGS,CDC) to allow for robust analysis/modeling of economic data with integration of geospatial data
  3. QGIS/Plotly Loading via python (Interacting with this database to create projects that visualize the data)

About

Bash Project to handle creation, loading, and configuration of a Geospatial Database in Postgresql (POSTGIS) using US Census Tiger Geodata

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages