# TCAD file exploration

We have received files from a client.  They are ....

# Shorten files for browsing

To shorten the files for browsing we can run a short shell script. This opens the zip that was received, and truncates each file at 100 lines long.

```{bash, eval=F}
# rm -rf shortened_appraisal_files
unzip original_data/Appraisal_Roll_History_1990.zip -d shortened_appraisal_files
find shortened_appraisal_files -name "*.TXT" -exec sed -i.full 100q {} \;
find shortened_appraisal_files -name "*.TXT.full" -exec rm {} \;
zip -r shortened_appraisal_files.zip shortened_appraisal_files
```

We can now attempt to load a shortened file using pandas

In [1]:
import pandas as pd

df = pd.read_csv("shortened_appraisal_files/Appraisal_Roll_History_1990_A/TCBC_SUM_1990_JURIS.TXT", sep = "|")
df.head()

Unnamed: 0,0000000003,0000,1990,02,0.56950,CI,Unnamed: 6,275,0,2923,...,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,4098.00,0.00,0.00.1,12.23,11.11,23.34
0,3,0,1990,3,0.409,CO,,275,0,2923,...,,,,,4098.0,0.0,0.0,16.76,0.0,16.76
1,3,0,1990,4,0.0001,CR,,275,0,2923,...,,,,,4098.0,0.0,0.0,0.0,0.0,0.0
2,3,0,1990,8,1.641,SD,,275,0,2923,...,,,Y,,4098.0,0.0,0.0,50.24,17.01,67.25
3,7,0,1990,1,1.266,SD,,25500,0,35000,...,,,Y,,78000.0,0.0,0.0,836.55,150.93,987.48
4,7,0,1990,2,0.5695,CI,,25500,0,35000,...,,,,,78000.0,0.0,0.0,232.75,211.46,444.21


In [2]:
# extract zip folder into a new folder
import zipfile
import os

# zip_file_path = "shortened_appraisal_files.zip"
zip_file_path = "original_data/Appraisal_Roll_History_1990.zip"
extract_folder_path = "data"

# Create the extract folder if it doesn't exist
if not os.path.exists(extract_folder_path):
    os.makedirs(extract_folder_path)

# Open the zip file and extract its contents to the extract folder
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_folder_path)

Challenge now is to use the *.TDF files to create tables.  I can think of two approaches.

1. The TDF files are SQL, so if those are fed to duckdb they should be able to create tables into which the TXT pipe-separated CSV files can be read.  There may be issues with the datatypes not matching (which would require mapping the current datatype definitions to duckdb datatypes by changing the words used to give the datatype to the columns).

2. Take the column names out of the TDF files and add them as the column names while reading the relevant CSV files into duckdb.  This would use duckdb's auto understanding of the column datatypes (so it would run, but it might guess wrongly and truncate or change data).

I think we should explore step 1 first.

## Creating tables using the TDF files

We have TDF files scattered through the \_A and \_B folders.  I have created a schema (a namespace) for the files from \_A called "folder_A" and "folder_B". So there are tables named the same thing in each of the schemas.  You can reference the tables as folder_A.TCBC_SUM_1990_JURIS and folder_B.TCBC_SUM_1990_JURIS 

We can use python to read each TDF file separately, create the table and then try to load the matching TXT file.  A little guidance on how to process a directory structure of files using Path and glob here:
http://howisonlab.github.io/datawrangling/faq.html#get-data-from-filenames

In [3]:
import csv
from pathlib import Path
import duckdb

con = duckdb.connect('duckdb-file.db') #  string to persist to disk
cursor = con.cursor()

# file_directory = 'shortened_appraisal_files/'
file_directory = 'data/'
# limit_to_file = 'TCBC_SUM_1990_JURIS'
limit_to_file = '*' # all files

# create schemas
cursor.execute("CREATE SCHEMA IF NOT EXISTS folder_A_TCBC;")
cursor.execute("CREATE SCHEMA IF NOT EXISTS folder_A_TXBC;")
cursor.execute("CREATE SCHEMA IF NOT EXISTS folder_B_TCBC;")
cursor.execute("CREATE SCHEMA IF NOT EXISTS folder_B_TXBC;")
# delete schemas that created previously
# cursor.execute("DROP SCHEMA IF EXISTS folder_A CASCADE")
# cursor.execute("DROP SCHEMA IF EXISTS folder_B CASCADE")

for filename in Path(file_directory).rglob(limit_to_file + '.TDF'):
    print(filename.parts)
    if "_A" in filename.parts[1] and "TCBC_" in filename.parts[2]:
        schema = "folder_A_TCBC"
    elif "_A" in filename.parts[1] and "TXBC_" in filename.parts[2]:
        schema = "folder_A_TXBC"
    elif "_B" in filename.parts[1] and "TCBC_" in filename.parts[2]:
        schema = "folder_B_TCBC"
    elif "_B" in filename.parts[1] and "TXBC_" in filename.parts[2]:
        schema = "folder_B_TXBC"
    else:
        exit("can't set schema")
    
    table_name = schema + "." + Path(filename).stem # e.g., A_TCBC_SUM_1990_JURIS

    # read .TDF file into string
    create_table_sql = Path(filename).read_text()
    # Need to alter table name to read in both _A and _B files
    create_table_sql = create_table_sql.replace(Path(filename).stem, table_name)
    
    # Here we have the table creation code in a string, so we can
    # swap datatypes out.
    # tried SMALLDATETIME --> DATETIME but was still giving errors
    # will need to fix this later.
    create_table_sql = create_table_sql.replace("SMALLDATETIME", "TEXT")
    create_table_sql = create_table_sql.replace("CREATE TABLE", "CREATE TABLE IF NOT EXISTS")    
    create_table_sql = f"DROP TABLE IF EXISTS {table_name}; "+ create_table_sql
    

    # execute that SQL with duckdb, this should create the table
#     already created table so no need to run
    #print(create_table_sql)
    #exit(1)
    cursor.execute(create_table_sql) 

    # copy CSV into duckdb. CSV is the matching .TXT
    path_to_csvpipefile = Path(filename).with_suffix(".TXT")
    # duckdb copy documentation: https://duckdb.org/docs/sql/statements/copy.html
    query = f"COPY {table_name} FROM '{path_to_csvpipefile}' ( DELIMITER '|')"
    # print(query)
    cursor.execute(query)

('data', 'Appraisal_Roll_History_1990_A', 'TCBC_SUM_1990_GRANT_EXMP.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TXBC_SUM_1990_JURIS_EXMP.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TXBC_SUM_1990_SUSP_INIT.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TCBC_SUM_1990_SUSP.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TCBC_SUM_1990_JURIS.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TXBC_SUM_1990_JURIS.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TXBC_SUM_1990_SUSP.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TXBC_SUM_1990.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TCBC_SUM_1990_SUSP_INIT.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TCBC_SUM_1990_JURIS_EXMP.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TXBC_SUM_1990_GRANT_EXMP.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TCBC_SUM_1990.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TCBC_SUM_1990_LEGAL.TDF')
('data', 'Appraisal_Roll_History_1990_A', 'TCBC_SUM_1990_CFOR.TDF')
('data', 'Appraisal_R

Create the tables for dbdocs

In [4]:
# set up sql for dbdocs
for filename in Path(file_directory).rglob(limit_to_file + '.TDF'):

    # read .TDF file into string
    dbdocs_create_table = Path(filename).read_text()

    # Remove commas before closing parentheses using regular expressions
    dbdocs_create_table = dbdocs_create_table.replace("),", ")")

    # Replacements for dbdocs
    dbdocs_create_table = dbdocs_create_table.replace("CREATE TABLE", "TABLE")
    dbdocs_create_table = dbdocs_create_table.replace("SMALLDATETIME", "TEXT")
    dbdocs_create_table = dbdocs_create_table.replace(" (", "{ ")
    dbdocs_create_table = dbdocs_create_table.replace(");", " }")
    
    # Print the updated SQL table code
    print(dbdocs_create_table)


TABLE TCBC_SUM_1990_GRANT_EXMP{ 
AcctNum VARCHAR(10)
SufxId VARCHAR(4)
TaxYear VARCHAR(4)
ExemType VARCHAR(1)
ExemNum VARCHAR(1) }

TABLE TXBC_SUM_1990_JURIS_EXMP{ 
Parcel VARCHAR(10)
OwnrId VARCHAR(4)
TaxYear VARCHAR(4)
Juris VARCHAR(2)
ExemType VARCHAR(1)
ExemNum VARCHAR(1)
ExemAmt NUMERIC(11,0) }

TABLE TXBC_SUM_1990_SUSP_INIT{ 
Parcel VARCHAR(10)
OwnrId VARCHAR(4)
TaxYear VARCHAR(4)
ARBInit VARCHAR(3) }

TABLE TCBC_SUM_1990_SUSP{ 
AcctNum VARCHAR(10)
SufxId VARCHAR(4)
TaxYear VARCHAR(4)
InformalDate TEXT,
FormalDate TEXT,
HearingType VARCHAR(1)
HearingOrigType VARCHAR(1)
HearingReasonCode VARCHAR(2)
DocketYear VARCHAR(4)
DocketNum VARCHAR(6)
InformalArea VARCHAR(1)
InformalApprInit VARCHAR(3)
ValApprInit VARCHAR(3)
AgentARBTemp VARCHAR(4)
LateStatus VARCHAR(1)
SuppFlag VARCHAR(1)
HoldFlag VARCHAR(1)
AreaChgFlag VARCHAR(1)
PrintFlag VARCHAR(1)
UseInfoAddrFlag VARCHAR(1)
CtrlAcctNum VARCHAR(10)
CtrlSufxId VARCHAR(4)
CtrlTaxYear VARCHAR(4) }

TABLE TCBC_SUM_1990_JURIS{ 
AcctNum VARCHA

# SQL for analysis

In [5]:
# setup from https://duckdb.org/docs/guides/python/jupyter.html
import duckdb
import pandas as pd
# No need to import duckdb_engine
#  jupysql will auto-detect the driver needed based on the connection string!

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

In [6]:
%sql duckdb:///duckdb-file.db

In [7]:
%%sql
SHOW TABLES -- no schema name

Unnamed: 0,name
0,TCBC_SUM_1990
1,TCBC_SUM_1990
2,TCBC_SUM_1990_CFOR
3,TCBC_SUM_1990_CFOR
4,TCBC_SUM_1990_GRANT_EXMP
5,TCBC_SUM_1990_GRANT_EXMP
6,TCBC_SUM_1990_JURIS
7,TCBC_SUM_1990_JURIS
8,TCBC_SUM_1990_JURIS_EXMP
9,TCBC_SUM_1990_JURIS_EXMP


Hey, duckdb implements all the same information schema names as postgres, so one can use the same queries to find the tables with their schaema names.

In [8]:
%%sql
SELECT schemaname AS schema_name, tablename AS table_name
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog'
AND schemaname != 'information_schema'
ORDER BY schemaname, tablename ASC;

Unnamed: 0,schema_name,table_name
0,folder_A_TCBC,TCBC_SUM_1990
1,folder_A_TCBC,TCBC_SUM_1990_CFOR
2,folder_A_TCBC,TCBC_SUM_1990_GRANT_EXMP
3,folder_A_TCBC,TCBC_SUM_1990_JURIS
4,folder_A_TCBC,TCBC_SUM_1990_JURIS_EXMP
5,folder_A_TCBC,TCBC_SUM_1990_LEGAL
6,folder_A_TCBC,TCBC_SUM_1990_SUSP
7,folder_A_TCBC,TCBC_SUM_1990_SUSP_INIT
8,folder_A_TXBC,TXBC_SUM_1990
9,folder_A_TXBC,TXBC_SUM_1990_CFOR


TCBC_SUM_1990_JURIS - Suppose total of 134933 rows, rows are adding up everytime rerun (fixed now)

JURIS probably means "jurisdiction" which means a legal area.  This makes sense because the columns are about tax rates (and metadata about tax status, like 'freeport').  So possibly this file is a list of jurisdictions to which a parcel can belong (and therefore holds the rates that would apply to the parcel?). It is surprising to have 134,933 different jurisdictions though!

In [9]:
%%sql
SELECT * FROM folder_A_TCBC.TCBC_SUM_1990_JURIS;

Unnamed: 0,AcctNum,SufxId,TaxYear,Juris,Rate,JurisType,JurisCED,MdseVal,FrptVal,FFEVal,...,ExmpStatFlag,JurisPctFlag,FreeportFlag,FreeportStatus,AssessVal,TaxFrzVal,TaxBeforeFrz,GenFundTax,SinkFundTax,TotTax
0,0000000003,0000,1990,02,0.56950,CI,,275,0,2923,...,,,,,4098.00,0.00,0.00,12.23,11.11,23.34
1,0000000003,0000,1990,03,0.40900,CO,,275,0,2923,...,,,,,4098.00,0.00,0.00,16.76,0.00,16.76
2,0000000003,0000,1990,04,0.00010,CR,,275,0,2923,...,,,,,4098.00,0.00,0.00,0.00,0.00,0.00
3,0000000003,0000,1990,08,1.64100,SD,,275,0,2923,...,,,Y,,4098.00,0.00,0.00,50.24,17.01,67.25
4,0000000007,0000,1990,01,1.26600,SD,,25500,0,35000,...,,,Y,,78000.00,0.00,0.00,836.55,150.93,987.48
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
134928,0000061017,0000,1990,01,1.26600,SD,,0,0,0,...,,,Y,,100653.00,0.00,0.00,1079.51,194.76,1274.27
134929,0000061017,0000,1990,02,0.56950,CI,,0,0,0,...,,,,,100653.00,0.00,0.00,300.35,272.87,573.22
134930,0000061017,0000,1990,03,0.40900,CO,,0,0,0,...,,,,,100653.00,0.00,0.00,411.67,0.00,411.67
134931,0000061017,0000,1990,04,0.00010,CR,,0,0,0,...,,,,,100653.00,0.00,0.00,0.10,0.00,0.10


The table without a suffix (TCBC_SUM_1990) has only 28,086 rows.  Perhaps these are accounts for individual tax payers, but individual tax payers can have multiple account number.

In [10]:
%%sql
SELECT * FROM folder_A_TCBC.TCBC_SUM_1990;

Unnamed: 0,AcctNum,SufxId,TaxYear,RunDate,KeyCode,LoanCo,LoanNum,ExmpCode,LocStreet,LocHouse,...,Zip5,Zip4,Zip2,MailCnt,MailAddr1,MailAddr2,MailAddr3,MailAddr4,MailAddr5,ComboRate
0,0000000003,0000,1990,1992-07-06,,0,,,MO-PAC CI,001004,...,78767,0971,,4,A & A REALTY TAX SERVICE,INC,P O BOX 971,AUSTIN TX 78767-0971,,2.61950
1,0000000007,0000,1990,1992-07-06,,0,,,5 ST E,002811,...,78744,,,4,A & J CARPET/JANITORIAL,SERVICE INC,4122 TODD LANE,AUSTIN TX 78744,,2.29450
2,0000000014,0000,1990,1992-07-06,,0,,,KENTSHIRE CI,000603,...,78704,5615,,4,A A A COMMERCIAL,STRIPING,603 KENTSHIRE CIR #B,AUSTIN TX 78704-5615,,2.29450
3,0000000015,0000,1990,1992-07-06,,0,,,BEN WHITE BV E,004818,...,78759,,,4,A A A CONSTRUCTION,INSPECTIONS INC,8500 NORTH MOPAC #813,AUSTIN TX 78759,,2.08850
4,0000000018,0000,1990,1992-07-06,,0,,,BURNET RD,004402,...,78765,4674,,4,A A A FILTER SERVICE,CORP,P O BOX 4674,AUSTIN TX 78765-4674,,2.29450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28081,0000060425,0000,1990,1992-07-06,,0,,,RED RIVER ST,000912,...,78703,,,3,JOSEPH SALEM,1500 SCENIC DR #106,AUSTIN TX 78703,,,2.29450
28082,0000060456,0000,1990,1992-07-06,,0,,,WILLIAM CANNON DR W,000414,...,78745,5664,,3,FLOWERS BY HAND,414 W WILLIAM CANNON #8,AUSTIN TX 78745-5664,,,2.29450
28083,0000060832,0000,1990,1992-07-06,,0,,,AMERICAN DR,003404,...,78645,6500,,3,THE PRIME GROUP,3404 AMERICAN DR,LAGO VISTA TX 78645-6500,,,2.61850
28084,0000060999,0000,1990,1992-07-06,,0,,,YAGER LN W,000615,...,78753,,,4,CONCRETE CORING CO INC,ATTN: MARTHA TURNER,615 YAGER LANE WEST,AUSTIN TX 78753,,1.91400


Skip down to middle data of to look into detail information of the TCBC summary file.

In [11]:
%%sql
SELECT * FROM folder_A_TCBC.TCBC_SUM_1990
LIMIT 100
OFFSET 20000;

Unnamed: 0,AcctNum,SufxId,TaxYear,RunDate,KeyCode,LoanCo,LoanNum,ExmpCode,LocStreet,LocHouse,...,Zip5,Zip4,Zip2,MailCnt,MailAddr1,MailAddr2,MailAddr3,MailAddr4,MailAddr5,ComboRate
0,0000046663,0000,1990,1992-07-06,,0,,,SHOAL CREEK BV,008900,...,78758,6840,,4,AUSTIN SHOE HOSPITAL,%TRAVIS CTY SHOE HOSP,8900 SHOAL CREEK BV #103,AUSTIN TX 78758-6840,,2.29450
1,0000046666,0000,1990,1992-07-06,,0,,,LA POSADA DR,001016,...,78752,3895,,4,T O A CREDIT UNION,% MANAGER,1016 LA POSADA DR #174,AUSTIN TX 78752-3895,,2.29450
2,0000046667,0000,1990,1992-07-06,,0,,,TOMANET TR,012412,...,78758,2412,,3,PARMER LANE DAY CARE,12412 TOMANET TRAIL,AUSTIN TX 78758-2412,,,1.75500
3,0000046668,0000,1990,1992-07-06,,0,,,ANDERSON LN W,001810,...,78757,1338,,3,BOOK EXCHANGE THE,1810 WEST ANDERSON LN,AUSTIN TX 78757-1338,,,2.29450
4,0000046680,0000,1990,1992-07-06,,0,,,HIDALGO ST,003411,...,78220,0243,,4,SEVEN UP LIKE BOTTLING,% GRANT LYDICK BEVERAGE,P O BOX 200243,SAN ANTONIO TX 78220-0243,,2.29450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0000046847,0002,1990,1992-07-06,,0,,,CAPITAL OF TX HY N,009020,...,75024,,,5,AMERICAN NETWORK LEASING,LEASE #882119,% EDS (S TAX) - PPT,5400 LEGACY DR,PLANO TX 75024,2.29450
96,0000046847,0003,1990,1992-07-06,,0,,,THERMAL DR,013804,...,75024,,,5,AMERICAN NETWORK LEASING,LEASE #325,% EDS (S TAX) - PPT,5400 LEGACY DR,PLANO TX 75024,1.91400
97,0000046847,0004,1990,1992-07-06,,0,,,BEE CAVES RD,004015,...,75024,,,5,AMERICAN NETWORK LEASING,LEASE # 692,% EDS (S TAX) - PPT,5400 LEGACY DR,PLANO TX 75024,2.35130
98,0000046847,0006,1990,1992-07-06,,0,,,CONGRESS AV S,007110,...,75024,,,5,AMERICAN NETWORK LEASING,LEASE #409,% EDS (S TAX) - PPT,5400 LEGACY DR,PLANO TX 75024,2.29450


The table without a suffix (TXBC_SUM_1990) has 255,593 rows.  Perhaps these are accounts for individual tax payers, but individual tax payers can have multiple parcel.

In [12]:
%%sql
SELECT * FROM folder_A_TXBC.TXBC_SUM_1990;

Unnamed: 0,Parcel,OwnrId,TaxYear,RunDate,KeyCode,LoanCo,LoanNum,ExmpCode,ExmpLandCode,ExmpImprCode,...,Zip5,Zip4,Zip2,MailCnt,MailAddr1,MailAddr2,MailAddr3,MailAddr4,MailAddr5,ComboRate
0,0100000003,0000,1990,1992-06-13,EX,990,21-042491,05,05,,...,77001,,,5,CITY OF AUSTIN,% SOUTHERN PACIFIC,TRANSPORTATION CO,P O BOX 1319,HOUSTON TX 77001,1.31600
1,0100000003,0001,1990,1992-06-13,EX,990,,05,05,,...,77001,,,5,CITY OF AUSTIN,% SOUTHERN PACIFIC,TRANSPORATION CO,P O BOX 1319,HOUSTON TX 77001,0.56950
2,0100000003,0002,1990,1992-06-13,EX,990,,05,05,,...,77001,,,5,CITY OF AUSTIN,% SOUTHERN PACIFIC,TRANSPORTATION CO,P O BOX 1319,HOUSTON TX 77001,0.40900
3,0100000003,0003,1990,1992-06-13,EX,990,,05,05,,...,77001,,,5,CITY OF AUSTIN,% SOUTHERN PACIFIC,TRANSPORTATION CO,P O BOX 1319,HOUSTON TX 77001,1.05460
4,0100000003,0004,1990,1992-06-13,EX,990,,05,05,,...,77001,,,5,CITY OF AUSTIN,% SOUTHERN PACIFIC,TRANSPORTATION CO,P O BOX 1319,HOUSTON TX 77001,0.49510
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255588,0667090302,0000,1990,1992-06-13,,0,,,,,...,78731,2901,,3,DAUGHERTY EDGAR S,3988 FAR WEST BLVD,AUSTIN TX 78731-2901,,,0.40900
255589,0667090303,0000,1990,1992-06-13,EX,980,,35,35,,...,78617,9638,,3,ONTIBEROS LEROY A (VLB),11 EDGEROCK DRIVE,DEL VALLE TX 78617-9638,,,0.40900
255590,0667090304,0000,1990,1992-06-13,,0,,,,,...,78763,5666,,3,SPIRES ALBERT B JR,P O BOX 5666,AUSTIN TX 78763-5666,,,0.40900
255591,0667190101,0000,1990,1992-06-13,,0,,,,,...,78615,0027,,3,GOETZ WILLIAM T,BOX 27,COUPLAND TX 78615-0027,,,1.79900


Looking for possible column that have relation to the location

In [13]:
%%sql
SELECT * FROM information_schema.columns
WHERE column_name LIKE '%Loc%'
ORDER BY table_schema, table_name;

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,...,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
0,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,LocStreet,9,,YES,VARCHAR,,,...,,,,,,,,,,
1,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,LocHouse,10,,YES,VARCHAR,,,...,,,,,,,,,,
2,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,LocFrac,11,,YES,VARCHAR,,,...,,,,,,,,,,
3,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,LocAlpha,12,,YES,VARCHAR,,,...,,,,,,,,,,
4,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,LocUnit,13,,YES,VARCHAR,,,...,,,,,,,,,,
5,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,LocZip,14,,YES,VARCHAR,,,...,,,,,,,,,,
6,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,FmtLoc,15,,YES,VARCHAR,,,...,,,,,,,,,,
7,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,LocStreet,11,,YES,VARCHAR,,,...,,,,,,,,,,
8,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,LocHouse,12,,YES,VARCHAR,,,...,,,,,,,,,,
9,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,LocFrac,13,,YES,VARCHAR,,,...,,,,,,,,,,


Concentrate looking for the location information that is important in the research. Where the FmtLoc present the full address of the record and the other columns (LocStreet, LocHouse, LocFrac, LocAlpha, LocUnit, and LocZip) are the splited address information. This applies to both TCBC and TXBC records.

While below shows the sample table from folder_A_TCBC schema of TCBC_SUM_1990 file.

In [14]:
%%sql
SELECT FmtLoc, 
       LocStreet, 
       LocHouse, 
       LocFrac, 
       LocAlpha, 
       LocUnit, 
       LocZip 
       FROM folder_A_TCBC.TCBC_SUM_1990;

Unnamed: 0,FmtLoc,LocStreet,LocHouse,LocFrac,LocAlpha,LocUnit,LocZip
0,1004 MO-PAC CI 101,MO-PAC CI,001004,,,00101,78746
1,2811 5 ST E,5 ST E,002811,,,,MULTI
2,603 KENTSHIRE CI,KENTSHIRE CI,000603,,,,78704
3,4818 BEN WHITE BV E 202,BEN WHITE BV E,004818,,,00202,MULTI
4,4402 BURNET RD,BURNET RD,004402,,,,MULTI
...,...,...,...,...,...,...,...
28081,912 RED RIVER ST,RED RIVER ST,000912,,,,MULTI
28082,414 WILLIAM CANNON DR W 8,WILLIAM CANNON DR W,000414,,,00008,
28083,3404 AMERICAN DR,AMERICAN DR,003404,,,,78641
28084,615 YAGER LN W,YAGER LN W,000615,,,,78753


Searching for columns that includes 'arcel' for parcel number:

In the TCBC files, the only columns that relate with parcel is the LinkParcel.

Mainly the parcel is located throughout all TXBC type files. 

In [15]:
%%sql
SELECT * FROM information_schema.columns
WHERE column_name LIKE '%arcel%'
ORDER BY table_name;

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,...,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
0,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,LinkParcel,29,,YES,VARCHAR,,,...,,,,,,,,,,
1,duckdb-file,folder_B_TCBC,TCBC_SUM_1990,LinkParcel,29,,YES,VARCHAR,,,...,,,,,,,,,,
2,duckdb-file,folder_B_TXBC,TXBC_SUM_1990,RefParcel1,73,,YES,VARCHAR,,,...,,,,,,,,,,
3,duckdb-file,folder_B_TXBC,TXBC_SUM_1990,RefParcel3,75,,YES,VARCHAR,,,...,,,,,,,,,,
4,duckdb-file,folder_B_TXBC,TXBC_SUM_1990,Parcel,1,,YES,VARCHAR,,,...,,,,,,,,,,
5,duckdb-file,folder_B_TXBC,TXBC_SUM_1990,RefParcel2,74,,YES,VARCHAR,,,...,,,,,,,,,,
6,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,RefParcel3,75,,YES,VARCHAR,,,...,,,,,,,,,,
7,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,RefParcel2,74,,YES,VARCHAR,,,...,,,,,,,,,,
8,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,RefParcel1,73,,YES,VARCHAR,,,...,,,,,,,,,,
9,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,Parcel,1,,YES,VARCHAR,,,...,,,,,,,,,,


There is no parcel relation in TCBC files knowing the only possible outcome is None from the LinkParcel column. Therefore, the only parcel number is under TXBC files.

In [16]:
%%sql
SELECT DISTINCT LinkParcel FROM folder_A_TCBC.TCBC_SUM_1990;

Unnamed: 0,LinkParcel
0,


Searching columns that have relation with the use: 

TCBC_SUM_1990_SUSP - UseInfoAddrFlag

TXBC_SUM_1990 - AgUseCode, AgUseMulti, UseCode, UseMulti, UseClass

TXBC_SUM_1990_SUSP - UseInfoAddrFlag

In [17]:
%%sql
SELECT * FROM information_schema.columns
WHERE column_name LIKE '%Use%'
ORDER BY table_name;

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,...,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
0,duckdb-file,folder_A_TCBC,TCBC_SUM_1990_SUSP,UseInfoAddrFlag,20,,YES,VARCHAR,,,...,,,,,,,,,,
1,duckdb-file,folder_B_TCBC,TCBC_SUM_1990_SUSP,UseInfoAddrFlag,20,,YES,VARCHAR,,,...,,,,,,,,,,
2,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,AgUseCode,22,,YES,VARCHAR,,,...,,,,,,,,,,
3,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,AgUseMulti,23,,YES,VARCHAR,,,...,,,,,,,,,,
4,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,UseCode,68,,YES,VARCHAR,,,...,,,,,,,,,,
5,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,UseMulti,69,,YES,VARCHAR,,,...,,,,,,,,,,
6,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,UseClass,70,,YES,VARCHAR,,,...,,,,,,,,,,
7,duckdb-file,folder_B_TXBC,TXBC_SUM_1990,AgUseCode,22,,YES,VARCHAR,,,...,,,,,,,,,,
8,duckdb-file,folder_B_TXBC,TXBC_SUM_1990,AgUseMulti,23,,YES,VARCHAR,,,...,,,,,,,,,,
9,duckdb-file,folder_B_TXBC,TXBC_SUM_1990,UseCode,68,,YES,VARCHAR,,,...,,,,,,,,,,


There's no data in all files with suffix of SUSP.

In [18]:
%%sql
SELECT * FROM folder_A_TCBC.TCBC_SUM_1990_SUSP;

In [19]:
%%sql
SELECT * FROM folder_A_TXBC.TXBC_SUM_1990_SUSP;

The important information that is seeking for use is the UseCode in TXBC_SUM_1990 file. Where the use code is a two digit number and it might just a code that match with other information.

In [20]:
%%sql
SELECT DISTINCT AgUseCode, AgUseMulti, UseCode, UseMulti, UseClass FROM folder_A_TXBC.TXBC_SUM_1990;

Unnamed: 0,AgUseCode,AgUseMulti,UseCode,UseMulti,UseClass
0,,,,,
1,,,01,,
2,,,13,,
3,,,00,,
4,,,11,,
...,...,...,...,...,...
120,,,73,,
121,,,72,*,
122,,,89,,
123,,,45,*,


Searching for the columns that can possibly find the data of "sq ft":

Both TCBC and TXBC files with no suffix (_SUM_1990) have the column "TotSqft" and may be the data we are searching for. 

In [21]:
%%sql
SELECT * FROM information_schema.columns
WHERE column_name LIKE '%ft%'
ORDER BY table_name;

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,...,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
0,duckdb-file,folder_A_TCBC,TCBC_SUM_1990,TotSqft,17,,YES,"DECIMAL(9,0)",,,...,,,,,,,,,,
1,duckdb-file,folder_B_TCBC,TCBC_SUM_1990,TotSqft,17,,YES,"DECIMAL(9,0)",,,...,,,,,,,,,,
2,duckdb-file,folder_A_TXBC,TXBC_SUM_1990,TotSqft,19,,YES,"DECIMAL(9,0)",,,...,,,,,,,,,,
3,duckdb-file,folder_B_TXBC,TXBC_SUM_1990,TotSqft,19,,YES,"DECIMAL(9,0)",,,...,,,,,,,,,,


Looking at all the possible values, TCBC file only have value of "0", and TXBC file have 8,866 types of outcomes. Therefore, I assume the record that does not have the square feet are all recorded as "0".

In [22]:
%%sql
SELECT DISTINCT TotSqft FROM folder_A_TCBC.TCBC_SUM_1990;

Unnamed: 0,TotSqft
0,0


In [23]:
%%sql
SELECT DISTINCT TotSqft FROM folder_A_TXBC.TXBC_SUM_1990;

Unnamed: 0,TotSqft
0,0
1,2199
2,2995
3,1744
4,1315
...,...
8861,80909
8862,48712
8863,426
8864,22690
