# TCAD file exploration

We have received files from a client.  They are ....

# Shorten files for browsing

To shorten the files for browsing we can run a short shell script. This opens the zip that was received, and truncates each file at 100 lines long.

```{bash, eval=F}
# rm -rf shortened_appraisal_files
unzip original_data/Appraisal_Roll_History_1990.zip -d shortened_appraisal_files
find shortened_appraisal_files -name "*.TXT" -exec sed -i.full 100q {} \;
find shortened_appraisal_files -name "*.TXT.full" -exec rm {} \;
zip -r shortened_appraisal_files.zip shortened_appraisal_files
```

We can now attempt to load a shortened file using pandas

In [None]:
import pandas as pd

df = pd.read_csv("shortened_appraisal_files/Appraisal_Roll_History_1990_A/TCBC_SUM_1990_JURIS.TXT", sep = "|")
df.head()

FileNotFoundError: ignored

Challenge now is to use the *.TDF files to create tables.  I can think of two approaches.

1. The TDF files are SQL, so if those are fed to duckdb they should be able to create tables into which the TXT pipe-separated CSV files can be read.  There may be issues with the datatypes not matching (which would require mapping the current datatype definitions to duckdb datatypes by changing the words used to give the datatype to the columns).

2. Take the column names out of the TDF files and add them as the column names while reading the relevant CSV files into duckdb.  This would use duckdb's auto understanding of the column datatypes (so it would run, but it might guess wrongly and truncate or change data).

I think we should explore step 1 first.

## Creating tables using the TDF files

We have TDF files scattered through the \_A and \_B folders.  I have created a schema (a namespace) for the files from \_A called "folder_A" and "folder_B". So there are tables named the same thing in each of the schemas.  You can reference the tables as folder_A.TCBC_SUM_1990_JURIS and folder_B.TCBC_SUM_1990_JURIS 

We can use python to read each TDF file separately, create the table and then try to load the matching TXT file.  A little guidance on how to process a directory structure of files using Path and glob here:
http://howisonlab.github.io/datawrangling/faq.html#get-data-from-filenames

In [None]:
import csv
from pathlib import Path
import duckdb

con = duckdb.connect('duckdb-file.db') #  string to persist to disk
cursor = con.cursor()

# file_directory = 'shortened_appraisal_files/'
file_directory = 'data/'
# limit_to_file = 'TCBC_SUM_1990_JURIS'
limit_to_file = '*' # all files

# create schemas
cursor.execute("CREATE SCHEMA IF NOT EXISTS folder_A_TCBC;")
cursor.execute("CREATE SCHEMA IF NOT EXISTS folder_A_TXBC;")
cursor.execute("CREATE SCHEMA IF NOT EXISTS folder_B_TCBC;")
cursor.execute("CREATE SCHEMA IF NOT EXISTS folder_B_TXBC;")
# delete schemas that created previously
# cursor.execute("DROP SCHEMA IF EXISTS folder_A CASCADE")
# cursor.execute("DROP SCHEMA IF EXISTS folder_B CASCADE")

for filename in Path(file_directory).rglob(limit_to_file + '.TDF'):
    print(filename.parts)
    if "_A" in filename.parts[1] and "TCBC_" in filename.parts[2]:
        schema = "folder_A_TCBC"
    elif "_A" in filename.parts[1] and "TXBC_" in filename.parts[2]:
        schema = "folder_A_TXBC"
    elif "_B" in filename.parts[1] and "TCBC_" in filename.parts[2]:
        schema = "folder_B_TCBC"
    elif "_B" in filename.parts[1] and "TXBC_" in filename.parts[2]:
        schema = "folder_B_TXBC"
    
    table_name = schema + "." + Path(filename).stem # e.g., A_TCBC_SUM_1990_JURIS

    # read .TDF file into string
    create_table_sql = Path(filename).read_text()
    # Need to alter table name to read in both _A and _B files
    create_table_sql = create_table_sql.replace(Path(filename).stem, table_name)
    
    # Here we have the table creation code in a string, so we can
    # swap datatypes out.
    # tried SMALLDATETIME --> DATETIME but was still giving errors
    # will need to fix this later.
    create_table_sql = create_table_sql.replace("SMALLDATETIME", "TEXT")
    

    # execute that SQL with duckdb, this should create the table
#     already created table so no need to run
#     cursor.execute(create_table_sql) 

    # copy CSV into duckdb. CSV is the matching .TXT
    path_to_csvpipefile = Path(filename).with_suffix(".TXT")
    # duckdb copy documentation: https://duckdb.org/docs/sql/statements/copy.html
    query = f"COPY {table_name} FROM '{path_to_csvpipefile}' ( DELIMITER '|')"
    cursor.execute(query)

In [None]:
#set up sql for dbdocs
for filename in Path(file_directory).rglob(limit_to_file + '.TDF'):

    # SQL table code with commas
    dbdocs_create_table = create_table_sql

    # Remove commas before closing parentheses using regular expressions
    dbdocs_create_table = dbdocs_create_table.replace("),", ")")

    # Print the updated SQL table code
    print(dbdocs_create_table)


In [None]:
# setup from https://duckdb.org/docs/guides/python/jupyter.html
import duckdb
import pandas as pd
# No need to import duckdb_engine
#  jupysql will auto-detect the driver needed based on the connection string!

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

In [None]:
%sql duckdb:///duckdb-file.db

In [None]:
%%sql
SHOW TABLES -- no schema name

Hey, duckdb implements all the same information schema names as postgres, so one can use the same queries to find the tables with their schaema names.

In [None]:
%%sql
SELECT schemaname AS schema_name, tablename AS table_name
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog'
AND schemaname != 'information_schema'
ORDER BY schemaname, tablename ASC;

Suppose total of 134933 rows, rows are adding up everytime rerun

In [None]:
%%sql
SELECT * FROM folder_A_TCBC.TCBC_SUM_1990_JURIS;

Since the data duplicates, then use distinct feature to get correct data

In [None]:
%%sql
SELECT DISTINCT * FROM folder_A_TCBC.TCBC_SUM_1990_JURIS;