# Moving SQL Queries
At this point, I have a lot of SQL queries just as string literals in a Python script.  That seems like it isn't modular enough -- it means that someone can't work on the SQL while someone else works on the Python, and it also means that you have to execute Python code in order to run the SQL scripts, which is silly

The goal here is a refactor, not adding any new features, but improving the separation of concerns

First I'm just copying (not moving) the SQL to new files under `src/data/sql/`

Now I'm starting with an example from [StackOverflow](https://stackoverflow.com/questions/19472922/reading-external-sql-script-in-python)

In [2]:
def executeScriptsFromFile(filename):
    # Open and read the file as a single buffer
    fd = open(filename, 'r')
    sqlFile = fd.read()
    fd.close()

    # all SQL commands (split on ';')
    sqlCommands = sqlFile.split(';')

    # Execute every command from the input file
    for command in sqlCommands:
        # This will skip and report errors
        # For example, if the tables do not yet exist, this will skip over
        # the DROP TABLE commands
        try:
            c.execute(command)
        except Exception as msg:
            print("Command skipped: ", msg)

In [3]:
import psycopg2
import pandas as pd

In [4]:
conn = psycopg2.connect(dbname="housing_data")
pd.read_sql_query("SELECT * FROM sales LIMIT 5;", conn)

Unnamed: 0,excisetaxnbr,major,minor,documentdate,saleprice,recordingnbr,volume,page,platnbr,plattype,...,propertytype,principaluse,saleinstrument,afforestland,afcurrentuseland,afnonprofituse,afhistoricproperty,salereason,propertyclass,salewarning
0,1923256,403720,880,2002-11-09,0,20021122000102,,,,,...,3,6,15,N,N,N,N,14,8,31 51 52
1,1443998,937230,60,1995-08-17,110000,199508230865,43.0,21.0,937230.0,P,...,3,6,2,N,N,N,N,0,8,
2,2737898,228520,320,2015-06-17,288700,20150619001358,,,,,...,3,2,3,N,N,N,N,1,3,
3,2501378,148930,126,2011-07-15,0,20110720000809,,,,,...,3,6,15,N,N,N,N,5,8,18 31 51
4,2838251,73150,180,2016-11-22,425000,20161207002286,,,,,...,11,6,3,,,,,1,8,


In [5]:
def open_sql_script(script_filename):
    file_obj = open(script_filename, 'r')
    file_contents = file_obj.read()
    file_obj.close()
    
    return file_contents

In [6]:
open_sql_script("../../src/data/sql/09_sales_df_query.sql")

FileNotFoundError: [Errno 2] No such file or directory: '../../src/data/sql/09_sales_df_query.sql'

In [7]:
! pwd

/Users/ehoffman/Development/DS/housing-price-statistical-analysis/notebooks/exploratory/data_collection


In [8]:
open_sql_script("../../../src/data/sql/09_sales_df_query.sql")

FileNotFoundError: [Errno 2] No such file or directory: '../../../src/data/sql/09_sales_df_query.sql'

In [9]:
! cat ../../../src/data/sql/09_sales_df_query.sql

cat: ../../../src/data/sql/09_sales_df_query.sql: No such file or directory


I accidentally had this one under `data/` instead of `data/sql`

In [10]:
! cat ../../../src/data/sql/09_sales_df_query.sql 

SELECT
    CONCAT(sales.Major, sales.Minor) AS PIN,        -- parcel id number
    sales.SalePrice,
    sales.DocumentDate,
    CASE
        WHEN parcels.WfntLocation > 0                -- 1-9 indicate particular bodies of water
            THEN TRUE
        ELSE                                         -- I infer that 0 means no waterfront
            FALSE
    END as WfntLocation,
    buildings.SqFtTotLiving
FROM sales                                           -- start the join with sales bc sale price is target
    INNER JOIN parcels ON (                          -- parcel major + minor is the unique identifier
            parcels.Major = sales.Major              -- (parcels are the things being sold in the sales)
        AND parcels.Minor = sales.Minor
    )
    INNER JOIN buildings ON (                        -- building belongs to one parcel
            buildings.Major = parcels.Major          -- parcel can have many buildings (unclear how often)
        AND buildings.Minor = parc

In [11]:
open_sql_script("../../../src/data/sql/09_sales_df_query.sql ")

FileNotFoundError: [Errno 2] No such file or directory: '../../../src/data/sql/09_sales_df_query.sql '

Ok, looks like relative import isn't gonna work, sad

In [12]:
__file__

NameError: name '__file__' is not defined

In [13]:
import os

In [14]:
__file__

NameError: name '__file__' is not defined

In [15]:
script_dir = os.path.dirname(__file__)

NameError: name '__file__' is not defined

Wait a second, I see an extra trailing space, trying relative import again

In [16]:
open_sql_script("../../../src/data/sql/09_sales_df_query.sql")

"SELECT\n    CONCAT(sales.Major, sales.Minor) AS PIN,        -- parcel id number\n    sales.SalePrice,\n    sales.DocumentDate,\n    CASE\n        WHEN parcels.WfntLocation > 0                -- 1-9 indicate particular bodies of water\n            THEN TRUE\n        ELSE                                         -- I infer that 0 means no waterfront\n            FALSE\n    END as WfntLocation,\n    buildings.SqFtTotLiving\nFROM sales                                           -- start the join with sales bc sale price is target\n    INNER JOIN parcels ON (                          -- parcel major + minor is the unique identifier\n            parcels.Major = sales.Major              -- (parcels are the things being sold in the sales)\n        AND parcels.Minor = sales.Minor\n    )\n    INNER JOIN buildings ON (                        -- building belongs to one parcel\n            buildings.Major = parcels.Major          -- parcel can have many buildings (unclear how often)\n        AND bui

In [17]:
def execute_sql_script(conn, script_filename):
    file_contents = open_sql_script(script_filename)
    cursor = conn.cursor()
    cursor.execute(file_contents)
    conn.commit()

In [18]:
execute_sql_script(conn, "../../../src/data/sql/09_sales_df_query.sql")

In [19]:
def return_result_of_sql_script(conn, script_filename):
    file_contents = open_sql_script(script_filename)
    result = pd.read_sql_query(file_contents, conn)
    return result

In [21]:
return_result_of_sql_script(conn, "../../../src/data/sql/09_sales_df_query.sql")

Unnamed: 0,pin,saleprice,documentdate,wfntlocation,sqfttotliving
0,2287300010,298633,2018-01-01,False,1810
1,8695200067,275000,2018-01-01,False,1250
2,8732160190,355000,2018-01-01,False,1580
3,1432401055,82886,2018-01-02,False,1170
4,5007500030,450000,2018-01-02,False,2540
...,...,...,...,...,...
30267,7202290630,705000,2018-12-31,False,1600
30268,8946720180,385000,2018-12-31,False,2130
30269,1796360480,47895,2018-12-31,False,1460
30270,9406520090,395000,2018-12-31,False,1654


In [22]:
from src.data import sql_utils

In [23]:
sql_utils.create_database()

FileNotFoundError: [Errno 2] No such file or directory: 'sql/01_drop_old_database.sql'

In [24]:
%load_ext autoreload

In [25]:
%autoreload 2

In [26]:
sql_utils.create_database()

ObjectInUse: database "housing_data" is being accessed by other users
DETAIL:  There is 1 other session using the database.


In [27]:
conn.close()

In [28]:
sql_utils.create_database()

In [29]:
conn = psycopg2.connect(dbname="housing_data")

In [30]:
sql_utils.create_sales_table(conn)

In [31]:
sql_utils.create_buildings_table(conn)

In [32]:
from src.data import data_collection
sales_files, buildings_files, parcels_files = data_collection.collect_all_data_files()

In [33]:
sales_zip_file, sales_csv_file = sales_files

In [34]:
sql_utils.copy_csv_to_sales_table(conn, sales_csv_file)

In [35]:
pd.read_sql_query("SELECT * FROM sales LIMIT 5;", conn)

Unnamed: 0,excisetaxnbr,major,minor,documentdate,saleprice,recordingnbr,volume,page,platnbr,plattype,...,propertytype,principaluse,saleinstrument,afforestland,afcurrentuseland,afnonprofituse,afhistoricproperty,salereason,propertyclass,salewarning
0,1923256,403720,880,2002-11-09,0,20021122000102,,,,,...,3,6,15,N,N,N,N,14,8,31 51 52
1,1443998,937230,60,1995-08-17,110000,199508230865,43.0,21.0,937230.0,P,...,3,6,2,N,N,N,N,0,8,
2,2737898,228520,320,2015-06-17,288700,20150619001358,,,,,...,3,2,3,N,N,N,N,1,3,
3,2501378,148930,126,2011-07-15,0,20110720000809,,,,,...,3,6,15,N,N,N,N,5,8,18 31 51
4,2838251,73150,180,2016-11-22,425000,20161207002286,,,,,...,11,6,3,,,,,1,8,


In [36]:
buildings_zip_file, buildings_csv_file = buildings_files
parcels_zip_file, parcels_csv_file = parcels_files

In [37]:
sql_utils.copy_csv_to_buildings_table(conn, buildings_csv_file)

In [38]:
sql_utils.copy_csv_to_parcels_table(conn, parcels_csv_file)

UndefinedTable: relation "parcels" does not exist


In [39]:
conn.close()

In [40]:
conn = psycopg2.connect(dbname="housing_data")

In [41]:
sql_utils.create_parcels_table(conn)

In [42]:
parcels_zip_file, parcels_csv_file = data_collection.collect_parcels_data()

In [43]:
sql_utils.copy_csv_to_parcels_table(conn, parcels_csv_file)

In [44]:
sales_zip_file.close()
sales_csv_file.close()
buildings_zip_file.close()
buildings_csv_file.close()
parcels_zip_file.close()
parcels_csv_file.close()

In [45]:
conn.close()

In [46]:
conn = psycopg2.connect(dbname="housing_data")

In [47]:
pd.read_sql_query("SELECT * FROM buildings LIMIT 5;", conn)

Unnamed: 0,major,minor,bldgnbr,nbrlivingunits,address,buildingnumber,fraction,directionprefix,streetname,streettype,...,fpmultistory,fpfreestanding,fpadditional,yrbuilt,yrrenovated,pcntcomplete,obsolescence,pcntnetcondition,condition,addnlcost
0,180,10,1,1,1525 S SNOQUALMIE ST 98108,1525,,S,SNOQUALMIE,ST,...,0,0,0,1915,2007,0,0,0,3,8000
1,180,143,1,1,1518 S ANGELINE ST 98108,1518,,S,ANGELINE,ST,...,0,0,0,1988,0,0,0,0,3,0
2,180,154,1,1,1711 S COLUMBIAN WAY 98108,1711,,S,COLUMBIAN,WAY,...,0,0,0,1958,0,0,0,0,3,0
3,280,17,1,1,13955 56TH PL S 98168,13955,,,56TH,PL,...,0,0,0,1943,1990,0,0,0,3,0
4,280,25,1,1,13925 56TH PL S 98168,13925,,,56TH,PL,...,0,0,0,1930,0,0,0,0,5,0


In [48]:
pd.read_sql_query("SELECT * FROM parcels LIMIT 5;", conn)

Unnamed: 0,major,minor,propname,platname,platlot,platblock,range,township,section,quartersection,...,seismichazard,landslidehazard,steepslopehazard,stream,wetland,speciesofconcern,sensitiveareatract,waterproblems,transpconcurrency,otherproblems
0,889250,80,...,VELKOFF JOHN ADD ...,8,,5,22,8,SE,...,N,N,N,N,N,N,N,N,N,N
1,736360,275,...,ROBERTS JAY COUNTRY CLUB ESTATES ...,8,2.0,4,26,34,SE,...,N,N,N,N,N,N,N,N,N,N
2,600350,635,...,NAGLES 2ND ADD ...,6,28.0,4,25,32,NE,...,N,N,N,N,N,N,N,N,N,N
3,635260,760,...,OLD MILL POINT ...,TRACT I,,6,25,18,SE,...,N,N,N,N,N,N,N,N,N,N
4,333250,15,...,HILLMAN CITY DIV NO. 05 ...,3-4,1.0,4,24,22,SE,...,N,N,N,N,N,N,N,N,N,N


In [49]:
sql_utils.create_sales_df()

Unnamed: 0,pin,saleprice,documentdate,wfntlocation,sqfttotliving
0,8732160190,355000,2018-01-01,False,1580
1,2287300010,298633,2018-01-01,False,1810
2,8695200067,275000,2018-01-01,False,1250
3,3672000080,1029884,2018-01-02,False,3030
4,7228501490,860000,2018-01-02,False,2200
...,...,...,...,...,...
30267,3750606594,394000,2018-12-31,False,1920
30268,3395070110,520000,2018-12-31,False,1720
30269,1895450110,415000,2018-12-31,False,2060
30270,2025049183,1085000,2018-12-31,False,1870


In [50]:
conn.close()

Ok, everything seems to be refactored

In [51]:
# This bit makes more sense in the modeling branch so I'm gonna remove it for now
def return_result_of_sql_script(conn, script_filename):
    """
    Given a DB connection and a file path to a SQL script, run the query and
    return the results as a pandas dataframe
    """
    file_contents = open_sql_script(script_filename)
    result = pd.read_sql_query(file_contents, conn)
    return result

Run the whole data collection pipeline to make sure everything is working

In [52]:
data_collection.download_data_and_load_into_sql()

And then the recently un-improved sales df query

In [53]:
sql_utils.create_sales_df()

Unnamed: 0,pin,saleprice,documentdate,wfntlocation,sqfttotliving
0,8732160190,355000,2018-01-01,False,1580
1,2287300010,298633,2018-01-01,False,1810
2,8695200067,275000,2018-01-01,False,1250
3,9414610310,525000,2018-01-02,False,1340
4,6884800015,660000,2018-01-02,False,1660
...,...,...,...,...,...
30267,7889500340,764950,2018-12-31,False,1960
30268,9347900210,305594,2018-12-31,False,880
30269,6852700555,1099950,2018-12-31,False,1470
30270,5028600460,363500,2018-12-31,False,1340


Looks refactored to me!