## Uploading Form Data into Postgres (Data2.ipynb)

The purpose of this workbook is to build on the previous data sources that were being uploaded.  In this instance, we're taking the form data from the v1 V1FEB23 table and am creating tables.  As part of this work, we're also creating .csv files which are stored within our `/dsa/groups/casestudy2023su/team03` directory.

#### 1. [Inital Connection to the database](#data_connect)
#### 2. [CSV Creation of patient forms](#data_creation)
#### 3. [Create tables and insert data](#data_insert)


### <a name="data_connect"></a>Initial connection to the database
Simlar to the Data1 Notebook, we're connecting to the database after importing all the necessary libraries. 

As an important note, we're connecting as a specific user.
1. Username: dtfp3
2. Database: pgsql.dsa.lan/casestdysu23t03

In [1]:
# Import all necessary libraries for this notebook
import numpy as np
import pandas as pd
import pandas as pd
import binascii
import psycopg2
import sqlalchemy
import getpass
import os
import csv

In [2]:
##Connect to Postgres
user = "dtfp3"
host = "pgsql.dsa.lan"
database = "casestdysu23t03"
password = getpass.getpass()
connectionstring = "postgresql://" + user + ":" + password + "@" + host + "/" + database
engine = sqlalchemy.create_engine(connectionstring)
connection = None
schema = "public"

try:
    connection = engine.connect()
except Exception as err:
    print("An error has occurred trying to connect: {}".format(err))

del password

········


### <a name="data_creation"></a>CSV Creation of Patient Forms
We're going to be creating tables based on patient forms within the **v1feb23_raw** table. First, we need to make sure we're selecting the specific columns of data using regex.  With our **list_of_forms** variable, we're currently isolating it to just these forms: 
- DH - Diet History
- FF - Fracture History
- FV - Functional Vision
- GI - General Information
- GS - Grip Strength
- HW - Height, Weight, and Pulse
- MH - Medical History
- MU - Medication Use
- NF - Neruomuscular Function
- TU - Tabacco & Alcohol Use
- NP - Nottingham Power Rig

While the dataset does offer these additional forms because of lack of data, irrelevance, or lack of correlation indiciated by literature:
- PS - Prostate Health
- PA - Physical Activity
- QL - Lifestyle
- BH - Back and Joint Health
- TB - Trail Making Task B
- TM - Teng Mini-Mental
- SC - Specimen Collection
- DX - Bone Density Form
- XR - X-Ray Form

In order to make the data more readable in the **General Information** form, we are also converting several numerical values to textual through dictionary mapping. This sort of process is also being handled independently depending on the form data as part of our EDA steps. 

In [3]:
# After opening the session, query the table
query = "SELECT * FROM public.v1feb23_raw"
v1feb23_df = pd.read_sql_query(query, con=connection)

In [4]:
# Define the list of the various forms 
list_of_forms = ["DH","FF","FV","GI","GS","HW","MH","MU","NF","TU","NP"]

In [5]:
## Dictionaries to convert form integers into corresponding categories 
ERACE_dict = {1:"1. WHITE",2:"2. AFRICAN AMERICAN",3:"3. ASIAN",4:"4. HISPANIC",5:"5. OTHER"}
SOC_dict = {11:"11. Management", 13:"13. Business and Financial", 15:"15. Computer and Mathematical", 17:"17. Architecture and Engineering",\
           19:"19. Life; Physical; and Social Science",21:"21. Community and Social Service",23:"23. Legal",\
            25:"25. Education; Training and Library",27:"27. Arts; Design; Entertainment; Sports and Media",\
            29:"29. Healthcare Practitioners and Technical" ,31:"31. Healthcare Support",33:"33. Protective Service",\
            35:"35. Food Preparation and Serving Related",37:"37. Building and Grounds Cleaning and Maintenance",\
            39:"39. Personal Care and Service",41:"41. Sales and Related",43:"43. Office and Administrative Support",\
            45:"45. Farming; Fishing and Forestry",47:"47. Construction and Extraction",49:"49. Installation; Maintenance and Repair"\
            ,51:"51. Production",53:"53. Transportation and Material Moving",55:"55. Military Specific"}
edu_dict = {1:"1. Some Elementary",2:"2. Elementary",3:"3. Some Highschool",4:"4. High School",5:"5. Some College",6:"6. College",7:"7. Some Grad",8:"8. Grad School"}

In [6]:
for form in list_of_forms:
    form_df = v1feb23_df.filter(regex=f"^(ID)|(^{form})")
    if form == "GI":
        form_df.GISOC = form_df.GISOC.map(SOC_dict)
        form_df.GIEDUC = form_df.GIEDUC.map(edu_dict)
        form_df.GIERACE = form_df.GIERACE.map(ERACE_dict)
    form_df.to_csv(f"/dsa/groups/casestudy2023su/team03/v1_form_{form}.csv",index=False)

# Directory that holds the form csvs
directory = f"/dsa/groups/casestudy2023su/team03/"

# Get all files within the directory
files = os.listdir(directory)

# Retrieve ONlY the specified csv files 
csv_files = [file for file in files if "_form_" in file and file.endswith(".csv")]

print(csv_files)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


['v1_form_DH.csv', 'v1_form_FF.csv', 'v1_form_FV.csv', 'v1_form_GI.csv', 'v1_form_GS.csv', 'v1_form_HW.csv', 'v1_form_MH.csv', 'v1_form_MU.csv', 'v1_form_NF.csv', 'v1_form_TU.csv', 'v1_form_NP.csv']


### <a name="data_insert"></a>Create tables and insert data
This process is nearly identical to what's performed in Data1.  The difference is that instead of using the SAS7DAT files, we're now loading in the CSV.  For consistency, the below details outline the process: 

After the files have been converted, we then need to start our import process.  An important first step is understanding whether the tables already exist or not.  Since we're starting from the raw data, we're:
1. **SELECT** to determine whether the tables exist or not
    1. If it does exist, the table will be **DROP**ed
1. **CREATE** the table - we're creating these tables with a few considerations
    1. The table will be created a concatenated string equaling filename+<_raw>
    2. We'll use the datatypes from the CSV so everyting isn't casted as a string
    3. Postgres has a limit of 1600 columns so if it's greater than that, we store it as a single column that can be parsed later
1. **GRANT** all **PRIVILEGES** to users within the **PUBLIC** group so everyone can manipulate the tables 
1. **INSERT** the data in to the respective columns
       
As a note, it's possible for a user to have a table-lock which would prevent the tables from being dropped. This can be resolved by checking who has the active transaction and then terminating or cancelling it from either the psql terminal or it could be done through another notebook connection:

**Identify the table lock** <br>
`select datname, pid, usename, application_name, state, query_start  from pg_stat_activity where datname = 'casestdysu23t03';`

**Terminate the PID** <br>
`SELECT pg_terminate_backend(23174);`

In [9]:
user = "dtfp3"
host = "pgsql.dsa.lan"
database = "casestdysu23t03"
password = getpass.getpass()
schema = "public" # Started with a different schema and then upated to public

dtype2SQL = {'object' : 'TEXT', 'float64' : 'REAL', 'int64' : "INTEGER","datetime64[ns]":'TEXT'}

# Connection setup
connection = None

try:
    connection = psycopg2.connect(user=user, host=host, database=database, password=password)
    cursor = connection.cursor()

    for csv_file in csv_files:
        df = pd.read_csv(directory+csv_file)

        # We're dealing with the raw data that could be very messy - starting with _raw for clarity
        table_name = csv_file.split(".")[0].lower()

        # Check if table exists
        check_table_query = f"SELECT EXISTS (SELECT 1 FROM information_schema.tables WHERE table_schema = '{schema}' AND table_name = '{table_name}')"
        cursor.execute(check_table_query)
        table_exists = cursor.fetchone()[0]

        if table_exists:
            print("Table already exists - blowing it away: {}".format(table_name))
            # Drop the table if it exists
            drop_table_query = f"DROP TABLE {schema}.{table_name}"
            cursor.execute(drop_table_query)
            connection.commit()

        # Create the table
        print("Creating Table: {}".format(table_name))
        
        if len(df.columns) >= 1600:
            # If the number of columns is equal to or more than 1600, create a table with a single "data_column"
            create_table_query = f"CREATE TABLE {schema}.{table_name} (data_column text)"
            cursor.execute(create_table_query)
        else:
            # If the number of columns is less than 1600, create a table with all the columns from the CSV
            columns = ', '.join([f'"{col}" {dtype2SQL[str(df[col].dtype)]}' for col in df.columns])
            create_table_query = f"CREATE TABLE {schema}.{table_name} ({columns})"
            cursor.execute(create_table_query)
        connection.commit()

        # Grant necessary privileges to all users
        grant_query = f"GRANT ALL PRIVILEGES ON TABLE {schema}.{table_name} TO PUBLIC"
        cursor.execute(grant_query)
        connection.commit()
        
        print(f"Table {schema}.{table_name} created successfully.")

        # Read in the CSV file and insert the data
        with open(directory+csv_file, 'r') as file:
            if len(df.columns) >= 1600:
                # We don't skip the header because we need that data if we're building another table. 
                # If the number of columns is equal to or more than 1600, read in each row but don't parse the columns
                for line in file:
                    data_column = line.strip()
                    insert_query = f"INSERT INTO {schema}.{table_name} (data_column) VALUES (%s)"
                    cursor.execute(insert_query, (data_column,))
            else:
                # Skip header row
                next(file)
                
                # If the number of columns is less than 1600, insert the data row by row
                for line in file:
                    values = line.strip().split(',')
                    values = [None if x == "" else x for x in values]
                    insert_query = f"INSERT INTO {schema}.{table_name} VALUES ({','.join(['%s']*len(values))})"
                    cursor.execute(insert_query, values)

        connection.commit()

        print(f"File {csv_file} inserted successfully into table {schema}.{table_name}.")

    print("All files inserted successfully.")

except Exception as err:
    print("An error has occurred: {}".format(err))

finally:
    if connection:
        cursor.close()
        connection.close()

del password

········
Table already exists - blowing it away: v1_form_dh
Creating Table: v1_form_dh
Table public.v1_form_dh created successfully.
File v1_form_DH.csv inserted successfully into table public.v1_form_dh.
Table already exists - blowing it away: v1_form_ff
Creating Table: v1_form_ff
Table public.v1_form_ff created successfully.
File v1_form_FF.csv inserted successfully into table public.v1_form_ff.
Table already exists - blowing it away: v1_form_fv
Creating Table: v1_form_fv
Table public.v1_form_fv created successfully.
File v1_form_FV.csv inserted successfully into table public.v1_form_fv.
Table already exists - blowing it away: v1_form_gi
Creating Table: v1_form_gi
Table public.v1_form_gi created successfully.
File v1_form_GI.csv inserted successfully into table public.v1_form_gi.
Table already exists - blowing it away: v1_form_gs
Creating Table: v1_form_gs
Table public.v1_form_gs created successfully.
File v1_form_GS.csv inserted successfully into table public.v1_form_gs.
Table alrea