# Connect to the OCHIN DB through python

This notebook will walk you through how to connect to the OCHIN DB using python. 
Before you begin, make sure that you have access to the data and check to make sure the `db-credentials.txt` file is located in your home directory. 

## Install ODBC drivers as necessary

In [None]:
%%sh

sudo su 

#Download appropriate package for the OS version
#Choose only ONE of the following, corresponding to your OS version

#RHEL 7 and Oracle Linux 7
curl https://packages.microsoft.com/config/rhel/7/prod.repo > /etc/yum.repos.d/mssql-release.repo

#RHEL 8 and Oracle Linux 8
#curl https://packages.microsoft.com/config/rhel/8/prod.repo > /etc/yum.repos.d/mssql-release.repo

#RHEL 9
#curl https://packages.microsoft.com/config/rhel/9.0/prod.repo > /etc/yum.repos.d/mssql-release.repo

exit

sudo yum remove unixODBC-utf16 unixODBC-utf16-devel #to avoid conflicts
sudo ACCEPT_EULA=Y yum install -y msodbcsql17
# optional: for bcp and sqlcmd
sudo ACCEPT_EULA=Y yum install -y mssql-tools
echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc
source ~/.bashrc
# optional: for unixODBC development headers
sudo yum install -y unixODBC-devel

## Import python packages

In [None]:
import pyodbc
print("List of ODBC drivers:")
dlist = pyodbc.drivers()
for drvr in dlist:
    print('\t', drvr)

print("End of list")

In [None]:
import sys
!{sys.executable} -m pip install pandasql
import pandas as pd
from pandasql import sqldf
import pyodbc

## Read and parse your db credentials


In [None]:
import re

file_path = '/home/ec2-user/SageMaker/db-credentials.txt'

with open(file_path, 'r') as file:
    # Code to parse the data will go here
    file_contents = file.read()

# Remove newlines and extra spaces
cleaned_string = file_contents.replace('\n', '').strip()

# Extract variable-value pairs using regular expressions
pattern = r'"([^"]+)": "([^"]+)"'
pairs = re.findall(pattern, cleaned_string)

parsed_data = {}

for variable, data in pairs:
    parsed_data[variable] = data


## Connect to the database
Remember that you may have access to multiple databases. 


### Connecting to the "raw" data
The raw database contains the patient data. You have read-only access to this database.

- For AIM AHEAD Year 1 programs, the Raw database is named FellowsSample.
- For AIM AHEAD Year 2 programs, the Raw database is named AAOCHIN2023.
- For AIM AHEAD Year 3 programs, the Raw database is named AAOCHIN2024.

Within the Raw database, you must open the **View** specific to your project. Your OCHIN Project ID is provided to you when you receive confirmation of your OCHIN DB Login Activation. 

Once you have opened your View, you will see the Tables which contain the subset of data specific to your project. 

- Tables containing patient-level data specific to your project will be prefixed with your OCHIN Project ID.
- Tables that contain metadata shared amongst all projects are prefixed with "common".

In [None]:
### This example shows how to view tables in the AAOCHIN2024 database. 
raw_db = 'AAOCHIN2024' 
raw_connection_string = "DRIVER={ODBC Driver 17 for SQL Server};" + \
                    "SERVER=" + parsed_data['host'] + ',' + parsed_data['port'] + ';' + \
                    "DATABASE=" + raw_db + ';' + \
                    "UID=" + parsed_data['username'] + ';' + \
                    "PWD={" + parsed_data['password'] + "};"
raw_conn = pyodbc.connect(raw_connection_string, trusted_connection = 'no')
raw_cursor = raw_conn.cursor()

In [None]:
### We can print all tables available in the database views we have access to like so: 
raw_cursor.execute('''
SELECT name AS TABLE_NAME
FROM sys.views
''')
for row in raw_cursor:
    print(row[0])

In [None]:
### We can view the top 10 entries in the CONCEPT_DIMENSION table like so:
### Note that the CONCEPT_DIMENSION table contains metadata shared amongst all projects, 
### and therefore must be prefixed with "common"
raw_cursor.execute('''
SELECT TOP 10 *
FROM common.CONCEPT_DIMENSION;

''')

results = raw_cursor.fetchall()

# Get the column names from the cursor description
columns = [column[0] for column in raw_cursor.description]

# Create a DataFrame from the fetched results and column names
results_df = pd.DataFrame.from_records(results, columns=columns)
results_df['NAME_CHAR'] = results_df['NAME_CHAR'].astype('category')
results_df

### Connecting to your project database
Your project database is unique to your project and named after your OCHIN Project ID. You have read-write access to this database.

Your project database is initially empty. All temporary tables or aggregate results for your project should be saved here. You can store up to 50GB in your project database. This is not a lot! The full Raw database has over 1TB of data. If you copy data directly from the Raw database to your project database, you will quickly run out of space.



In [None]:
### This example shows how to view tables in your project specific database. 
proj_db = 'S000' # Change this to your project ID
proj_connection_string = "DRIVER={ODBC Driver 17 for SQL Server};" + \
                    "SERVER=" + parsed_data['host'] + ',' + parsed_data['port'] + ';' + \
                    "DATABASE=" + raw_db + ';' + \
                    "UID=" + parsed_data['username'] + ';' + \
                    "PWD={" + parsed_data['password'] + "};"
proj_conn = pyodbc.connect(proj_connection_string, trusted_connection = 'no')
proj_cursor = proj_conn.cursor()

In [None]:
### We can print all tables in the project database like so:
### Note that this database will be empty unless you have saved tables here. 
proj_cursor.execute('''
SELECT name AS TABLE_NAME
FROM sys.tables
''')
for row in raw_cursor:
    print(row[0])