# Data Filtering and Conversion Notebook

## In this notebook : we are creating a filtered dataset from a much larger database to extract relevant data/columns for further analysis

## Input : 
    - SQLite database path from joining the input meta data and sequence data (in Code cell 3)
    - Query to select columns of interest (in Code cell 3)
    - User-defined filters to analyze a subset of the data (eg. Baseline vs Acute)
            

## Output : 
    - Dataframe object in CSV or SQLite DB (In Code cell 11 and 12)

In [1]:
# load needed libraries
import sqlite3
import os
import pandas as pd

# User input

## Database from Extract

Connect to SQLite database

In [2]:
def connect():
    '''function to return SQLite connection
    '''
    # create connection object representing database
    # point to local db (DB Path)
    return(sqlite3.connect('/media/teamcovid/ForData/fdacovid3.db') )

Create SQLite connection instance

In [3]:
conn = connect()

### Enter your query parameters below

In [4]:
disease_stage1 = "Acute"

In [5]:
disease_stage2 = "Baseline"

In [6]:
num_rows = 100000

## Generate query to select key columns

Generate SQL query

In [7]:
# Query selects needed columns
# Current Example filters by column disease_stage and number of rows
# Current Example creates a function with WHERE clause
def query_str():
    
 
    
    ''' function takes input: None
                       output: parametrized SQL query string
   
    '''

    query = '''SELECT sequence_id, junction_aa, junction_aa_length,\n
    seqtable.sample_processing_id, metadata.subject_id, metadata.study_id, metadata.sex, \n
    metadata.disease_diagnosis, metadata.disease_stage, metadata.intervention \n
    FROM seqtable \n
    INNER JOIN metadata on metadata.sample_processing_id = seqtable.sample_processing_id \n
    WHERE disease_stage = (?)  limit (?)'''
    
    return(query)

Generate a query string instance

In [8]:
query = query_str()

## Constraints in the Query

Generate dataframe from query above

In [9]:
# This creates a pandas dataframe from the database
# You will need to make a  place holder for column name and number of records
def getdata( query, disease_stage,  num_rows ):
    ''' function takes input: 
                            query : SQL query,
                            and two binding parameters in query string: disease_stage, num_rows   
    '''
    
    df = pd.read_sql_query(query_str(),
                           connect(),
                           params = (disease_stage, num_rows),)
    return(df)

Generate dataframe instance

In [10]:
df1 = getdata(query, disease_stage1, num_rows ) # Recovered
df2 = getdata(query, disease_stage2, num_rows ) # Acute

# export queried data frame to a CSV file

In [11]:
#df.to_csv("df_for_analysis2.csv")
#df.to_csv("df_for_analysis2.csv")

# export queried data frame to a SQLite db

In [12]:
# export queried data frame to a SQLite db
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

disk_engine = create_engine('sqlite:///disease_stage.db')
Base = declarative_base()

df1.to_sql('tb1', disk_engine, if_exists = 'replace')
df2.to_sql('tb2', disk_engine, if_exists = 'replace')


Base.metadata.create_all(disk_engine)

In [13]:
conn.close()

# Load the queried data

In [14]:
df1.iloc[1]

sequence_id                                    5f4c2670d06cd255bfdd1622
junction_aa                                               CSARGGRDYEQYF
junction_aa_length                                                   13
sample_processing_id                           5f4c16ad83a8226b3dd8db5a
subject_id                                                 ADIRP0000093
study_id                ImmuneCODE-COVID-Release-002: COVID-19-Adaptive
sex                                                              female
disease_diagnosis                                              COVID-19
disease_stage                                                     Acute
intervention                                                           
Name: 1, dtype: object