### Subprocesses

One of the biggest strengths of Python is that it can be used as a *glue* language. <br>
It can 'glue' together a series of programs into a flexible and highly extensible pipline.

### Why subprocesses
One of the most common, yet complicated, tasks that most programming languages need to do is creating new processes. <br>
This could be as simple as seeing what files are present in the current working directory (`ls`) or as complicated as creating a program workflow that *pipes* output from one program into another program's input. <br/><br/>
Many such tasks are easily taken care of through the use of Python libraries and modules (`import`) that *wrap* the programs into Python code, effectively creating Application Programming Interfaces (API). <br/><br/>
However, there are many use cases that require the user to make calls to the terminal from ***within*** a Python program.

#### Operating System Conundrum

As many in this class have found out, while Python can be installed on most operating systems; doing the same thing in one operating system (Unix) may not always yield the same results in another (Windows).<br/><br/>
The very first step to making a program **"OS-agnostic"** is through the use of the `os` module.

In [None]:
import os

https://docs.python.org/3/library/os.html

In [None]:
#dir(os)

In [None]:
help(os.getcwd)

In [None]:
os.getcwd()

In [None]:
help(os.chdir)

In [None]:
# The name of the operating system dependent module imported. 
# The following names have currently been registered: 'posix', 'nt', 'java'
# Portable Operating System Interface -  IEEE standard designed to facilitate application portability
# (Windows) New Technology - a 32-bit operating system that supports preemptive multitasking
# 
os.name

In [None]:
# A list of strings that specifies the search path for modules. 
import sys
sys.path

In [None]:
# A mapping object that contains environment variables and their values.

os.environ

In [None]:
# A mapping object representing the string environment.

print(os.environ['HOME'])

#Return the value of the environment variable key if it exists, 
#or default if it doesn’t. key, default and the result are str.

print(os.getenv("HOME"))

In [None]:
print(os.getenv("PATH"))

# Returns the list of directories that will be searched for a named executable,
#similar to a shell, when launching a process. 
# env, when specified, should be an environment variable dictionary to lookup the PATH in. 
# By default, when env is None, environ is used.

os.get_exec_path()

The `os` module wraps OS-specific operations into a set of standardized commands. <br>
For instance, the Linux end-of-line (EOL) character is a `\n`, but `\r\n` in Windows. <br>
In Python, we can just use the following:

In [None]:
# EOL - for the current (detected) environment

'''
The string used to separate (or, rather, terminate) lines on the current platform. 
This may be a single character, such as '\n' for POSIX, or multiple characters, 
for example, '\r\n' for Windows. 
Do not use os.linesep as a line terminator when writing files opened in text mode (the default); 
use a single '\n' instead, on all platforms.
'''

os.linesep

Another example, in a Linux environment, one must use the following command to list the contents of a given directory:
```
ls -alh 
```

In Windows, the equivalent is as follows:
```
dir
```

Python allows users to do a single command, in spite of the OS:

In [None]:
# List directory contents

os.listdir("ProjectCM")

However, the biggest issue for creating an OS-agnostic program is ***paths*** <br/>
Windows: `"C:\\Users\\MDS\\Documents"`<br/>
Linux: `/mnt/c/Users/MDS/Documents/`<br/><br/>
Enter Python:

In [None]:
# path joining from pwd
pwd = os.getcwd()
print(pwd)
print(os.path.dirname(pwd))
os.path.join(pwd,"ProjectCM","demoCM","test.py")

### `subprocess`

If you Google anything on how to run shell commands, but don't specify Python 3.x, you will likely get an answer that includes `popen`, `popen2`, or `popen3`. These were the most prolific ways to *open* a new *p*rocess. In Python 3.x, they encapsulated these functions into a new one called `run` available through the `subprocess` library.

In [None]:
# Import and alias
import subprocess as sp

#### `check_output`

In [None]:
help(sp.check_output)

In [None]:
# check_output returns a bytestring by default, so I set encoding to convert it to strings.
# [command, command line arguments]
# change from bytes to string using encoding

sp.check_output(["echo","test"],encoding='utf_8')

In [None]:
# demonstration, might not work if test.py does not have the parsing code
sp.check_output([os.path.join(pwd,"test.py"),"[1,2,3]"],encoding='utf_8')

The first thing we will look are trivial examples that demonstrate just capturing the *output* (stdout) of a program

However, while the `check_output` function is still in the `subprocess` module, it can easily be converted into into a more specific and/or flexible `run` function signature.

#### `run`

In [None]:
help(sp.run)

In [None]:
sub = sp.run(
    [
        'echo',             # The command we want to run
        'test'              # Arguments for the command
    ],
    encoding='utf_8',       # Converting byte code
    stdout=sp.PIPE,         # Where to send the output
    check=True              # Whether to raise an error if the process fails
)  
sub

In [None]:
[elem for elem in dir(sub) if not elem.startswith("__")]

In [None]:
print(sub.stdout)

The main utility of `check_output` was to capture the output (stdout) of a program. <br>
By using the `stdout=subprocess.PIPE` argument, the output can easily be captured, along with its return code. <br>
A return code signifies the program's exit status: 0 for success, anything else otherwise

In [None]:
sub.returncode

With our `run` code above, our program ran to completetion, exiting with status 0. The next example shows a different status.

In [None]:
sp.run(
        'exit 1',      # Command & arguments
        shell = True   # Run from the shell
        )


However, if the `check=True` argument is used, it will raise a `CalledProcessError` if your program exits with anything different than 0. This is helpful for detecting a pipeline failure, and exiting or correcting before attempting to continue computation.

In [None]:
sp.run(
        'exit 1',      # Command & arguments
        shell = True,  # Run from the shell
        check = True   # Check exit status
    )

In [None]:
sub = sp.run(
        'exit 1',      # Command & arguments
        shell = True,  # Run from the shell
        # check = True   # Check exit status
    )
if (sub.returncode != 0):
    print(f"Exit code {sub.returncode}. Expected 0 when there is no error.")

#### Syntax when using `run`:
1. A list of arguments: `subprocess.run(['echo', 'test', ...], ...)` 
2. A string and `shell`: `subprocess.run('exit 1', shell = True, ...)`

The preferred way of using `run` is the first way. <br>
This preference is mainly due to security purposes (to prevent shell injection attacks). <br>
It also allows the module to take care of any required escaping and quoting of arguments for a pseudo-OS-agnostic approach. 

There are some guidelines though:
1. Sequence (list) of arguments is generally preferred
2. A str is appropriate if the user is just calling a program with no arguments
3. The user should use a str to pass argument if `shell` is `True`<br/>
Your next questions should be, "What is `shell`?"

`shell` is just your terminal/command prompt. This is the environment where you call `ls/dir` in. It is also where users can define variables. More importantly, this is where your *environmental variables* are set...like `PATH`.<br/><br/>
By using `shell = True`, the user can now use shell-based environmental variable expansion from within a Python program.

In [None]:
sp.run(
        'echo $PATH',            # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )      # Look at the output


In [None]:
p1 = sp.run(
        'sleep 5; echo done1',   # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p1)
p2 = sp.run(
        'echo done2',            # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p2)

For the most part, you shouldn't need to use `shell` simply because Python has modules in the standard library that can do most of the shell commands. For example `mkdir` can be done with `os.mkdir()`, and `$PATH` can be retrieved using os.getenv("PATH") or os.get_exec_path() as shown above. 

#### Blocking vs Non-blocking

The last topic of this lecture is "blocking". This is computer science lingo/jargon for whether or not a program ***waits*** until something is complete before moving on. Think of this like a really bad website that takes forever to load because it is waiting until it has rendered all its images first, versus the website that sets the formatting and text while it works on the images.

1. `subprocess.run()` is blocking (it waits until the process is complete)
2. `subprocess.Popen()` is non-blocking (it will run the command, then move on)

***Most*** use cases can be handled through the use of `run()`.<br> 
`run()` is just a *wrapped* version of `Popen()` that simplifies use. <br>
However, `Popen()` allows the user a more flexible control of the subprocess call. <br>
`Popen()` can be used similar way as run (with more optional parameters).

An example use case for `Popen()` is if the user has some intermediate data that needs to get processed, but the output of that data doesn't necessarily affect the rest of the pipeline.

#### `Popen`

In [None]:
p1 = sp.Popen(
        'sleep 5; echo done1',               # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p1)
p2 = sp.Popen(
        'echo done2',             # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p2)
print("processes ran")

print(p1.stdout.read())
print(p2.stdout.read())
print("processes completed")



In [None]:
# Use context manager to handle process while it is running,
# and gracefully close it
with sp.Popen(
    [
        'echo',         # Command
        'here we are'       # Command line arguments
    ],
    encoding='utf_8', # Convert from byte to string
    stdout=sp.PIPE    # Where to send it
) as proc:            # Enclose and alias the context manager
    print(
        proc.stdout.read() # Look at the output
    )

In [None]:
for elem in dir(proc):
    if not elem.startswith('_'):
        print(elem)

#### ***NOTE***: From here on out, there might be different commands used for **Linux** / **MacOS** or **Windows**

In [None]:
#test_pipe.txt - a file to be used to demonstrate pipe of cat and sort 
!echo testing > test_pipe.txt
!echo the >> test_pipe.txt
!echo subprocess >> test_pipe.txt
!echo pipe >> test_pipe.txt


In [None]:
# mac OS
p1 = sp.Popen(['cat','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

# windows OS
# p1 = sp.Popen(['type','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

print(p1.stdout.read())

In [None]:
# mac OS
p1 = sp.Popen(['cat','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

# windows OS
# p1 = sp.Popen(['type','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')


p2 = sp.Popen(['sort'], stdin=p1.stdout, stdout=sp.PIPE, encoding='utf_8')
p1.stdout.close()  # Allow p1 to receive a SIGPIPE if p2 exits
output = p2.communicate()[0]
print(output)


`Popen` can create background processes, shell-background-like behavior means not blocking. <br>
`Popen` has a lot more functionality than `run`.

In [None]:
sub_popen = sp.Popen(
    [
        'echo',          # Command
        'test',        # Command line arguments
    ],
    encoding='utf_8',  # Convert from byte to string
    stdout=sp.PIPE     # Where to send it
)
for j in dir(sub_popen):
    if not j.startswith('_'):
        print(j)


In [None]:
# sub - returned by run
for j in dir(sub):
    if not j.startswith('_'):
        print(j)

In [None]:
sub_popen.kill()       # Close the process

Example creating child process.<br>
https://pymotw.com/3/subprocess/

A collection of `Popen` examples: <br>
https://www.programcreek.com/python/example/50/subprocess.Popen

## SQL

#### What is a database? 
* Is an organized collection of data (files)
* A way to store and retrieve that information
* A relational database is structured to recognize relations between the data elements

E.g. NCBI Gene <br>
https://www.ncbi.nlm.nih.gov/gene/statistics



https://www.researchgate.net/profile/Adam_Richards3/publication/282134102/figure/fig3/AS:289128232046602@1445944950296/Database-entity-diagram-Data-collected-from-NCBI-the-Gene-Ontology-and-UniProt-are.png

<img src = "https://www.researchgate.net/profile/Adam_Richards3/publication/282134102/figure/fig3/AS:289128232046602@1445944950296/Database-entity-diagram-Data-collected-from-NCBI-the-Gene-Ontology-and-UniProt-are.png" width = "700"/>

#### More database examples: 
* The Python dictionary qualifies
* A spreadsheet is a type of database – a table
* A fasta file could be considered a database


#### Why use databases?
* Databases can handle very large data sets 
* Databases scale well
* Databases are concurrent 
* Databases are fault-tolerant
* Your data has a built-in structure to it
* Information of a given type is typically stored only once
* You can query the data in a database  and easily create meaningful reports
* You can relate data from different tables


#### What is the Structured Query Language (SQL) ?
* SQL is the standard language for relational database management systems (ANSI)
* SQL is used to communicate with a database
* SQL can be used to: add, remove, modify, request data 

* SQL is a declarative language - you describe what you want



#### Relational Database Management Systems
* Software programs such as Oracle, MySQL, SQLServer, DB2, postgreSQL are the backbone on which a specific database can be built 
* They are called RDBMS (relational database management systems)
* They handle the data storage, indexing, logging, tracking and security  
* They have a very fine-grained way of granting permissions to users at the level of commands that may be used
    * Create a database
    * Create a table
    * Update or insert data
    * View certain tables ... and many more
    
* An important part of learning databases is to understand the type of data which is stored in columns and rows.  
* Likewise when we get to the database design section, it is critically important to know what type of data you will be modeling and storing (and roughly how much, in traditional systems) 
* Exactly which types are available depends on the database system


#### SQLite 
* SQLite is a software library that implements a self-contained, serverless, zero-configuration, embedded high-reliability, full-featured, public-domain SQL database engine. SQLite is the most widely deployed database engine in the world (https://sqlite.org/)
* A SQLite database is a single file that is transportable
* Check-out bioconductor (annotation) packages that come with sqlite databases
    * hgu133a.db
        * https://bioconductor.org/packages/release/data/annotation/html/hgu133a.db.html
    * org.Hs.eg.db - Genome wide annotation for Human, primarily based on mapping using Entrez Gene identifiers
        * https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html


##### SQLite uses a greatly simplified set of data types:
* INTEGER - numeric
* REAL - numeric
* TEXT – text of any length
    * Dates are held as text
* BLOB – binary large objects
    * Such as images


In [None]:
from sqlite3 import connect

# the file org.Hs.eg.sqlite should be in the datasets folder 
# if you pulled the info from the class github repo
# otherwise retrieve from the class github repo or canvas
conn = connect('../datasets/org.Hs.eg.sqlite')
curs = conn.cursor()

# close cursor and connection
curs.close()
conn.close()

In [None]:
conn = connect('org.Hs.eg.sqlite')
curs = conn.cursor()

There is a special sqlite_master table that describes the contents of the database

Major SQL commands: SELECT, INSERT, DELETE, UPDATE

#### SELECT - Retrieves data from one or more tables and doesn’t change the data at all 

* SELECT  * (means all columns), or the comma separated names of the columns of data you wish to return
    * They will return (left to right) in the order received. 
* FROM is the table source or sources (comma separated)
* WHERE (optional) is the predicate clause: conditions for the query
    * Evaluates to True or False for each row
    * This clause almost always includes Column-Value pairs.
    * Omitting the Where clause returns ALL the records in that table.
    * Note: the match is case sensitive
* ORDER BY (optional) indicates a sort order for the output data 
    * default is row_id, which can be very non-intuitive  
    * ASCending or DESCending can be appended to change the sort order.  (ASC is default)
* In most SQL clients, the ";" indicates the end of a statement and requests execution


SELECT - which columns to include in the result, use * for all columns <br>
FROM - which tables to use <br>
WHERE (optional) - predicate clause, which rows to include

'*' selects ALL rows and ALL columns and returns them by column order and row_id

In [None]:
sql = '''SELECT * FROM sqlite_master;'''
curs.execute(sql)

See result header

In [None]:
curs.description

See result

In [None]:
for row in curs: print(row)

WHERE clause example

In [None]:
sql = '''
SELECT name
FROM sqlite_master 
WHERE type= "table";
'''
curs.execute(sql)
for row in curs: print(row)

In [None]:
def get_header(cursor):
    '''Makes a header row from the cursor description. Its tab
delimited.


Arguments:
    cursor: a cursor after a select query
Returns:
    string: A string consisting of the column names separated by tabs, no new line
'''
    return '\t'.join([row[0] for row in cursor.description])
#    colNames = []
#    for row  in cursor.description:
#        colNames.append(row[0])
#    return '\t'.join(colNames)
print(get_header(curs))

In [None]:
sql = '''
SELECT *
FROM go_bp LIMIT 10;
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

http://geneontology.org/docs/guide-go-evidence-codes/
* Inferred from Experiment (EXP)
* Inferred from Direct Assay (IDA)
* Inferred from Physical Interaction (IPI)
* Inferred from Mutant Phenotype (IMP)
* Inferred from Genetic Interaction (IGI)
* Inferred from Expression Pattern (IEP)

Aliasing column names to make them easier to understand 

In [None]:
sql = '''
SELECT * FROM gene_info LIMIT 5;
'''
curs.execute(sql)
for i in curs.description: print(i[0])
for row in curs: print(row)


In [None]:
sql = '''
SELECT _id 'Gene Identifier', symbol "Gene Symbol"
FROM gene_info LIMIT 5;
'''
curs.execute(sql)
curs.description

In [None]:
curs.fetchall()

In [None]:
sql = '''
SELECT _id 'ID', symbol "Symbol"
FROM gene_info LIMIT 10;
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

In [None]:
#select all from go_bp



http://geneontology.org/docs/guide-go-evidence-codes/
* Inferred from Experiment (EXP)
* Inferred from Direct Assay (IDA)
* Inferred from Physical Interaction (IPI)
* Inferred from Mutant Phenotype (IMP)
* Inferred from Genetic Interaction (IGI)
* Inferred from Expression Pattern (IEP)
* Inferred from High Throughput Experiment (HTP)
* Inferred from High Throughput Direct Assay (HDA)
* Inferred from High Throughput Mutant Phenotype (HMP)
* Inferred from High Throughput Genetic Interaction (HGI)
* Inferred from High Throughput Expression Pattern (HEP)
* Inferred from Biological aspect of Ancestor (IBA)
* Inferred from Biological aspect of Descendant (IBD)
* Inferred from Key Residues (IKR)
* Inferred from Rapid Divergence (IRD)
* Inferred from Sequence or structural Similarity (ISS)
* Inferred from Sequence Orthology (ISO)
* Inferred from Sequence Alignment (ISA)
* Inferred from Sequence Model (ISM)
* Inferred from Genomic Context (IGC)
* Inferred from Reviewed Computational Analysis (RCA)
* Traceable Author Statement (TAS)
* Non-traceable Author Statement (NAS)
* Inferred by Curator (IC)
* No biological Data available (ND)
* Inferred from Electronic Annotation (IEA)


SELECT - which columns to include in the result <br>
FROM - which tables to use <br>
WHERE (optional) - predicate clause, which rows to include <br>
ORDER BY (optional) - indicates a sort order for the output data

In [None]:
sql = '''
SELECT _id, go_id
FROM go_bp 
WHERE evidence="ND"
ORDER BY _id  DESC
LIMIT 20;
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))
#curs.fetchall()
#for row in curs: print(row)

COUNT returns  a single number, which is the count of all rows in the table

In [None]:
sql = '''
SELECT count(*) FROM genes;
'''
curs.execute(sql)
curs.fetchall()

In [None]:
sql = '''
SELECT count(_id) AS 'Number of genes' 
FROM genes;
'''
curs.execute(sql)
print(get_header(curs))
curs.fetchall()[0][0]

DISTINCT selects  non-duplicated elements (rows)

In [None]:
sql = '''
SELECT _id FROM go_bp LIMIT 20;
'''
curs.execute(sql)
curs.fetchall()

In [None]:
sql = '''
SELECT DISTINCT _id FROM go_bp LIMIT 10;
'''
curs.execute(sql)
curs.fetchall()

In [None]:
#count the number of rows on go_bp



In [None]:
sql = '''
SELECT DISTINCT _id FROM go_bp;
'''
curs.execute(sql)
result = curs.fetchall()
len(result)

WHERE clause operators <br>
https://www.sqlite.org/lang_expr.html

<> ,  != 	inequality <br>
<			less than <br>
<= 			less than or equal <br>
=			equal <br>
'>			greater than <br>
'>= 		greater than or equal <br>
BETWEEN v1 AND v2	tests that a value to lies in a given range <br>
EXISTS		test for existence of rows matching query <br>
IN			tests if a value falls within a given set or query <br>
IS [ NOT ] NULL	is or is not null <br>
[ NOT ] LIKE		tests value to see if like or not like another <br>

% is the wildcard in SQL, used in conjunction with LIKE


In [None]:
sql = '''
SELECT * FROM go_bp 
WHERE _id = '1';
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

In [None]:
sql = '''
SELECT * FROM go_bp 
WHERE _id IN (1,5,7);
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

In [None]:
sql = '''
SELECT * FROM go_bp 
WHERE evidence = 'ND' AND _id BETWEEN 20 AND 2000 
LIMIT 10
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

In [None]:
sql = '''
SELECT * 
FROM go_bp
WHERE go_id LIKE '%0081%' 
LIMIT 10;
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

In [None]:
# Retrieve rows from go_bp where the go_id is GO:0008104 and evidence is IEA or IDA

Sqlite3 also has some PRAGMA methods <br>
SQL extension specific to SQLite and used to modify the operation of the SQLite library or to query the SQLite library for internal (non-table) data <br>
https://www.sqlite.org/pragma.html <br>
The code below shows how to get the schema (columns and columns information)

In [None]:
sql = 'PRAGMA table_info("go_bp")'
curs.execute(sql)
curs.fetchall()

In [None]:
sql = '''SELECT * FROM pragma_table_info("go_bp")  '''
curs.execute(sql)
curs.fetchall()

In [None]:
sql = '''
SELECT _id, symbol, gene_name 
FROM gene_info
WHERE _id IN
    (SELECT DISTINCT _id 
    FROM go_bp
    WHERE go_id == 'GO:0008104'); 
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

##### GROUP BY groups by a column and creates summary data for a different column

In [None]:
sql = '''
SELECT go_id, count(*) FROM go_bp GROUP BY go_id LIMIT 10;
'''
curs.execute(sql)
curs.fetchall()

In [None]:
sql = '''
SELECT go_id, count(_id) as gene_no FROM go_bp GROUP BY go_id LIMIT 10;
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

##### HAVING allows restrictions on the rows used or selected

In [None]:
sql = '''
SELECT go_id, count(_id) as gene_no FROM go_bp GROUP BY go_id
HAVING gene_no>500;
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

In [None]:
# Select gene ids with more than 100 biological processes associated




#### See the create table statement

In [None]:
sql = '''
SELECT name,sql
FROM sqlite_master 
WHERE type= "table" and name == "go_bp"
LIMIT 2;
'''
curs.execute(sql)
print(get_header(curs))
for row in curs.fetchall():        
    print('\t'.join([str(elem) for elem in row ]))

In [None]:
print(row[1])

In [None]:
curs.close()
conn.close()