# Table of Contents
* [Encapsulation](#Encapsulation)
* [Classes](#Classes)
* [Python Modules](#Python-Modules)
* [Containers](#Containers)
* [APIs](#APIs)
* [OS](#The-os-module)


# Encapsulation

A very important concept implemented in most, if not all, programming languages is the principle of *encapsulation*.

The basic idea is that code should be written in a modular manner, each functionality being implemented in a piece of code that communicates with the rest of the code via an interface. This has many advantages, such as 

* each piece of code can be written and tested *independently* of other code
* variables in that code will not be inadvertently changed somewhere else in the program
* the code can be easily shared and reused

In other words, it becomes much easier to ensure that a particular functionality is implemented correctly, it can be optimized without affecting other parts of a program, and once the development of the code is complete, it does not have to be revisited every time the functionality needs to be reused.

Furthermore, encapsulated code can be conveniently reused without having to worry how it does what it does, just like a *black box*:
```
        ____________
       | BLACK  BOX |
       |            |
input ==> f(input) ==> output
       |            |
       |____________|
```

The output produced by the code in the black box is just a function of its input.

There a various levels or layers of encapsulation. Python's functions and methods are one example of encapsulation. 

## Scope

Scope is a related concept. Imagine that we have written a function `power_up()` that looks like this:

In [1]:
def power_up(x, y=1):
    """returns x^y, or x^1 if no value for y is provided"""
    return x**y

And we use this function in our code:

In [2]:
print("With 2 args:", power_up(3, 3))
print("With default:", power_up(3))
print("With y=", y, ": ", power_up(3))

With 2 args: 27
With default: 3


NameError: name 'y' is not defined

What has happened? 

Even though the function executes fine, at the last call to print we find out that `y` is not defined. Why is that? 

That's because `y` only exists in the **scope** of the function, not in the larger scope of the program which calls the function. This is important! Imagine what would happen if every function, module etc. that we use in our code, written by ourselves or others, would define variables that are visible everywhere in the code. Then variables that we define will start changes without us being aware. Not a good idea.

Thus, Python keeps track of variables, functions etc. in a `symbol table`, where each of these objects are associated with their scope. Objects from an *outer* scope are visible in a *inner* scope but not the other way around. Let's look at an example:

In [3]:
def power_up(x, y=1):
    """returns x^y, or x^1 if no value for y is provided"""
    print("In the inner scope x =", str(x))
    print("in the inner scope z = ", str(z))
    return x**y

x = 3
z = 10

print("Before function call x =", str(x))
print(power_up(4, 2))
print("After function call x =", str(x))

Before function call x = 3
In the inner scope x = 4
in the inner scope z =  10
16
After function call x = 3


As you can see, the `x` outside the function is not influenced by the argument `x` to the function. Furthermore, the `global` variable `z` is visible inside the function.

## Exercises

Use the build-in function `id()` to show that two variables with the same name, one defined in the global scope, the other defined in a local scope, are indeed different objects.

# Classes

Classes are nice abstractions that allow us to group data and functionality into objects, creating new *data types* or extending already available ones. Classes also define new *name scopes*, beyond the *global* scope of the entire program and the *local* scope of individual functions.

## Class definition

Is similar to the function definition:

```
class ClassName:
    <statement 1>
    <statement 2>
    ...
    <statement n>
```
and should come before the first use of the class object. The statements inside the class definition implement class *variables* and *functions*, which can be accessed with the <kbd>.</kbd> notation. Let's look in more detail at an example.

In [85]:
# a class describing different cell types
class Cell:

    markers = []
    
    def __init__(self, kind):
        self.kind = kind
        
    def add_marker(self, marker):
        self.markers.append(marker)
        

c = Cell("B cell")
c.add_marker("CD19")
d = Cell("T cell")
d.add_marker("CD3")
print(c.kind + " : " + " ".join(c.markers))

B cell : CD19 CD3


We started by defining a `class variable` called *markers*, intended to hold the markers associated with each type of cell. We then defined two functions, one that has a special name, `__init__` and will be called every time we create an `instance` of the class and another that adds markers for the cell. 

The `__init__` function only sets the type, or kind, of the cell that has just been created. Note, however, that this, as all other functions that we define for the class take a special parameter called *self*, which is a reference to the current instance of the class. The `add_marker` function allows us to construct the list of markers defining the cell type.

Now what happens when we start instantiating the class to generate various types of cells. We see that instead of a cell type-specific marker list, each cell ends up having the same markers. This is because *markers* was defined as a `class variable`, which all instances share and modify. In contrast `kind` is a `instance variable`, its definition being linked to a specific instance of the class, which is why we full name of the varible is `self.kind`. We can modify this behavior by redefining the *markers* variable for each instance. Nevertheless, we still have access to the class variable as illustrated below. In contrast to some other programming languages, Python does not prevent us from accesses all members of a class, variables and methods, via the <kbd>.</kbd> notation, i.e. they are *public*, not *private* to the class. 

In [89]:
# a class describing different cell types
class Cell:

    markers = []
    
    def __init__(self, kind):
        self.kind = kind
        self.markers = []
        
    def add_marker(self, marker):
        self.markers.append(marker)
        
c = Cell("B cell")
c.add_marker("CD19")
d = Cell("T cell")
d.add_marker("CD3")
Cell.markers.append("TIA1")
print(c.kind + " : " + " ".join(c.markers))
print(d.kind + " : " + " ".join(d.markers))
print(Cell.markers)

B cell : CD19
T cell : CD3
['TIA1']


A more interesting use of class variables is when we do want to keep track of properties that are indeed shared by all instances of the class. In the example below we first define a dictionary that uses a data type we have not discussed before, which the enumeration or `Enum` type. This basically consists of consecutive values typically starting from 1, which in the code we would like to relabel by more informative names. Here we created an `Enum` variable called *errorType*, which can take one of three values, these values being specified in a string argument.

In [87]:
from enum import Enum, auto

class ParameterError:
    errorType = Enum('errorType', 'unknown wrongType wrongValue')
    messageDict = {errorType.unknown.value : "Unexpected error",
                   errorType.wrongType.value : "Incorrect parameter type", 
                   errorType.wrongValue.value : "Incorrect parameter value"}
    
    def __init__(self, typeCode=errorType.unknown.value):
        self.t = typeCode
        self.m = self.messageDict[self.t]

    def shout(self):
        return (self.t, self.m)

e = ParameterError(1)
f = ParameterError(2)
print(e.m)
print(f.m)

Unexpected error
Incorrect parameter type


Here it makes sense to use the same `Enum` variable to deal with all errors, and then add instance variables to enhance the functionality of the class.

## Inheritance

Classes are important not only because of the encapsulation they provide, but also because they support `inheritance`, a concept that allows us to build objects in a modular, incremental manner. Let's look again at an example. 

In [None]:
class Transcript:
    
    def __init__(self, tid, name, kind):
        # give the transcript an identifier, name and function
        self.tid = tid
        self.name = name
        self.kind = kind
        
    def set_coords(self, chrom, strand, start, end):
        # save the coordinates of the transcript in the genome
        self.chrom = chrom
        self.strand = strand
        self.start = start
        self.end = end
        
    def set_sequence(self, seq):
        # save the transcript sequence
        self.seq = seq
        

class CodingTranscript(Transcript):
    
    def set_cds(self, start, end):
        self.cds_start = start
        self.cds_end = end
        

I defined a general `Transcript` class, with information that all transcripts should have, i.e. id, name, functional annotation, and genome coordinates. Then I wanted to have additional information for coding transcripts, namely where the coding region starts and ends in the transcript. To do that, I defined a `CodingTranscript` class that **inherits** from the `Transcript` class, i.e. has all the attributes of this class, but in addition, it has a method that can set the additional variables that I want. Let's use these classes now to create a coding an a non-coding transcript.

In [96]:
my_coding_transcript = CodingTranscript("1", "CT1", "coding")
my_coding_transcript.set_coords("chr1", "+", 231456, 232929)
my_coding_transcript.set_cds(142, 895)

my_nc_transcript = Transcript("2", "NCT1", "noncoding")
my_nc_transcript.set_coords("chr3", "-", 852314, 853100)
my_nc_transcript.set_cds(42, 604)

AttributeError: 'Transcript' object has no attribute 'set_cds'

As we can see, all works well when I try to set `cds_start` and `cds_end` in a transcript of the `CodingTranscript` class, which has this method and attributes, but not when I try to use the method for an instance of the `Transcript` base class, which does not have the `set_cds` method defined. 

Furthermore, the `__init__` function that I defined for a `Transcript` works also when I create an instance of the `CodingTranscript` class.

I can also overwrite the `__init__` function in the subclass, which will work like this:

In [97]:
class OtherCodingTranscript(Transcript):
    
    def __init__(self, tid, name, kind, start, end):
        # give the transcript an identifier, name and function
        self.tid = tid
        self.name = name
        self.kind = kind
        self.cds_start = start
        self.cds_end = end

second_coding_transcript = OtherCodingTranscript("2", "CT2", "coding", 45, 500)
third_coding_transcript = OtherCodingTranscript("3", "CT3", "coding")

TypeError: __init__() missing 2 required positional arguments: 'start' and 'end'

There are also ways to do the initialization stepwise, cascading `__init__` calls to parent functions.

# Python Modules

Let's look now at another very useful application of encapsulation, which the **module**. Documentation about modules can be found at https://docs.python.org/3/tutorial/modules.html

A **module** is a logical unit of code, a file containing Python definitions and statements. In its most basic form, it is just the set of commands inside one of the code cells here, saved as a text file with a `.py` extension (a Python *script* or program).

As soon as scripts get too big, it makes sense to distribute functional units across managable modules or files, each dealing with a specific functionality that we need in our program. Some of the available modules are extremely general, they basically define objects and methods for those objects that are useful in very different fields, from physics to biology to humanities.

To use the content of a module within another module or program one needs to `import` the module. This will include the objects and methods of the module in the symbol table of our program. To avoid any confusion, the names of the objects and methods of the module will be accessible within the current program prefixed with the name of the module.

Innumerable modules have been written for Python, many of which have been further bundled into *packages* (another encapsulation layer) that one can *install* and use. Installation can be done via conda or pip:

```bash
conda install [package name]
pip install [package name]
```

Python also comes already with a considerable number of built-in modules. Modules need to be `import`ed before they can be used. We are going to look at a number of modules in the coming weeks, but let's take a look at an example to get an idea of how a module is composed. We will use a module that comes with the standard python distribution, namely `time`.

### What's in a module

It's not uncommon that we want to find out how long it takes to run various parts of our programs and for this, Python has a built-in `time` module that has a lot of relevant functions. We can access this module as follows:

In [4]:
import time

print(time)

<module 'time' (built-in)>


The `print` command does not tell us very explicitly what is inside the module. To find out, we can use the **`dir()`** function, using the module name as argument:

In [5]:
dir(time)

['_STRUCT_TM_ITEMS',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'altzone',
 'asctime',
 'clock',
 'ctime',
 'daylight',
 'get_clock_info',
 'gmtime',
 'localtime',
 'mktime',
 'monotonic',
 'monotonic_ns',
 'perf_counter',
 'perf_counter_ns',
 'process_time',
 'process_time_ns',
 'sleep',
 'strftime',
 'strptime',
 'struct_time',
 'time',
 'time_ns',
 'timezone',
 'tzname',
 'tzset']

What are these things? Modules typically contain one or more of the following objects:
* special functions (starting with double underscores)
* variables (may also be referred to as attributes)
* functions (may also be referred to as methods)

To enforce good programming practices, Python has a number of conventions about underscores in names:

* Single Leading Underscore **\_var**: It indicates that the name is not part of the interface, it is for internal use in the module (similar to `private` names in other programming languages). 
* Double Leading Underscore **\_\_var**: Indicates that in the context of Python `classes` the name will be rewritten to prevent conflicts with names in subclasses.
* Single Trailing Underscore **var\_**: Used to avoid naming conflicts with Python keywords.
* Double Trailing Underscore **\_\_var\_\_**: Indicates special methods defined by the Python language (as we see above in the `time` module).
* Underscore *\_*: Special name for temporary variables.

Let's have a look at some of the objects, starting with the specially designated ones.

In [8]:
# Documentation text, also used when calling help() on the module
print(time.__doc__)

This module provides various functions to manipulate time values.

There are two standard representations of time.  One is the number
of seconds since the Epoch, in UTC (a.k.a. GMT).  It may be an integer
or a floating point number (to represent fractions of seconds).
The Epoch is system-defined; on Unix, it is generally January 1st, 1970.
The actual value can be retrieved by calling gmtime(0).

The other representation is a tuple of 9 integers giving local time.
The tuple items are:
  year (including century, e.g. 1998)
  month (1-12)
  day (1-31)
  hours (0-23)
  minutes (0-59)
  seconds (0-59)
  weekday (0-6, Monday is 0)
  Julian day (day in the year, 1-366)
  DST (Daylight Savings Time) flag (-1, 0 or 1)
If the DST flag is 0, the time is given in the regular time zone;
if it is 1, the time is given in the DST time zone;
if it is -1, mktime() should guess based on the date and time.



In [12]:
print('Loader: ', time.__loader__)
print('Package: ', time.__package__)
print('Spec: ', time.__spec__)
print('Name: ', time.__name__)

Loader:  <class '_frozen_importlib.BuiltinImporter'>
Package:  
Spec:  ModuleSpec(name='time', loader=<class '_frozen_importlib.BuiltinImporter'>, origin='built-in')
Name:  time


So these special objects give us more information about the module, including its name, where it resides and how it is loaded.

A special mention for the **\_\_name\_\_** variable. This allows us to write modules and packages that can be either run as stand-alone programs or be imported within other programs. How does it work? The whole idea rests on what the \_\_name\_\_ variables is set to in these two situations. 

When the module is run as stand-alone:
```bash
python my_module.py
```
\_\_name\_\_ is set to value '\_\_main\_\_'


whereas when the module is imported within another program
\_\_name\_\_ is set to name of the module file, which in this case would be 'my_module'


Then we can do things like this:

```python
def my_function:
    print 'The value of __name__ is ' + __name__

#### finished implementing module functionality ####

#### additional code for executing the code of the module from the commandline ###
def main:
    my_function()
    
if __name__ == '__main__':
    main()
```
explicitly running the function when the program is called from the commandline, and only use the function definition as needed when the module is imported.

Let us return to the `time` module and frequently used functions in it.

In [9]:
print(time.gmtime()) #coordinated universal time
print(time.localtime()) #local time

time.struct_time(tm_year=2021, tm_mon=8, tm_mday=14, tm_hour=21, tm_min=33, tm_sec=11, tm_wday=5, tm_yday=226, tm_isdst=0)
time.struct_time(tm_year=2021, tm_mon=8, tm_mday=14, tm_hour=23, tm_min=33, tm_sec=11, tm_wday=5, tm_yday=226, tm_isdst=1)


The `sleep` function is especially useful when dealing with web services, when we do not want to flood the service with request. Then, we typically space the requests at appropriate time intervals, and the `sleep` function helps us do this. For example:

In [10]:
def sleep_n_seconds(n = 10):
    """This function waits for n (default = 10) seconds."""
    print('Starting to wait...')
    time.sleep(n)
    print('Done waiting.')
    return

sleep_n_seconds(10)

Starting to wait...
Done waiting.


Let's convince ourselves that it really waits for the specified amount of seconds. We'll rewrite the function to tell us the time before it starts waiting and when it's finished:

In [11]:
def sleep_n_seconds(n = 10):
    """This function waits for n (default = 10) seconds and returns the elapsed time."""
    print('Starting to wait...')
    start = time.time()
    time.sleep(n)
    end = time.time()
    print('Done waiting.')
    return start, end

(start, end) = sleep_n_seconds(3)

print('Started at: ', time.ctime(start))
print('Ended at: ', time.ctime(end))
print('Time elapsed:', end-start)

Starting to wait...
Done waiting.
Started at:  Sat Aug 14 23:34:56 2021
Ended at:  Sat Aug 14 23:34:59 2021
Time elapsed: 3.0006680488586426


### Difference between module and package

While discussing modules above, I sometimes referred to a *package*. Is there a difference?

A module is a single python file. A package (https://docs.python.org/3/tutorial/modules.html#packages) is a collection of modules, organized in a hierarchical manner on the file system (directories-sunbdirectories-files), and having in addition an `__init__.py` file in each of these directories. The `__init__.py` file can just be empty, can contain code that will be executed upon import (initialization code), or set the `__all__` variable. This variable contains the list of module names that should be imported when the instruction `from package import *` is encountered. However, it is not recommended to use this instruction, but rather use explicitly import of desired modules. The items to be imported can be not only submodules, but also other names defined in the package, such as functions, classes or variables. 

Where does Python look for packages?

When the import statement is encountered, Python checks whether the item to be imported is defined in the package, and  if not, it assumes it is a module and attempts to load it, using the directories specified in the list variable `sys.path`. This is initialized from the environment variable PYTHONPATH, or, if this is not set, from a built-in default. This variable an be modified, for e.g. 
```
import sys
sys.path.append('/path/to/my/module')
```
An ImportError exception is raised if the module is not found. Packages support a special attribute, __path__, initialized to be a list containing the name of the directory holding the packageâ€™s __init__.py before the code in that file is executed. This is sometimes used to extend the set of modules found in a package.

## Exercises

1. install the Biopython package on your laptop

2. use it to create a nucleotide sequence

3. use it to generate the reverse complement

4. translate your transcript into protein

# Containers

**Containers** are a really important concept for reproducible data science. A container, just as its name says is an encapsulated piece of software that can be run on various hardware (with Linux or Windows operating system), entirely isolated manner from the host system. The container has code, environment variables, systems libraries, file system, etc., in principle all that is necessary to run the code that it contains. Containers can be created for various tools, versions of software etc., allowing reproducibility of analyses across architectures and users as well as the sharing of functionality. There are various cotainer technologies. The first and still most popular is **Docker**, which we will also use here. Rapidly gaining acceptance for the management of cluster jobs is **Kubernetes**.

To use Docker to manage our software environments we will need to download and install it from https://www.docker.com/. Once the Docker engine is installed, we will be able to build, run and share Docker containers.

## Docker build

The `build` command generates a Docker `image` from a recipe, `Dockerfile`. These images can be uploaded and shared via a centralised repository, **Docker Hub**, https://hub.docker.com/. 

We will use a widely-used tool for RNA-seq analysis, `salmon` to illustrate how to build a Docker image.

### The Dockerfile

To create a Docker image we need a "recipe", containing the commands that are needed to assemble and configure all the necessary software packages. This recipe is contained in the Dockerfile text file. We will use the example from the `nextflow` tutorial (https://github.com/seqeralabs/nextflow-tutorial).

As Docker copies all the file from the current directory into the image, we will create a new directory and move into it to have a clean slate. Another option would be to create a .dockerignore file in the current directory, specifying which files to exclude from the build.

Using a text editor, we'll create and save a Dockerfile with the following content: 

```
FROM debian:jessie-slim

MAINTAINER <your name>

RUN apt-get update && apt-get install -y curl cowsay 

ENV PATH=$PATH:/usr/games/
```
We can now build a first image, with the following command:
```
docker build -t my-image .
```
which takes all the files in the current directory, `.` and sends them to the Docker daemon to create an image which is tagged with *my_image* tag. The build command has of course many more options (check https://docs.docker.com/engine/reference/commandline/build/). We can check that our image was build with `docker images`, which shows all the images our daemon knows about.

A Docker image has many layers. The first is either a parent image which is reused or a base image that is built from scratch with the build command. Layers are added to the image, viewable via the Docker history command, in our case
```
docker history my-image
```
A Docker image also creates a writable (container), which stores changes made in the running container, including newly written or deleted files. Finally, a Docker image also contains a 'Docker manifest' and JSON-formatted file describing the image.

### Docker image commands (from https://searchitoperations.techtarget.com/definition/Docker-image)

* docker image build. Builds an image from a Dockerfile.
* docker image inspect. Displays information on one or more images.
* docker image load. Loads an image from a tar archive or streams for receiving or reading input (STDIN).
* docker image prune. Removes unused images.
* docker image pull. Pulls an image or a repository from a registry.
* docker image push. Pushes an image or a repository to a registry.
* docker image rm. Removes one or more images.
* docker image save. Saves one or more images to a tar archive (streamed to STDOUT by default).
* docker image tag. Creates a tag TARGET_IMAGE that refers to SOURCE_IMAGE.

### Docker commands
Beyond the build and history commands described above, frequently used Docker commands are:
* docker run. Run the specified image.
* docker update. Enables a user to update the configuration of containers.
* docker tag. Creates a tag, such as TARGET_IMAGE, which enables users to group and organize container images.
* docker search. Looks in Docker Hub for whatever the user needs.
* docker save. Enables a user to save images to an archive.
* docker compose. Used to handle an environment variable.

Applying some of these to our example, let's run the image we just created:
```
docker run my-image cowsay Hello Docker!
```
Generates some text-based graphics. Now let's say that we want to build a Docker image for the `salmon` tool, which is used to quantify transcript abundances from RNA-seq data. not very useful for running `salmon`. To do that, we will update the Dockerfile to include the following code:
```
RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v0.8.2/Salmon-0.8.2_linux_x86_64.tar.gz | tar xz \
 && mv /Salmon-*/bin/* /usr/bin/ \
 && mv /Salmon-*/lib/* /usr/lib/
```
and rerun the build command. If we then check the history of the image, we will see the additional layer that we just added. `salmon` should now be available in the container and we can check this with
```
docker run my-image salmon --version
```
That is, we run the image in a container and we invoke the program we want to run. **Important note:** we cannot simply pass parameters such as files that are on our host file system to the software that runs in the container because they container doesn't know about the host system structure. However, Docker provide ways to deal with this. One is to mount a host file system into the container. This can be done either with the `--volume` or with the `--mount` option. The difference is in how they handle files/directories that do not exist yet on the Docker host. `--volume` creates these **but always as directories**, while the latter gives an error. More about bind/mount at https://docs.docker.com/storage/bind-mounts/.

If we tried to run `salmon` indexing in a container that contais salmon, but with an input file on the host file system, it would look like this:
```
docker run --volume <absolute-path-host>:<absolute-path-container> my-image \
  salmon index -t <path-to-transcriptome-file> -i index 
```
  
Now I want to create a very simply image that contains my own code. I create a directory *MyDockerExample* and in it I make a *Code* subdirectory where I place a simply python script:
```
import sys                                                                                                             
                                                                                                                       
print("Simple python script illustration : addition of numbers ")                                                      
                                                                                                                       
def validate(args):                                                                                                    
                                                                                                                       
    nums = []                                                                                                          
    try:                                                                                                               
        for v in args:                                                                                                 
            nums.append(float(v))                                                                                      
    except ValueError:                                                                                                 
        print("Cannot convert " + v + " to float")                                                                     
                                                                                                                       
    return nums                                                                                                        
                                                                                                                       
print(sum(validate(sys.argv[1:])))                                                                                     
```

After checking that the program runs correctly I create my Dockerfile in the *MyDockerExample* directory, with the following content:

```
FROM python:3.8-slim-buster                                                                                            
                                                                                                                       
MAINTAINER Mihaela                                                                                                     

ENV PATH="$PATH:/Code/

COPY . .                                                                                                               
                                                                                                                       
WORKDIR /Code  
``` 

This will take a base python image to include it into my image, it will include my *Code* directory in the path where executables will be searched, it will copy the current of my *MyDockerExample* into the image root directory and will set the working directory of the image to be the one that contains my code.

I then create my Docker image:
```
docker build -t hello-image .
```
And run my program:
```
run hello-image python arith.py 2 4 5
```
Which generates the following output:
```
Simple python script illustration : addition of numbers 
11.0
```
                                                                                                                       
### Uploading image to Docker hub

Finally, if I really do something that is useful to others, I could upload my image to the Docker Hub. This requires the following steps:

1. creating an account on Docker hub (https://hub.docker.com web site)

2. use the credentials from Docker hub to login at the shell `docker login` ... 

3. tagging the image with Docker user name `docker tag my-image <user-name>/my-image` 

4. pushing image to Docker hub `docker push <user-name>/my-image`

This will make the image accessible to anyone, via `docker pull <user-name>/my-image`.

## Exercises

1. Write a program that reverse complement a sequence passed as argument.

2. Build a docker container around this program.

3. Run the program in the container for a few nucleotide sequences.

# APIs

A really nice illustration of encapsulation is *application program interfaces* (*APIs*), which are pieces of software that interact with other pieces of software via defined interfaces. They allow us to access/retrieve data stored on web servers/databases and then locally save or process and analyze that data. 

Many web services offer APIs for developers or interested users. For example, if you are interested in calculating some stats on your friends' tweets or the YouTube comments on your favorite artists' music videos, see here:
* [Twitter API documentation](https://developer.twitter.com/)
* [YouTube API documentation](https://developers.google.com/youtube/)

Or, since you are in this course, perhaps you are more interested in programmatically accessing data in the [Ensembl](http://www.ensembl.org/) or [NCBI](https://www.ncbi.nlm.nih.gov/) databases:
* [Ensembl REST API](https://rest.ensembl.org/)
* [NCBI public APIs](https://www.ncbi.nlm.nih.gov/home/develop/api/)

Note that APIs are interfaces for programmatic access to a web service in general, not just for data retrieval. Depending on the service, the API and your permissions, you might also use them to post or edit data, or embed a news or Twitter feed or your current location on your personal website.

While this may sound complicated and overwhelming at the beginning, APIs of big web services are typically well documented and easy to use and, moreover, there is generally plenty of help right at your fingertips. For example, search the internet for `Twitter API Python tutorial` and you will pretty soon be finding yourself analyzing tweets.

But let's get to an example that is more relevant for our course.

## Retrieving data via Ensembl's REST API

**WARNING:** You will need an active internet connection from here on!

[Ensembl](http://www.ensembl.org/) is a public resource of genomes and a wealth of related information, developed and maintained by the European Bioinformatics Institute. While it is possible to browse it and manually retrieve information of interest to you, it also offers a pretty neat (and fairly new) [REST API](https://rest.ensembl.org/) that allows you to programmatically access a large fraction of their data programmatically via the programming language of your choice. This gives you access to the most recent version of a database/service and you do not have to worry about versions of the data (typically these are stored on the web server and can be accessed as well).

So what is a REST API? A proper explanation would include a lot of info on the architecture of the internet and, more specifically, the hyper text transfer protocol (HTTP), which is way beyond the scope of this course. For now, a REST API gives you the opportunity to query remote databases via *uniform resources locators (URLs)* (i.e. web addresses). That is, by assembling a web address according to certain rules, we can get exactly the information we need.

REST APIs define different endpoints (i.e. different URL stems) to retrieve (or post, update...) data entries. These stems can then by modified by parameters (very similar to a function's or method's arguments) to get you what you want in the form that you want. Endpoints of Ensembl's REST API are listed here: https://rest.ensembl.org/. Unfortunately, not all databases are represented. Nevertheless, let's see how we can retrieve a genomic sequence via the API.

The endpoint we are using for our example is called "GET sequence/id/{sequenceID}".

If you look at that page, it gives
* a description of the required parameters (only `sequenceID` for this endpoint)
* description of the optional parameters
* examples with code in various programming language for retrieving specific types of information

Let's try to use this API to get the genomic sequence of the MYC gene locus, extende by 100 nucleotides both 5' and 3'. For this, we have to make use of another handy python module, *requests*. You can find more about this module's functionality at https://requests.readthedocs.io/en/master/user/quickstart/. 

In [59]:
import requests, sys
 
server = "https://rest.ensembl.org"
ext = "/sequence/id/ENSG00000136997?"
options = "expand_3prime=10;expand_5prime=10"

response = requests.get(server+ext+options, headers={ "Content-Type" : "application/json"})

if not r.ok:
  response.raise_for_status()
  sys.exit()
 

# Let's inspect what we got back
print("Information about the returned object:")
print(r.headers)
print("\n")

print("Methods and attributes of a response object:")
print(dir(response), '\n')
print("\n")

# The information we actualy asked for
decoded = response.json()
print(repr(decoded))

Information about the returned object:
{'Vary': 'Content-Type, Origin', 'Content-Type': 'application/json', 'Content-Length': '7678', 'X-Runtime': '0.121852', 'X-RateLimit-Limit': '55000', 'X-RateLimit-Reset': '2676', 'X-RateLimit-Period': '3600', 'X-RateLimit-Remaining': '54994', 'Date': 'Sun, 15 Aug 2021 13:15:24 GMT'}


Methods and attributes of a response object:
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect'

The information in the record seems to be organized similar to a Python dictionary, as *key-value* pairs. In fact, it is not a Python dictionary but rather text in **json** format, for JavaScript Object Notation. It is a format used to communicate between web services, and it is indeed very similar to Python's dictionary, allowing nesting. 

As you may notice from the attributes of the response, one of them is json() and we can check what it means:

In [60]:
help(response.json)

Help on method json in module requests.models:

json(**kwargs) method of requests.models.Response instance
    Returns the json-encoded content of a response, if any.
    
    :param \*\*kwargs: Optional arguments that ``json.loads`` takes.
    :raises ValueError: If the response body does not contain valid json.



**json** is a method of the response object, which returns the json-encoded content of the response, if present. If we call this method on our data, we get a dictionary-type of object, and we can inspect the keys:

In [61]:
print(response.json().keys())

dict_keys(['desc', 'molecule', 'seq', 'version', 'id', 'query'])


We can also check the type of values associated with the keys, which in this case are simply strinsgs:

In [62]:
data = dict(response.json())

for key in data.keys():
    print(type(data[key]))
print(data['molecule'])
print(data['desc'])

<class 'str'>
<class 'str'>
<class 'str'>
<class 'int'>
<class 'str'>
<class 'str'>
dna
chromosome:GRCh38:8:127735424:127742961:1


Where the API comes in handy is in retrieving **lots** of sequences. Let's try looking for a list of transcripts.

In [63]:
# Import built-in Python module that handles HTTP requests
import requests

# We have a list of 10 Ensembl transcript IDs
transcripts = [
    'ENST00000580119', 'ENST00000607131', 'ENST00000507008', 'ENST00000596515', 'ENST00000621529', 
    'ENST00000516151', 'ENST00000464729', 'ENST00000544404', 'ENST00000462341', 'ENST00000450928'
    ]

# Fixed parameters
host = 'https://rest.ensembl.org'
prefix = '/sequence/id/'
suffix = '?type=cdna'

# Open file for writing the retrieved sequences
with open('ensembl_rest.fa', 'w') as f:

    # Iterate over transcripts
    for transcript in transcripts:

        # Build query URL
        url = ''.join([host, prefix, transcript, suffix])

        # Write status message
        print("Sending request to", url)
        
        # Request sequence
        response = requests.get(url, headers={ "Content-Type" : "text/plain"})

        # Write identifier in FASTA format
        f.write('>' + transcript + '\n')
        
        # Write cDNA sequence
        f.write(response.text + '\n')

# Write status message
print("All sequences fetched!")

Sending request to https://rest.ensembl.org/sequence/id/ENST00000580119?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000607131?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000507008?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000596515?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000621529?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000516151?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000464729?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000544404?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000462341?type=cdna
Sending request to https://rest.ensembl.org/sequence/id/ENST00000450928?type=cdna
All sequences fetched!


In [64]:
!head -2 ensembl_rest.fa

>ENST00000580119
TACATTTAAACACAATTCCTTTGTTCCCATATCCTGAAATCAGTAAATTTAAGCAATTTTATTTTTCATTCTGTGCTTCAGGGATCTATGAGGGTCAACACATATTTATCACACGCTGACCTTGTGCTCAGGCCTATAGGAAAATGAACCTCCCTGCCCGCAATGACCTTAGAACCTAACTAAGGATATGGTTAAAACAAGCGCAGGATTGGGTGAGACCTGCTCAAGGGTCAGGCGCTGGTTTGTCGGGGAGAAATGTTGCCCAGGAAGGACAGGTGTGAAAGAAAGAGAAGATCCTGCAAGACCGGAGGGCCCAGGAAAGTCTGGGGGACTCAAAGGAGAGTTTCAGGATTGACCATGGGTGGACAGGGCTCTGCAGACCAGGATCTTTTCTATGCACACTTGGGGCCAGAAAGTTCTTTGTGGTGGGGGCTGTCCTGTACAATGCAGGAAGCTCAGCAGCATCCCTGGCCTCCACCCACTCAATGCCAGTGGCACTCCCTCCCTAGTTGTAAATGTGTGCAGACATTAGTACTTGTCACCTGCAGGGAAGAATCACCCCACTGAGAACCACATGCTGAGTCAAGGCAACAGCATAGGAATTGGAGCACGCAAGAGAGCTGGTCAACACAGAGAGATGAGCTTCAGGGGCTGCTGGATGTACTCCACCTGGAAGCAGAGGCCGAGGACCCCTCCTCCTGGACTAGGGCTCACCTCGCCTCCTGACTGTATTCTGCCTGTGTTCTCCTCCAGTCCCACCCTTATTTCTCTTTGTCTAAATGGCAAACTCCCTTTAAGCAAGAATTCATGGCGTCTCCACATTCGGCCCTCAAGAAACTTTTATTGAGTTGAACTAAAACCAGGCATGTCTTAAGTGTTCATGGCTGATAATGCGGGATCCTTCAACGCTGGAAACGCCCAGATGGTAACGAGAGTTCTGATAGTGCCGGATCCTTCGACGCTGGAAACGCCCAGATGAAAC

# The os module

This module contains useful functions to deal with files and the operating system. Let us look at some of these functions.

In [65]:
import os

#let's check what operating system we have
print(os.name)

posix


In [66]:
#run system commands
os.system("pwd >currentWorkingDir")
!head currentWorkingDir

#remove the temporary file
os.system("rm currentWorkingDir")

#check if it's still there
os.access("currentWorkingDir",os.F_OK)

/Users/zavolan/Teaching/ProgrammingLifeSciences/2021


False

`F_OK` is the *mode* we want to access the file, in this case just to see if it exists. Other modes are os.R_OK for "readable", os.W_OK for "writable", and os.X_OK for "executable".

A very powerful use of this module is to automate file creation, writing, etc. during data processing, e.g. if we want to extract sequences from a file and write them into separate files. 

In [67]:
# you can always use help to find out what else there is in the module
help(os)

Help on module os:

NAME
    os - OS routines for NT or Posix depending on what system we're on.

MODULE REFERENCE
    https://docs.python.org/3.7/library/os
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This exports:
      - all functions from posix or nt, e.g. unlink, stat, etc.
      - os.path is either posixpath or ntpath
      - os.name is either 'posix' or 'nt'
      - os.curdir is a string representing the current directory (always '.')
      - os.pardir is a string representing the parent directory (always '..')
      - os.sep is the (or a most common) pathname separator ('/' or '\\')
      - os.extsep is the extension separator (always '.')
      - os.altsep is the alternate pathname se

# Exercises

(1) Generate file `ensembl_rest.fa`.

(2) Use the `.readline()` method to print the first 10 lines of the file 'test.fa'. Can you also do this with the `.readlines()` method instead? 

(3) Write a function that returns the number of sequences in a FASTA file when provided with a file name. 

Use the function to verify that the `ensembl_rest.fa` file is correct.

(4) Implement a script that writes a report for file `ensembl_rest.fa`. For every entry in the input file, a line with the following tab-separated (remember: tabs are encoded with '\t') fields should be written to an output file `report.tsv`:
  * length of the sequence
  * frequency of nucleotides A, C, G, T, N
  * position of the first start codon 

(5) Summarize the report file `report.tsv`: Calculate the total length of all sequences, the mean sequence length and the overall [GC-content](https://en.wikipedia.org/wiki/GC-content) across all the nucleotides of all sequences (note that this is *not* the same as the mean of the GC-contents of each sequence!). Print the results to the screen.