<small><small><i>
Introduction to Python for Bioinformatics - available at https://github.com/kipkurui/Python4Bioinformatics.
</i></small></small>

## Files, Scripting and Modules

So far, we have been writing all our Python Code in Jupyter notebooks. However, if you want to use the code we have written as part of a pipeline, you need to write scripts. Also, most of the time the data you need to analyse is in a file, which you need to read to Python and process. 


### Reading Files

So far we have been working from memory. In Bioinformatics, you will need to read some file or even write some output to file. We use the `open` function. 

In [9]:
myfile = open("../Data/test.txt", "w")
myfile.write("My first file written from Python Today\n")
myfile.write("---------------------------------\n")
myfile.write("Hello, world!")
myfile.close()

In [8]:
myfile.close()

The **mode** in which you open the file determines whether to write (w), read (r) or append(a) to file. 

Opening a file creates what we call a **file handle** which contains methods for manipulating the file. In our case, `myfile` has the methods to write and close the file. Closing the file makes it accessible in the disk. 

Alternatively, one can open the file in a mode that automatically closes the file when done. 

In [2]:
with open("../Data/test.txt", "w") as myfile:
    myfile.write("My first file written from Python \n")
    myfile.write("---------------------------------\n")
    myfile.write("Hello, world!\n")

Let's check what else we can do with `open`.

In [39]:
#?open

#### Fetching file from the web
Download this [file](https://www.uniprot.org/docs/humchrx.txt) we will use to explore file reading in python. 

In [40]:
import urllib.request

url = "https://www.uniprot.org/docs/humchrx.txt"
destination_filename = "../Data/humchrx.txt"
urllib.request.urlretrieve(url, destination_filename)

('../Data/humchrx.txt', <http.client.HTTPMessage at 0x7fdf08370810>)

#### Reading a file line-at-a-time

We can read the file line by line using `readline`. Thie reads the line one by one until the end of the file. This is suitable for a large file which may not fit memory. 

In [41]:
humchrx = open('../Data/humchrx.txt', 'r')
line = humchrx.readline()
print(line)

----------------------------------------------------------------------------



In [42]:
line = humchrx.readline()
print(line)

        UniProt - Swiss-Prot Protein Knowledgebase



In [43]:
humchrx.close()

In [44]:
with open('../Data/test.txt', 'r') as myfile:
    while True:
        line = myfile.readline()
        if len(line) == 0: # If there are no more lines
            break
        print(line)
    

My first file written from Python Today

---------------------------------

Hello, world!


### Read the whole file

If the file is small or PC has enough memory, you can read the whole file into memory as a list using `readlines`.

In [45]:
with open('../Data/test.txt', 'r') as myfile:
    lines = myfile.readlines()
    for line in lines:
        print(line)

My first file written from Python Today

---------------------------------

Hello, world!


or as a whole

In [108]:
with open('../Data/test.txt', 'r') as myfile:
    whole_file = myfile.read()
    print(whole_file)

My first file written from Python Today
---------------------------------
Hello, world!


In [112]:
whole_file.split('\n')

['My first file written from Python Today',
 '---------------------------------',
 'Hello, world!']

### Exercise 1

Write a function the reads the file (humchr.txt) and writes to another file (gene_names.txt) a clean list of gene names.

In [46]:
humchrx = open('../Data/humchrx.txt', 'r')
line = humchrx.readline()

In [48]:
humchrx.close()

In [3]:
def write2file(gene_list, out_file):
    """
    Takes a gene list and writes the output to file
    """
    with open(out_file, 'w') as outfile:
        outfile.write('\n'.join(gene_list))

def remove_empty(gene_list):
    """
    Given a gene list, removes items 
    that start with dash (empty)
    """
    tag = True
    while tag:
        try:
            gene_list.remove('-')
        except ValueError:
            tag = False
    return gene_list

def clean_genes(input_file, out_file):
    """
    Given a chromosome annotation file, extract the 
    genes and write them to another file
    """
    gene_list = []
    tag = False
    with open(input_file, 'r') as humchrx:
        for line in humchrx:
            if line.startswith('Gene'):
                tag=True
            if line == '\n':
                tag = False
            if tag:
                gene_list.append(line.split()[0])
    #clean the gene list
    gene_list.pop(2)
    gene_list[0] = gene_list[0]+"_"+gene_list[1]
    gene_list.pop(1)
    
    gene_list = remove_empty(gene_list)
    
    ## Writing to file
    write2file(gene_list, out_file)
clean_genes('../Data/humchrx.txt', 'testing.txt')

## Alternative print options

In [103]:
print('\n'.join(gene_list),file=open('test.txt','w'), end='')

In [105]:
with open('test3.txt', 'w') as outfile:
    for gene in gene_list:
        outfile.write(gene+'\n')

### Scripts and Modules

A script is a file containing Python definitions and statements for performing some analysis. Scripts are known as when they are intended for use in other Python programs. Many Python modules come with Python as part of the standard library. 

You can get a list of available modules using help() and explore them.

seqtools.pyseqtools.py### Writing you own modules

All we need to do to create our own modules is to save our script as a file with a `.py` extension. Suppose, for example, this script is saved as a file named `seqtools.py`.

```python
def remove_at(pos, seq):
    return seq[:pos] + seq[pos+1:]```
    
We can import the module as:

In [1]:
import math

In [3]:
?math

[0;31mType:[0m        module
[0;31mString form:[0m <module 'math' from '/Users/ckibet/miniconda3/envs/bioinf/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so'>
[0;31mFile:[0m        ~/miniconda3/envs/bioinf/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so
[0;31mDocstring:[0m  
This module provides access to the mathematical functions
defined by the C standard.


In [1]:
import seqtools

In [3]:
?seqtools.remove_at()

[0;31mSignature:[0m [0mseqtools[0m[0;34m.[0m[0mremove_at[0m[0;34m([0m[0mpos[0m[0;34m,[0m [0mseq[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/MScBioinfo/Intro2Programming/Python/Python4Bioinformatics2020/Notebooks/seqtools.py
[0;31mType:[0m      function


In [2]:
s = "A string!"
seqtools.remove_at(4,s)

'A sting!'

In [13]:
'23,000,'.replace(',','')

'23000'

In [4]:
import seqtools

In [5]:
from seqtools import remove_at

In [None]:
remove_at()

Modules are useful when you want to analyse large data using the HPC or even create your library of handy functions. 

#### Running scripts

When you have put your commands into a .py file, you can execute on the command line by invoking the Python interpreter using `python script.py.`

### Exercise 2

1. Convert the function you wrote in exercise 1 into a python module. Then, import the module and use the function to read `humchrx.txt` file and create a gene list file.
2. Create a stand-alone script that does all the above.


### Script that takes command line arguments
So far, we can create a script that does one thing. In this case, you have to edit the script if you have a new gene file to analyse or you want to use a different name for the output file.

#### sys.argv
sys.argv is a list in Python, which contains the command line arguments passed to the script. Lets add this to a script `sysargv.py` and run on the command line. 

```python
import sys
print("This is the name of the script: ", sys.argv[0])
print("Number of arguments: ", len(sys.argv))
print("The arguments are: " , str(sys.argv))```

In [1]:
import get_gene_list

In [2]:
get_gene_list.clean_genes('../Data/humchrx.txt', '../Data/clean_genes.txt')

In [9]:
!python sysargv.py test

This is the name of the script:  sysargv.py
Number of arguments:  2
The arguments are:  ['sysargv.py', 'test']


### Exercise 3

- Using the same concept, convert your script in exercise 1 to take command line arguments (input and output files)
- Using a DNA sequence read from file, answer the following questions:
    1. Show that the DNA string contains only four letters.
    2. In the DNA string there are regions that have a repeating letter. What is the letter and length of the longest repeating region?
    3. How many ’ATG’s are in the DNA string?

### File handling, OS module, Shutil and Path modules

Python can also interface directly with the Linux operating system using the **os**, **Shutil** and **path** modules.

First, let's import the OS module

In [10]:
import os

In [14]:
os.cpu_count()

4

In [11]:
os.getcwd()

'/Users/ckibet/MScBioinfo/Intro2Programming/Python/Python4Bioinformatics2020/Notebooks'

In [13]:
!pwd

/Users/ckibet/MScBioinfo/Intro2Programming/Python/Python4Bioinformatics2020/Notebooks


In [17]:
os.chdir('..')

In [18]:
os.getcwd()

'/home/user/Python4Bioinformatics'

In [19]:
os.chdir('INotebooks/')

In [15]:
?os

[0;31mType:[0m        module
[0;31mString form:[0m <module 'os' from '/Users/ckibet/miniconda3/envs/bioinf/lib/python3.7/os.py'>
[0;31mFile:[0m        ~/miniconda3/envs/bioinf/lib/python3.7/os.py
[0;31mDocstring:[0m  
OS routines for NT or Posix depending on what system we're on.

This exports:
  - all functions from posix or nt, e.g. unlink, stat, etc.
  - os.path is either posixpath or ntpath
  - os.name is either 'posix' or 'nt'
  - os.curdir is a string representing the current directory (always '.')
  - os.pardir is a string representing the parent directory (always '..')
  - os.sep is the (or a most common) pathname separator ('/' or '\\')
  - os.extsep is the extension separator (always '.')
  - os.altsep is the alternate pathname separator (None or '/')
  - os.pathsep is the component separator used in $PATH etc
  - os.linesep is the line separator in text files ('\r' or '\n' or '\r\n')
  - os.defpath is the default search path for executables
  - os.devnull is the file

In [16]:
os.listdir()

['10.ipynb',
 '09.ipynb',
 '08.ipynb',
 '__init__.py',
 '__pycache__',
 'seqtools.py',
 '00.ipynb',
 '02.ipynb',
 '06.ipynb',
 '04.ipynb',
 '.ipynb_checkpoints',
 '03.ipynb',
 '01.ipynb',
 '05.ipynb',
 'output.txt',
 'sysargv.py',
 '07.ipynb']

In [17]:
os.path.isdir('../Scripts/bank.py')

False

In [18]:
os.path.isfile('../Scripts/bank.py')

False

In [19]:
os.path.isfile('seqtools.py')

True

### path manipulation
The path module inside the os module contains methods related with path manipulation.For example you can use `path.join()` to join paths. 
- `path.exists(path):` Checks if a given path exists.
- `path.split(path):` Returns a tuple splitting the file or directory name at the end and the rest of the path
- `path.splitext(path):` Splits out the extension of a file. It returns a tuple with the dotted extension and the original parameter up to the dot.
- `path.join(directory1,directory2,...)`: Join two or more path name components, inserting the operating system path separator as needed

In [20]:
?os.path.join()

[0;31mSignature:[0m [0mos[0m[0;34m.[0m[0mpath[0m[0;34m.[0m[0mjoin[0m[0;34m([0m[0ma[0m[0;34m,[0m [0;34m*[0m[0mp[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Join two or more pathname components, inserting '/' as needed.
If any component is an absolute path, all previous path components
will be discarded.  An empty last part will result in a path that
ends with a separator.
[0;31mFile:[0m      ~/miniconda3/envs/bioinf/lib/python3.7/posixpath.py
[0;31mType:[0m      function


In [24]:
?os.path.join()

[0;31mSignature:[0m [0mos[0m[0;34m.[0m[0mpath[0m[0;34m.[0m[0mjoin[0m[0;34m([0m[0ma[0m[0;34m,[0m [0;34m*[0m[0mp[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Join two or more pathname components, inserting '/' as needed.
If any component is an absolute path, all previous path components
will be discarded.  An empty last part will result in a path that
ends with a separator.
[0;31mFile:[0m      ~/miniconda3/envs/bioinf/lib/python3.7/posixpath.py
[0;31mType:[0m      function


Explore more at your own time.

### Shutil
Utility functions for copying and archiving files and directory trees.

In [25]:
import shutil

In [38]:
help('modules')


Please wait a moment while I gather a list of all available modules...





IPython             abc                 importlib           rmagic
OpenSSL             aifc                importlib_metadata  runpy
PyQt5               antigravity         inspect             sched
Scripts             appnope             io                  secrets
__future__          argon2              ipaddress           select
_abc                argparse            ipykernel           selectors
_ast                array               ipykernel_launcher  send2trash
_asyncio            ast                 ipython_genutils    setuptools
_bisect             async_generator     ipywidgets          shelve
_blake2             asynchat            itertools           shlex
_bootlocale         asyncio             jinja2              shutil
_bz2                asyncore            json                signal
_cffi_backend       atexit              json5               simplegeneric
_codecs             attr                jsonschema          sipconfig
_codecs_cn          audioop             jup

    Install tornado itself to use zmq with the tornado IOLoop.
    
  yield from walk_packages(path, info.name+'.', onerror)


In [27]:
os.mkdir('Scripts')

In [29]:
shutil.copy('seqtools.py', 'Scripts/')

'Scripts/seqtools.py'

In [32]:
os.remove('seqtools.py')

In [35]:
shutil.move('__init__.py', 'Scripts/')

'Scripts/__init__.py'