# Text manipulation

Text manipulation is quite simplified in Python thanks to a wide variety of packages. It is generally advisable not to reinvent the wheels, so only perform quick and dirty text parsing when it is really necessary.

### File IO, streaming, serialization

Here is a basic raw text file opening template in Python.

In [None]:
# f =  open('path/to/file.txt','r')
# #f.readlines()
# for l in f:
#     #do stuff
#     print l
# f.close()

What if we want to read text from the standard input? (Useful for pipelining)

In [1]:
#use ike this: cat file.txt | python script.py
import sys
for line in sys.stdin:
    # do suff
    print line

# there is a dedicated module to text IO
#import io
#t = io.TextIO()

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-1-dd343f7516e3>, line 5)

### Pickling

This in Python jargon means object serialization, a very important feature allowing you to save on disk the contents of a Python datastructure directly, in a specially compressed or sometimes binary format.

In [2]:
d = {'first': [1,"two"], 'second': set([3, 4, 'five'])}
import pickle
with open('dumpfile.pkl','wb') as fout:
    pickle.dump(d, fout)
with open('dumpfile.pkl','rb') as fin:
    d2 = pickle.load(fin)
print(d2)

{'first': [1, 'two'], 'second': {'five', 3, 4}}


### JSON

A short word for JavaScript Object Notation, .json became ubiquitous as a simple data interchange format mainly in remote Web API calls and microtransactions. Json is easily loaded into native Python datastructures. An example:

In [3]:
import json
#json_string = json.dumps([1, 2, 3, "a", "b", "c"])
d = {'first': [1,"two"], 'second': [3, 4, 'five']}
json_string = json.dumps(d)
print(json_string)


{"first": [1, "two"], "second": [3, 4, "five"]}


# Parsing and regular expressions

Used for any raw text format in biology, such as (FASTA, FASTAQ, PDB, VCF, GFF, SAM).


Example: FASTA parsing

Open the file containing all peptide sequences in the human body.
How many unknown peptides does it contain?
How many unique genes and transcripts are in there for the unknown peptides?
Output a tab separated file containing the gene id and transcript id for each unknown peptide.

Observation:
Usage of Biopython and pandas modules.

Task:
Order the chromosomes by the number of unknown peptides versus the total number of peptides they translate.

>ENSP00000388523 pep:known chromosome:GRCh38:7:142300924:142301432:1 gene:ENSG00000226660 transcript:ENST00000455382 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene
MDTWLVCWAIFSLLKAGLTEPEVTQTPSHQVTQMGQEVILRCVPISNHLYFYWYRQILGQ
KVEFLVSFYNNEISEKSEIFDDQFSVERPDGSNFTLKIRSTKLEDSAMYFCASSE


In [19]:
import sys
f = open('data/Homo_sapiens.GRCh38.pep.all.fa','r')
peptides = {}
for l in f:
    if l[0]=='>':
        #print l.strip().split()
        record = {}
        r = l.strip('\n').split()
        pepid = r[0][1:]
        record['pep'] = 1 if r[1].split(':')[1]=='known' else 0
        record['gene'] = r[3].split(':')[1]
        record['transcript'] = r[4].split(':')[1]
        peptides[pepid] = record
f.close()

##using regular expressions to match all known peptides
nupep2 = 0
import re
#pattern = re.compile('^>.*(known).*')
pattern = re.compile('^>((?!known).)*$')
with open('data/Homo_sapiens.GRCh38.pep.all.fa','r') as f:
    for l in f:
        if pattern.search(l) is not None: nupep2 += 1 

npep = len(peptides)
upep = set([pepid for pepid in peptides if peptides[pepid]['pep']==0]) #unknown peptides
nunknown = len(upep)
genes = set([peptides[pepid]['gene'] for pepid in upep])
trans = set([peptides[pepid]['transcript'] for pepid in upep])
print npep, nupep2, nunknown, len(genes), len(ntrans)


with open('unknown_peptides.txt','w') as f:
    for pepid in upep:
        f.write('\t'.join([pepid, peptides[pepid]['gene'], peptides[pepid]['transcript']])+'\n')


99436 28828 28828 11116 70608


In [3]:
f = open('data/Homo_sapiens.GRCh38.pep.all.fa','r')
from Bio import SeqIO
fasta = SeqIO.parse(f,'fasta')

i = 0
name, sequence = fasta.id, fasta.seq.tostring()
if len(sequence)<100 and len(sequence)>20:
    i += 1
    print i
    print "Name",name
    print "Sequence",sequence
    if i > 5: break
f.close()

AttributeError: 'generator' object has no attribute 'id'

## XML parsing

XML is a general file format used for data interchange, especially among different applications. One of the most popular use in Biology is the SBML format, that aims to store a biological model specification, no matter how specific that model may be.

Task:

Download a curated SBML file from the BioModels database:
http://www.ebi.ac.uk/biomodels-main/

Find out how many reactions the file contains.

Extra task:

Make a simplified XML file of the reactants and their k-values for each reaction.

In [13]:
import sys

import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='data/curated_sbml.xml')
#tree = ET.parse(open('data/curated_sbml.xml'))
root = tree.getroot()
print root.tag, root.attrib
for child in root:
    print child.tag, child.attrib
    for child2 in child:
        print child2.tag, child2.attrib

#print tree.write(sys.stdout)
for elem in root.iter('reaction'):
    print elem.tag, elem.attrib

for elem in root.iter('species'):
    print elem.tag, elem.attrib
    print elem.get('id')

print tree.findall('.//reaction')

sbml {'version': '4', 'metaid': '_000000', 'level': '2'}
model {'id': 'BIOMD0000000001', 'metaid': '_000001', 'name': 'Edelstein1996 - EPSP ACh event'}
notes {}
annotation {}
listOfCompartments {}
listOfSpecies {}
listOfParameters {}
listOfReactions {}
listOfEvents {}
reaction {'name': 'React0', 'id': 'React0', 'metaid': '_000016', 'sboTerm': 'SBO:0000177'}
reaction {'name': 'React1', 'id': 'React1', 'metaid': '_000017', 'sboTerm': 'SBO:0000177'}
reaction {'name': 'React2', 'id': 'React2', 'metaid': '_000018', 'sboTerm': 'SBO:0000181'}
reaction {'name': 'React3', 'id': 'React3', 'metaid': '_000019', 'sboTerm': 'SBO:0000177'}
reaction {'name': 'React4', 'id': 'React4', 'metaid': '_000020', 'sboTerm': 'SBO:0000177'}
reaction {'name': 'React5', 'id': 'React5', 'metaid': '_000021', 'sboTerm': 'SBO:0000181'}
reaction {'name': 'React6', 'id': 'React6', 'metaid': '_000022', 'sboTerm': 'SBO:0000181'}
reaction {'name': 'React7', 'id': 'React7', 'metaid': '_000023', 'sboTerm': 'SBO:0000177'}
rea

### xmltodict

In [None]:
import xmltodict
with open('data/curated_sbml.xml','r') as fd:
    doc = xmltodict.parse(fd.read())

## Web scraping

This is concerned with automatic information processing from teh Internet.

Task:
- Create your own web crawlers, to mine the relevant articles from your favorite journals. 

[BeautifulSoup](http://www.pythonforbeginners.com/python-on-the-web/beautifulsoup-4-python/) is loved by hackers. Aside from html it can also parse xml.

Here is a small script that will list all web anchors from Reddit main page (an anchod is a html tag normally used to provide hyperlinks and reference points inside a web page).

In [4]:
from bs4 import BeautifulSoup
import urllib2

redditFile = urllib2.urlopen("http://www.reddit.com")
redditHtml = redditFile.read()
redditFile.close()

soup = BeautifulSoup(redditHtml)
redditAll = soup.find_all("a")
for links in soup.find_all('a'):
    print (links.get('href'))

#content
http://www.reddit.com/r/Allsvenskan/
http://www.reddit.com/r/Art/
http://www.reddit.com/r/AskReddit/
http://www.reddit.com/r/askscience/
http://www.reddit.com/r/aww/
http://www.reddit.com/r/books/
http://www.reddit.com/r/creepy/
http://www.reddit.com/r/dataisbeautiful/
http://www.reddit.com/r/DIY/
http://www.reddit.com/r/Documentaries/
http://www.reddit.com/r/EarthPorn/
http://www.reddit.com/r/europe/
http://www.reddit.com/r/explainlikeimfive/
http://www.reddit.com/r/Fitness/
http://www.reddit.com/r/food/
http://www.reddit.com/r/funny/
http://www.reddit.com/r/Futurology/
http://www.reddit.com/r/gadgets/
http://www.reddit.com/r/gaming/
http://www.reddit.com/r/GetMotivated/
http://www.reddit.com/r/gifs/
http://www.reddit.com/r/history/
http://www.reddit.com/r/IAmA/
http://www.reddit.com/r/InternetIsBeautiful/
http://www.reddit.com/r/intresseklubben/
http://www.reddit.com/r/Jokes/
http://www.reddit.com/r/LifeProTips/
http://www.reddit.com/r/listentothis/
http://www.reddit.com/r/m