# PLM7 - the web

**Requests** is a package that facilitates the task of accessing data from remote files (among other tasks)

In [6]:
import requests

Before executing the cell visit the *url* below in your browser to see what we are trying to get. The *get()* function of the *requests* package gets the response of the http request.

In [7]:
url = 'http://www.uniprot.org/uniprot/?query=muscarinic+receptor&sort=score&format=tab'
req = requests.get(url)  
req

<Response [200]>

You should have gotten *response 200*, which is the standard response for successful HTTP requests

The *response* object has an attribute *.text* that contains the text of the requested url

In [8]:
req.text

"Entry\tEntry name\tStatus\tProtein names\tGene names\tOrganism\tLength\nQ9U7D5\tACM3_CAEEL\treviewed\tMuscarinic acetylcholine receptor gar-3 (G-protein-linked acetylcholine receptor 3)\tgar-3 Y40H4A.1\tCaenorhabditis elegans\t611\nQ09388\tACM2_CAEEL\treviewed\tMuscarinic acetylcholine receptor gar-2 (G-protein-linked acetylcholine receptor 2)\tgar-2 F47D12.1\tCaenorhabditis elegans\t627\nP08483\tACM3_RAT\treviewed\tMuscarinic acetylcholine receptor M3\tChrm3 Chrm-3\tRattus norvegicus (Rat)\t589\nP08172\tACM2_HUMAN\treviewed\tMuscarinic acetylcholine receptor M2\tCHRM2\tHomo sapiens (Human)\t466\nP20309\tACM3_HUMAN\treviewed\tMuscarinic acetylcholine receptor M3\tCHRM3\tHomo sapiens (Human)\t590\nP11229\tACM1_HUMAN\treviewed\tMuscarinic acetylcholine receptor M1\tCHRM1\tHomo sapiens (Human)\t460\nP08482\tACM1_RAT\treviewed\tMuscarinic acetylcholine receptor M1\tChrm1 Chrm-1\tRattus norvegicus (Rat)\t460\nQ9ERZ3\tACM3_MOUSE\treviewed\tMuscarinic acetylcholine receptor M3 (Mm3 mAChR)\tC

With the *str* method *.splitlines()* we can put it as a list

In [10]:
lines = req.text.splitlines()
lines

['Entry\tEntry name\tStatus\tProtein names\tGene names\tOrganism\tLength',
 'Q9U7D5\tACM3_CAEEL\treviewed\tMuscarinic acetylcholine receptor gar-3 (G-protein-linked acetylcholine receptor 3)\tgar-3 Y40H4A.1\tCaenorhabditis elegans\t611',
 'Q09388\tACM2_CAEEL\treviewed\tMuscarinic acetylcholine receptor gar-2 (G-protein-linked acetylcholine receptor 2)\tgar-2 F47D12.1\tCaenorhabditis elegans\t627',
 'P08483\tACM3_RAT\treviewed\tMuscarinic acetylcholine receptor M3\tChrm3 Chrm-3\tRattus norvegicus (Rat)\t589',
 'P08172\tACM2_HUMAN\treviewed\tMuscarinic acetylcholine receptor M2\tCHRM2\tHomo sapiens (Human)\t466',
 'P20309\tACM3_HUMAN\treviewed\tMuscarinic acetylcholine receptor M3\tCHRM3\tHomo sapiens (Human)\t590',
 'P11229\tACM1_HUMAN\treviewed\tMuscarinic acetylcholine receptor M1\tCHRM1\tHomo sapiens (Human)\t460',
 'P08482\tACM1_RAT\treviewed\tMuscarinic acetylcholine receptor M1\tChrm1 Chrm-1\tRattus norvegicus (Rat)\t460',
 'Q9ERZ3\tACM3_MOUSE\treviewed\tMuscarinic acetylcholine r

The first element contains the headings and the other elements contain actual data

In [11]:
print(lines[0])  # headings
print(lines[1])  # first row of data

Entry	Entry name	Status	Protein names	Gene names	Organism	Length
Q9U7D5	ACM3_CAEEL	reviewed	Muscarinic acetylcholine receptor gar-3 (G-protein-linked acetylcholine receptor 3)	gar-3 Y40H4A.1	Caenorhabditis elegans	611


## BeautifulSoup

*BeautifulSoup* facilitates the task of extracting text from *HTML documents*

In [18]:
from bs4 import BeautifulSoup

Let's open an HTML document and store it as a *str*

In [19]:
with open('simple_page.html') as html_file:
    html_text = html_file.read()
html_text

'<!DOCTYPE html>\n<html>\n<head>\n  <title>Contrived Example</title>\n</head>\n<body>\n<p>I am the egg man</p>\n<p>I am the walrus</p>\n</body>\n</html>\n\n'

BeautifulSoup provides a parser that returns the *str* in an organized manner (a beautiful soup):

In [20]:
soup = BeautifulSoup(html_text, 'html.parser')
soup

<!DOCTYPE html>

<html>
<head>
<title>Contrived Example</title>
</head>
<body>
<p>I am the egg man</p>
<p>I am the walrus</p>
</body>
</html>

The *soup* object has a method to searh for specific HTML flags. We will search for paragraphs (*\<p\>*)

In [21]:
soup.find_all('p')

[<p>I am the egg man</p>, <p>I am the walrus</p>]

Since we get a list, we can iterate through the elements on this list.

In [22]:
for paragraph in soup.find_all('p'):
    print(paragraph)

<p>I am the egg man</p>
<p>I am the walrus</p>


Each paragraph is a *tag* object with attributes

In [23]:
type(paragraph)

bs4.element.Tag

One of this attributes is *.text*, which returns only the text within the HTML tags (between the *\<p\>* and *\</p\>* tags in this cases:

In [24]:
for paragraph in soup.find_all('p'):
    print(paragraph.text)

I am the egg man
I am the walrus


Clear? Then let's try a more realistic example. HTML pages are typically not stored in our computer but in remote servers. So we need to combine *requests* and *BeutifulSoup*. Open the following url in your browser before continuing.

In [30]:
url = 'http://aps.unmc.edu/AP/database/antiB.php'

This is a very clunky HTML. It is an easy example for parsing data (that's how we call getting content from HTML pages) because the HTML code is quite simple. Very often things become much more difficult in nice-looknig pages!
A before let's start getting a response for the url by using *requests*.

In [31]:
req = requests.get(url)
html_text = req.text
html_text

'\n\n\n<html>\n<br>\n<table width="100%" cellpadding="0" cellspacing="0" border="0">\n  <tbody>\n    <tr>\n      <td  bgcolor="blue" ><b><font size="4" color="#ffffff">\n          APD has 2666 antibacterial peptides </font></b>\n      </td>\n    </tr>\n  </tbody>\n</table>\n\n<div align = right>\n<a href = "query_input.php">Search Database</a> | \n<a href = "../main.php">Home </a>\n</div\n\n</html>\n\n\n\n<html>\n<b ody>\n\n<br>\n<a href=\'query_output.php?ID=00001 \' target = \'_blank\'> AP00001 </a>\n</html>\n</body>\n\n   : P31107 ,  Name: Dermaseptin-B2 (XXA, DRS-B2,  Dermaseptin B2, DRS B2, DS bII, ADENOREGULIN; UCLL1c;  frog, amphibians, animals)<br>\n<html>\n<b ody>\n\n<br>\n<a href=\'query_output.php?ID=00002 \' target = \'_blank\'> AP00002 </a>\n</html>\n</body>\n\n   : P15450 ,  Name: Abaecin (Pro-rich; insects, arthropods, invertebrates, animals)<br>\n<html>\n<b ody>\n\n<br>\n<a href=\'query_output.php?ID=00004 \' target = \'_blank\'> AP00004 </a>\n</html>\n</body>\n\n   : R

Now we are ready to access the HTML

In [32]:
soup = BeautifulSoup(html_text, 'html.parser')
soup


<html>
<br/>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr>
<td bgcolor="blue"><b><font color="#ffffff" size="4">
          APD has 2666 antibacterial peptides </font></b>
</td>
</tr>
</tbody>
</table>
<div align="right">
<a href="query_input.php">Search Database</a> | 
<a href="../main.php">Home </a>
</div>
<html>
<b ody="">
<br/>
<a href="query_output.php?ID=00001 " target="_blank"> AP00001 </a>
</b></html>
</html>

   : P31107 ,  Name: Dermaseptin-B2 (XXA, DRS-B2,  Dermaseptin B2, DRS B2, DS bII, ADENOREGULIN; UCLL1c;  frog, amphibians, animals)<br/>
<html>
<b ody="">
<br/>
<a href="query_output.php?ID=00002 " target="_blank"> AP00002 </a>
</b></html>


   : P15450 ,  Name: Abaecin (Pro-rich; insects, arthropods, invertebrates, animals)<br/>
<html>
<b ody="">
<br/>
<a href="query_output.php?ID=00004 " target="_blank"> AP00004 </a>
</b></html>


   : Ref ,  Name: Ct-AMP1 (CtAMP1, C. ternatea-antimicrobial peptide 1; defensins; 4S=S, UCSS1a; plants)<br/>

We would like to get the names of all peptides displayed. This data is all within the *\<a>* tags:

In [33]:
soup.find_all('a')

[<a href="query_input.php">Search Database</a>,
 <a href="../main.php">Home </a>,
 <a href="query_output.php?ID=00001 " target="_blank"> AP00001 </a>,
 <a href="query_output.php?ID=00002 " target="_blank"> AP00002 </a>,
 <a href="query_output.php?ID=00004 " target="_blank"> AP00004 </a>,
 <a href="query_output.php?ID=00005 " target="_blank"> AP00005 </a>,
 <a href="query_output.php?ID=00006 " target="_blank"> AP00006 </a>,
 <a href="query_output.php?ID=00007 " target="_blank"> AP00007 </a>,
 <a href="query_output.php?ID=00008 " target="_blank"> AP00008 </a>,
 <a href="query_output.php?ID=00009 " target="_blank"> AP00009 </a>,
 <a href="query_output.php?ID=00010 " target="_blank"> AP00010 </a>,
 <a href="query_output.php?ID=00011 " target="_blank"> AP00011 </a>,
 <a href="query_output.php?ID=00012 " target="_blank"> AP00012 </a>,
 <a href="query_output.php?ID=00013 " target="_blank"> AP00013 </a>,
 <a href="query_output.php?ID=00014 " target="_blank"> AP00014 </a>,
 <a href="query_outpu

In [34]:
for link in soup.find_all('a'):
    print(link.text)

Search Database
Home 
 AP00001 
 AP00002 
 AP00004 
 AP00005 
 AP00006 
 AP00007 
 AP00008 
 AP00009 
 AP00010 
 AP00011 
 AP00012 
 AP00013 
 AP00014 
 AP00015 
 AP00016 
 AP00017 
Next page 


Let's sotre the names:

In [35]:
names = [name.text for name in soup.find_all('a')]
names

['Search Database',
 'Home ',
 ' AP00001 ',
 ' AP00002 ',
 ' AP00004 ',
 ' AP00005 ',
 ' AP00006 ',
 ' AP00007 ',
 ' AP00008 ',
 ' AP00009 ',
 ' AP00010 ',
 ' AP00011 ',
 ' AP00012 ',
 ' AP00013 ',
 ' AP00014 ',
 ' AP00015 ',
 ' AP00016 ',
 ' AP00017 ',
 'Next page ']

Notice that there are some links to pages other than the peptides that we don't want:

In [36]:
print(names[0])
print(names[1])
print(names[-1])

Search Database
Home 
Next page 


Now you are ready for **ambpdb_data.py**

## set
In Python a *set* is a collection of unique elements which is unordered and unindexed. We can create a set from a list:

In [37]:
a = ['P1', 'P2', 'P3', 'P1']
prot_a = set(a)
prot_a

{'P1', 'P2', 'P3'}

We can alo define a set using curly braces 

In [38]:
a = {'P1', 'P2', 'P3', 'P1'}

We can add elements using the *.add()* method

In [39]:
prot_a.add('P1')
prot_a

{'P1', 'P2', 'P3'}

Let's now create an empty set:

In [40]:
prot_b = set()

Notice that despite the uses of curly braces in *sets* we can not use {} to initialize an emtpy *set*

In [41]:
not_an_emtpy_set = {}
type(not_an_emtpy_set)

dict

Let's add  and element to prot_b and get the *intersection* of both *sets*

In [42]:
prot_b.add('P1')
prot_b & prot_a #the intersation between prot_b and prot_b

{'P1'}

In [43]:
prot_b

{'P1'}

We can also get differences (elements in *a* and not in *b* or viceversa)

In [44]:
print('a', prot_a)
print('b', prot_b)
print('a-b:', prot_a - prot_b)
print('b-a', prot_b - prot_a)

a {'P2', 'P1', 'P3'}
b {'P1'}
a-b: {'P2', 'P3'}
b-a set()


We use *.update()* when we want to add more than one element

In [45]:
prot_b.update(('P4', 'P5'))
prot_b

{'P1', 'P4', 'P5'}

Now let's do *union*:

In [46]:
prot_a | prot_b

{'P1', 'P2', 'P3', 'P4', 'P5'}

And finally remove elemetns from an object

In [47]:
print(prot_b)
prot_b.remove('P1')
prot_b

{'P5', 'P1', 'P4'}


{'P4', 'P5'}