## <font color = 'blue'> Autophagy, cell death, and lysosomes </font>
### Introduction  
#### April 1, 2020
- <font size = '3' color = 'red'>**Briefly**</font><br>
Autophagy is the process that delivers intracellular components to lysosomes for degradation. This process can promote or inhibit cell death. The latter occurs because it's degrading components that can be used as an energy source in case of stavation."*Much of this interest has been driven by the fact that manipulation of autophagy holds great promise for improving treatment of diverse diseases.*" Micoroautophagy, macroautophagy and chaperon mediated autophagy are all different autophagic mechanisms that are mediated by lysosomes (i.e. they rely on lysosomes to degrade the components they hold or sequester).<br>
**Cell death can be in many forms:** <br>
* Extrinsic apoptosis  
* Caspase dependent  
* Caspase independent intrinsic apoptosis (mitochondrial).
* Regulated necrosis  
* Autophagic   
* Mitotic catastrophic <br>
As for apoptotic cell death(programmed cell death), which is important in normal development and is an essential mechanism to remove damaged or dangerous cells, it's important to know that many signals trigger apoptosis(mitochondrial , golgi apparatus, and lysosomal).<br>
**Lysosomal cell death is divided into:**
- Necrotic LCD 
- Autophagic LCD 
- Apoptotic LCD <br>
Lysosomal membrane permeabilization is can predict which kind of cell death is taking place (e.g. partial selective permeabilization triggers apoptosis, while complete breakdown of an organelle cn occur in unregulated necrosis).
As we have seen from the very little information provided above, all the three phases (autophagy, cell death, and lysosome) are highly overlapped and studying a gene/protein involved in one phase can mean that this gene/protein is also involved in the other phase.<br>
Therefore, **one of the aims** of this project is to gather information about the genes related to autophagy, cell, and lysosomes and find the overlap between them in order to understand their mechanisms better. <br>
#### <font color = 'blue'>Meeting with Sonja:</font>
**Tasks:**<br>
1. Write a script to automatically download all the databases.
2. Write a script to extract the information from the databases automatically. (Doesn’t matter the format to be downloaded). To get gene IDs, gene names and their synonyms if available, Uniprot ID if available, protein names.<br>
3. Above steps can be mixed together.<br>
4. Next step is to merge entries whereby those who do not have uniprot IDs will be assigned to one (take into account that some genes (microRNA genes) are not protein encoding so they won’t have uniprot IDs).<br>
5. Then we will have a list of genes that will be classified according how they are reported in different databases. e.g. lamp1 reported to be in lysosomes in “this database”, or reported to be important in cell death in “this database”..etc.<br>
6. One complicated thing that could be added is to find the homologs in different species. e.g. one database reported galactin3 in humans and another reported it in yeast. This could be part of the functional annotation.<br>
7. At the end we will have a list of genes that are known have relation to ACL. We can look at the overlap between all of them and compare our results to the text mining results.<br>

8. They tried to knock down some genes and see the effect on cell count (if they become less then they most probably died). There’s a list of these genes (hits/screens) already. So when I get the database tables we can see how many genes of these lists are already reported to be involved in lysosomes, cell death, or autophagy. We want to see also which genes that are regulating cells are dying but not dead yet.<br>

9. We don’t have data for what is dying.<br>
10. I will look at the genes that we know are dead, and relate if they have something to do with lysosomes.<br>
11. Final step is to do pathway analysis, what type of genes are interacting with each other (draw pathway). <br><br>
***IN SUMMARY***<br>
1. First step is to get information from databases.
2. Second step is to compare the screens to the genes list I got. Cell profiler can be used in this step to compare the images results for the cells that are dying(after getting Malou’s results).
3. Do pathway analysis (e.g. 14 genes are interacting with each other so they may form a pathway).<br>


### April 2-8, 2020
### <font color = 'blue'>Tasks:</font>
- Learn how to use packages for downloading **zipped, gzipped, tar files as well as csv tables.**<br>
- Write codes to download databases.<br>
- Learn how to use jupyter notebook and jupyter markdown.<br>


<font size = '3'>**Downloading The Autophagy Database**</font>

In [25]:
import tarfile
import wget
import gzip
import zipfile
from zipfile import ZipFile
import csv
import pandas as pd
from owlready2 import *
import goatools
from goatools.obo_parser import GODag

#Downloading .tar.gz files from "The Autophagy database". It requires wget and tarfile libraries.
#save the url of the file to a variable
url = "http://www.tanpaku.org/autophagy/download/autophagyDB.tar.gz"
#download with wget and use the name of the file with its extension
wget.download(url,'autophagyDB.tar.gz')
#Tarfile by default does not treat the file as gzipped so give it the r:gz mode. then open the folder.Here I do not specify 'rb' because it's a folder.
AutophagyDB= tarfile.open('autophagyDB.tar.gz', "r:gz")
#extract the content
AutophagyDB.extractall()
AutophagyDB.close()
#The output was saved in /Volumes/LaCie/MasterThesis2020/The_Autophagy_database

<font size = '3'>**Downloading The Gene Ontology Database**</font>

In [11]:
#Downloading go.owl. This requires owlready2 library
#specify the directory you want to append the file to
onto_path.append('/Volumes/LaCie/MasterThesis2020/jupTest')
go_onto = get_ontology("http://purl.obolibrary.org/obo/go.owl").load()
#save the file 
go_onto.save()
#The output was saved in /Volumes/LaCie/MasterThesis2020/The_GO_Resource_database

In [16]:
#Downloading go-basic-obo. This requires goatools library and need to import GODag
url = 'http://purl.obolibrary.org/obo/go/go-basic.obo'
wget.download(url,'go-basic.obo')
go_obo = goatools.obo_parser.GODag('go-basic.obo')
#The output was saved in /Volumes/LaCie/MasterThesis2020/The_GO_Resource_database

go-basic.obo: fmt(1.2) rel(2020-03-23) 47,232 GO Terms


<font color = 'red' size = '4'>NOTE:<br> On the 7th of april, Sonja told me to try parse the gene ontology database by searching in the html file downloaded for specific GO terms concerning autophagy, cell death and lysosomes and get the list of genes for that. So I might not use the following files but I will keep them anyways.</font> <br>
**The following codes are used to download and gunzip gene annotation files for 6 species whereby homo sapiens has 4 different files (protein, isoform,complex, rna).**</font> <br>



In [None]:
#These files require wget and gzip libraries
#I downloaded these files already using Atom and copied them to jupyter notebook.
#Downloading .gz files from "The Gene Ontology Database"
#Homo sapiens protein**
url = 'http://geneontology.org/gene-associations/goa_human.gaf.gz'
wget.download(url, 'goa_human.gaf.gz')
#specify s reading mode for the file (read binary). Two files will be downloaded .gaf.gz and .gaf (i can name them whatever I want)
Go_human = gzip.open('goa_human.gaf.gz', 'rb')
#better to close to reduce the computational load.
Go_annotation_human = Go_human.read()
Go_human.close()
#Now we write  the file to another file (gunzipped because I gunzipped it at earlier steps). I specify the wb because it's binary
output = open('goa_human.gaf', 'wb')
# I write the read file into the output file.
output.write(Go_annotation_human)
output.close()

#Homo sapiens complex
url = 'http://geneontology.org/gene-associations/goa_human_complex.gaf.gz'
wget.download(url, 'goa_human_complex.gaf.gz')
Go_human_complex = gzip.open('goa_human_complex.gaf.gz', 'rb')
Go_annotation_human_complex = Go_human_complex.read()
Go_human_complex.close()
output_complex = open('goa_human_compex.gaf', 'wb')
output_complex.write(Go_annotation_human_complex)
output_complex.close()

#Homo sapiens isoform
url = 'http://geneontology.org/gene-associations/goa_human_isoform.gaf.gz'
wget.download(url, 'goa_human_isoform.gaf.gz')
Go_human_isoform = gzip.open('goa_human_isoform.gaf.gz', 'rb')
Go_annotation_human_isoform = Go_human_isoform.read()
Go_human_isoform.close()
output_isoform = open('goa_human_isoform.gaf', 'wb')
output_isoform.write(Go_annotation_human_isoform)
output_isoform.close()

#Homo sapiens rna
url = 'http://geneontology.org/gene-associations/goa_human_rna.gaf.gz'
wget.download(url, 'goa_human_rna.gaf.gz')
Go_human_rna = gzip.open('goa_human_rna.gaf.gz', 'rb')
Go_annotation_human_rna = Go_human_rna.read()
Go_human_rna.close()
output_rna = open('goa_human_rna.gaf', 'wb')
output_rna.write(Go_annotation_human_rna)
output_rna.close()

#C.elegans
url = 'http://current.geneontology.org/annotations/wb.gaf.gz'
wget.download(url, 'wb.gaf.gz')
Go_elegans = gzip.open('wb.gaf.gz', 'rb')
Go_annotation_elegans = Go_elegans.read()
Go_elegans.close()
output = open('wb.gaf', 'wb')
output.write(Go_annotation_elegans)
output.close()

#Mus musculus
url = 'http://current.geneontology.org/annotations/mgi.gaf.gz'
wget.download(url, 'mgi.gaf.gz')
Go_mus = gzip.open('mgi.gaf.gz', 'rb')
Go_annotation_mus = Go_mus.read()
Go_mus.close()
output_mus = open('mgi.gaf', 'wb')
output_mus.write(Go_annotation_mus)
output_mus.close()

#Rattus norvegicus
url = 'http://current.geneontology.org/annotations/rgd.gaf.gz'
wget.download(url, 'rgd.gaf.gz')
Go_rattus = gzip.open('rgd.gaf.gz', 'rb')
Go_annotation_rattus = Go_rattus.read()
Go_rattus.close()
output_rattus = open('rgd.gaf', 'wb')
output_rattus.write(Go_annotation_rattus)
output_rattus.close()

#Saccharomyces cerevisiae
url = 'http://current.geneontology.org/annotations/sgd.gaf.gz'
wget.download(url, 'sgd.gaf.gz')
Go_sacch = gzip.open('sgd.gaf.gz', 'rb')
Go_annotation_sacch = Go_sacch.read()
Go_sacch.close()
output_sacch = open('sgd.gaf', 'wb')
output_sacch.write(Go_annotation_sacch)
output_sacch.close()


<font size = '3'>**Downloading The Human Autophagy Modulator Database**</font>

In [None]:
#This file requires zipfile and wget libraries
#Downloading .zip files from "The Human Autophagy modulator Database" HAmdb (proteins with autophagy information)
url = 'http://hamdb.scbdd.com/static/home/download/protein-role-csv.zip'
wget.download(url, 'protein-role-csv.zip')
zip = zipfile.ZipFile('protein-role-csv.zip')
zip.printdir()
zip.extractall()
#when parsing the file specify the endcoding = 'latin1' (e.g. with open('downloaded file', 'r' , encoding = 'latin1') as file)

In [None]:
#Downloading proteins with basic information (HAMdb)
url = 'http://hamdb.scbdd.com/static/home/download/protein-basic-csv.zip'
wget.download(url, 'protein-basic-csv.zip')
zip = zipfile.ZipFile('protein-basic-csv.zip')
zip.printdir()
zip.extractall()
#when parsing the file specify the endcoding = 'latin1' and low_memory= False # I set low_memory = False because when I downloaded I got this warning on the terminal "DtypeWarning: Columns (19,24,30) have mixed types. Specify dtype option on import or set low_memory=False."
#keep in mind # df =  pd.read_csv('protein-role.csv', encoding = 'latin1', low_memory=False) if needed.

<font size = '3' color = 'red'> **I will check later on with Sonja on whether to use the following two files from HAMdb or not**</font>

In [None]:
#Downloading Micro-RNA with autophagy information/ HAmdb
url = 'http://hamdb.scbdd.com/static/home/download/RNA-role-csv.zip'
wget.download(url, 'RNA-role-csv.zip')
zip = zipfile.ZipFile('RNA-role-csv.zip')
zip.printdir()
zip.extractall()
#when parsing the file specify the endcoding = 'latin1'

In [None]:
#Downloading Micro-RNA with basic infromation (HAMdb)
url = 'http://hamdb.scbdd.com/static/home/download/RNA-basic-csv.zip'
wget.download(url, 'RNA-basic-csv.zip')
zip = zipfile.ZipFile('RNA-basic-csv.zip')
zip.printdir()
zip.extractall()
#when parsing the file specify the endcoding = 'latin1'

<font size = '3'>**Downloading Deathbase**</font>

In [None]:
#Downloading Deathbase (public list of proteins)
url = 'http://www.deathbase.org/docs/protein_list.txt'
wget.download(url, 'protein_list.txt')

<font size = '3'>**Downloading yeast CellDeath Database**</font> <br> <font color = 'red'>Note:</font> <br> **#good way to parse csv file.**<br>
with open('yApoptosis.csv', 'r', encoding = 'latin1') as csv_file:<br>
csv_reader = csv.DictReader(csv_file)<br>
for line in csv_reader:<br>
print(line['gene_name'])

In [None]:
#Downloading yeast cellDeath database (yeast apoptosis database):
url = 'http://ycelldeath.com/yapoptosis/download/yApoptosis.csv'
wget.download(url, 'yApoptosis.csv')

### April 9, 2020
### <font color = 'blue'>Tasks:</font>
- Start reading previous student's notebook.<br>
- Figure out a way to fetch gene ontology data (product/gene name, synonyms (if any), uniprot ID (if any),GO terms). Explore GO database more.<br> 
- Start reading this [Extremely Helpful GO-Book](https://link.springer.com/content/pdf/10.1007%2F978-1-4939-3743-1.pdf)<br>
- I downloaded goa_uniprot_all.gaf.gz from ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz. This file contains all GO annotations and information for proteins in the UniProt KnowledgeBase (UniProtKB) and for entities other than proteins, e.g., macromolecular complexes (Complex Portal identifiers) and RNAs (RNAcentral identifiers).

In [None]:
#was downloaded already on Atom
url = 'ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz'
wget.download(url, 'goa_uniprot_all.gaf.gz')
GO = gzip.open('goa_uniprot_all.gaf.gz', 'rb')
Go_annotation_uniprot = GO.read()
GO.close()
output = open('goa_uniprot_all.gaf', 'wb')
output.write(Go_annotation_uniprot)
output.close()

In [2]:
#counting number of lines in this file

goa_uniprot= "goa_uniprot_all.gaf"
count = 0
with open(goa_uniprot, 'r') as goa_uniprot:
    for line in goa_uniprot:
        if not line.startswith('!'):
            count += 1
print("Total number of lines is:", count)

Total number of lines is: 770661751


### April 10, 2020
### <font color = 'blue'>Tasks:</font>
- Continue reading this [Extremely Helpful GO-Book](https://link.springer.com/content/pdf/10.1007%2F978-1-4939-3743-1.pdf)<br>
- Start learning about webscraping in order to try parsing [Gene Ontology amigo table](http://amigo.geneontology.org/amigo/search/bioentity?q=GO:0006914) as well as [CASBAH annotation table](http://bioinf.gen.tcd.ie/cgi-bin/casbah/casbah.pl)

<font color = 'red'> In the codes in the cell below I was learning how to use BeautifulSoup and pandas for webscraping and how to extract features from htmls.</font><br> **always remember to do "inspect" of the file before parsing, this will give info about the class, div, tr, td ..etc**

In [None]:
#Web scraping trial

import bs4
from bs4 import BeautifulSoup

url = 'http://bioinf.gen.tcd.ie/cgi-bin/casbah/casbah.pl'
wget.download(url, 'casbah.pl')

with open('casbah.pl') as html_file:
#there is BeautifulSoup documentation on parsers
    soup = BeautifulSoup(html_file, 'lxml')
    table = soup.find('table', border="1")
    print(table.text)
    
#This will print the whole html file. prettify method will show which tags are nested within each other to clean this up a bit
print(soup.prettify())

#this will access the title <title>The CASBAH</title> and text will access the text within the title i.e. THE CASBAH
match = soup.title.text
print(match)

#This will give the first div tag in the page and the child tags under div tag. If I add ".text" it will give the text within.
match = soup.div
print(match)

#This will print all the texts of the rows in the table
match = soup.find_all('tr')
for row in match:
    print(row.get_text())
    
d = pd.read_html('http://bioinf.gen.tcd.ie/cgi-bin/casbah/casbah.pl', index_col=0)

#This will tell me how many table are there in a website
len(d)
#to print the table of interest, check for the index
df = d[1]
df # only the first 20 entries are printed
df.info()

#try to get the casbah table for all the pages.
with open('casbah.pl') as html_file:
#there is BeautifulSoup documentation on parsers
    soup = BeautifulSoup(html_file, 'lxml')
    table = soup.find('table', border="1")
    print(table.text) #only 20 entries are printed

### <font color = 'blue'>Problems faced:</font>
- CASBAH table spans over 20 pages and apparently there is something hidden in the page's source code which is preventing me from printing more than the first 20 entries of the first page.(I tried parsing with pandas and BS as seen in the above code)<br><br>
- The gene ontology amigo table hasn't been downloaded because each time I run the last code in the cell above it gives me "pending.." status. If I inspect the elements of the table, I will see the word "pending.." so apparently there is a high security on the source code from the website (speculations).

### <font color = 'blue'>Potential Solutions:</font>

- For CASBAH, load the whole 777 entries and right click to "View Page Source".This will view the whole html file and therefore, I can download it and try to parse it on <font size = '3' color = 'red'>**April 14**</font>.<br>
- Try to download gene ontology AmiGo manually by custom download for each GO term(include Gene/product(bioentity), Gene/product(bioentity_label), Synonyms(synonym), Organism(taxon_label), Direct annotation(annotation_class_list), Source(source)). <br>
Copy the link and download with wget.

### April 11, 2020
### <font color = 'blue'>Tasks:</font>
- Parse [BCL2 database annotation tables](https://bcl2db.lyon.inserm.fr/BCL2DB/BCL2DBCellular) to get the tables as csv. 

In [237]:
#Download the database
url = 'https://bcl2db.lyon.inserm.fr/BCL2DB/BCL2DBCellular'
wget.download(url, 'BCL2DBCellular')

'BCL2DBCellular'

In [875]:
#Open the file because I want to get information on how many tables it has and their indexes
with open('BCL2DBCellular') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    
    #the class attribute is always represented with '_' because "class" is a reserved word in python
    table = soup.find_all('table', class_="tnomenclature")
    
#to check how many tables are there in the site
len(table) #4

#check the index of my table of interest. That's important because there are 4 tables
table[0] #this will give a long table

4

In [240]:
#TRYING SOMETHING.WON't BE USED FOR THE LAST CODE BUT GOOD INFO

#to plot the rows in BCL2 table: Specify its index
for row in table[0].find_all('tr'):
    #for each row we need to plot each td element
    for cell in row.find_all('td'):
        print(cell.text)

BCL-2
Bcl-2
-
inhibitor
P10415
Tsujimoto et al.,1984
BCL2L1
Bcl-xL
Bcl2l1
inhibitor
Q07817
Boise et al., 1993
BCL2L2
Bcl-w
Bcl2l2
inhibitor
Q92843
Gibson et al., 1996
MCL-1
Mcl-1
Bcl2l3
inhibitor
Q07820
Kozopas et al., 1993
BCL2L10
Bcl2l10
Bcl-B, Nrh, Nr-13, Diva, Boo
inhibitor
Q9HD36
Aouacheria et al., 2001
BCL2A1
Bfl1
Bcl2a1, Bcl2l5
inhibitor
Q16548
Lin et al., 1993


In [243]:
#TRYING SOMETHING. WON't BE USED FOR THE LAST CODE BUT GOOD INFO

#creating new text file to save all the contents
with open ('BCL2_table'+".csv", 'w') as out:
    for row in table[0].find_all('tr'):
        for cell in row.find_all('td'):
            out.write(cell.text)
        #to have each row on a new line
        out.write('\n')

In [244]:
#RELY ON THIS CONTINUATION OF THE CODE.

#I need the BS library and the code that reads the file (in the pre-pre-pre cell)
#in order to have each cell seperated from the others (because they are all squeezed at this step)


list_head = []
list_rows= []

#Create a csv
with open ('BCL2_table'+ ".csv", 'w') as out:
    for row in table[0].find_all('tr'):
        for head in row.find_all('th'):
            #append the headers to a list and join with commas (easier to parse)
            list_head.append(head.text)
            header = ",".join(list_head)
            #append the cells in each row to a list
        for cell in row.find_all('td'):
            #since the synonyms are seperated by commas, replace the commas with "|" because if I specify the delimiter later on as a comma it will be problematic.
            list_rows.append(cell.text.replace(',', '|'))
    print(header, file = out)
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*6))
    
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        print(','.join(tup), file = out)

    
#output BCL2_table.csv in BCL2_database directory
        

    
            



In [247]:
table[1].text #that's the second table in the page

'Gene nameProtein nameSynonymsPrimary functionBCL2DB acc.ReferenceBAXBaxBcl2l4promoterQ07812Oltvai et al., 1993BAK1Bak1Bcl2l7promoterQ16611Chittenden et al., 1995BOKBokBcl2l9, MtdpromoterQ9UMX3Inohara et al., 1998BCL-WAVBcl-WAV-promoterD2Y5Q2Prudent et al., 2013'

In [270]:
#Printing table BAX
list_head = []
list_rows= []

#Create a csv
with open ('BAX_table'+ ".csv", 'w') as out:
    for row in table[1].find_all('tr'):
        for head in row.find_all('th'):
            #append the headers to a list and join with commas (easier to parse)
            list_head.append(head.text)
            header = ",".join(list_head)
            #append the cells in each row to a list
        for cell in row.find_all('td'):
            #since the synonyms are seperated by commas, replace the commas with "|" because if I specify the delimiter later on as a comma it will be problematic.
            list_rows.append(cell.text.replace(',', '|'))
    print(header, file = out)
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*6))
    
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        print(','.join(tup), file = out)

In [271]:
#Printing table BID-like
list_head = []
list_rows= []

#Create a csv
with open ('BID_table'+ ".csv", 'w') as out:
    for row in table[2].find_all('tr'):
        for head in row.find_all('th'):
            #append the headers to a list and join with commas (easier to parse)
            list_head.append(head.text)
            header = ",".join(list_head)
            #append the cells in each row to a list
        for cell in row.find_all('td'):
            #since the synonyms are seperated by commas, replace the commas with "|" because if I specify the delimiter later on as a comma it will be problematic.
            list_rows.append(cell.text.replace(',', '|'))
    print(header, file = out)
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*6))
    
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        print(','.join(tup), file = out)



In [273]:
#Printing other cellular homologs table

list_head = []
list_rows= []

#Create a csv
with open ('otherCellularHomologs_table'+ ".csv", 'w') as out:
    for row in table[3].find_all('tr'):
        for head in row.find_all('th'):
            #append the headers to a list and join with commas (easier to parse)
            list_head.append(head.text)
            header = ",".join(list_head)
            #append the cells in each row to a list
        for cell in row.find_all('td'):
            #since the synonyms are seperated by commas, replace the commas with "|" because if I specify the delimiter later on as a comma it will be problematic.
            list_rows.append(cell.text.replace(',', '|'))
    print(header, file = out)
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*6))
    
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        print(','.join(tup), file = out)

### April 13, 2020
### <font color = 'blue'>Tasks:</font>
- Download the genes associated with lysosomes in [The Human Protein Atlas](https://www.proteinatlas.org/about/download) and learn how to parse xmls.<br> 
- Type "lysosome" in the search box to get the list needed to be downloaded.
- Download xml and TSV formats.<br>
- Download and try to parse [The Hela Spatial Proteome](http://mapofthecell.biochem.mpg.de/index.html) and [The Human Lysosome Gene Database](http://lysosome.unipg.it/index.php#results)

In [291]:
#Download The Human Protein Atlas database (xml)
url = 'https://www.proteinatlas.org/search/Lysosome?format=xml'
wget.download(url, 'proteinAtlasLysosome.xml')

'proteinAtlasLysosome.xml'

In [292]:
#Download The The Human Protein Atlas database (tsv)
url = 'https://www.proteinatlas.org/search/lysosome?format=tsv'
wget.download(url, 'proteinAtlasLysosome.tsv')

'proteinAtlasLysosome.tsv'

In [308]:
#Downloading The Human Lysosome Gene Database
url = 'http://lysosome.unipg.it/index.php#results'
wget.download(url, 'unipgLysosomesList')

'unipgLysosomesList'

In [315]:
#Trying to parse The Human Lysosome Gene Database
with open('unipgLysosomesList') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    #inspect the website
    table = soup.find_all('table')
print(table)
#we can notice that the table that has the genes is not printed.


[<table border="0" cellpadding="0" cellspacing="0" id="indextable">
<tr>
<td id="index_col1"><h2>Filter genes</h2>
<div>View only genes reported lysosomal by 
	<select class="selectbox" name="q_papers_logical">
<option selected="true" value="any">any (union)</option>
<option value="all">all (intersection)</option></select> of the following sources:</div><br/><fieldset class="foldable foldable_closed">
<legend><a onclick="toggleFoldable(this.parentNode);">Proteomics Studies (view)</a> - <span id="studies_sel"></span> (<a class="linklike" onclick="checkbox_select(this.parentNode.parentNode,true);">select all</a> | <a class="linklike" onclick="checkbox_select(this.parentNode.parentNode,false);">none</a>)</legend>
<div id="studies_cnt">
<p class="checkoption"><input checked="checked" class="checkbox" name="q_papers[]" onclick="checkedCounts();" type="checkbox" value="19556463"/> A gene network regulating lysosomal biogenesis and function.<br/>
<span class="reduced">Sardiello M, Science 200

In [363]:
#Downloading The HeLa Spatial Proteome database (will be removed)
url = 'http://mapofthecell.biochem.mpg.de/index.html'
wget.download(url, 'HeLaProteome_lyso.xlsx')

'HeLaProteome_lyso.xlsx'

In [317]:
#Trying to parse The HeLa Spatial Proteome database
with open('HeLaProteome_lyso') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    #inspect the website
    div = soup.find_all('div')
print(div)
#No info about the genes is provided.

[<div id="main">
<div id="header">
<div id="banner">
<div id="welcome">
<h1>The HeLa Spatial Proteome</h1>
</div><!--close welcome-->
<div id="menubar">
<ul id="menu">
<li><a href="about.html">About</a></li>
<li class="current"><a href="index.html">Map your protein</a></li>
<li><a href="howto.html">How to</a></li>
</ul>
</div><!--close menubar-->
</div><!--close banner-->
</div><!--close header-->
<div id="site_content">
<div class="leftside">
<div class="selectionformarea">
<h3>How to use this website</h3>
<div class="DescriptionT">
<p align="justify">
				1.To highlight a selection on the map, enter gene names in the Protein mapping field and press the Select button. 
				</p>
<p align="justify">
				2. Detailed quantitative information is displayed for the first entry or upon clicking on a dot in the plot.
				</p>
<p align="justify">
				3. Click on the different maps to check the localization of your selection.
				</p>
<p align="justify">
				4. To hide a specific cluster, click 

### <font color = 'blue'>Problems faced:</font>
- Couldn't get any info about the genes by parsing the htmls with BeautifulSoup from The HeLa Spatial Proteome and The Human Lysosome Gene Database.<br><br>
- I contacted Sonja and informed about the problems faced so far and I'm currently waiting for reply.

### <font color = 'blue'>Potential Solutions:</font>

- Download the HeLa Spatial Proteome as always [Copy the download link](http://mapofthecell.biochem.mpg.de/HeLa_Subcell_Localization_Summary.xlsx) and **NOTICE THAT** the excel sheet has 3 tabs. We need the Organellar marker tab to use lysosomal markers and we need the Compact HeLa Spatial Proteome tab to use proteins that are predicted to be in lysosomes.<br>
- Do "View Page Source" for The Human Lysosome Gene Database.

In [364]:
url = 'http://mapofthecell.biochem.mpg.de/HeLa_Subcell_Localization_Summary.xlsx'
wget.download(url, 'Hela_Subcell_localization.xlsx')

'Hela_Subcell_localization.xlsx'

In [341]:
#parsing xml file trial
import os
import xml.etree.ElementTree as ET
file_name = 'proteinAtlasLysosome.xml'
full_file = os.path.abspath(os.path.join(file_name))
tree = ET.parse(full_file)
#in case I want to have the root element of my file
root = tree.getroot()
#<Element 'proteinAtlas' at 0x11dd20fb0>

In [343]:
#in the parent (entry) find all children (name): these are the protein names
names = tree.findall('entry/name')
for name in names:
    print(name.text)

SNAPIN
VPS33B
RAB7A
SCARB2
VPS11
VPS18
VPS35
BLOC1S2
TPP1
LAMP1
LAMP2
LRRK2
M6PR
SYT7
VPS16
VPS33A
VPS39
TMEM106B
HPS1
LAMTOR1
PLEKHM2
SNX16
VPS41
HGS
AP3M1
BLOC1S1
GAA
HPS4
HPS6
KPTN
NAGPA
SORT1
TPCN2
VAMP7
C12orf66
ACP2
ADRB2
AKTIP
AP4M1
ARL8B
ARSB
BECN1
BORCS5
BORCS6
C9orf72
CHMP5
CLN3
DNASE2
DTX3L
FAM160A2
GNPTAB
HEXB
HOOK1
HOOK2
HOOK3
KXD1
LAMTOR4
LAMTOR5
LAPTM4B
LARS
LDLR
LIPA
LYST
MANBA
MFSD8
NAGLU
NEU1
PLEKHM1
PPT1
RAB12
RAB34
RILP
SIAE
SYT11
SZT2
UBXN6
VPS4A
ZNRF1
CLTC
SNX1
STX7
AP1G1
AP3B1
AP3D1
ARSD
ARSG
BLOC1S4
CD1B
CD1D
CD1E
CD68
CHMP2B
CLEC16A
CLN5
CLN6
CLVS2
CTNS
CTSB
CTSK
CTSL
CTSS
EPDR1
GNPTG
GPRASP1
GUSB
HLA-DOA
KIF13A
LAMP3
LAMTOR2
LAPTM5
LGMN
MCOLN1
MYO7A
NCOA4
NPC1
PCSK9
RAB39A
SLC17A5
SPG11
SPPL2B
SQSTM1
TECPR1
TLR7
TLR9
TMEM175
TRIM23
UNC93B1
FUCA1
GRN
RPTOR
SNX2
SNX6
TMEM192
VPS26A
VPS36
ASAH1
DRAM2
ABCA1
ABCA2
ABCA5
ABCB9
AC001226.2
ACE
ACP5
ACPP
ADA
AGA
AKR1B10
ANK2
ANK3
ANKFY1
ANKRD27
AP2M1
AP5M1
AP5S1
ARL8A
ARRDC3
ARSA
ASS1
ATG14
ATP13A2
BBC3
BCL10
BECN2
BIN

In [351]:
synonyms = tree.findall('entry/synonym')
for synonym in synonyms:
    print(synonym.text)
    

BLOC1S7
BORCS3
SNAPAP
FLJ14848
RAB7
CD36L2
HLGP85
LIMP-2
LIMPII
SR-BII
PEP5
RNF108
KIAA1475
PEP3
FLJ10752
MEM3
PARK17
BLOS2
BORCS2
FLJ30135
MGC10120
CLN2
SCAR7
CD107a
CD107b
DKFZp434H2111
FLJ45829
PARK8
RIPK7
ROCO2
CD-M6PR
CD-MPR
IPCA-7
MGC150517
PCANAP7
SYT-VII
KIAA0770
VAM6
FLJ11273
MGC33727
BLOC3S1
HPS
C11orf59
FLJ20625
p18
p27RF-Rho
Pdro
Ragulator1
KIAA0842
HVSP41
Hrs
Vps27
ZFYVE8
BLOS1
BORCS1
GCN5L1
BLOC3S2
KIAA1667
LE
BLOC2S3
FLJ22501
2E4
APAA
UCE
Gp95
NT3
TPC2
SYBL1
TI-VAMP
VAMP-7
FLJ32549
LAP
ADRB2R
ADRBR
B2AR
BAR
FLJ13258
FTS
MU-4
MU-ARP2
SPG50
ARL10C
FLJ10702
Gie1
ATG6
VPS30
LOH12CR1
LOH1CR12
C17orf59
FLJ20014
DENNL72
MGC23980
C9orf83
CGI-34
HSPC177
SNF7DC2
Vps60
BTN1
BTS
JNCL
DNL
DNL2
BBAP
RNF143
C11orf56
DKFZP566M1046
FHIP
FLJ22665
KIAA1759
GNPTA
KIAA1208
MGC4170
HK1
HK2
HK3
BORCS4
C19orf50
FLJ25480
KXDL
MGC2749
C7orf59
HBXIP
MGC71071
XIP
LC27
FLJ10595
FLJ21788
HSPC192
LARS1
LEUS
RNTLS
LDLCQ2
CESD
LAL
CHS
CHS1
CLN7
MGC33302
NAG
NEU
KIAA0356
CLN1
INCL
PPT
NARR
RAB39
RAH
FLJ3

In [None]:
#in the parent (entry) find all children (name): these are the protein names
names = tree.findall('entry/name')
for name in names:
    print(name.text)

In [352]:
#I though that the best way to parse xml is to save the contents into a dictionary.
#There could be better ways.
myDict = {}
with open('proteinAtlasLysosome.xml', 'r') as prot:
    for line in prot:
        line = line.strip()
        if line.startswith('<name>'):
            pre_name = line.split('>')[1].strip()
            name = pre_name.split('<')[0].strip()
            myDict[name] = {}
        elif line.startswith('<synonym>'):
            #the actual synonym will be my key and it's value will be synonym
            pre_synonym = line.split('>')[1].strip()
            synonym = pre_synonym.split('<')[0].strip()
            myDict[name][synonym] = 'synonym'
        elif line.startswith('<xref'):
            pre_id = line.split('"')[1].strip()
            myDict[name]['id'] = pre_id
    print(myDict)
            

<font size = 3>**Sonja said that [This link]('https://www.cell.com/cms/10.1016/j.celrep.2017.08.063/attachment/235fbebe-76e4-48f9-a1a0-aa8c49d68426/mmc2.xlsx') contains more information about genes available from Hela Spacial proteome.**</font> 
<br>
The download wasn't working with python so I will download it manually and split the tables to different csv files.

### April 14, 2020
### <font color = 'blue'>Tasks:</font>
- Download Amigo GO files for each of the 3 GO terms.
- Try parse The CASBAH database and The Human Lysosome Gene database.

In [None]:
#Start by downloading Amigo Autophagy GO:0006914
url = 'http://golr-aux.geneontology.io/solr/select?defType=edismax&qt=standard&indent=on&wt=csv&rows=100000&start=0&fl=bioentity,bioentity_label,synonym,taxon_label,annotation_class_list,source&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&hl.snippets=1000&csv.encapsulator=&csv.separator=%09&csv.header=false&csv.mv.separator=%7C&fq=document_category:%22bioentity%22&facet.field=source&facet.field=taxon_subset_closure_label&facet.field=type&facet.field=panther_family_label&facet.field=annotation_class_list_label&facet.field=regulates_closure_label&q=GO:0006914&qf=bioentity%5E2&qf=bioentity_label_searchable%5E2&qf=bioentity_name_searchable%5E1&qf=bioentity_internal_id%5E1&qf=synonym_searchable%5E1&qf=isa_partof_closure_label_searchable%5E1&qf=regulates_closure%5E1&qf=regulates_closure_label_searchable%5E1&qf=panther_family_searchable%5E1&qf=panther_family_label_searchable%5E1&qf=taxon_label_searchable%5E1'
wget.download(url, 'AmiGo_Autophagy_geneproduct')

In [356]:
#Download Amigo lysosome GO:0005764
url = 'http://golr-aux.geneontology.io/solr/select?defType=edismax&qt=standard&indent=on&wt=csv&rows=100000&start=0&fl=bioentity,bioentity_label,synonym,taxon_label,annotation_class_list,source&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&hl.snippets=1000&csv.encapsulator=&csv.separator=%09&csv.header=false&csv.mv.separator=%7C&fq=document_category:%22bioentity%22&facet.field=source&facet.field=taxon_subset_closure_label&facet.field=type&facet.field=panther_family_label&facet.field=annotation_class_list_label&facet.field=regulates_closure_label&q=GO:0005764&qf=bioentity%5E2&qf=bioentity_label_searchable%5E2&qf=bioentity_name_searchable%5E1&qf=bioentity_internal_id%5E1&qf=synonym_searchable%5E1&qf=isa_partof_closure_label_searchable%5E1&qf=regulates_closure%5E1&qf=regulates_closure_label_searchable%5E1&qf=panther_family_searchable%5E1&qf=panther_family_label_searchable%5E1&qf=taxon_label_searchable%5E1'
wget.download(url, 'AmiGo_lysosome_geneproduct')

'AmiGo_lysosome_geneproduct'

In [365]:
#Download Amigo cellDeath GO:0008219
url = 'http://golr-aux.geneontology.io/solr/select?defType=edismax&qt=standard&indent=on&wt=csv&rows=100000&start=0&fl=bioentity,bioentity_label,synonym,taxon_label,annotation_class_list,source&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&hl.snippets=1000&csv.encapsulator=&csv.separator=%09&csv.header=false&csv.mv.separator=%7C&fq=document_category:%22bioentity%22&facet.field=source&facet.field=taxon_subset_closure_label&facet.field=type&facet.field=panther_family_label&facet.field=annotation_class_list_label&facet.field=regulates_closure_label&q=GO:0008219&qf=bioentity%5E2&qf=bioentity_label_searchable%5E2&qf=bioentity_name_searchable%5E1&qf=bioentity_internal_id%5E1&qf=synonym_searchable%5E1&qf=isa_partof_closure_label_searchable%5E1&qf=regulates_closure%5E1&qf=regulates_closure_label_searchable%5E1&qf=panther_family_searchable%5E1&qf=panther_family_label_searchable%5E1&qf=taxon_label_searchable%5E1'
wget.download(url, 'AmiGo_cellDeath_geneproduct')

'AmiGo_cellDeath_geneproduct'

In [3]:
#Try parsing CASBAH 
with open('The_CASBAH.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    table = soup.find_all('table')
    
#print(table)
len(table) #18 SO CHECK THE INDEX OF EACH TABLE OF INTEREST

#table[1] will give me 50 entries (i.e. 1-50 until BAD)
#table[2] will give me 50 entries (i.e. 51 - 100) ...etc

18

**I tried to loop through all the tables with Beautiful soup but I was getting an error SO I parsed one by one**

In [7]:
list_head = []
list_rows= []

#Create a csv
with open ('CASBAH_table'+ ".csv", 'w') as out:
    for row in table[1].find_all('tr'):
        for head in row.find_all('th'):
            #This step was done because some lines before the headers started with "#" and they were hard to get rid of.
            if head.text.startswith('Name') or head.text.startswith('Uni Prot') or head.text.startswith('Synonyms') or head.text.startswith('Consequences') or head.text.startswith('PubMed') or head.text.startswith('Site(s)'):#print(head.text) 
            #append the headers to a list and join with commas (easier to parse)
                list_head.append(head.text)#print(list_head)
    #I appended unknown at the end of the list because there was an extra column of spaces at the end of the table. (empty cells cause problems)
    header = ",".join(list_head)
                
    #append the cells in the table to a list
    for cell in table[1].find_all('td'):
        #I don't want the first cell which is a number
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    print(header, file = out)
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)

In [8]:
#for the rest of the tables, I won't print the headers because I already printed that for the first table. I will concatenate all of them at the end.

list_rows= []
list_head = []

#Create a csv
with open ('CASBAH_table2'+ ".csv", 'w') as out:
    for row in table[2].find_all('tr'):
        for head in row.find_all('th'):
            if head.text.startswith('Name') or head.text.startswith('Uni Prot') or head.text.startswith('Synonyms') or head.text.startswith('Consequences') or head.text.startswith('PubMed') or head.text.startswith('Site(s)'):#print(head.text)
                list_head.append(head.text)
    header = ",".join(list_head)
    print(header, file = out)
    for cell in table[2].find_all('td'):
        if not cell.text.isdigit():
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown'))
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples1 = list(zip(*[iter(list_rows)]*7))
    # I took the first 13 tuples because there's problem in this table.
    #the first 13 tuples and the last 21 tuples in the list had one empty column at the end while the others had 4.
    first13_tuples = tuples1[:13]
    for tup in first13_tuples:
        #remove the last "unknown" which I used to replace '\xa0'
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
    #get the middle elements in the list and create another tuple: I counted the index of each element in list_rows and wrote the index accordingly.
    tuples2 = list(zip(*[iter(list_rows[91:283])]*12))
    for tup in tuples2:
        #remove the last 6 unknowns (this part of the table had more unknowns than the other tables)
        tup = tup[:-6]
        print(','.join(tup), file = out)
        
    #create another tuple that contain that last 147 element of the list_rows. there are the last 21 rows in the table.
    tuples3 = list(zip(*[iter(list_rows[-147:])]*7)) 
    for tup in tuples3:
        tup = tup[:-1]
        print(','. join(tup), file = out)
        
   

In [9]:
list_rows= []

#Create a csv
with open ('CASBAH_table3'+ ".csv", 'w') as out:
    for cell in table[3].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)

In [10]:
list_rows= []

with open ('CASBAH_table4'+ ".csv", 'w') as out:
    for cell in table[4].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)

        
with open ('CASBAH_table5'+ ".csv", 'w') as out:
    for cell in table[5].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)

        
with open ('CASBAH_table6'+ ".csv", 'w') as out:
    for cell in table[6].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table7'+ ".csv", 'w') as out:
    for cell in table[7].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table8'+ ".csv", 'w') as out:
    for cell in table[8].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table9'+ ".csv", 'w') as out:
    for cell in table[9].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table10'+ ".csv", 'w') as out:
    for cell in table[10].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table11'+ ".csv", 'w') as out:
    for cell in table[11].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table12'+ ".csv", 'w') as out:
    for cell in table[12].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table13'+ ".csv", 'w') as out:
    for cell in table[13].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table14'+ ".csv", 'w') as out:
    for cell in table[14].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table15'+ ".csv", 'w') as out:
    for cell in table[15].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)
        
        
with open ('CASBAH_table16'+ ".csv", 'w') as out:
    for cell in table[16].find_all('td'):
        #print(cell.text)
        if not cell.text.isdigit():#print(cell.text)
            list_rows.append(cell.text.replace(',', '|').replace('\xa0','unknown').replace('&nbsp;', 'x')) 
            #to remove empty strings from a list
            list_rows = list(filter(None,list_rows))
    #print(list_rows)
    
    
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*7))
    #print(tuples)
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples:
        tup = tup[:-1]
        print(','.join(tup), file = out)

**I saved the source page of HLG database" in my working directory.**

In [872]:
with open('TheHumanLysosomeGene.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    table = soup.find_all('table')
len(table)

2

In [654]:
#Trying to get the links of genes in Human Lysosome Gene Database. But apparently it's not possible with this code.
#I am getting only the pubmed links.
for link in soup.find_all('a'):
    print(link.get('href'))

In [656]:
table[1]

<table cellspacing="0" class="datatable" id="resultlist">
<thead>
<tr><th class="nosort"></th><th>Symbol</th><th>Name</th><th>Nb of sources (among selected)</th>
</tr>
</thead>
<tbody>
<tr><td><input name="q_genes[]" type="checkbox" value="20"/></td><td><a href="gene.php?id=20">ABCA2</a></td><td>ATP-binding cassette, sub-family A (ABC1), member 2</td><td><a style="display: none;">006</a><a href="http://www.ncbi.nlm.nih.gov/pubmed/20957757" target="_blank" title="The proteome of lysosomes.">Schröder BA 2010</a>, <a href="http://www.ncbi.nlm.nih.gov/pubmed/19556463" target="_blank" title="A gene network regulating lysosomal biogenesis and function.">Sardiello M 2009</a>, <a href="http://www.ncbi.nlm.nih.gov/pubmed/10802651" target="_blank" title="Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.">Ashburner M 2000</a>, <a href="http://www.ncbi.nlm.nih.gov/pubmed/18977398" target="_blank" title="Proteomics of the lysosome.">Lübke T 2009</a>, <a href="http://

In [None]:
list_head = []
list_rows= []

#Create a csv
with open ('HumanLysosomeGene_table'+ ".csv", 'w') as out:
    for row in table[1].find_all('tr'):
        for head in row.find_all('th'):
            #append the headers to a list and join with commas (easier to parse)
            list_head.append(head.text)
            list_head = list(filter(None,list_head))
            header = ";".join(list_head)
            #append the cells in each row to a list
        for cell in row.find_all('td'):
            #since the synonyms are seperated by commas, replace the commas with "|" because if I specify the delimiter later on as a comma it will be problematic.
            list_rows.append(cell.text)
            list_rows = list(filter(None,list_rows))
    print(header, file = out)
    #This code will create a list of tuples whereby each tuple has 6 cells (i.e. 1 row)
    tuples = list(zip(*[iter(list_rows)]*3))
    
    #iterate over the tuple list in order to join the cells with commas (easier to parse)
    for tup in tuples: #keep in mind that the delimiter is a ';'
        print(';'.join(tup), file = out)

### April 15-16, 2020
### <font color = 'blue'>Tasks:</font>
- Try again with table[2] in CASBAH.<br>
- Ask about accessing the uniprot ID from the Human Lysosome Gene database.<br>
- Get the table of Human Autophagy database. (fetch source code) and ask about getting the synonyms.

### <font color = 'blue'>Answer:</font>
- No need to access any link inside the websites to get the uniprot ID and synonyms in the meantime.

In [730]:
# Downloading and parsing The Human Autophagy Database
url = 'http://autophagy.lu/clustering/index.html'
wget.download(url, 'HumanAutophagydatabase.html')


'HumanAutophagydatabase.html'

In [None]:
#THESE CODES WERE RUN ON ATOM BECAUSE THEY WERE GIVING THIS ERROR ON JUPYTER (I DUNNO WHY YET).
#FOR tuples LINE : 'str' object is not callable

mylist = [] # that is going to be a list of lists
myelement = []

#No need to write this with open statment each time I run the code here. just the first time. But write it in Atom
with open('HumanAutophagydatabase.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    #check the index of all tables
    table = soup.find_all('table')
#len(table)
#index table[6]

#the file is named to be cleaned because I will remove the escape characters from it.
with open('tobecleaned', 'w') as tobecleaned:
    for row in table[6].find_all('tr'):
        print(row.text, file = tobecleaned)  

#row.text contains a lot of spaces and tabs. The file is a mess.
with open('tobecleaned','r') as tobecleaned, open('cleanTable6', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        #I split at the new lines and that gave empty lists instead of the new lines.
        line_list = line.split('\n')
        mylist.append(line_list)
        #I need to remove the empty lists from the list.
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            #append all the contents of the sub-lists to another list (this will make it easier to parse)
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples:
        #The delimiter now is a ';'. That's better than',' because some words within same field are separated by ','.
        print(';'.join(tup), file = output) 

In [None]:
#IMP THE CODE WAS RUN ON ATOM 
#REMOVE HEADERS
with open('HumanAutophagydatabase.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    table = soup.find_all('table')


with open('tobecleaned8', 'w') as tobecleaned:
    for row in table[8].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned8','r') as tobecleaned, open('cleanTable8', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    #I don't want to include the header in the rest of the tuples because it's already present from the first table. (i will concate all)
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        

        
        
        


with open('tobecleaned10', 'w') as tobecleaned:
    for row in table[10].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned10','r') as tobecleaned, open('cleanTable10', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        

        
        


with open('tobecleaned12', 'w') as tobecleaned:
    for row in table[12].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned12','r') as tobecleaned, open('cleanTable12', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
with open('tobecleaned14', 'w') as tobecleaned:
    for row in table[14].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned14','r') as tobecleaned, open('cleanTable14', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        

with open('tobecleaned16', 'w') as tobecleaned:
    for row in table[16].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned16','r') as tobecleaned, open('cleanTable16', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        

with open('tobecleaned18', 'w') as tobecleaned:
    for row in table[18].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned18','r') as tobecleaned, open('cleanTable18', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
with open('tobecleaned20', 'w') as tobecleaned:
    for row in table[20].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned20','r') as tobecleaned, open('cleanTable20', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
with open('tobecleaned22', 'w') as tobecleaned:
    for row in table[22].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned22','r') as tobecleaned, open('cleanTable22', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
 

        
with open('tobecleaned24', 'w') as tobecleaned:
    for row in table[24].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned24','r') as tobecleaned, open('cleanTable24', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        

with open('tobecleaned26', 'w') as tobecleaned:
    for row in table[26].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned26','r') as tobecleaned, open('cleanTable26', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        

with open('tobecleaned28', 'w') as tobecleaned:
    for row in table[28].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned28','r') as tobecleaned, open('cleanTable28', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
        
with open('tobecleaned30', 'w') as tobecleaned:
    for row in table[30].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned30','r') as tobecleaned, open('cleanTable30', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
        
with open('tobecleaned32', 'w') as tobecleaned:
    for row in table[32].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned32','r') as tobecleaned, open('cleanTable32', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
#TABLE34 IS EMPTY       
with open('tobecleaned36', 'w') as tobecleaned:
    for row in table[36].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned36','r') as tobecleaned, open('cleanTable36', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
#TABLE 38 is empty      
with open('tobecleaned40', 'w') as tobecleaned:
    for row in table[40].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned40','r') as tobecleaned, open('cleanTable40', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
        
with open('tobecleaned42', 'w') as tobecleaned:
    for row in table[42].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned42','r') as tobecleaned, open('cleanTable42', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        

with open('tobecleaned44', 'w') as tobecleaned:
    for row in table[44].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned44','r') as tobecleaned, open('cleanTable44', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
with open('tobecleaned46', 'w') as tobecleaned:
    for row in table[46].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned46','r') as tobecleaned, open('cleanTable46', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
with open('tobecleaned48', 'w') as tobecleaned:
    for row in table[48].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned48','r') as tobecleaned, open('cleanTable48', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
        
with open('tobecleaned50', 'w') as tobecleaned:
    for row in table[50].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned50','r') as tobecleaned, open('cleanTable50', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)
        
        
        
#Tables 52,54 were empty
with open('tobecleaned56', 'w') as tobecleaned:
    for row in table[56].find_all('tr'):
        print(row.text, file = tobecleaned)

mylist = []
myelement = []
with open('tobecleaned56','r') as tobecleaned, open('cleanTable56', 'w') as output:
    for line in tobecleaned:
        line=line.strip()
        line_list = line.split('\n')
        mylist.append(line_list)
        mylist = [x for x in mylist if x != ['']]
    for lists in mylist:
        for element in lists:
            myelement.append(element)
            ','. join(myelement)
            tuples = list(zip(*[iter(myelement)]*3))
    for tup in tuples[1:]:
        print(';'.join(tup), file = output)

<font color = 'red' size = '3'>**NOTE**</font> </br>

**At this step I finished downloading the databases.** 

### April 17-19, 2020
### <font color = 'blue'>Tasks:</font>
- Try to finish as much as poss



ible of [linear algebra course]('https://www.coursera.org/learn/linear-algebra-machine-learning/home/welcome').

In [None]:
https://www.cell.com/cms/10.1016/j.celrep.2017.08.063/attachment/235fbebe-76e4-48f9-a1a0-aa8c49d68426/mmc2.xlsx