# Scraping the NCBI Website

NCBI web page has a good overview of genome information by organism. From the summary page, we can get several high level genetic features (Kingdom, Size[nucleotide Mb], and Chromosome #), as well as the url for further genetic information. There are 167 pages of Organisms, which we would need to loop through.


In [5]:
import webbrowser
eurl = 'http://www.ncbi.nlm.nih.gov/genome/browse/#tabs-genomes' #url without selection
url1 = eurl
webbrowser.open(url1,new=2) #nice looking version which is clickable

#example of page 7 of of general query which has scrapable information
num_page = 7
url1 = 'http://www.ncbi.nlm.nih.gov/genomes/Genome2BE/genome2srv.cgi?action=GetGenomeList4Grid&mode=2&page='+str(num_page)+'&pageSize=100'

#get number of pages
webbrowser.open(url1,new=2) #To open web url, has real page data

import requests
response = requests.get(url1)
HTML = response.text

print HTML[:500]


<!--  7 100 16679 -->
<tr class="Even"><td><a href="/genome/30333" target="_blank">Aigarchaeota archaeon SCGC AAA471-J08</a></td><td>Archaea</td><td>TACK group</td><td>Thaumarchaeota</td><td>0.252228</td><td>-</td><td>-</td><td>-</td><td><a href="/genome/30333" target="_blank">1</a></td></tr>
<tr class="Odd"><td><a href="/genome/2702" target="_blank">Ailuropoda melanoleuca</a></td><td>Eukaryota</td><td>Animals</td><td>Mammals</td><td>2299.51</td><td>-</td><td>1</td><td>-</td><td><a href="/genome


In [10]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

import pandas as pd

If you click on the organism name within the Overview webpage, it sends you to another page with further genetic information. Scrapy is used to for getting each organism's url information in a function given below:

In [11]:
###SCRAPY###
# Get Link from HTML for Summary Pages for Organism

def p1_Link(HTML):
    site_all = "//tr/td/a/@href" #URL for Organism
    for_page = Selector(text=HTML).xpath(site_all)
    page_tot = for_page.extract()
    page_tot #list type

    href_list = []
    
    for j, k in enumerate(page_tot):
        if j%2 == 0: #even numbers, to get rid of duplicate lines
            href_list.append(k)
    
    p1a_df = pd.DataFrame(href_list, columns=["url"])
    
    #print p1a_df.shape
    #p1a_df.head(10)
    return p1a_df


Another function using Scrapy is created to format the 9 columns from the HTML function from long to wide format for each organism.

In [12]:
###SCRAPY###
# Get "Kingdom","Group","SubGroup","Size","Chr","Organelles","Plasmids" for 1 Page in dataframe

def p1_Cols(HTML):
    site_all = "//td//text()"
    for_page = Selector(text=HTML).xpath(site_all)
    page_tot = for_page.extract()
    page_tot #list type

    p1_df = pd.DataFrame(page_tot)

    #print p1_df.shape
    #p1_df.head(10)
    return p1_df


In [13]:
# Long Format, 9 cols, Dataframe formatting from long to wide
# Add Column Names, repeating list of 9

def format_Colnames(p1_df):
    col_num = 9 #Hard coded column number
    col_names = ["Organism","Kingdom","Group","SubGroup","Size","Chr","Organelles","Plasmids","Assemblies"]
    
    df_length = len(p1_df)
    x = len(p1_df)/col_num 
    
    col_long = []
    
    i=0
    for row in range(0, x):
        col_short = []

        for col in range(0,col_num):
            col_short.append(p1_df[0][i])
            i = i + 1 

        col_long.append(col_short)

    pwide_df = pd.DataFrame(col_long, columns=col_names) #index=range(0:x), columns=range(0:col_num))

    return pwide_df

Two dataframes were created from these two functions, which was concatenated to get one dataframe from the scraped 167 NCBI Overiview pages. The pandas dataframe format will allow for easy manipulation of the data for analysis.

In [15]:
big_df = pd.DataFrame()
num_pages = 164 #164,5

for i in range(1, num_pages+1): #For each page, get Grouping Info for each Organism

    num_page = i
    url1 = 'http://www.ncbi.nlm.nih.gov/genomes/Genome2BE/genome2srv.cgi?action=GetGenomeList4Grid&mode=2&page='+str(num_page)+'&pageSize=100'

    response = requests.get(url1)
    HTML = response.text
    
    p1_df =p1_Cols(HTML) #Other Cols for Organism
    p1_df_a = pd.DataFrame()
    p1_df_a = p1_Link(HTML) #Other Cols for Organism
    
    p1_df_format = pd.DataFrame()
    p1_df_format = format_Colnames(p1_df) #Add Column Names 
    
    p1_df_format = pd.concat([p1_df_format, p1_df_a], axis=1)    
    big_df = pd.concat([big_df, p1_df_format], axis=0)
    print i

#print big_df.shape
big_df.tail(10)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164


Unnamed: 0,Organism,Kingdom,Group,SubGroup,Size,Chr,Organelles,Plasmids,Assemblies,url
90,Wellfleet Bay virus,Viruses,ssRNA viruses,Orthomyxoviridae,0.011958,7,-,-,1,/genome/35297
91,Wenling shark virus,Viruses,ssRNA viruses,unclassified,0.009653,1,-,-,1,/genome/41580
92,Wenxinia marina,Bacteria,Proteobacteria,Alphaproteobacteria,4.17589,-,-,1,2,/genome/13935
93,Wenzhou virus,Viruses,ssRNA viruses,Arenaviridae,0.010483,2,-,-,1,/genome/35602
94,Wenzhouxiangella marina,Bacteria,Proteobacteria,Gammaproteobacteria,3.6752,1,-,-,1,/genome/39107
95,Wesselsbron virus,Viruses,ssRNA viruses,Flaviviridae,0.010814,1,-,-,1,/genome/6478
96,West African Asystasia virus 1,Viruses,ssDNA viruses,Geminiviridae,0.005388,2,-,-,1,/genome/12296
97,West Caucasian bat virus,Viruses,ssRNA viruses,Rhabdoviridae,0.012278,1,-,-,1,/genome/34375
98,West Nile virus,Viruses,ssRNA viruses,Flaviviridae,0.010962,1,-,-,2,/genome/10311
99,Western equine encephalitis virus,Viruses,ssRNA viruses,Togaviridae,0.011484,1,-,-,1,/genome/5000


When looking at large amounts of scraped data from the web which is dynamic, it is a good idea to pickle our data into csv files which can be imported into data frames as below.

In [16]:
big_df.to_csv('./out_Z2.csv', encoding='utf-8')

In [17]:
df_164 = pd.read_csv('./out_Z2.csv')

Armed with some data, we can start to take a look. Value counts on the dataframe gives us a good idea of how many organisms are in the "Kingdoms".

In [18]:
df_164['Kingdom'].value_counts()

Bacteria     8470
Viruses      5394
Eukaryota    1915
Archaea       573
Viroids        48
Name: Kingdom, dtype: int64

In [19]:
#Convert to numeric so that describe function later on works

df_164['Size'] = df_164['Size'].convert_objects(convert_numeric=True)
df_164['Chr'] = df_164['Chr'].convert_objects(convert_numeric=True)

print df_164['Size'].dtype

float64


  app.launch_new_instance()


In [20]:
df_suba = df_164[['Kingdom','Size']]
print df_suba.isnull().sum()

Kingdom    0
Size       3
dtype: int64


In [21]:
df_subb = df_164[['Kingdom','Chr']]
print df_subb.isnull().sum()

Kingdom       0
Chr        8493
dtype: int64


In [22]:
bytreatment = df_suba.groupby('Kingdom')
bytreatment['Size'].describe()

Kingdom         
Archaea    count      573.000000
           mean         2.096091
           std          1.217596
           min          0.100212
           25%          1.227160
           50%          1.853160
           75%          2.937200
           max          6.451200
Bacteria   count     8470.000000
           mean         3.606051
           std          2.262490
           min          0.104827
           25%          1.958058
           50%          3.342230
           75%          4.798423
           max         16.377200
Eukaryota  count     1912.000000
           mean       463.160587
           std       1274.846636
           min          0.662517
           25%         30.715150
           50%         54.741200
           75%        375.169500
           max      27602.700000
Viroids    count       48.000000
           mean         0.000337
           std          0.000045
           min          0.000246
           25%          0.000305
           50%          0.

In [23]:
bytreatment = df_subb.groupby('Kingdom')
bytreatment['Chr'].describe()

Kingdom         
Archaea    count     192.000000
           mean        1.015625
           std         0.124344
           min         1.000000
           25%         1.000000
           50%         1.000000
           75%         1.000000
           max         2.000000
Bacteria   count    2056.000000
           mean        1.085603
           std         0.673060
           min         1.000000
           25%         1.000000
           50%         1.000000
           75%         1.000000
           max        27.000000
Eukaryota  count     220.000000
           mean       16.022727
           std        11.069908
           min         1.000000
           25%         8.000000
           50%        13.500000
           75%        22.000000
           max        64.000000
Viroids    count      48.000000
           mean        1.000000
           std         0.000000
           min         1.000000
           25%         1.000000
           50%         1.000000
           75%         

In [24]:
df_subb["Chr"].fillna(0,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [25]:
bytreatment = df_subb.groupby('Kingdom')
bytreatment['Chr'].describe()

Kingdom         
Archaea    count     573.000000
           mean        0.340314
           std         0.485162
           min         0.000000
           25%         0.000000
           50%         0.000000
           75%         1.000000
           max         2.000000
Bacteria   count    8470.000000
           mean        0.263518
           std         0.571474
           min         0.000000
           25%         0.000000
           50%         0.000000
           75%         0.000000
           max        27.000000
Eukaryota  count    1915.000000
           mean        1.840731
           std         6.335633
           min         0.000000
           25%         0.000000
           50%         0.000000
           75%         0.000000
           max        64.000000
Viroids    count      48.000000
           mean        1.000000
           std         0.000000
           min         1.000000
           25%         1.000000
           50%         1.000000
           75%         