# WEB SCRAPPING GOOGLE SCHOLAR

## Libraries

Due to some misinformation on some sources, and some too similar names, a newbie could be confused as I was with `request` and `requests`, and with the `urllib` and `urllib3`libs. Let's clarify that:  

Python comes with the `urllib` package, that comes with the **`request`** module (and with not as used `response`, `error`, `parse` and `robotparser`), which is used mostly to open urls, with `urlopen`, an equivalent to `open(file)` for urls. The **`requests`** (plural!) library, is a very useful API with which we handle the requests much easier. 

`requests` does depend partially on `urllib3`, but don't be misguided to import the `urllib3` library for your requests, the first already includes what it needs from the last. <blockquote>"Under the hood, *requests* uses *urllib3* to do most of the http heavy lifting. When used properly they should be mostly the same unless you need more advanced configuration" (ref: [Stackexchange](https://stackoverflow.com/questions/36937110/what-is-the-practical-difference-between-these-two-ways-of-making-web-connection))</blockquote> 

`urllib` and `urllib2` are standard Python librares (`urllib2` is included in `urllib`), but `urllib3` is a completely separated library with a misleading name. A portion of it has been included in the standard library, but it is not a newer version of `urllib`/`urllib2`; the library that actually wants to improve is `httplib` (ref: [Github](https://github.com/urllib3/urllib3/issues/1065)).  

Here is what we need to import for our purposes:

In [44]:
import requests, json, pandas as pd 
from bs4 import BeautifulSoup 

### On `request`:
For mere instructional purposes, I'm importing the `urllib` and `urllib3` libraries to see the`request` (singular) function/module inside it. Again, is the `urllib` the one included in Python stdlib:

In [45]:
import urllib, urllib3
print('\n', dir(urllib))
print(dir(urllib3)) # print() displays them more compacted


 ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'error', 'parse', 'request', 'response']


________________
Even though `urllib` library has been imported, if you type:
`dir(request)`
it would return an error:  
`NameError: name 'request' is not defined`

So importing a library doesn't grant access to their modules. We need to import the module from the library:

In [46]:
from urllib import request

Now we can:

In [47]:
print(dir(request))



In the list you can see that it contains, among may others, the `urlopen` function, which is not included in `requests`...  

In [48]:
print(dir(requests))



After learning all of this, I have also learnt that urllib methods are all old-school methods. It is good to know and undestand them so that you reduce your confusion when you see them, but use `requests` instead. Having this clear, lets continue with our request, which we are going to do **from the library `requests`**, whith the `.get` method:

## Connection and access to the html with `requests`

In [49]:
url = 'https://scholar.google.es/citations?view_op=top_venues&hl=es'

Among the `requests` library, we can use **`.text` or `.content`** to **get the content** of the request.  
`.text` gets the content in *unicode*, and `.content` in *bytes*.  
Yet Python 3, will display the content as text even though it will be a bytes object, so the display is basically the same, except that `.content` adds a 'b simbol (for bytes) before. 
Request library is unclear about the distinction among the two, but [Stack overflow discussion](https://stackoverflow.com/questions/17011357/what-is-the-difference-between-content-and-text), mentions that 
> HTML and XML use declarations in the data to do their own decoding, and so they should be fed the raw `.content`  

But it is also mentioned that `.text` should be used for text-like formats like HTML and XML, and `.content` for *images* and *pdf*. Moreover, the [documentation](http://docs.python-requests.org/en/master/user/quickstart/#response-content) does say that 
> When you make a request, `Requests` makes educated guesses about the encoding of the response based on the HTTP headers.

So I'm using **`.text`**:

In [50]:
# reqContent = requests.get(url).content 
# reqContent

In [51]:
reqText = requests.get(url).text
reqText[:800] # printing just the first 800 chars for display convenience.

'<!doctype html><html><head><title>inglés - Estadísticas de Google Académico</title><meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="referrer" content="always"><meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no"><style>html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}html,body{height:100%}#gs_top{position:relative;box-sizing:border-box;min-height:100%;min-width:964px;-webkit-tap-highlight-color:rgba(0,0,0,0);}#gs_top>*:not(#x){-webkit-tap-highlight-color:rgba(204,204,204,.5);}.gs_el_ph #gs_top,.gs_el_t'

___________

Same for *content* would be:  
`reqContent = requests.get(url).content`  

Or with the **`request`** module, which returns a `http.client.HTTPResponse` type:  
`request.urlopen(url)`  
--Notice this last instruction is not returning the content yet, nor the status as `requests.get(url)` directly does (it returns `<Response [200]>`).  
You can get the status like this:  
`>>> request.urlopen(url).code`  
`200`  
Also, we can use the `.peek()` function in an *Response* object to see just the initial part (similar to the slice we do above with `reqText`


Back to **`requests`**. To **parse** the text, ***BeatifulSoup*** **`html.parser`** is used as it doesn't display certain text weirdly as `lxml` does sometimes (here it displays columns of certain chars for example).  

In [52]:
soup = BeautifulSoup(reqText, 'html.parser')
# soup
print(type(soup))

<class 'bs4.BeautifulSoup'>


Then it is **prettified** with BeautifulSoup's **`.prettify()`** function, which only works when printed, not when displayed, and makes html much more distinguishable.  
Note that I can't make:  
`soup = BeautifulSoup(reqText, 'html.parser').prettify()`  
as `soup.prettify` would return a *string* element, to which the .children attribute couldn't be applied later on.
Priting a *string* allows me to slice the results for visualization purposes:

In [53]:
print(soup.prettify()[:1000]) 

<!DOCTYPE doctype html>
<html>
 <head>
  <title>
   inglés - Estadísticas de Google Académico
  </title>
  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <meta content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2" name="viewport"/>
  <meta content="telephone=no" name="format-detection"/>
  <style>
   html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}html,body{height:100%}#gs_top{position:relative;box-sizing:border-box;min-height:100%;min-width:964px;-webkit-tap-highlight-color:rgba(0,0,0,0);}#gs_top>*:not(#x){-webkit-tap-highlight-color:rgba(204,204,204,.5);}.gs_el_ph #gs_top,.gs_el_ta #gs_top{min-width:320px;}#gs_top.gs_nscl{position:fixed;width:100%;}body,td,input,button{font-size:13px;font-family:Arial,sans-serif;line-height:1.24;}body

_____________


**Beautifulsoup** permite hacer una **lista bs4** (es un objeto bs4.element.Tag) por cada etiqueta de código html. Así podemos seleccionar la **etiqueta html**, que es la segunda:

In [54]:
html = list(soup.children)[1]

print(html.prettify()[:1000])
# print(html.children.prettify()[:1000])
# print(html.children)
print('\nsoup type: {}. html type: {}'.format(type(soup), type(html))) # uses new Python string formatting best practive

<html>
 <head>
  <title>
   inglés - Estadísticas de Google Académico
  </title>
  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <meta content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2" name="viewport"/>
  <meta content="telephone=no" name="format-detection"/>
  <style>
   html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}html,body{height:100%}#gs_top{position:relative;box-sizing:border-box;min-height:100%;min-width:964px;-webkit-tap-highlight-color:rgba(0,0,0,0);}#gs_top>*:not(#x){-webkit-tap-highlight-color:rgba(204,204,204,.5);}.gs_el_ph #gs_top,.gs_el_ta #gs_top{min-width:320px;}#gs_top.gs_nscl{position:fixed;width:100%;}body,td,input,button{font-size:13px;font-family:Arial,sans-serif;line-height:1.24;}body{background:#fff;color:#

__________________

## Adquisition of data

#### **Column by column method**

**List of (top 100) journals**:

We are going to select all elements of the class that incorporates the tittle, using `.select()` mainly.   
`.select()` uses CSS selectors, is more efficient and returns a **list**, while `.find_all()` returns **bs4.element.ResultSet** type:

In [55]:
journalsHtml = html.select('.gsc_mvt_t') 

display(journalsHtml[:5], type(journalsHtml), len(journalsHtml)) # display (vs. print) lists each row in a new line

[<th class="gsc_mvt_t">Publicación</th>,
 <td class="gsc_mvt_t">Nature</td>,
 <td class="gsc_mvt_t">The New England Journal of Medicine</td>,
 <td class="gsc_mvt_t">Science</td>,
 <td class="gsc_mvt_t">The Lancet</td>]

list

101

If we want to restrict the class to the `<td>`, thus without the header `<th>`:

In [73]:
journalsHtml = html.select('td.gsc_mvt_t')  # 'td +.gsc_mvt_t' would also work
display(journalsHtml[:5], type(journalsHtml), len(journalsHtml)) 

[<td class="gsc_mvt_t">Nature</td>,
 <td class="gsc_mvt_t">The New England Journal of Medicine</td>,
 <td class="gsc_mvt_t">Science</td>,
 <td class="gsc_mvt_t">The Lancet</td>,
 <td class="gsc_mvt_t">Chemical Society reviews</td>]

list

100

The same with find_all:
`journalsHtml = html.find_all('td', class_='gsc_mvt_t')`

**List of h5 index**:

`a[href^="/citations"]` finds all href starting with '/citations'. Make sure to use double quotes " "
Other options to remember are:
`$=` - ending, 
`~=` - contains that word...). .   
It finds that beneath the `tr` tag:

In [57]:
# h5indexHtml = html.find_all(class_='gsc_mvt_n')
h5indexHtml = html.select('tr a[href^="/citations"]')

print(len(h5indexHtml))
h5indexHtml[:5]

100


[<a class="gs_ibl gsc_mp_anchor" href="/citations?hl=es&amp;oe=ASCII&amp;vq=en&amp;view_op=list_hcore&amp;venue=H--JoiVp8x8J.2018">362</a>,
 <a class="gs_ibl gsc_mp_anchor" href="/citations?hl=es&amp;oe=ASCII&amp;vq=en&amp;view_op=list_hcore&amp;venue=IKEvlTw-e8IJ.2018">358</a>,
 <a class="gs_ibl gsc_mp_anchor" href="/citations?hl=es&amp;oe=ASCII&amp;vq=en&amp;view_op=list_hcore&amp;venue=oY2eER5-jTUJ.2018">345</a>,
 <a class="gs_ibl gsc_mp_anchor" href="/citations?hl=es&amp;oe=ASCII&amp;vq=en&amp;view_op=list_hcore&amp;venue=dj7TIF9zE7gJ.2018">278</a>,
 <a class="gs_ibl gsc_mp_anchor" href="/citations?hl=es&amp;oe=ASCII&amp;vq=en&amp;view_op=list_hcore&amp;venue=UJChSoIuvTUJ.2018">256</a>]

#### Content extraction:

**`.string`** or **`.get_text`** extracts the content of each line. It has to be applied to each **element content**, not to a a *list* (from select) or a *bs4.element.ResultSet* (from find_all):

In [75]:
journal_1 = journalsHtml[0].string # test 1 journal
print('Journal tittle: {}. \nVarType: {}'.format(journal_1, type(journal_1)))

Journal tittle: Nature. 
VarType: <class 'bs4.element.NavigableString'>


In [None]:
# journals= []
# [journals.append(journal) for journal in journalsHtml[0].string]
# print(journals[0:])
# print('Journal tittle: {}. \nVarType: {}'.format(journal, type(journal_1)))
# journal_1 = journalsHtml[0].string # test 1 journal 
# print('Journal tittle: {}. \nVarType: {}'.format(journal_1, type(journal_1)))


We need **to apply it to a list/set**:

In [59]:
#Beware: check which last journalsHtml is executed before this. 
lJournals = [journal.string for journal in journalsHtml]
lJournals[:6]

['Nature',
 'The New England Journal of Medicine',
 'Science',
 'The Lancet',
 'Chemical Society reviews',
 'Cell']

In [60]:
lH5indexes = [h5.string for h5 in h5indexHtml]
lH5indexes[:6]

['362', '358', '345', '278', '256', '244']

In [67]:
print(journalsHtml[1].string) # `

362


But lets stop this line of work and explore another more direct approach, that might be worthier to pursue:

#### **Selecting rows (more direct method)**:

To extract all content in a row together:

In [62]:
rowsHtml = html.select('tr')
display(rowsHtml[:3])
rows = [row.text.split("/n") for row in rowsHtml] # \n splits each paragraph (you can see that the html above shows each row in a paragraph)
# you might want to use `.strip()` in this step.
print('\n')
display(rows[:6])

[<tr><th class="gsc_mvt_p"></th><th class="gsc_mvt_t">Publicación</th><th class="gsc_mvt_n"><a class="gsc_mp_anchor gsc_mp_tgh" data-tg="gsc_mphm_hidx" href="javascript:void(0)">Índice h5</a></th><th class="gsc_mvt_n"><a class="gsc_mp_anchor gsc_mp_tgh" data-tg="gsc_mphm_hmed" href="javascript:void(0)">Mediana h5</a></th></tr>,
 <tr><td class="gsc_mvt_p">1.</td><td class="gsc_mvt_t">Nature</td><td class="gsc_mvt_n"><a class="gs_ibl gsc_mp_anchor" href="/citations?hl=es&amp;oe=ASCII&amp;vq=en&amp;view_op=list_hcore&amp;venue=H--JoiVp8x8J.2018">362</a></td><td class="gsc_mvt_n"><span class="gs_ibl gsc_mp_anchor">542</span></td></tr>,
 <tr><td class="gsc_mvt_p">2.</td><td class="gsc_mvt_t">The New England Journal of Medicine</td><td class="gsc_mvt_n"><a class="gs_ibl gsc_mp_anchor" href="/citations?hl=es&amp;oe=ASCII&amp;vq=en&amp;view_op=list_hcore&amp;venue=IKEvlTw-e8IJ.2018">358</a></td><td class="gsc_mvt_n"><span class="gs_ibl gsc_mp_anchor">602</span></td></tr>]





[['PublicaciónÍndice h5Mediana h5'],
 ['1.Nature362542'],
 ['2.The New England Journal of Medicine358602'],
 ['3.Science345497'],
 ['4.The Lancet278417'],
 ['5.Chemical Society reviews256366']]

Separate the 4 elements inside each list:

In [63]:
# Discarded test:
# for row in rows[1:3]: 
# #     print(row)
#     dotIndex = str(row).index('.')
# #     print(dotIndex)
#     print(col1[i].append(str(row)[2:dotIndex+1]))
#     i+=1
#     print(col1)
# col1

Convert to **data frame**, separating the head from the data:

In [64]:
colnames = rows[0]
data = rows[1:]

df = pd.DataFrame(data, columns=colnames)
df

Unnamed: 0,PublicaciónÍndice h5Mediana h5
0,1.Nature362542
1,2.The New England Journal of Medicine358602
2,3.Science345497
3,4.The Lancet278417
4,5.Chemical Society reviews256366
5,6.Cell244366
6,7.Nature Communications240318
7,8.Chemical Reviews239373
8,9.Journal of the American Chemical Society236309
9,10.Advanced Materials235336


In [65]:
# # Código dividido por pasos 
# journalsHtml = html.find_all('td')
# # len(journals)
# # print(journalsHtml)
# print(journalsHtml[3].string) # `
# print(type(journalsHtml))
# print(type(journalsHtml[0]))

542
<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>


Get a Pandas **Dataframe**:

In [66]:
dfMainJour = pd.DataFrame({
    "Journal": lJournals,
    "h5index": lH5indexes
})
dfMainJour

Unnamed: 0,Journal,h5index
0,Nature,362
1,The New England Journal of Medicine,358
2,Science,345
3,The Lancet,278
4,Chemical Society reviews,256
5,Cell,244
6,Nature Communications,240
7,Chemical Reviews,239
8,Journal of the American Chemical Society,236
9,Advanced Materials,235
