# Web scraping with python


Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort. 


#### Important notes about web scraping:
Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes.
Make sure you are not downloading data at too rapid a rate because this may break the website. You may potentially be blocked from the site as well 8for this purpose we will make use of time module)


## First: inspect website

The first thing that we need to do is to figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags. 
On the website, right click and click on “Inspect”(wiew page source / html ...). This allows you to see the raw code behind the site.

You can also import and 'prettify' it via Python (with BeautifulSoup)




The following modules will be used:

#### Request

Is a module allowing to interrogate web sites and get response.

request.get(url)

you can pass parameters via a dictionary when interrogating websites with arguments

payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

r = requests.get('https://zzz.org/get', params=payload)

print(r.url)

https://zzz.org/get?key1=value1&key2=value2&key2=value3



.text property return text
.json() funtion returns json structure

requests allows also to handle forms cookies etc.


#### BeautifulSoup (bs4)

is a Python library for pulling data out of HTML and XML files.

It allow to parse html trees and display nested structure

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify()) --> shows html tree nested

Works very well on tags 

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)

tag.name  --> name of tag (b in the example)

### Example

In this example we start with Symantec virtual patent marking website:

https://www.symantec.com/en/uk/about/legal/virtual-patent




In [None]:
import requests
#import urllib.request
import time
from bs4 import BeautifulSoup

url = "https://www.symantec.com/en/uk/about/legal/virtual-patent"
response = requests.get(url)

response  # 200 means page has been hooked

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify()[155050:])



# second step

Decide which tags to use for parsing the 'soup'.

function findAll(TAG) returns a list with all the items within a given tag

One common task is extracting all the URLs found within a page’s '< a >' tags:

for link in soup.find_all('a'):

    print(link.get('href'))
    
In our case we are looking for paragraphs that contain the patent list.    
    

In [None]:
soup.findAll('p')[43:]


Another useful property is .content

It allows to get whats inside the tag, as a list to account possibly nested tags.

The first element of our soup above

[<p><b>Norton Core:</b> Protected by U.S. Patents D791,768, and 9,565,192. Additional patents may be pending in the U.S. and elsewhere.</p>,

will be thus splitted in two sub elements in the list.





In [None]:
vps = []

for i in range(1,len(soup.findAll('p'))): #'p' tags are for paragraph 
    one_p_tag = soup.findAll('p')[i]
    
    if len(one_p_tag.contents)>1:   # only lines with title and list of patents
        vps.append([one_p_tag.contents[0], one_p_tag.contents[1]])

vps        

In [None]:
# now we remove html tags using .text property
# we also transform element 0 into a string via .join() method

for vp in vps:
    if vp:
        vp[0] = BeautifulSoup(''.join(vp[0]), "lxml").text
      
vps[:2]   # first 3 elements only

In [None]:
# here we create a list splitting the patents
for vp in vps:
    pats = vp[1].split(', ')
    print(pats)

In [None]:
# here we create a list splitting the patents and 
# a new list with item + patent

vp_split = []

for vp in vps:
    pats = vp[1].split(', ')
    for pat in pats:
        vp_split.append([vp[0], pat])
    
vp_split[:10]   # leading 10 records only

In [None]:
# last step: put all in a dataframe for better management


import pandas as pd

feature_list = ['item', 'patents']

vps_df = pd.DataFrame(vp_split, columns=feature_list)

vps_df.head(5)

### other websites with VP:

https://www.3m.com/3M/en_US/company-us/patent/

https://www.honeywellaidc.com/en-sg/working-with-us/patents 

https://www.pg.com/patents/brands.shtml