# Crash Web Scrape Notes (Python-BeautifulSoup)

### =======Syllabus=======
__1. Introduction to Web Scrape__

__2. Python Libraries__

__3. Checking The Status Of a Web Site__

    3.1. Status Codes
    3.2. Header Editing
__4. Editing URLs__
    
    4.1. Adding Standard Parameters To a URL
    4.2. Google Search URL Parameters
    4.3. Adding Variable-Parameters To a URL
    4.4. Finding The Parameters In URLs
__5. Gathering Content__
    
    5.1. General Content
    5.2. Specific Content
    5.3. CAPSTONE-1
    5.4. BONUS: Scrape Tables with Pandas

__6. Some Other Usefull Information__
    
    6.1. Adding Cookies To Requests
    6.2. Scraping via Proxies

__7. CAPSTONE-2__



### 1. Introduction to Web Scrape

Web Scraping is an efficient way to extract data from open sources (different websites). It is an automated process that;
- reaching a web site (URL) by browser (ex:Selenium library for Python) or by HTTP directly (ex: BeautifulSoup library for Python)
- surpassing the captcha or any other security precautions (if exists)
- finding the relevant (wanted) data
- gathering the data
- saving the data in structured type (to a database, .CSV file..etc) 

<img src="https://github.com/msklc/crash_web_scrape_notes/blob/master/images/web_scrape_schema.jpg?raw=true">

__WARNING:__

__Web scraping itself can’t be illegal. But before scraping any data, make sure to check the "terms of services". Otherwise, while scraping it will be possible to break the law and commit a crime.__

2 main issue for web scrape:
- Editing the URL
- Finding the content location for gathering 

<img src="https://github.com/msklc/crash_web_scrape_notes/blob/master/images/2main_issue.jpg?raw=true">

### 2. Python Libraries

__via directly HTTP__
- [Requests](https://requests.readthedocs.io/en/master/)
- [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- urllib
- Scrapy
- LXML
- Pandas ??REALLY??

__via Browser__
- Selenium

In [None]:
import requests
import bs4
from bs4 import BeautifulSoup

print('requests version:', requests.__version__)
print('BeautifulSoup version:', bs4.__version__)

### 3. Checking The Status Of a Web Site

__3.1. Status Codes__
- 200 : OK (Successfuly Connection)
- 3xx : Redirection
- 400 : Bad Request
- 401 : Unauthorized
- 403 : Forbidden
- 404 : Not Found
- 5xx Server Error
    - 500 : Internal Server Error
    - 501 : Not Implemented
    - 502 : Bad Gateway
    - 503 : Service Unavailable
    - 504 : Gateway Timeout

__Example__

In [None]:
import requests
url='http://www.google.com'
r=requests.get(url)
r.status_code

__More Example__

In [None]:
import requests
url_list=['http://www.deeploai.com', 'http://worldagnetwork.com', 'http://www.deeploai.com/notfound.php']
for url in url_list:
    r=requests.get(url)
    print('{} : {}'.format(url,r.status_code))

__Question 1:__
- What is the reason of Error-403? Is it possible to surpass this error?

__3.2. Header Editing__

In [None]:
import requests
url = 'http://worldagnetwork.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
r=requests.get(url, headers=headers)
print('Request via normal browser: {}'.format(r.status_code))
mobile_headers = {'User-Agent' : 'Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B137 Safari/601.1'}
r=requests.get(url, headers=mobile_headers)
print('Request via mobile browser: {}'.format(r.status_code))

### 4. Editing URLs

__4.1. Adding Standard Parameters To a URL__

In [None]:
import requests
url='http://www.domainname.com/'
payload = {'key1': 'value1', 'key2': 'value2'}
r=requests.get(url, params=payload)
print(r.url)

__Question-2:__

How can you edit the URLs that every URL has a only one key-value pairs from payload?

In [None]:
import requests
url='http://www.domainname.com/'
payload = {'key1': 'value1', 'key2': 'value2'}
for k,v in payload.items():
    r=requests.get(url, params={k:v})
    print(r.url)

__4.2. Google Search URL Parameters__

Basic URL: http://www.google.com/search?

- Single Keyword Query: __q=__deeploai

- Multiple Keyword Query: __q=__deeploai+netherlands

- Keyword(s) in Quotes: __as_epq=__deeploai+netherlands

- Limit the Result Number: __num=__100

- File Type: __as_filetype=__pdf

- Search in Spesific Web Site: __as_sitesearch=__deeploai.com

- Search in Spesific Time Duration: 

    - The previous 24 hours : __as_qdr=d__
    - The previous seven days : __as_qdr=w__
    - The previous month : __as_qdr=m__
    - The previous 3 month :__as_qdr=m3__
    - Past year: __as_qdr=y__

[Detail For Google URL Parameters](https://moz.com/blog/the-ultimate-guide-to-the-google-search-parameters)


__Example__

In [None]:
import requests
url='http://www.google.com/search?'
payload = {'q': 'deeploai', 'as_qdr': 'w'}
r=requests.get(url, params=payload)
print(r.url)

__4.3. Adding Variable-Parameters To a URL__

Get URLs with keywords from list

In [None]:
import requests
keywords=['data+scientist','data+engineer','data+analist']
for keyword in keywords:
    url='https://www.indeed.nl/job?q={}'.format(keyword)
    r=requests.get(url)
    print(r.url)

In [None]:
#or
import requests
keywords=['data+scientist','data+engineer','data+analist']
for n in range(len(keywords)):
    url='https://www.indeed.nl/job?q={}'.format(keywords[n])
    r=requests.get(url)
    print(r.url)

Get URLs with page numbers

In [None]:
import requests
keywords=['pandas']
for n in range(1,11):
    url='https://stackoverflow.com/questions/tagged/{}?tab=newest&page={}'.format(keywords[0],n)
    r=requests.get(url)
    print(r.url)

__4.4. Finding The Parameters In URLs__

What is the parameters of a URL for a query for 'https://www.internationalparceltracking.com'?

We get the parameters from the browser Developer Console:
- Developer Console >> Network Tab >> Headers Tab >> Query String Parameters (For Chrome)
- Web Developer >> Network Tab >> Params Tab (For Firefox)


<img src="https://github.com/msklc/crash_web_scrape_notes/blob/master/images/browser_DeveloperConsole.jpg?raw=true">

__Question-3:__

What is the URL of barcode with '123456789', postal code with '99999' from France at PostNL?

In [None]:
import requests
url='https://www.internationalparceltracking.com/api/shipment?barcode=123456789&checkIfValid=true&country=FR&language=en&postalCode=99999'
r=requests.get(url)
r.status_code

### 5. Gathering Content

__5.1. General Content__

In [None]:
#List of municipalities of the Netherlands at wikipedia
import requests
url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
r.content

__It is not easy to gathering relevant data!!!__

So, we prefer to use BeautifulSoup to get the relevant data easily;

In [None]:
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
soup

__5.2. Specific Content__

We need to understand the defination of __tags__ in HTML

<img src="https://github.com/msklc/crash_web_scrape_notes/blob/master/images/tags_of_HTML.jpg?raw=true">

| Tag | Description |
| --- | --- |
| __div__ | Division/Section in a page | 
| __table__ | Defines a table | 
| __th__ | Defines a header cell in a table | 
| __tr__ | Defines a row in a table | 
| __td__ | Defines a cell in a table | 	
| __span__ | Generic inline container |
| __a__ | Defines a hyperlink |
| __ul__ | Defines an unordered list |
| __li__ | Defines each list item |

<img src="https://github.com/msklc/crash_web_scrape_notes/blob/master/images/browser_inspector.jpg?raw=true">

__Example:__

Find all hyperlinks in previous page

In [None]:
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

links = soup.find_all('a')
print('Total hyperlinks in the page:',len(links))
#print(links)

__Example:__

Find all __http__ hyperlinks in previous page

In [None]:
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

links = soup.find_all('a')

for link in links:
    tag=link.get('href')
    try:
        print(tag) if 'http' in tag else None
    except:
        None

__Question-3:__

Why we need to use __try__ in previous example?

__Example:__

Gather the specific string: __25,445__

In [30]:
#find vs find_all
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

tag=soup.find('span',{'data-sort-value':"7004254450000000000♠"}).text
tag

'25,445'

__Example:__
    
(population example above can be often subjected to update, so this one could be much more surviving...)
    
Gather the specific string:__'Aa en Hunze'__


In [43]:
#or we can pull another string
#find vs find_all
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

tag=soup.find('th',{'scope':"row"}).find("a").text
tag

'Aa en Hunze'

__Example:__

Gather the all Municipalities name

In [None]:
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

table=soup.find('table',{'class':'wikitable plainrowheaders sortable'})
table

In [None]:
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

table=soup.find('table',{'class':'wikitable plainrowheaders sortable'})

municipalities=table.find_all('th')

for row in municipalities:
    print(row.text.strip())

__5.3. CAPSTONE-1:__

- Gather __all detail data__ of Netherlands municipalities from [wikipedia](https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands)
- Save them to CSV file

Hint:

In [None]:
import requests
from bs4 import BeautifulSoup

url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
r=requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

table=soup.find('table',{'class':'wikitable plainrowheaders sortable'})
rows=table.find_all('tr')[1:]

for row in rows:
    municipality=row.find('th').text.strip()
    print(municipality)
    cbs_code=row.find_all('td')[0].text.strip()
    print(cbs_code)
    province=row.find_all('td')[1].text.strip()
    print(province)
    population=row.find_all('td')[2].text.strip()
    print(population)
    pop_density=row.find_all('td')[3].text.strip()
    print(pop_density)
    land_area=row.find_all('td')[4].text.strip()
    print(land_area)
    print('=======')


__5.4. BONUS: Scrape Tables with Pandas__

In [None]:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_municipalities_of_the_Netherlands'
table=pd.read_html(url,match='Aa en Hunze')
table[0]

### 6. Some Other Usefull Information

__6.1. Adding Cookies To Requests__

__Question-4:__

- What is the reason of Error-403? Is it possible to surpass this error?

We get the cookies from the browser Developer Console:
- Developer Console >> Application Tab >> Cookies Tab (For Chrome)
- Web Developer >> Storage Tab >> Cookies Tab (For Firefox)


<img src="https://github.com/msklc/crash_web_scrape_notes/blob/master/images/sessionID_chrome.jpg?raw=true">

In [None]:
import requests
url='https://www.deeploai.com'
cookie = {'PHPSESSID': 'XXXX','ZHE':'YYY'}
r = requests.post(url, cookies=cookie)
print(r.status_code)

__6.2. Scraping via Proxies__

In [None]:
import requests
url='https://www.deeploai.com'
proxies = {'http': '110.74.209.202:51491'} #free proxy adress

r = requests.get(url, proxies=proxies)
print(r.status_code)

### 7. CAPSTONE-2

- Scrape all Angola movies from IMDB
    - Movie names
    - Produced Year
    - IMDB Rank
- Save them to CSV file

(Hint: Use IMDB Advance search page https://www.imdb.com/search/title/)

Good Luck!!!