## 1. Get webpage using *requests*

In [1]:
import requests

req = requests.get('https://en.wikipedia.org/wiki/Data_science')


In [2]:
req

<Response [200]>

In [3]:
#req.content

In [4]:
req.encoding

'UTF-8'

In [5]:
webpage = req.text
type(webpage)

str

In [6]:
filename = 'test.txt'
with open(filename, "wb") as f:
    f.write(webpage.encode())    

In [7]:
#print(webpage)

## 2. Get specific contents using BeatifulSoup

In [8]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(webpage, 'html.parser')

### 2.1 Prettify the webpage

In [9]:
#print(soup.prettify())

### 2.2 Get the first paragraph

You can try to remove "attrs" to see how it works.

In [10]:
paragraph = soup.find('p')

In [11]:
paragraph

<p>
			Pages for logged out editors <a aria-label="Learn more about editing" data-mw="interface" href="/wiki/Help:Introduction"><span>learn more</span></a>
</p>

In [12]:
#paragraph = soup.find('p', attrs={"class":False})
paragraph = soup.find_all('p', attrs={"class":False})
paragraph = paragraph[1]

In [13]:
paragraph

<p><b>Data science</b> is an <a class="mw-redirect" href="/wiki/Interdisciplinary" title="Interdisciplinary">interdisciplinary</a> academic field <sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> that uses <a href="/wiki/Statistics" title="Statistics">statistics</a>, <a class="mw-redirect" href="/wiki/Scientific_computing" title="Scientific computing">scientific computing</a>, <a href="/wiki/Scientific_method" title="Scientific method">scientific methods</a>, processes, <a href="/wiki/Algorithm" title="Algorithm">algorithms</a> and systems to extract or extrapolate <a href="/wiki/Knowledge" title="Knowledge">knowledge</a> and insights from noisy, structured, and <a href="/wiki/Unstructured_data" title="Unstructured data">unstructured data</a>.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup>
</p>

### 2.3 Get all the links in this paragraph which point to other webpages

In [14]:
paragraph.find_all('a')

[<a class="mw-redirect" href="/wiki/Interdisciplinary" title="Interdisciplinary">interdisciplinary</a>,
 <a href="#cite_note-1">[1]</a>,
 <a href="/wiki/Statistics" title="Statistics">statistics</a>,
 <a class="mw-redirect" href="/wiki/Scientific_computing" title="Scientific computing">scientific computing</a>,
 <a href="/wiki/Scientific_method" title="Scientific method">scientific methods</a>,
 <a href="/wiki/Algorithm" title="Algorithm">algorithms</a>,
 <a href="/wiki/Knowledge" title="Knowledge">knowledge</a>,
 <a href="/wiki/Unstructured_data" title="Unstructured data">unstructured data</a>,
 <a href="#cite_note-2">[2]</a>]

In [15]:
data = {"title":[], "href":[]}
for link in paragraph.find_all('a', attrs={"title":True}):
    data["title"].append(link["title"])
    data["href"].append(link["href"])

In [16]:
import pandas as pd
df = pd.DataFrame(data)

In [17]:
df

Unnamed: 0,title,href
0,Interdisciplinary,/wiki/Interdisciplinary
1,Statistics,/wiki/Statistics
2,Scientific computing,/wiki/Scientific_computing
3,Scientific method,/wiki/Scientific_method
4,Algorithm,/wiki/Algorithm
5,Knowledge,/wiki/Knowledge
6,Unstructured data,/wiki/Unstructured_data


## 3. Get the contents from all the webpages

In [18]:
webpages = []
head = "https://en.wikipedia.org"
for href in data["href"]:
    link = head + href
    req = requests.get(link)
    webpage = req.text
    webpages.append(webpage)

In [19]:
len(webpage)

99121

## 4. Futher readings

### 4.1 robots.txt

Check robots.txt of the website to find out what are allowed.

In [20]:
req = requests.get("https://en.wikipedia.org/robots.txt")
webpage = req.text

In [21]:
soup = BeautifulSoup(webpage, 'html.parser')
#print(soup.text)

### 4.2 Sleep

You would be banned, if you scrape a website too fast. Let your crawler sleep for a while after each round.

In [22]:
import time

for i in range(5):
    time.sleep(3)
    print(i)

0
1
2
3
4


### 4.3 Randomness

Pausing for extactly three seconds after each round is too robotic. Let's add some randomness to make your crawler looks more like a human.

In [23]:
from random import random

for i in range(5):
    t = 1 + 2 * random()
    time.sleep(t)
    print(i)

0
1
2
3
4


### 4.4 Separate the codes for scraping from the ones for data extraction

1. Scraping is more vulnerable. Nothing is more annoying than your crawler breaks because of a bug in the data extraction part.  
2. You never know what data you would need for modeling. So keep all the webpages you obtain. 

### 4.5 Chrome Driver and Selenium

### 4.5.1 Start Chrome Service

In [24]:
from selenium.webdriver.chrome.service import Service

In [25]:
service = Service(r"C:\Users\jingd\OneDrive\Documents\DownloadPrograms\chromedriver\chromedriver.exe")
service.start()

### 4.5.2 Define driver

In [26]:
from selenium import webdriver

In [27]:
driver = webdriver.Remote(service.service_url)

### 4.5.3 Get Webpage

In [28]:
driver.get("http://www.indeed.com/")

### 4.5.4 Input position

In [29]:
elem = driver.find_element("id", "text-input-what")
elem.clear()
elem.send_keys("data scientist")

### 4.5.5 Return

In [30]:
from selenium.webdriver.common.keys import Keys

In [31]:
elem.send_keys(Keys.RETURN)

### 4.5.6 Get current link

These are the tools make your crawler act even more like a human.

In [32]:
print(driver.current_url)

https://www.indeed.com/jobs?q=data+scientist&l=Houston%2C+TX&from=searchOnHP&vjk=de98f6b1bc8d004a


### 4.5.7 Quit Driver

In [33]:
driver.quit()

In [34]:
import time
from selenium import webdriver

# DeprecationWarning: executable_path has been deprecated, please pass in a Service object
#driver = webdriver.Chrome(r"C:\Users\jingd\OneDrive\Documents\DownloadPrograms\chromedriver\chromedriver.exe")  # Optional argument, if not specified will search path.

ser = Service(r"C:\Users\jingd\OneDrive\Documents\DownloadPrograms\chromedriver\chromedriver.exe")

op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=ser, options=op)

driver.get('http://www.google.com/');

time.sleep(5) # Let the user actually see something!

search_box = driver.find_element("name", "q")

search_box.send_keys('Techlent')

search_box.submit()

time.sleep(5) # Let the user actually see something!

driver.quit()