# Conceptual Background

## Data Collection

In this lecture, we are going to see some of the examples on how to extract data from online resources.  
Particularly, web scraping and API use cases will be examined.

## A General Pipeline

*Data Collection*
> The process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer queries, stated research questions, test hypotheses, and evaluate outcomes.

<div>
<img src="https://miro.medium.com/max/1200/1*ZWcBynyugbLpWcU3QWH7Tg.jpeg" alt="project-flow" width="500" height="600"/>
</div>


## Web Scraping & APIs

A data scientist doesn’t always get data handed to them in a CSV or an easily accessible database. In those cases, you need to manually extract the data from various resources. To this end, we have specialized tools. 

For instance, most of the web sources, such as IMDB, provide a set of protocols/methods for outside connections to interact with their database. These protocols/methods are aggregated as an **API** (Application Programming Interface). An API can be used in numerous contexts, such OS or web-dev like here. The idea is to have an outer interface for those who wish access a set of resources. In our case, this resource is particularly a dataset.

However, there might be some cases in which an API does not exist. The desired data is embedded in the raw HTML file and enclosed by various tags. In those cases, we need to parse the document and extract the desired data. To this end, we have **web scraping** concept in which the HTML file is parsed and stored as a tree to preserve the hierarchical relationship between tags.



## Web Scraping

Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.


![](https://pbs.twimg.com/media/EGwqy2OWwAAi6-F?format=jpg&name=small)

## Working with APIs

The term API is an acronym, and it stands for "**Application Programming Interface**". Think of an API like a menu in a restaurant. The menu provides a list of dishes you can order, along with a description of each dish. When you specify what menu items you want, the restaurant’s kitchen does the work and provides you with some finished dishes. You don’t know exactly how the restaurant prepares that food, and you don’t really need to.

![](https://miro.medium.com/max/1200/1*3h95bN2_xe-eitwHh_Ygvw.png)

# 1. Request library

## HTTP Requests

HTTP stands for Hypertext Transfer Protocol and is used to structure requests and responses over the internet. HTTP requires data to be transferred from one point to another over the network. You may think of it as the command language that the devices on both sides of the connection must follow in order to communicate.


|Command (HTTP CODE)|CRUD Operation|Sample Endpoint|Description|
|---|---|---|---|
|get (GET)|Read (Retrieve)|http://example.com/resources/item17|Retrieve a representation of the addressed member of the collection, expressed in an appropriate Internet media type.|
|post (POST)|Create	Collection|http://example.com/resources/|Create a new entry in the collection. The new entry's URL is assigned automatically and is usually returned by the operation.|
|put (PUT)|Update|http://example.com/resources/item17|Replace the addressed member of the collection, or if it doesn't exist, create it|.
|delete (DELETE)|Delete (Destroy)|http://example.com/resources/item17|Delete the addressed member of the collection.|
||||**Table 1 Methods and sample endpoints.**|

Below, you may find a sample request from [Twitter's official API page](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets).

![](https://pbs.twimg.com/media/EGsWEYwX0AADYm8?format=jpg&name=large)

In return, this request is replied with a set of extracted tweets in **json** format. 

JSON is short for JavaScript Object Notation, and is a way to store information in an organized, easy-to-access manner. In a nutshell, it gives us a human-readable collection of data that we can access in a really logical manner. You may think of them as a generalized dictionary object across various languages.



### Requests: Making HTTP Requests!

This is the de facto standard library for making HTTP requests in Python. It abstracts the complexities of making requests behind a beautiful, simple API so that you can focus on interacting with services and consuming data in your application.

As mentioned before, the `GET` method indicates that you’re trying to get or retrieve data from a specified resource. To make a GET request with `requests` library, just call requests.get(url) with url as the target webpage.

In [None]:
# the library comes built-in with colab
import requests 

In [None]:
url = "http://www.google.com"
# making a GET request
res = requests.get(url)

In [None]:
# success code
res.status_code

200

In [None]:
# returns the HTML format of the search page
res.content

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="zNCVoSUImKpOlGVI9hOz8Q">(function(){window.google={kEI:\'O4MgZIHTIrWw5NoP8_61qAg\',kEXPI:\'0,1303426,55983,6059,206,4804,2316,383,246,5,1129120,1197708,693,380089,16109,28690,22430,1362,12317,17582,4998,13228,3847,36218,2226,2872,2891,561,3365,8434,29843,30847,15324,432,3,346,1244,1,5445,11471,2652,4,1528,2304,11926,17136,13063,13660,2980,1457,9358,7428,5821,2536,4094,7596,1,42154,2,14022,2373,342,23024,5679,1020,31123,4568,6258,23418,1252,5835,14968,4332,7484,445,2,2,1,26632,

# 2. Data Retrieval via APIs

REST APIs, or Representational State Transfer APIs, are a type of web service that uses HTTP requests to access and manipulate data. REST APIs are based on a set of principles that make it easy for different systems to communicate with each other, and they are widely used in modern web development.

To use a REST API, we need to send an HTTP request to a specific endpoint, which is a URL that represents a particular resource or action. The response we receive from the server will usually be in a specific format, such as JSON or XML.

As an example, let's look at the Wikipedia API, which allows you to retrieve data from Wikipedia, including article content, metadata, and search results. To use the Wikipedia API, you need to send HTTP requests to specific endpoints and provide query parameters to specify the data you want to retrieve.

Let's start by importing the necessary libraries and defining the base URL for the Wikipedia API:

In [None]:
import requests
import json

# Define the base URL for the Wikipedia API endpoint
WIKIPEDIA_API_URL = "https://en.wikipedia.org/w/api.php"

In [None]:
# Define the parameters for the API request
title = "Python (programming language)"
params = {
    "action": "query",
    "format": "json",
    "prop": "extracts",
    "exintro": "",
    "explaintext": "",
    "titles": title
}

In [None]:
# Send the API request
response = requests.get(WIKIPEDIA_API_URL, params=params)

In [None]:
# Parse the JSON response
data = json.loads(response.text)

In [None]:
data

{'batchcomplete': '',
 'query': {'pages': {'23862': {'pageid': 23862,
    'ns': 0,
    'title': 'Python (programming language)',
    'extract': 'Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.Python consistently ranks as one of the most popular pro

In [None]:
# Extract the article summary from the response
summary = ""
for page in data["query"]["pages"].values():
    summary = page["extract"]


In [None]:
summary

'Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.Python consistently ranks as one of the most popular programming languages.\n\n'

# 3. Standard Web Scraping using BeautifulSoup

### Beautiful Soup: Parsing Structured Documents!

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for parsing HTML and XML documents and it is used for web scraping mainly. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


Given a HTML file, our goal is to parse the content and store it in an easily accessible data structure. So, we'll store it as a document tree object. Whenever we provide an HTML content to Beautiful Soap parser as the input, it returns the root of the resulting domcument tree. 

![](https://dab1nmslvvntp.cloudfront.net/wp-content/uploads/2014/10/1413373269crp-1.png)



To demonstrate web scraping with BeautifulSoup, let's extract the descriptions of the first 3 books in goodreads.com


In [76]:
import requests
from bs4 import BeautifulSoup as bs
import re

# Step 1: Download page content and extract links
allLinks = []
url = "https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once?page=1"
resp = requests.get(url)  # send a GET request to the specified URL and store the server's response in resp variable
soup = bs(resp.content)  # parse the HTML content of the server's response using BeautifulSoup and store the result in soup variable
for book in soup.find_all('a', href=True, class_=('bookTitle')):
    link = book['href']  # get the 'href' attribute of the <a> tag
    allLinks.append("https://www.goodreads.com" + link)  # append the link to allLinks list


In [81]:
# Step 2: Download HTML content from links and save to file
for link in allLinks[0:5]:
    resp = requests.get(link)
    book_id = re.findall('\d+', link)[0]  # extract the book ID from the link
    with open(f"{book_id}.html", "w", encoding="utf-8") as f:
        f.write(resp.text)

In [78]:
allLinks[0:5]

['https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird',
 'https://www.goodreads.com/book/show/72193.Harry_Potter_and_the_Philosopher_s_Stone',
 'https://www.goodreads.com/book/show/1885.Pride_and_Prejudice',
 'https://www.goodreads.com/book/show/48855.The_Diary_of_a_Young_Girl',
 'https://www.goodreads.com/book/show/170448.Animal_Farm']

In [82]:
# Step 3: Parse HTML content to extract book description
for link in allLinks[0:3]:
    book_id = re.findall('\d+', link)[0]
    with open(f"{book_id}.html", "r", encoding="utf-8") as f:
        soup = bs(f.read(), "html.parser")
        try:
          description = soup.find("div", {"data-testid": "description"}).find_all("span")[0].text.strip()
        except:
          continue
        print(f"{book_id}: {description}\n")

2657: The unforgettable novel of a childhood in a sleepy Southern town and the crisis of conscience that rocked it. "To Kill A Mockingbird" became both an instant bestseller and a critical success when it was first published in 1960. It went on to win the Pulitzer Prize in 1961 and was later made into an Academy Award-winning film, also a classic.Compassionate, dramatic, and deeply moving, "To Kill A Mockingbird" takes readers to the roots of human behavior - to innocence and experience, kindness and cruelty, love and hatred, humor and pathos. Now with over 18 million copies in print and translated into forty languages, this regional story by a young Alabama woman claims universal appeal. Harper Lee always considered her book to be a simple love story. Today it is regarded as a masterpiece of American literature.

72193: Harry Potter thinks he is an ordinary boy - until he is rescued by an owl, taken to Hogwarts School of Witchcraft and Wizardry, learns to play Quidditch and does battl

# 4. Automation using Selenium

Sometimes, we need to interact with websites in ways that are not possible with simple HTTP requests or web scraping. In such cases, we can use browser automation tools like Selenium.

To demonstrate automation with Selenium, let's automate the process of searching for a keyword on Google and extracting the URLs of the top 3 search results.

To be able to run automation code below, it needs to be installed chromedriver from the link here: https://chromedriver.chromium.org/downloads. You should download the version compatible with your chrome version. You can display your chrome version from the settings of Chrome and clicking on 'About Chrome'. After that you need to write the path of the downloaded chromedriver below. 

The code related this part will be demonstrated on local environment instead of google colab.