# Webscraping 101

- What is webscraping?
- Packages used
- Making a request
- Getting and parsing the data
- A couple of tricks to get around tricky websites

#### What is webscraping?
Webscraping is extracting data from a website. This could be text, numbers, images, urls, etc. It is mainly used as a tool for data collection and research purposes.

#### Packages that are used

In [1]:
import requests # used for making requests to website and getting the information
from bs4 import BeautifulSoup # used for parsing the data

#### Making a Request

In [2]:
response = requests.get('https://google.com')

In [3]:
# status code
print(response.status_code)

200


#### A Couple of Common Status Codes
- 200 (successful response from server)
- 403 (forbidden response to server --> usually means they are doing a good job of blocking or you are missing credentials)
- 404 (not found --> usually means the page you are looking for does not exist anymore)
- Anything 500 is usually a server error

You can find a full list of HTTP response codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [4]:
# page source
print(response.text)

# url sent
print(response.url)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="F1hYOrRCPeGwMArqgAtSdA">(function(){window.google={kEI:'f-J5Yq7nDcK0tQbvw5GICg',kEXPI:'0,1302536,56872,6059,207,4804,2316,383,246,5,1354,4013,1238,1122515,1197718,683,380089,16115,19397,9287,17572,4858,1362,9290,3030,17579,4020,978,13228,3847,4192,6430,7432,15309,5081,1593,1279,2742,149,1103,841,1982,4,4310,3514,606,2023,1777,520,14670,3227,2845,7,17450,8101,8219,1851,2614,3784,8926,432,3,346,1244,1,5444,149,11323,991,1661,4,1528,2304,6463,576,13271,8752,3050,2658,7357,11443,2215,

Unfortuantely, there's not much just to scrape from Google's main page so let's try Nike

In [5]:
response = requests.get('https://www.nike.com/w/new-mens-shoes-3n82yznik1zy7ok')

In [6]:
response.status_code

200

### Parsing the webpage
To parse the webpage we will want to use BeautifulSoup to extract information from the page source that we just got

In [7]:
soup = BeautifulSoup(response.text, 'html.parser') # first argument is the text, second is the type of parser we want to use

In [8]:
data = soup.select('div[class="product-card__body"]')

In [9]:
len(data)

24

In [10]:
for x in data:
    print(x.text)

Nike Go FlyEaseJust InNike Go FlyEaseShoes3 Colors$120
Nike Air Max DawnJust InNike Air Max DawnMen's Shoes4 Colors$110
Nike Air Max 90Just InNike Air Max 90Men's Shoes1 Color$140
Nike Air Max Pre-DayJust InNike Air Max Pre-DayMen's Shoes3 Colors$140
Nike Blazer Low JumboJust InNike Blazer Low JumboMen's Shoes1 Color$100
Nike Blazer Mid '77 PremiumNike Blazer Mid '77 PremiumMen's Shoes1 Color$100
Nike Air Max 270Best SellerNike Air Max 270Men's Shoes1 Color$170
Nike Air Max PlusJust InNike Air Max PlusShoes2 Colors$175
LeBron 19 LowJust InLeBron 19 LowBasketball Shoes1 Color$160
LeBron 19Best SellerLeBron 19Basketball Shoes2 Colors$200
Jordan .5 'Why Not?'Just InJordan .5 'Why Not?'Men's Basketball Shoes1 Color$130
Air Jordan XXXVIJust InAir Jordan XXXVIBasketball Shoes2 Colors$185
Jordan ADG 4Just InJordan ADG 4Men's Golf Shoes3 Colors$185
Jordan Series MidJust InJordan Series MidMen's Shoes1 Color$90
Nike OneontaNike OneontaMen's Sandals3 Colors$65
Nike Asuna 2Nike Asuna 2Men's Slide

### Getting around Tricky Websites
1. Headers
2. Proxies
3. Selenium

In [11]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36'}

In [12]:
response = requests.get('https://www.tiktok.com/')

In [13]:
response.text

''

In [14]:
response = requests.get('https://www.tiktok.com/', headers=headers)

In [15]:
response.text



#### Proxies
Many proxy services are available such as Scraper API provide proxy services which you can use to scrape

#### Selenium
This will be another video but it allows you to automate actual Chrome browsers