# Webscraping

This notebook shows how to scrape a website using the `BeautifulSoup` and `Requests` libraries. The example we are using here is scraping arms classifieds from the [Armslist](https://www.armslist.com/classifieds) website.

**Import libraries**

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from requests import get

**Set URL**

In [3]:
baseURL = 'https://www.armslist.com/classifieds/search?location=new-york&category=all&page=1&posttype=7&ships=False'

**Query the website and explore the response**

In [4]:
response = get(baseURL)

In [5]:
response.status_code

200

In [6]:
response.text



**Use `beautifulsoup` to parse the `html` and `prettify` the response**

In [7]:
soup = BeautifulSoup(response.content, 'html.parser')

In [8]:
soup


<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
    ARMSLIST    -
New York     
All Categories     
Classifieds
</title>
<link href="https://s3.amazonaws.com/mgm-content/sites/armslist/content/system/favicon.ico" rel="shortcut icon"/>
<link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/bootstrap.css" rel="stylesheet"/>
<link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/all.css" rel="stylesheet"/>
<link href="//fonts.googleapis.com/css?family=Oswald:400,300%7CLato:400,700" rel="stylesheet" type="text/css"/>
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css" rel="stylesheet"/>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js"></script>
<script src="https://s3.amazonaws.com/mgm-content/static/r

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   ARMSLIST    -
New York     
All Categories     
Classifieds
  </title>
  <link href="https://s3.amazonaws.com/mgm-content/sites/armslist/content/system/favicon.ico" rel="shortcut icon"/>
  <link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/bootstrap.css" rel="stylesheet"/>
  <link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/all.css" rel="stylesheet"/>
  <link href="//fonts.googleapis.com/css?family=Oswald:400,300%7CLato:400,700" rel="stylesheet" type="text/css"/>
  <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css" rel="stylesheet"/>
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js">
  </script>
  <script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js">
  </script>
  <script src="https://s3.ama

**Get the children of the main object**

In [11]:
list(soup.children)

['\n', 'html', '\n', <html>
 <head>
 <meta charset="utf-8"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <title>
     ARMSLIST    -
 New York     
 All Categories     
 Classifieds
 </title>
 <link href="https://s3.amazonaws.com/mgm-content/sites/armslist/content/system/favicon.ico" rel="shortcut icon"/>
 <link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/bootstrap.css" rel="stylesheet"/>
 <link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/all.css" rel="stylesheet"/>
 <link href="//fonts.googleapis.com/css?family=Oswald:400,300%7CLato:400,700" rel="stylesheet" type="text/css"/>
 <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css" rel="stylesheet"/>
 <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
 <script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js"></script>
 <script src="https://s3.amazonaws.com/

In [12]:
list(soup.children)[3]

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
    ARMSLIST    -
New York     
All Categories     
Classifieds
</title>
<link href="https://s3.amazonaws.com/mgm-content/sites/armslist/content/system/favicon.ico" rel="shortcut icon"/>
<link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/bootstrap.css" rel="stylesheet"/>
<link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/all.css" rel="stylesheet"/>
<link href="//fonts.googleapis.com/css?family=Oswald:400,300%7CLato:400,700" rel="stylesheet" type="text/css"/>
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css" rel="stylesheet"/>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js"></script>
<script src="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared

In [13]:
for baby in list(soup.children):
    print(type(baby))

<class 'bs4.element.NavigableString'>
<class 'bs4.element.Doctype'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>


**Get the main `html` element from the children and then its own children**

In [16]:
myHtml = list(soup.children)[3]

In [17]:
type(myHtml)

bs4.element.Tag

In [18]:
list(myHtml.children)

['\n', <head>
 <meta charset="utf-8"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <title>
     ARMSLIST    -
 New York     
 All Categories     
 Classifieds
 </title>
 <link href="https://s3.amazonaws.com/mgm-content/sites/armslist/content/system/favicon.ico" rel="shortcut icon"/>
 <link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/bootstrap.css" rel="stylesheet"/>
 <link href="https://s3.amazonaws.com/mgm-content/static/r131903/gzip/shared/css/all.css" rel="stylesheet"/>
 <link href="//fonts.googleapis.com/css?family=Oswald:400,300%7CLato:400,700" rel="stylesheet" type="text/css"/>
 <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css" rel="stylesheet"/>
 <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
 <script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.11.4/jquery-ui.min.js"></script>
 <script src="https://s3.amazonaws.com/mgm-content/static/r13

In [19]:
list(myHtml.children)[3]

<body style="-webkit-overflow-scrolling: touch;">
<div class="modal fade" id="termsModal" role="dialog" style="color: #655D5B !important; z-index: 100000 !important;">
<div class="modal-dialog">
<!-- Modal content-->
<div class="modal-content">
<div class="modal-header">
<h4 style="color: #655d5b">TERMS OF USE</h4>
</div>
<div class="modal-body">
<div id="termsModalContainer"></div>
</div>
<div class="modal-footer">
<button class="btn btn-default" onclick="acceptTerms()" type="button">Accept</button>
</div>
</div>
</div>
</div>
<div class="modal fade" id="promotionModal" role="dialog" style="color: #655D5B !important; z-index: 100000 !important;">
<div class="modal-dialog">
<!-- Modal content-->
<div class="modal-content">
<div class="modal-header">
<h4 style="color: #655d5b">PROMOTIONAL LINK</h4>
</div>
<div class="modal-body">
<div id="promotionModalContainer"></div>
</div>
<div class="modal-footer">
<button class="btn btn-default" data-dismiss="modal" type="button">Close</button>
</

**Get the `body` element and explore its components**

In [20]:
myBody = list(myHtml.children)[3]

In [21]:
myBody.find_all('p')

[<p>
 <small>
 For Sale                </small>
 </p>,
 <p><small><i class="fa fa-map-marker fa-2"></i> Long Island</small></p>,
 <p style="font-size: 80%;"><small>Tuesday, 6/18 4:17 PM</small></p>,
 <p>
 <small>
 For Sale/Trade                </small>
 </p>,
 <p><small><i class="fa fa-map-marker fa-2"></i> Long Island</small></p>,
 <p style="font-size: 80%;"><small>Tuesday, 6/18 4:17 PM</small></p>,
 <p>
 <small>
 For Sale                </small>
 </p>,
 <p><small><i class="fa fa-map-marker fa-2"></i> Long Island</small></p>,
 <p style="font-size: 80%;"><small>Tuesday, 6/18 4:17 PM</small></p>,
 <p>
 <small>
 For Sale/Trade                </small>
 </p>,
 <p><small><i class="fa fa-map-marker fa-2"></i> Long Island</small></p>,
 <p style="font-size: 80%;"><small>Tuesday, 6/18 4:17 PM</small></p>,
 <p>
 <small>
 For Sale                </small>
 </p>,
 <p><small><i class="fa fa-map-marker fa-2"></i> Long Island</small></p>,
 <p style="font-size: 80%;"><small>Tuesday, 6/18 4:17 PM</small

**Once you've identified the `col-md-7` as the tag that holds the elements we want to get we can use that to extract those elmements from our `soup` object.**

In [22]:
saleInfo = myBody.find_all(class_='col-md-7')

In [23]:
saleInfo

[<div class="col-md-7">
 <h4 style="margin: 0px 0px 10px 0px; padding: 0px;"><a href="/posts/9571384/long-island-new-york-optics-for-sale--vortex-viper">Vortex Viper</a></h4>
 <h4 style="color: #000000; font-weight: 400;">
                     $ 500
                 </h4>
 <p>
 <small>
 For Sale                </small>
 </p>
 <p><small><i class="fa fa-map-marker fa-2"></i> Long Island</small></p>
 <p style="font-size: 80%;"><small>Tuesday, 6/18 4:17 PM</small></p>
 </div>, <div class="col-md-7">
 <h4 style="margin: 0px 0px 10px 0px; padding: 0px;"><a href="/posts/9789750/long-island-new-york-rifles-for-sale-trade--chiappa-1892-alaskan-take-down">Chiappa 1892 Alaskan Take Down</a></h4>
 <h4 style="color: #000000; font-weight: 400;">
                     $ 1,250
                 </h4>
 <p>
 <small>
 For Sale/Trade                </small>
 </p>
 <p><small><i class="fa fa-map-marker fa-2"></i> Long Island</small></p>
 <p style="font-size: 80%;"><small>Tuesday, 6/18 4:17 PM</small></p>
 </d

**We can also select items based on multiple classes**

In [24]:
myBody.select('.col-md-7 h4 a')

[<a href="/posts/9571384/long-island-new-york-optics-for-sale--vortex-viper">Vortex Viper</a>,
 <a href="/posts/9789750/long-island-new-york-rifles-for-sale-trade--chiappa-1892-alaskan-take-down">Chiappa 1892 Alaskan Take Down</a>,
 <a href="/posts/9572179/long-island-new-york-gun-parts-for-sale--rossi-coach-gun-overland-parts">Rossi Coach Gun Overland Parts</a>,
 <a href="/posts/9503782/long-island-new-york-shotguns-for-sale-trade--rossi-chrome-coach-gun-rare">Rossi Chrome Coach Gun Rare</a>,
 <a href="/posts/9504240/long-island-new-york-shotguns-for-sale--savage-24--22lr-over-20ga-">Savage 24  22lr Over 20ga.</a>,
 <a href="/posts/9874160/long-island-new-york-shotguns-for-sale--remington-coach-gun-new-old-stock">Remington Coach Gun New Old Stock</a>,
 <a href="/posts/10074126/long-island-new-york-rifles-for-sale--ruger-m77-hawkeye-tactical--308">Ruger m77 Hawkeye Tactical .308</a>,
 <a href="/posts/10075280/long-island-new-york-shotguns-for-sale--rossi-overland-coach-gun-">Rossi Over

In [25]:
for item in myBody.select('.col-md-7 h4 a'):
    print(item.get_text())

Vortex Viper
Chiappa 1892 Alaskan Take Down
Rossi Coach Gun Overland Parts
Rossi Chrome Coach Gun Rare
Savage 24  22lr Over 20ga.
Remington Coach Gun New Old Stock
Ruger m77 Hawkeye Tactical .308
Rossi Overland Coach Gun 
Remington .22 Win Mag case
Winchester .410 PDX1 Defender Ammo
Ruger Security Six .357
 GLOCK 45 9MM 10RD 3 MAGS FRT SER
Taurus 605
Ruger LCPII kydex holster and Ruger mag
Taylor & Co. (Uberti) Smoke Wagon .357


**Use the `%` operand to do something every other time in a loop**

In [26]:
counter = 0
for item in myBody.select('.col-md-7 h4'):
    counter += 1
    if counter%2 == 0:
        print(item.get_text().strip())

$ 500
$ 1,250
$ 300
$ 700
$ 700
$ 700
$ 1,500
$ 675
$ 100
$ 100
$ 450
$ 500
$ 280
$ 40
$ 600


**Create an empty dataframe where to put the data you scrape and then run through the loop and add the data to this dataframe**

In [27]:
gunData = pd.DataFrame(columns=['description', 'price'])

In [28]:
counter = 0
description = ""
price = ""
for item in myBody.select('.col-md-7 h4'):
    counter += 1
    if counter%2 == 0:
        price = item.get_text().strip()
        print(description, price)
        myData = [description, price]
        gunData.loc[len(gunData)] = myData
    else:
        description = item.get_text()
        

Vortex Viper $ 500
Chiappa 1892 Alaskan Take Down $ 1,250
Rossi Coach Gun Overland Parts $ 300
Rossi Chrome Coach Gun Rare $ 700
Savage 24  22lr Over 20ga. $ 700
Remington Coach Gun New Old Stock $ 700
Ruger m77 Hawkeye Tactical .308 $ 1,500
Rossi Overland Coach Gun  $ 675
Remington .22 Win Mag case $ 100
Winchester .410 PDX1 Defender Ammo $ 100
Ruger Security Six .357 $ 450
 GLOCK 45 9MM 10RD 3 MAGS FRT SER $ 500
Taurus 605 $ 280
Ruger LCPII kydex holster and Ruger mag $ 40
Taylor & Co. (Uberti) Smoke Wagon .357 $ 600


In [29]:
gunData.head()

Unnamed: 0,description,price
0,Vortex Viper,$ 500
1,Chiappa 1892 Alaskan Take Down,"$ 1,250"
2,Rossi Coach Gun Overland Parts,$ 300
3,Rossi Chrome Coach Gun Rare,$ 700
4,Savage 24 22lr Over 20ga.,$ 700


If you are going to do this multiple times in a row, for example if you are going to query multiple pages, you need to put everything into a global loop (to loop through the pages) and add a timer so that the website doesn't return an error.

Make sur to import the `time` library.

In [30]:
import time
baseURL = 'https://www.armslist.com/classifieds/search?location=new-york&category=all&page='
baseURL2 = '&posttype=7&ships=False'
del gunData
gunData = pd.DataFrame(columns=['description', 'price'])

for i in range(1,11):
    request = baseURL + str(i) + baseURL2
    response = get(request)
    soup = BeautifulSoup(response.content, 'html.parser')
    myHtml = list(soup.children)[3]
    myBody = list(myHtml.children)[3]
    counter = 0
    description = ""
    price = ""
    for item in myBody.select('.col-md-7 h4'):
        counter += 1
        if counter%2 == 0:
            price = item.get_text().strip()
            print(description, price)
            myData = [description, price]
            gunData.loc[len(gunData)] = myData
        else:
            description = item.get_text()
    time.sleep(2)
gunData.head()

Vortex Viper $ 500
Chiappa 1892 Alaskan Take Down $ 1,250
Rossi Coach Gun Overland Parts $ 300
Rossi Chrome Coach Gun Rare $ 700
Savage 24  22lr Over 20ga. $ 700
Remington Coach Gun New Old Stock $ 700
Ruger m77 Hawkeye Tactical .308 $ 1,500
Rossi Overland Coach Gun  $ 675
Remington .22 Win Mag case $ 100
Winchester .410 PDX1 Defender Ammo $ 100
Ruger Security Six .357 $ 450
 GLOCK 45 9MM 10RD 3 MAGS FRT SER $ 500
Taurus 605 $ 280
Ruger LCPII kydex holster and Ruger mag $ 40
Taylor & Co. (Uberti) Smoke Wagon .357 $ 600
Ruger Mini 14 $ 750
New Complete 5.56 M-lok Upper $ 260
New .625 Pencil Barrel/Gas system $ 80
Adjustable Gas Block $ 40
Cold Hammer Forged Barrel $ 240
Federal 5.56 XM193 55gr SALE  $ 149
CZ P10 C Clearance! $ 425
Shooting DVD's (Lot of 21 Titles) $ 25
Liberty Gun Safe -  Sale/Trade REDUCED $ 750
AmSec SF6030 Gun Safe $ 1,899
Lasermax Glock G43 Guide Rod Laser $ 240
Glock 42 .380acp Magazine w/ finger extension $ 33
EAA Witness 9mm 10 Round Mag Part 101922 $ 29
Double A

Unnamed: 0,description,price
0,Vortex Viper,$ 500
1,Chiappa 1892 Alaskan Take Down,"$ 1,250"
2,Rossi Coach Gun Overland Parts,$ 300
3,Rossi Chrome Coach Gun Rare,$ 700
4,Savage 24 22lr Over 20ga.,$ 700


In [31]:
gunData.head(50)

Unnamed: 0,description,price
0,Vortex Viper,$ 500
1,Chiappa 1892 Alaskan Take Down,"$ 1,250"
2,Rossi Coach Gun Overland Parts,$ 300
3,Rossi Chrome Coach Gun Rare,$ 700
4,Savage 24 22lr Over 20ga.,$ 700
5,Remington Coach Gun New Old Stock,$ 700
6,Ruger m77 Hawkeye Tactical .308,"$ 1,500"
7,Rossi Overland Coach Gun,$ 675
8,Remington .22 Win Mag case,$ 100
9,Winchester .410 PDX1 Defender Ammo,$ 100
