# **Web Scraping**

**Importing Dependencies**

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

**HTTP Request**

In [2]:
headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/120.0 Safari/537.36"
}
res = requests.get("https://www.scrapethissite.com/pages/", headers=headers).text

In [3]:
soup = BeautifulSoup(res, "lxml")

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Learn Web Scraping | Scrape This Site | A public sandbox for learning web scraping
  </title>
  <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Here are some practice pages you can scrape." name="description"/>
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
  <link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
 </head>
 <body>
  <nav id="site-nav">
   <div class="container">
    <div class="col-md-12">
     <ul class="nav nav-tabs">
  

**Extracting text**

In [5]:
pages = soup.find_all('div', class_='page')
titles = []
descriptions = []

for page in pages:
    title = page.find('a').text.strip()
    titles.append(title)
    desc = page.find('p', class_='lead').text.strip()
    descriptions.append(desc)

In [6]:
print(titles)

['Countries of the World: A Simple Example', 'Hockey Teams: Forms, Searching and Pagination', 'Oscar Winning Films: AJAX and Javascript', 'Turtles All the Way Down: Frames & iFrames', "Advanced Topics: Real World Challenges You'll Encounter"]


In [7]:
print(descriptions)

['A single page that lists information about all the countries in the world. Good for those just get started with web scraping.', 'Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.', 'Click through a bunch of great films. Learn how content is added to the page asynchronously with Javascript and how you can scrape it.', 'Some older sites might still use frames to break up thier pages. Modern ones might be using iFrames to expose data. Learn about turtles as you scrape content inside frames.', "Scraping real websites, you're likely run into a number of common gotchas. Get practice with spoofing headers, handling logins & session cookies, finding CSRF tokens, and other common network errors."]


In [8]:
print("Length of pages: ", len(pages))
print("Length of titles: ", len(titles))
print("Length of descriptions: ", len(descriptions))

Length of pages:  5
Length of titles:  5
Length of descriptions:  5


**Creating a Dataframe**

In [9]:
df = pd.DataFrame({'Title': titles, 'Description': descriptions})
df.shape

(5, 2)

In [10]:
df.head()

Unnamed: 0,Title,Description
0,Countries of the World: A Simple Example,A single page that lists information about all...
1,"Hockey Teams: Forms, Searching and Pagination",Browse through a database of NHL team stats si...
2,Oscar Winning Films: AJAX and Javascript,Click through a bunch of great films. Learn ho...
3,Turtles All the Way Down: Frames & iFrames,Some older sites might still use frames to bre...
4,Advanced Topics: Real World Challenges You'll ...,"Scraping real websites, you're likely run into..."
