# Part 1: Loading Web Pages with 'request'

The requests module allows you to send HTTP remquests using Python.

The HTTP request returns a Response Object with all the response data (content, encoding, status, and so on).
One example of getting the HTML of a page:

In [3]:
import requests

res = requests.get('https://codedamn.com')

print(res.text)
print(res.status_code)

<!DOCTYPE html><html lang="en" class="text-gray-500 antialiased bg-white js-focus-visible"><head><meta charSet="utf-8"/><meta name="twitter:card" content="summary_large_image"/><meta name="twitter:site" content="@codedamncom"/><meta name="twitter:creator" content="@codedamncom"/><meta property="fb:app_id" content="261251371039658"/><meta property="og:url" content="https://codedamn.com/"/><meta property="og:type" content="website"/><meta property="og:image" content="https://codedamn.com/assets/images/cover.jpg"/><meta property="og:image:alt" content="codedamn cover image"/><meta property="og:image:width" content="1280"/><meta property="og:image:height" content="720"/><script type="application/ld+json">{
    "@context": "https://schema.org",
    "@type": "Organization",
    "url": "https://codedamn.com",
    "logo": "https://codedamn.com/assets/images/blacklogo.jpg"
  }</script><title>Learn to code for free - frontend, backend, full-stack, web3 and more | codedamn</title><meta name="robo

# Part 2: Extracting title with BeautifulSoup

In this whole classroom, you’ll be using a library called BeautifulSoup in Python to do web scraping. 
Some features that make BeautifulSoup a powerful solution are:

1. It provides a lot of simple methods and Pythonic idioms for navigating, searching, and modifying a DOM tree. It doesn't take much code to write an application
2. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different   parsing strategies or trade speed for flexibility.

Basically, BeautifulSoup can parse anything on the web you give it.

Here’s a simple example of BeautifulSoup:

In [4]:
from bs4 import BeautifulSoup

In [5]:
page = requests.get("https://codedamn.com")

In [6]:
soup = BeautifulSoup(page.content, 'html.parser')

In [7]:
title = soup.title.text # gets you the text of the <title>(...)</title>

In [8]:
title

'Learn to code for free - frontend, backend, full-stack, web3 and more | codedamn'

# Part 3: Soup-ed body and head

In the last lab, you saw how you can extract the title from the page.
It is equally easy to extract out certain sections too.

You also saw that you have to call .text on these to get the string, but you can print them without calling .text too, and it will give you the full markup. 

Try to run the example below:

In [9]:
# Extract body of page and head of page
page_body = soup.body

page_head = soup.head

In [10]:
# print the result
print(page_body, page_head)

<body class="font-body"><div data-reactroot="" id="__next"><div id="root"><div class="relative" id="layout-conntainer"><header class="z-[51] relative cd-morph-dropdown text-gray-100 bg-gradient-to-r from-gray-900 via-gray-800 to-gray-900"><div class="jsx-b35e882c88056c79 relative py-4 max-w-7xl mx-auto flex items-center justify-between px-4 sm:px-6 group"><a class="jsx-b35e882c88056c79 flex flex-grow lg:flex-grow-0 sm:space-x-2 items-center" data-testid="logo" href="/"><div class="jsx-b35e882c88056c79"><span style="box-sizing:border-box;display:inline-block;overflow:hidden;width:initial;height:initial;background:none;opacity:1;border:0;margin:0;padding:0;position:relative;max-width:100%"><span style="box-sizing:border-box;display:block;width:initial;height:initial;background:none;opacity:1;border:0;margin:0;padding:0;max-width:100%"><img alt="" aria-hidden="true" src="data:image/svg+xml,%3csvg%20xmlns=%27http://www.w3.org/2000/svg%27%20version=%271.1%27%20width=%2735.13%27%20height=%27

In [11]:
page_body

<body class="font-body"><div data-reactroot="" id="__next"><div id="root"><div class="relative" id="layout-conntainer"><header class="z-[51] relative cd-morph-dropdown text-gray-100 bg-gradient-to-r from-gray-900 via-gray-800 to-gray-900"><div class="jsx-b35e882c88056c79 relative py-4 max-w-7xl mx-auto flex items-center justify-between px-4 sm:px-6 group"><a class="jsx-b35e882c88056c79 flex flex-grow lg:flex-grow-0 sm:space-x-2 items-center" data-testid="logo" href="/"><div class="jsx-b35e882c88056c79"><span style="box-sizing:border-box;display:inline-block;overflow:hidden;width:initial;height:initial;background:none;opacity:1;border:0;margin:0;padding:0;position:relative;max-width:100%"><span style="box-sizing:border-box;display:block;width:initial;height:initial;background:none;opacity:1;border:0;margin:0;padding:0;max-width:100%"><img alt="" aria-hidden="true" src="data:image/svg+xml,%3csvg%20xmlns=%27http://www.w3.org/2000/svg%27%20version=%271.1%27%20width=%2735.13%27%20height=%27

In [14]:
type(page_body)

bs4.element.Tag

In [12]:
page_head

<head><meta charset="utf-8"/><meta content="summary_large_image" name="twitter:card"/><meta content="@codedamncom" name="twitter:site"/><meta content="@codedamncom" name="twitter:creator"/><meta content="261251371039658" property="fb:app_id"/><meta content="https://codedamn.com/" property="og:url"/><meta content="website" property="og:type"/><meta content="https://codedamn.com/assets/images/cover.jpg" property="og:image"/><meta content="codedamn cover image" property="og:image:alt"/><meta content="1280" property="og:image:width"/><meta content="720" property="og:image:height"/><script type="application/ld+json">{
    "@context": "https://schema.org",
    "@type": "Organization",
    "url": "https://codedamn.com",
    "logo": "https://codedamn.com/assets/images/blacklogo.jpg"
  }</script><title>Learn to code for free - frontend, backend, full-stack, web3 and more | codedamn</title><meta content="index,follow" name="robots"/><meta content="index,follow" name="googlebot"/><meta content="L

In [13]:
type(page_head)

bs4.element.Tag

# Part 4: select with BeautifulSoup

Now that you have explored some parts of BeautifulSoup, let's look how you can select DOM elements with BeautifulSoup methods.

Once you have the soup variable (like previous labs), you can work with .select on it which is a CSS selector inside BeautifulSoup. 

That is, you can reach down the DOM tree just like how you will select elements with CSS. Let's look at an example:

In [15]:
# Extract first <h1>(...)</h1> text

first_h1 = soup.select('h1')[0].text

In [16]:
first_h1

'Learn ProgrammingInteractively'

## .select returns a Python list of all the elements. 
## This is why you selected only the first element here with the [0] index.

In [17]:
type(first_h1)

str

In [19]:
soup.select('h1')

[<h1 class="mt-4 text-4xl font-extrabold tracking-tight text-white sm:mt-5 sm:text-6xl lg:mt-6 xl:text-6xl"><span class="block">Learn Programming</span><span class="block pb-3 text-transparent bg-clip-text bg-gradient-to-r from-indigo-200 to-cyan-400">Interactively</span></h1>]

In [20]:
len(soup.select('h1'))

1

In [21]:
soup.select('h1')[0]

<h1 class="mt-4 text-4xl font-extrabold tracking-tight text-white sm:mt-5 sm:text-6xl lg:mt-6 xl:text-6xl"><span class="block">Learn Programming</span><span class="block pb-3 text-transparent bg-clip-text bg-gradient-to-r from-indigo-200 to-cyan-400">Interactively</span></h1>

In [23]:
soup.select('h1')[0].content

# Part 5: Top items being scraped right now

### Note that this is only one of the solutions. You can attempt this in a different way too. In this solution:

1. First of all you select all the div.thumbnail elements which gives you a list of individual products; 
   Then you iterate over them, Because select allows you to chain over itself, you can use select again to get the title.
2. Note that because you're running inside a loop for div.thumbnail already, the h4 > a.title selector would only give you one result, inside a list. You select that list's 0th element and extract out the text.
3. Finally you strip any extra whitespace and append it to your list.

Straightforward right?

In [24]:
soup.select('div.thumbnail')

[]

In [26]:
products = soup.select('div.thumbnail')
for elem in products:
    print(elem.select('h1 > a.title')[0].text)

# Part 6: Extracting Links

So far you have seen how you can extract the text, or rather innerText of elements.

Let's now see how you can extract attributes by extracting links from the page.

Here’s an example of how to extract out all the image information from the page:

In [27]:
soup.select('img')

[<img alt="facebook pixel" class="hidden" height="1" src="https://www.facebook.com/tr?id=471148730684170&amp;ev=PageView&amp;noscript=1" width="1"/>,
 <img alt="" aria-hidden="true" src="data:image/svg+xml,%3csvg%20xmlns=%27http://www.w3.org/2000/svg%27%20version=%271.1%27%20width=%2735.13%27%20height=%2758%27/%3e" style="display:block;max-width:100%;width:initial;height:initial;background:none;opacity:1;border:0;margin:0;padding:0"/>,
 ,
 <img alt="codedamn logo" data-nimg="intrinsic" decoding="async" loading="lazy" src="//images.weserv.nl/?url=https%3A%2F%2Fcodedamn.com%2Fassets%2Fimages%2Fwhite-logo.png&amp;w=70.26&

In [28]:
soup.select('img')[0]

<img alt="facebook pixel" class="hidden" height="1" src="https://www.facebook.com/tr?id=471148730684170&amp;ev=PageView&amp;noscript=1" width="1"/>

In [29]:
soup.select('img')[-1]

<img alt="codedamn footer logo" data-nimg="intrinsic" decoding="async" loading="lazy" src="//images.weserv.nl/?url=https%3A%2F%2Fcodedamn.com%2Fassets%2Fimages%2Fwhite-logo.png&amp;w=100&amp;q=75" srcset="//images.weserv.nl/?url=https%3A%2F%2Fcodedamn.com%2Fassets%2Fimages%2Fwhite-logo.png&amp;w=100&amp;q=75 1x, //images.weserv.nl/?url=https%3A%2F%2Fcodedamn.com%2Fassets%2Fimages%2Fwhite-logo.png&amp;w=100&amp;q=75 2x" style="position:absolute;top:0;left:0;bottom:0;right:0;box-sizing:border-box;padding:0;border:none;margin:auto;display:block;width:0;height:0;min-width:100%;max-width:100%;min-height:100%;max-height:100%"/>

In [32]:
# Create top_items as empty list
image_data = []

# Extract and store in top_items according to instructions on the left
images = soup.select('img')

for image in images:
    alt = image.get('alt')
    alt_class = image.get('class')
    src = image.get('src')
    image_data.append({"alt": alt, "alt_class": alt_class, "src": src,})

print(image_data)

[{'alt': 'facebook pixel', 'alt_class': ['hidden'], 'src': 'https://www.facebook.com/tr?id=471148730684170&ev=PageView&noscript=1'}, {'alt': '', 'alt_class': None, 'src': 'data:image/svg+xml,%3csvg%20xmlns=%27http://www.w3.org/2000/svg%27%20version=%271.1%27%20width=%2735.13%27%20height=%2758%27/%3e'}, {'alt': 'codedamn logo', 'alt_class': None, 'src': 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7'}, {'alt': 'codedamn logo', 'alt_class': None, 'src': '//images.weserv.nl/?url=https%3A%2F%2Fcodedamn.com%2Fassets%2Fimages%2Fwhite-logo.png&w=70.26&q=75'}, {'alt': '', 'alt_class': None, 'src': 'data:image/svg+xml,%3csvg%20xmlns=%27http://www.w3.org/2000/svg%27%20version=%271.1%27%20width=%2732%27%20height=%2732%27/%3e'}, {'alt': 'Language image', 'alt_class': None, 'src': 'data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7'}, {'alt': 'Language image', 'alt_class': None, 'src': '//images.weserv.nl/?url=https%3A%2F%2Fcodedamn.com%2

In [34]:
import pandas as pd
df_images = pd.DataFrame(image_data)
df_images

Unnamed: 0,alt,alt_class,src
0,facebook pixel,[hidden],https://www.facebook.com/tr?id=471148730684170...
1,,,"data:image/svg+xml,%3csvg%20xmlns=%27http://ww..."
2,codedamn logo,,"data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP//..."
3,codedamn logo,,//images.weserv.nl/?url=https%3A%2F%2Fcodedamn...
4,,,"data:image/svg+xml,%3csvg%20xmlns=%27http://ww..."
...,...,...,...
135,amazon,"[text-gray-800, hover:text-gray-600]","data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP//..."
136,amazon,"[text-gray-800, hover:text-gray-600]",//images.weserv.nl/?url=https%3A%2F%2Fcodedamn...
137,,,"data:image/svg+xml,%3csvg%20xmlns=%27http://ww..."
138,codedamn footer logo,,"data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP//..."



### In this lab, your task is to extract the href attribute of links with their text as well.

Make sure of the following things:

1. You have to create a list called all_links. In this list, store all link dict information.
   It should be in the following format:
   
info = {
   "href": "<link here>",
   "text": "<link text here>"
}

2. Make sure your text is stripped of any whitespace
3. Make sure you check if your .text is None before you call .strip() on it.
4. Store all these dicts in the all_links
5. Print this list at the end

You are extracting the attribute values just like you extract values from a dict, using the get function. 

Let's take a look at the solution for this lab:

In [35]:
# Create top_items as empty list
all_links = []

# Extract and store in top_items according to instructions on the left
links = soup.select('a')
for ahref in links:
    text = ahref.text
    text = text.strip() if text is not None else ''

    href = ahref.get('href')
    href = href.strip() if href is not None else ''
    all_links.append({"href": href, "text": text})

print(all_links)

[{'href': '/', 'text': 'codedamn'}, {'href': '/pricing', 'text': 'Pricing'}, {'href': '/contact', 'text': 'Contact Us'}, {'href': '/login', 'text': 'Sign in'}, {'href': '/register', 'text': 'Create Free Account'}, {'href': '/learning-paths/frontend', 'text': 'Frontend learning path Become a frontend React web developer by learning through interactive courses'}, {'href': '/learning-paths/backend', 'text': 'Backend learning path Become a backend developer by learning through interactive courses'}, {'href': '/learning-paths/fullstack', 'text': 'Full-stack learning path PopularBecome a full-stack web developer by learning through interactive courses'}, {'href': '/learning-paths/web3', 'text': 'Web 3.0 And Blockchain BetaStart your Web 3.0 journey building with ethereum, solidity, and more'}, {'href': '/learning-paths', 'text': 'Explore All Paths'}, {'href': '/courses', 'text': 'Explore All Courses'}, {'href': '/playgrounds?template=html', 'text': 'HTML/CSS'}, {'href': '/playgrounds?templat

### note
Here, you extract the href attribute just like you did in the image case.

The only thing you're doing is also checking if it is None.

We want to set it to empty string, otherwise we want to strip the whitespace.

In [36]:
pd.DataFrame(all_links)

Unnamed: 0,href,text
0,/,codedamn
1,/pricing,Pricing
2,/contact,Contact Us
3,/login,Sign in
4,/register,Create Free Account
...,...,...
129,/news/product/write-for-codedamn,Write on codedamn
130,/campus,Campus Evangelist
131,https://partner.codedamn.com,Affiliate Program
132,/privacy-policy,Privacy


# Part 7: Generating CSV from data ; Finally

In [40]:
import csv

Finally, let's understand how you can generate CSV from a set of data. You will create a CSV with the following headings:

Product Name

Price

Description

Reviews

Product Image

These products are located in the div.thumbnail. The CSV boilerplate is given below:

In [37]:
soup.select('div.thumbnail')

[]

In [50]:
# Create top_items as empty list
all_products = []

# Extract and store in top_items according to instructions on the left
products = soup.select('div.thumbnail')
for product in products:
    name = product.select('h4 > a')[0].text.strip()
    description = product.select('p.description')[0].text.strip()
    price = product.select('h4.price')[0].text.strip()
    reviews = product.select('div.ratings')[0].text.strip()
    image = product.select('img')[0].get('src')

    all_products.append({
        "name": name,
        "description": description,
        "price": price,
        "reviews": reviews,
        "image": image
    })

In [51]:
all_products

[]

In [52]:
all_products.append({'name': 'Asus ROG Strix G...',
  'description': 'Apple MacBook Air 13.3", Core i5 1.8GHz, 8GB, 128GB\n\t\t\t\t\t\t\t\t\t\t\tSSD, Intel HD 4000, RUS',
  'price': '$1101.83',
  'reviews': '4 reviews',
  'image': '/webscraper-python-codedamn-classroom-website/cart2.png'})

In [53]:
all_products

[{'name': 'Asus ROG Strix G...',
  'description': 'Apple MacBook Air 13.3", Core i5 1.8GHz, 8GB, 128GB\n\t\t\t\t\t\t\t\t\t\t\tSSD, Intel HD 4000, RUS',
  'price': '$1101.83',
  'reviews': '4 reviews',
  'image': '/webscraper-python-codedamn-classroom-website/cart2.png'}]

In [56]:
keys = all_products[0].keys()

In [57]:
with open('products.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)

In [58]:
# dataframe

pd.read_csv('products.csv')

Unnamed: 0,name,description,price,reviews,image
0,Asus ROG Strix G...,"Apple MacBook Air 13.3"", Core i5 1.8GHz, 8GB, ...",$1101.83,4 reviews,/webscraper-python-codedamn-classroom-website/...


The for block is the most interesting here. You extract all the elements and attributes from what you've learned so far in all the labs.

When you run this code, you end up with a nice CSV file. And that's about all the basics of web scraping with BeautifulSoup!