# Web Scraping with BeautifulSoup

### BeautifulSoup is a Python package used for parsing HTML and XML documents. It provides a convenient way to extract data from web pages by navigating the HTML/XML tree structure and accessing specific elements based on their tags, attributes, or content.

## Imports and Url:

In [3]:
#Imports -

from bs4 import BeautifulSoup
import requests

In [4]:
#url - 

url = "http://codewithharry.com"

## Get the HTML:

In [6]:
#Fetching the content by using request module-

r = requests.get(url)

In [7]:
#For HTML content-

htmlContent = r.content

In [8]:
#Seeing content-

print(r)

<Response [200]>


## Parse the Html/Creating the Soup-

In [9]:
soup = BeautifulSoup(htmlContent, 'html.parser')

In [10]:
print(soup)

<!DOCTYPE html>
<html><head><meta content="width=device-width" name="viewport"/><meta charset="utf-8"/><link href="/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/><script async="" crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9655830461045889"></script><title>Learn to code online - CodeWithHarry</title><meta content="Generated by create next app" name="description"/><link href="/img/favicon.ico" rel="icon"/><meta content="7" name="next-head-count"/><meta name="next-font-preconnect"/><link as="style" href="/_next/static/css/ced2f5f753e05303.css" rel="preload"/><link data-n-g="" href="/_next/static/css/ced2f5f753e05303.css" rel="stylesheet"/><link as="style" href="/_next/static/css/470c5e8db7cdc7e9.css" rel="preload"/><link data-n-p="" href="/_next/static/css/470c5e8db7cdc7e9.css" rel="stylesheet"/><noscript data-n-css=""></noscript><script defer="" nomodule="" src="/_next/static/chunks/polyfills-5cd94c89d3acac5f.js">

In [12]:
#Just to make content readable-

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta content="width=device-width" name="viewport"/>
  <meta charset="utf-8"/>
  <link href="/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <script async="" crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9655830461045889">
  </script>
  <title>
   Learn to code online - CodeWithHarry
  </title>
  <meta content="Generated by create next app" name="description"/>
  <link href="/img/favicon.ico" rel="icon"/>
  <meta content="7" name="next-head-count"/>
  <meta name="next-font-preconnect"/>
  <link as="style" href="/_next/static/css/ced2f5f753e05303.css" rel="preload"/>
  <link data-n-g="" href="/_next/static/css/ced2f5f753e05303.css" rel="stylesheet"/>
  <link as="style" href="/_next/static/css/470c5e8db7cdc7e9.css" rel="preload"/>
  <link data-n-p="" href="/_next/static/css/470c5e8db7cdc7e9.css" rel="stylesheet"/>
  <noscript data-n-css="">
  </noscript>
  <script defer="" nomodule=

## HTML Tree Traversal:

In [14]:
#To get the title of HTML Page-

title = soup.title

In [15]:
print(title)

<title>Learn to code online - CodeWithHarry</title>


### Commonly used Type of Objects:

#### 1.Tag

In [16]:
#By printing type, it will not give string, but will give just tag

print(type(title))

<class 'bs4.element.Tag'>


#### 2. Navigable String : Special type of String used by Beautiful Soup

In [17]:
#By using string, it will give string between the title tag

print(type(title.string))

<class 'bs4.element.NavigableString'>


#### 3. BeautifulSoup

In [19]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


#### 4. Comments

In [49]:
markup = '<p><!-- this is comment --></p>'
soup2 = BeautifulSoup(markup)
print(soup2.p)

<p><!-- this is comment --></p>


In [50]:
print(soup2.p.string)

 this is comment 


In [51]:
print(type(soup2.p.string))

<class 'bs4.element.Comment'>


## Scrapping 

In [21]:
#Trying to get all the Paragraphs from the Page-
#P is the name of tag in HTML

paras = soup.find_all('p')

#### The find_all() function in BeautifulSoup is a powerful method used to extract data from an HTML or XML document. It allows you to search for and retrieve specific elements or tags that match certain criteria.

In [24]:
print(paras)

[<p class="mt-2 text-sm text-gray-500 md:text-base dark:text-gray-400">Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.</p>, <p class="text-gray-700 text-base dark:text-gray-400">Python is one of the most demanded programming languages in the job market. Surprisingly, it is equally easy to learn and master Python. Let's commit our 100 days of code to python!</p>, <p class="text-gray-700 text-base dark:text-gray-400">This latest JavaScript course comes with premium curriculum that covers everything from basics to advance. On top of that, you will get my handwritten notes of JS for completely free. What are you waiting for? Just Enroll Buddy</p>, <p class="text-gray-700 text-base dark:text-gray-400">React is a free and open-source front-end JavaScript library. This series will cover React fro

In [25]:
#Now Anchor Tag-

anchors = soup.find_all('a')
print(anchors)

[<a href="/">CodeWithHarry</a>, <a href="/">Home</a>, <a href="/videos/">Courses</a>, <a href="/tutorials/">Tutorial</a>, <a href="/blog/">Blog</a>, <a href="/notes/">Notes</a>, <a href="/contact/">Contact</a>, <a href="/my-gear/">My Gear</a>, <a href="/work/">Work With Us</a>, <a href="/tutorial/html-home/">HTML</a>, <a href="/tutorial/css-home/">CSS</a>, <a href="/tutorial/js/">JS</a>, <a href="/tutorial/c/">C</a>, <a href="/tutorial/cplusplus/">C++</a>, <a href="/tutorial/java/">JAVA</a>, <a href="/tutorial/python/">PYTHON</a>, <a href="/tutorial/php/">PHP</a>, <a href="/tutorial/react-home/">REACT JS</a>, <a href="/">Home</a>, <a href="/videos/">Courses</a>, <a href="/tutorial/html-home/">HTML</a>, <a href="/tutorial/css-home/">CSS</a>, <a href="/tutorial/js/">JS</a>, <a href="/tutorial/c/">C</a>, <a href="/tutorial/cplusplus/">C++</a>, <a href="/tutorial/java/">JAVA</a>, <a href="/tutorial/python/">PYTHON</a>, <a href="/tutorial/php/">PHP</a>, <a href="/tutorial/react-home/">REACT

In [26]:
#To get the first element of HTML page-

print(soup.find('p'))

<p class="mt-2 text-sm text-gray-500 md:text-base dark:text-gray-400">Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.</p>


In [27]:
#To know the classes of any element in HTML page-

print(soup.find('p')['class'])

['mt-2', 'text-sm', 'text-gray-500', 'md:text-base', 'dark:text-gray-400']


In [32]:
#To Find All the Element in Class text-gray-700-

print(soup.find_all('p', class_= 'text-gray-700'))

[<p class="text-gray-700 text-base dark:text-gray-400">Python is one of the most demanded programming languages in the job market. Surprisingly, it is equally easy to learn and master Python. Let's commit our 100 days of code to python!</p>, <p class="text-gray-700 text-base dark:text-gray-400">This latest JavaScript course comes with premium curriculum that covers everything from basics to advance. On top of that, you will get my handwritten notes of JS for completely free. What are you waiting for? Just Enroll Buddy</p>, <p class="text-gray-700 text-base dark:text-gray-400">React is a free and open-source front-end JavaScript library. This series will cover React from starting to the end. We will learn react from the ground up!</p>]


In [33]:
#When you use class name which does not exist-

print(soup.find_all('p', class_= 'lead'))

[]


In [34]:
#To get the text from the tags/soup

print(soup.find('p').get_text())

Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.


In [35]:
print(soup.get_text())

Learn to code online - CodeWithHarryCodeWithHarryMenuLoginHomeCoursesTutorialBlogNotesContactMy GearWork With UsLoginSignupHTMLCSSJSCC++JAVAPYTHONPHPREACT JSHomeCoursesTutorial HTMLCSSJSCC++JAVAPYTHONPHPREACT JSBlogNotesContactMy GearWork With UsWelcome to Learn Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.Free CoursesExplore BlogRecommended CoursesFREE COURSEPython Tutorials - 100 Days of Code    Python is one of the most demanded programming languages in the job market. Surprisingly, it is equally easy to learn and master Python. Let's commit our 100 days of code to python!  Start Watching FREE COURSEUltimate JavaScript CourseThis latest JavaScript course comes with premium curriculum that covers everything from basics to advance. On top of that, you will get my handwritten notes of JS

### Above you can see that all text of the page without tag is printed 

In [36]:
#To get all the links of Anchor tag-

for link in anchors:
    print(link.get('href'))

/
/
/videos/
/tutorials/
/blog/
/notes/
/contact/
/my-gear/
/work/
/tutorial/html-home/
/tutorial/css-home/
/tutorial/js/
/tutorial/c/
/tutorial/cplusplus/
/tutorial/java/
/tutorial/python/
/tutorial/php/
/tutorial/react-home/
/
/videos/
/tutorial/html-home/
/tutorial/css-home/
/tutorial/js/
/tutorial/c/
/tutorial/cplusplus/
/tutorial/java/
/tutorial/python/
/tutorial/php/
/tutorial/react-home/
/blog/
/notes/
/contact/
/my-gear/
/work/
https://www.facebook.com/codewithharry
https://www.twitter.com/codewithharry
https://www.instagram.com/codewithharry
https://www.github.com/codewithharry


### As you can see that I can't visit all links, so-

#### I used set instead of list beacuse if links are repeating I wouldn't get big lists

In [48]:
all_links = set()
for link in anchors:
    if(link.get('href') != '#'):
        linkText = "http://codewithharry.com" +link.get('href')
        all_links.add(link)
        print(linkText)

http://codewithharry.com/
http://codewithharry.com/
http://codewithharry.com/videos/
http://codewithharry.com/tutorials/
http://codewithharry.com/blog/
http://codewithharry.com/notes/
http://codewithharry.com/contact/
http://codewithharry.com/my-gear/
http://codewithharry.com/work/
http://codewithharry.com/tutorial/html-home/
http://codewithharry.com/tutorial/css-home/
http://codewithharry.com/tutorial/js/
http://codewithharry.com/tutorial/c/
http://codewithharry.com/tutorial/cplusplus/
http://codewithharry.com/tutorial/java/
http://codewithharry.com/tutorial/python/
http://codewithharry.com/tutorial/php/
http://codewithharry.com/tutorial/react-home/
http://codewithharry.com/
http://codewithharry.com/videos/
http://codewithharry.com/tutorial/html-home/
http://codewithharry.com/tutorial/css-home/
http://codewithharry.com/tutorial/js/
http://codewithharry.com/tutorial/c/
http://codewithharry.com/tutorial/cplusplus/
http://codewithharry.com/tutorial/java/
http://codewithharry.com/tutorial

## Contents:

### In BeautifulSoup, you can access the contents of an HTML element using various methods and properties

In [23]:
#Here __next is the id

__next = soup.find(id='__next')
print(__next)

<div id="__next"><div class=""><div class="" style="position:fixed;top:0;left:0;height:2px;background:transparent;z-index:99999999999;width:100%"><div class="" style="height:100%;background:purple;transition:all 500ms ease;width:0%"><div style="box-shadow:0 0 10px purple, 0 0 10px purple;width:5%;opacity:1;position:absolute;height:100%;transition:all 500ms ease;transform:rotate(3deg) translate(0px, -4px);left:-10rem"></div></div></div><div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light shadow-md dark:bg-gray-800 dark:border-black" id="imgpreview2"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4"><div class="px-0 lg:pl-4 flex items-center lg:mx-4 cursor-pointer text-purple-700 text-xl font-bold mx-3 dark:text-purple-300"><a href="/">CodeWithHarry</a></div><div class="flex items-center md:hidden"><div class="text-purple-700 text-md font-semibold">Menu</div><svg class="text-purple-700 mt-1" fill="currentColor" height="1em" stroke="currentCo

In [24]:
#.children property: It returns an iterator that allows you to iterate over the direct child elements of the selected element

__next.children

<list_iterator at 0x24230d786a0>

In [28]:
#.contents property: It returns a list of all direct child elements of the selected element.

#Each element in the list can be further accessed for its content using the above methods

__next.contents

[<div class=""><div class="" style="position:fixed;top:0;left:0;height:2px;background:transparent;z-index:99999999999;width:100%"><div class="" style="height:100%;background:purple;transition:all 500ms ease;width:0%"><div style="box-shadow:0 0 10px purple, 0 0 10px purple;width:5%;opacity:1;position:absolute;height:100%;transition:all 500ms ease;transform:rotate(3deg) translate(0px, -4px);left:-10rem"></div></div></div><div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light shadow-md dark:bg-gray-800 dark:border-black" id="imgpreview2"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4"><div class="px-0 lg:pl-4 flex items-center lg:mx-4 cursor-pointer text-purple-700 text-xl font-bold mx-3 dark:text-purple-300"><a href="/">CodeWithHarry</a></div><div class="flex items-center md:hidden"><div class="text-purple-700 text-md font-semibold">Menu</div><svg class="text-purple-700 mt-1" fill="currentColor" height="1em" stroke="currentColor" stroke-widt

In [29]:
__next = soup.find(id='__next')
for elem in __next.contents:
    print(elem)

<div class=""><div class="" style="position:fixed;top:0;left:0;height:2px;background:transparent;z-index:99999999999;width:100%"><div class="" style="height:100%;background:purple;transition:all 500ms ease;width:0%"><div style="box-shadow:0 0 10px purple, 0 0 10px purple;width:5%;opacity:1;position:absolute;height:100%;transition:all 500ms ease;transform:rotate(3deg) translate(0px, -4px);left:-10rem"></div></div></div><div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light shadow-md dark:bg-gray-800 dark:border-black" id="imgpreview2"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4"><div class="px-0 lg:pl-4 flex items-center lg:mx-4 cursor-pointer text-purple-700 text-xl font-bold mx-3 dark:text-purple-300"><a href="/">CodeWithHarry</a></div><div class="flex items-center md:hidden"><div class="text-purple-700 text-md font-semibold">Menu</div><svg class="text-purple-700 mt-1" fill="currentColor" height="1em" stroke="currentColor" stroke-width

In [30]:
__next = soup.find(id='__next')
for elem in __next.children:
    print(elem)

<div class=""><div class="" style="position:fixed;top:0;left:0;height:2px;background:transparent;z-index:99999999999;width:100%"><div class="" style="height:100%;background:purple;transition:all 500ms ease;width:0%"><div style="box-shadow:0 0 10px purple, 0 0 10px purple;width:5%;opacity:1;position:absolute;height:100%;transition:all 500ms ease;transform:rotate(3deg) translate(0px, -4px);left:-10rem"></div></div></div><div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light shadow-md dark:bg-gray-800 dark:border-black" id="imgpreview2"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4"><div class="px-0 lg:pl-4 flex items-center lg:mx-4 cursor-pointer text-purple-700 text-xl font-bold mx-3 dark:text-purple-300"><a href="/">CodeWithHarry</a></div><div class="flex items-center md:hidden"><div class="text-purple-700 text-md font-semibold">Menu</div><svg class="text-purple-700 mt-1" fill="currentColor" height="1em" stroke="currentColor" stroke-width

#### The main difference between (.contents) and (.children) in BeautifulSoup is how they provide access to the child elements of an element. (.contents) returns a list of all direct child contents, including elements and non-element contents, while (.children) provides an iterator specifically for direct child elements, excluding non-element contents. 

In [31]:
#To print all strings in Id-

for item in __next.strings:
    print(item)

CodeWithHarry
Menu
Login
Home
Courses
Tutorial
Blog
Notes
Contact
My Gear
Work With Us
Login
Signup
HTML
CSS
JS
C
C++
JAVA
PYTHON
PHP
REACT JS
Home
Courses
Tutorial 
HTML
CSS
JS
C
C++
JAVA
PYTHON
PHP
REACT JS
Blog
Notes
Contact
My Gear
Work With Us
Welcome to 
Learn 
Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.
Free Courses
Explore Blog
Recommended Courses
FREE COURSE
Python Tutorials - 100 Days of Code    
Python is one of the most demanded programming languages in the job market. Surprisingly, it is equally easy to learn and master Python. Let's commit our 100 days of code to python!
  Start Watching 
FREE COURSE
Ultimate JavaScript Course
This latest JavaScript course comes with premium curriculum that covers everything from basics to advance. On top of that, you will get my handwrit

In [32]:
for item in __next.stripped_strings:
    print(item)

CodeWithHarry
Menu
Login
Home
Courses
Tutorial
Blog
Notes
Contact
My Gear
Work With Us
Login
Signup
HTML
CSS
JS
C
C++
JAVA
PYTHON
PHP
REACT JS
Home
Courses
Tutorial
HTML
CSS
JS
C
C++
JAVA
PYTHON
PHP
REACT JS
Blog
Notes
Contact
My Gear
Work With Us
Welcome to
Learn
Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.
Free Courses
Explore Blog
Recommended Courses
FREE COURSE
Python Tutorials - 100 Days of Code
Python is one of the most demanded programming languages in the job market. Surprisingly, it is equally easy to learn and master Python. Let's commit our 100 days of code to python!
Start Watching
FREE COURSE
Ultimate JavaScript Course
This latest JavaScript course comes with premium curriculum that covers everything from basics to advance. On top of that, you will get my handwritten notes 

#### The stripped_strings property in BeautifulSoup is a generator that allows you to iterate over all the strings within an element and its descendants, stripping any leading or trailing whitespace. It is particularly useful when you want to extract and process the text content of an HTML document while ignoring any surrounding whitespace.

In [33]:
#Parent Function-

#To know the parent of any element

print(__next.parent)

<body><div id="__next"><div class=""><div class="" style="position:fixed;top:0;left:0;height:2px;background:transparent;z-index:99999999999;width:100%"><div class="" style="height:100%;background:purple;transition:all 500ms ease;width:0%"><div style="box-shadow:0 0 10px purple, 0 0 10px purple;width:5%;opacity:1;position:absolute;height:100%;transition:all 500ms ease;transform:rotate(3deg) translate(0px, -4px);left:-10rem"></div></div></div><div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light shadow-md dark:bg-gray-800 dark:border-black" id="imgpreview2"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4"><div class="px-0 lg:pl-4 flex items-center lg:mx-4 cursor-pointer text-purple-700 text-xl font-bold mx-3 dark:text-purple-300"><a href="/">CodeWithHarry</a></div><div class="flex items-center md:hidden"><div class="text-purple-700 text-md font-semibold">Menu</div><svg class="text-purple-700 mt-1" fill="currentColor" height="1em" stroke="cur

In [34]:
print(__next.parents)

<generator object PageElement.parents at 0x00000242333B8660>


In [35]:
#Now iterating it-

for item in __next.parents:
    print(item.name)

body
html
[document]


#### In BeautifulSoup, the parents attribute represents a generator that yields all the parent elements of a given element, starting from the immediate parent and going up the hierarchy until the root element.
#### By using a for loop to iterate over __next.parents, you can access each parent element one by one. The name attribute of an element in BeautifulSoup gives you the name or tag name of the element.

In [36]:
#Sibling Function-

print(__next.next_sibling)

<script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{},"__N_SSP":true},"page":"/","query":{},"buildId":"m_hyoqLdBgkk3xZsvW8xP","isFallback":false,"gssp":true,"scriptLoader":[]}</script>


In [38]:
print(__next.next_sibling.next_sibling)

None


In [37]:
print(__next.previous_sibling)

None


#### The next_sibling and previous_sibling functions in BeautifulSoup are used to navigate and access the adjacent sibling elements of a given element. (next_sibling) returns the next sibling element of the current element and (previous_sibling) returns the previous sibling element of the current element.

### CSS Selectors:

In [40]:
elem = soup.select('#__NEXT_DATA__')
print(elem)

[<script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{},"__N_SSP":true},"page":"/","query":{},"buildId":"m_hyoqLdBgkk3xZsvW8xP","isFallback":false,"gssp":true,"scriptLoader":[]}</script>]


In [41]:
#(.) means class

elem = soup.select('.__NEXT_DATA__')
print(elem)

[]


In [42]:
elem = soup.select('#__next')
print(elem)

[<div id="__next"><div class=""><div class="" style="position:fixed;top:0;left:0;height:2px;background:transparent;z-index:99999999999;width:100%"><div class="" style="height:100%;background:purple;transition:all 500ms ease;width:0%"><div style="box-shadow:0 0 10px purple, 0 0 10px purple;width:5%;opacity:1;position:absolute;height:100%;transition:all 500ms ease;transform:rotate(3deg) translate(0px, -4px);left:-10rem"></div></div></div><div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light shadow-md dark:bg-gray-800 dark:border-black" id="imgpreview2"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4"><div class="px-0 lg:pl-4 flex items-center lg:mx-4 cursor-pointer text-purple-700 text-xl font-bold mx-3 dark:text-purple-300"><a href="/">CodeWithHarry</a></div><div class="flex items-center md:hidden"><div class="text-purple-700 text-md font-semibold">Menu</div><svg class="text-purple-700 mt-1" fill="currentColor" height="1em" stroke="currentC

#### CSS Selectors is also used to manipulate elements. 

In [43]:
#Insert() function-

#the insert() function is used to insert a new element or string into an existing HTML document.
#It allows you to add content at a specific position within a tag's contents.

html = """
<html>
  <body>
    <div id="content">
      <p>Paragraph 1</p>
      <p>Paragraph 2</p>
      <p>Paragraph 3</p>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# Find the div element
div_element = soup.find('div')

# Create a new paragraph element
new_paragraph = soup.new_tag('p')
new_paragraph.string = 'New Paragraph'

# Insert the new paragraph at the beginning of the div element
div_element.insert(0, new_paragraph)

print(soup.prettify())

<html>
 <body>
  <div id="content">
   <p>
    New Paragraph
   </p>
   <p>
    Paragraph 1
   </p>
   <p>
    Paragraph 2
   </p>
   <p>
    Paragraph 3
   </p>
  </div>
 </body>
</html>

