# Web Scrapping

Web Scraping is the programming-based technique for extracting relevant information from websites and storing it in the local system for further use.

In modern times, web scraping has a lot of applications in the fields of Data Science and Marketing. Web scrapers across the world gather tons of information for either personal or professional use. Moreover, present-day tech giants rely on such web scraping methods to fulfill the needs of their consumer bas

## Some basic requirements:
In order to make a soup, we need proper ingredients. Similarly, our fresh web scraper requires certain components.P
Pyt - n - The ease of use and a vast collection of libraries make Python the numero-uno for scraping websites. However, if the user does not have it pre-installed, refer he

e.
Beautiful : p - One of the many Web Scraping libraries for Python. The easy and clean usage of the library makes it a top contender for web scraping. After a successful installation of Python, user can install Beautiful Soup e tags..e.

In [12]:
!pip install bs4 --quiet
!pip install lxml --quiet


Basic Understanding of HTML Tags - Refer to this tutorial for gaining necessary information about HTML tags.
Web Browser - Since we have to toss out a lot of unnecessary information from a website, we need specific ids and tags for filtering. Therefore, a web browser like Google Chrome or Mozilla Firefox serves the purpose of discovering those tags.

In [1]:
from bs4 import BeautifulSoup
import requests

### Creating a User-Agent
Many websites have certain protocols for blocking robots from accessing data. Therefore, in order to extract data from a script, we need to create a User-Agent. The User-Agent is basically a string that tells the server about the type of host sending the request.

In [2]:
HEADERS = ({'User-Agent':
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'})

### Sending a request to a URL
A webpage is accessed by its URL (Uniform Resource Locator). With the help of the URL, we will send the request to the webpage for accessing its data.

In [22]:
URL = "https://www.amazon.com/dp/B09V5R8RYK/ref=sspa_dk_detail_4?"
webpage = requests.get(URL, headers=HEADERS)

The requested webpage features an Amazon product. Hence, our Python script focuses on extracting product details like “The Name of the Product”, “The Current Price” and so on.

### Creating a soup of information
The webpage variable contains a response received by the website. We parse the content of the response and the type of parser to the Beautiful Soup function.

In [23]:

soup = BeautifulSoup(webpage.content,  "html.parser")

'lxml' is a high-speed parser employed by Beautiful Soup to break down the HTML page into complex Python objects. Generally, there are four kinds of Python Objects obtained:

1. Tag - It corresponds to HTML or XML tags, which include names and attributes.
2. NavigableString - It corresponds to the text stored within a tag.
3. BeautifulSoup - In fact, the entire parsed document.
4. Comments - Finally, the leftover pieces of the HTML page that is not included in the above three categories.

### Discovering the exact tags for Object Extraction
One of the most hectic part of this project is unearthing the ids and tags storing the relevant information. As mentioned before, we use web browsers for accomplishing this task.

We open the webpage in the browser and inspect the relevant element by pressing right-click.
![image.png](attachment:4ef4af65-2db7-4be7-a26e-f3b7ba10a866.png)

As a result, a panel opens on the right-hand side of the screen as shown in the following figure.

![image.png](attachment:5428c987-a514-46db-b01e-5a60da260bb9.png)


Once we obtain the tag values, extracting information becomes a piece of cake. However, we must learn certain functions defined for Beautiful Soup Object.

## Extracting the Product Title
Using the find() function available for searching specific tags with specific attributes we locate the Tag Object containing title of the product.

In [24]:
# Outer Tag Object
title = soup.find("span", attrs={"id":'productTitle'})

In [25]:
print(title)

None


Then, we take out the NavigableString Object

In [26]:
# Inner NavigableString Object
title_value = title.string

AttributeError: 'NoneType' object has no attribute 'string'

In [8]:
# Inner NavigableString Object
print(title_value)

        Sony PlayStation 4 Pro 1TB Console - Black (PS4 Pro)       


And finally, we strip extra spaces and convert the object to a string value.

In [9]:
# Title as a string value
title_string = title_value.strip()

In [10]:
print(title_string)

Sony PlayStation 4 Pro 1TB Console - Black (PS4 Pro)


We can take a look at types of each variable using type() function.

In [26]:
# Printing types of values for efficient understanding
print(type(title))
print(type(title_value))
print(type(title_string))
print()

# Printing Product Title
print("Product Title = ", title_string)

<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'str'>

Product Title =  Sony PlayStation 4 Pro 1TB Console - Black (PS4 Pro)
