# Web Scraping
#### A simple NoteBook to explain WS "Web Scraping" and how to make it using Python

<br>

##### 📌 What is web scraping?
    📍 Web Scraping is a process that can be used to automatically extract information from a website, and can easily be accomplished within a matter of minutes and not hours.

##### 📌 Why we use web scraping?
    📍 Because sometimes we need a piece of specific information that will help us to do specific tasks and solve some problems that need some specific type of data.
    
##### 📌 Why is web scraping named like this?
    📍 Because in this type of task, we target a specific website and web page to get some data.
##### 📌 Can be Python used to do the web scraping task?
    📍 Yes, Python has some modules that are used for web scraping.
##### 📌Libraries can be used in Python for web scraping:
    📦 bs4: beautifulsoup4: pip install beautifulsoup4
    📦 pandas: to scrape tables from the web page: pip install pandas
    📦 selenium: used to automate some tasks and scrape web page: pip install selenium
##### Is there a need to use links in web scraping?
    📍 Yes, we need the target link of the website to make requests

<br>

##### Note: That whole this notebook is for learning purposes you can use links here on your own responsibility  

### Dependencies

In [1]:
from bs4 import BeautifulSoup  # for web scraping
import requests as req  # for requesting web pages
import pandas as pd  # to manipulate and create a dataframe
import validators  # To check if a specific link valid or not
import bs4

### Example

In [2]:
# URL that will used in first request
URL = 'https://www.learnpython.org/'

In [3]:
# Making request using the get method
r = req.get(URL)

In [4]:
# Creating object from the BeautifulSoup class and providing the parameters
# The first parameters is the text or the content of the web page that we got from the request
# the second parameter is the feature like "html.parser"
soup = BeautifulSoup(r.content, features='html.parser')

In [5]:
# Show the all beautiful tags we got from the beautifulsoup
soup

<!DOCTYPE html>

<html lang="en">
<head>
<title>Learn Python - Free Interactive Python Tutorial</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="learnpython.org is a free interactive Python tutorial for people who want to learn Python, fast." name="description"/>
<meta content="Learn,Python,Tutorial,Interactive,Free" name="keywords"/>
<meta content="cXWj61RCtO3fVP24Y7CO-nX0ba30tgdJYY8GGBactLI" name="google-site-verification"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="Learn Python - Free Interactive Python Tutorial" property="og:title">
<meta content="website" property="og:type"/>
<meta content="https://www.learnpython.org" property="og:url"/>
<meta content="https://www.learnpython.org/static/img/share-logos/learnpython.org.png" property="og:image"/>
<link href="/static/img/favicons/learnpython.org.ico" rel="icon" type="image/x-icon"/>
<link href="/static/css/bootstrap.min.css" rel="stylesheet" type="

In [6]:
# we can show the attributes like href and others using atters "Attribute"
# But in this example we will get on an empty dict because we provide
# The contents not the text to the BeautifulSoup class
soup.attrs

{}

In [7]:
# soup.currentTag

In [8]:
# Using find_all() method we can get any tag we want
# here will show all 'a' tags
soup.find_all("a")

[<a class="navbar-brand" href="/">
 <img src="/static/img/favicons/learnpython.org.ico" style="height: 24px"/>
         learnpython.org
     </a>,
 <a class="nav-link" href="/">Home <span class="sr-only">(current)</span></a>,
 <a class="nav-link" href="/about">About</a>,
 <a class="nav-link" href="https://www.learnx.org">Certify</a>,
 <a aria-expanded="false" aria-haspopup="true" class="nav-link dropdown-toggle" data-toggle="dropdown" href="#" id="more-langs">
                     More Languages
                 </a>,
 <a class="dropdown-item disabled" href="https://www.learnpython.org">Python</a>,
 <a class="dropdown-item" href="https://www.learnjavaonline.org">Java</a>,
 <a class="dropdown-item" href="https://www.learn-html.org">HTML</a>,
 <a class="dropdown-item" href="https://www.learn-golang.org">Go</a>,
 <a class="dropdown-item" href="https://www.learn-c.org">C</a>,
 <a class="dropdown-item" href="https://www.learn-cpp.org">C++</a>,
 <a class="dropdown-item" href="https://www.lea

In [9]:
# Now I will store all a tags inside a simple variable
a_tags = soup.find_all("a")

In [10]:
# see here how we can get a specific value from the list and show the content of it
a_tags[0].contents[1]

<img src="/static/img/favicons/learnpython.org.ico" style="height: 24px"/>

In [11]:
# we can use get method on the content attribute to show somthing like src in img tag
a_tags[0].contents[1].get('src')

'/static/img/favicons/learnpython.org.ico'

In [12]:
# Notice that the type of tags will be somthing like bs4.element.Tag
type(a_tags[0].contents[1])

bs4.element.Tag

In [13]:
a_tags[0].get('href')

'/'

In [14]:
# Here we can get all a tags then store all the links of them inside a list
all_links = []
for a in a_tags:
    Url = a.get('href')
    if validators.url(Url):
        all_links.append(Url)

In [15]:
# Show all links we got
all_links

['https://www.learnx.org',
 'https://www.learnpython.org',
 'https://www.learnjavaonline.org',
 'https://www.learn-html.org',
 'https://www.learn-golang.org',
 'https://www.learn-c.org',
 'https://www.learn-cpp.org',
 'https://www.learn-js.org',
 'https://www.learn-php.org',
 'https://www.learnshell.org',
 'https://www.learncs.org',
 'https://www.learn-perl.org',
 'https://www.learnrubyonline.org',
 'https://www.learnscala.org',
 'https://www.learnsqlonline.org',
 'https://github.com/ronreiter/interactive-tutorials',
 'https://github.com/ronreiter/interactive-tutorials/fork',
 'https://www.learnpython.org',
 'https://www.learnjavaonline.org',
 'https://www.learn-html.org',
 'https://www.learn-golang.org',
 'https://www.learn-c.org',
 'https://www.learn-cpp.org',
 'https://www.learn-js.org',
 'https://www.learn-php.org',
 'https://www.learnshell.org',
 'https://www.learncs.org',
 'https://www.learn-perl.org',
 'https://www.learnrubyonline.org',
 'https://www.learnscala.org',
 'https://w

### Another example

In [16]:
# URL
nurl = "https://marketplace.visualstudio.com/items?itemName=ms-python.python"
# GET request
nr = req.get(nurl)
# Initialize an object from bs4
nsoup = BeautifulSoup(nr.text, 'html.parser')

In [None]:
# what the nsoup variable contains. I will clear the output
# because it is so large, just run the cell to see the output
nsoup

##### I want to get the table from the target URL and create a simple DataFrame contains the same data

<br>

![image.png](attachment:8b45d393-709b-4b69-b2d2-fd4329174b57.png)

In [18]:
# find all tr tags
table = nsoup.find_all('tr')

In [19]:
table

[<tr><td class="item-img" id="vss_2"><img alt="" class="image-display" src="https://ms-python.gallerycdn.vsassets.io/extensions/ms-python/python/2023.3.10341119/1675423947517/Microsoft.VisualStudio.Services.Icons.Default" style="top:0.5px;visibility:visible"/></td><td class="item-header"><div class="item-header-content dark"><h1><span class="ux-item-name">Python</span></h1><div class="ux-item-second-row-wrapper"><div class="ux-item-publisher"><h2 role="presentation"><a aria-label="More from Microsoft publisher" class="ux-item-publisher-link item-banner-focussable-child-item" href="publishers/Microsoft" style="color:#ffffff">Microsoft</a></h2></div><div class="ux-marketplace-verified-doamin-icon"><div class="verified-domain-icon"><i class="verified-domain-icon-background" role="presentation"></i><i class="verified-domain-icon-foreground" role="presentation" title="Microsoft has a verified ownership for the domain microsoft.com"></i></div></div><span class="divider"> | </span><div class=

In [20]:
# Here I want to take only the rows that contains the table data
best_rows = table[-7:]

In [21]:
# table header "Name of Columns"
th = best_rows[0].find_all('th')
# Two lists to store all data of columns
command_col = []
desc_col = []
# Loooping through rows
for tr in best_rows[1:]:
    # finding all td tags in each row
    data_rows = tr.find_all('td')
    # Looping through the data of rows
    for d in data_rows:
        # Check if the d is a tag like "<code>" tag or just a simple text
        # so if it is a <code> we will insert the contents of it to the Command column
        # else we will insert the text to description column
        if type(d.contents[0]) == bs4.element.Tag:
            command_col.append(d.contents[0].contents[0])
        else:
            desc_col.append(d.contents[0])

In [22]:
desc_col

['Switch between Python interpreters, versions, and environments.',
 'Start an interactive Python REPL using the selected interpreter in the VS Code terminal.',
 'Runs the active Python file in the VS Code terminal. You can also run a Python file by right-clicking on the file and selecting ',
 'Switch from Pylint to Flake8 or other supported linters.',
 'Formats code using the provided ',
 'Select a test framework and configure it to display the Test Explorer.']

In [23]:
command_col

['Python: Select Interpreter',
 'Python: Start REPL',
 'Python: Run Python File in Terminal',
 'Python: Select Linter',
 'Format Document',
 'Python: Configure Tests']

In [24]:
# Creating a dictionary for the data frame
table = dict(
    Command=command_col,
    Description=desc_col
)

In [25]:
table

{'Command': ['Python: Select Interpreter',
  'Python: Start REPL',
  'Python: Run Python File in Terminal',
  'Python: Select Linter',
  'Format Document',
  'Python: Configure Tests'],
 'Description': ['Switch between Python interpreters, versions, and environments.',
  'Start an interactive Python REPL using the selected interpreter in the VS Code terminal.',
  'Runs the active Python file in the VS Code terminal. You can also run a Python file by right-clicking on the file and selecting ',
  'Switch from Pylint to Flake8 or other supported linters.',
  'Formats code using the provided ',
  'Select a test framework and configure it to display the Test Explorer.']}

In [26]:
# Finally ceating a dataframe
df = pd.DataFrame(table)

In [27]:
# The target result
df

Unnamed: 0,Command,Description
0,Python: Select Interpreter,"Switch between Python interpreters, versions, ..."
1,Python: Start REPL,Start an interactive Python REPL using the sel...
2,Python: Run Python File in Terminal,Runs the active Python file in the VS Code ter...
3,Python: Select Linter,Switch from Pylint to Flake8 or other supporte...
4,Format Document,Formats code using the provided
5,Python: Configure Tests,Select a test framework and configure it to di...


#### How to use pandas to scrape a table from a web page?
    
    📍 To scrape a table from web page we can use a pandas library in Python
    📍 We need a web page that contains a table
    📍 then using a few lines of code with pandas we can get the tables, let's do it


In [28]:
# link
link = 'https://www.w3schools.com/python/python_ref_string.asp'

In [29]:
# Read the link and get the any table in that web page
get_html_tables = pd.read_html(link)

In [30]:
# Show list of tables
get_html_tables

[            Method                                        Description
 0     capitalize()         Converts the first character to upper case
 1       casefold()                    Converts string into lower case
 2         center()                          Returns a centered string
 3          count()  Returns the number of times a specified value ...
 4         encode()           Returns an encoded version of the string
 5       endswith()  Returns true if the string ends with the speci...
 6     expandtabs()                    Sets the tab size of the string
 7           find()  Searches the string for a specified value and ...
 8         format()               Formats specified values in a string
 9     format_map()               Formats specified values in a string
 10         index()  Searches the string for a specified value and ...
 11       isalnum()  Returns True if all characters in the string a...
 12       isalpha()  Returns True if all characters in the string a...
 13   

In [31]:
# show the table as a Data Frame
get_html_tables[0]

Unnamed: 0,Method,Description
0,capitalize(),Converts the first character to upper case
1,casefold(),Converts string into lower case
2,center(),Returns a centered string
3,count(),Returns the number of times a specified value ...
4,encode(),Returns an encoded version of the string
5,endswith(),Returns true if the string ends with the speci...
6,expandtabs(),Sets the tab size of the string
7,find(),Searches the string for a specified value and ...
8,format(),Formats specified values in a string
9,format_map(),Formats specified values in a string


<center><h1>Good Luck</h1></center>