<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Hands-on lab: Webscraping**


Estimated time needed: **30** minutes


## Objectives

By the end of this lab, you will be able to:

* Use the <Code>requests</code> and <Code>BeautifulSoup</Code> libraries to extract the contents of a web page

* Analyze the <Code>HTML</Code> code of a webpage to find the relevant information

* Extract the relevant information and save it in the required form


## Scenario

Consider that you have been hired by a Multiplex management organization to extract the information of the top 50 movies with the best average rating from the web link shared below.

<code>https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films</code>

The information required is <code>Average Rank</code>, <code>Film</code>, and <code>Year</code>.

You are required to write a Python script <code>webscraping_movies.py</code> that extracts the information and saves it to a <code>CSV</code> file <code>top_50_films.csv</code>.

You are also required to save the same information to a database <code>Movies.db</code> under the table name <code>Top_50</code>.

Import the required modules and functions


## Code Setup

## Imports

Import any additional libraries you may need here.


In [1]:
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup

## Initialization of known entities

You must declare a few entities at the beginning. For example, you know the required <code>URL</code>, the <code>CSV</code> name for saving the record, the database name, and the table name for storing the record. You also know the entities to be saved. Additionally, since you require only the top 50 results, you will require a loop counter initialized to 0. You may initialize all these by using the following code in <code>webscraping_movies.py</code>:

In [12]:
url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name = 'Movies.db'
table_name = 'Top_50'
csv_path = 'c:\\Users\\kyoss\\Desktop\\COURSERA\\IBM DATA ENGINEERING PROFESSIONAL\\Course 03 - Python Project for Data Engineering/top_50_films.csv'
df = pd.DataFrame(columns=["Average Rank","Film","Year"])
count = 0

## Loading the webpage for Webscraping
To access the required information from the web page, you first need to load the entire web page as an <code>HTML</code> document in <code>python</code> using the <code>requests.get().text</code> function and then parse the text in the HTML format using <code>BeautifulSoup</code> to enable extraction of relevant information.

Add the following code to <code>webscraping_movies.py</code>:

In [3]:
html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')

## Scraping of required information
You now need to write the loop to extract the appropriate information from the web page. The rows of the table needed can be accessed using the <code>find_all()</code> function with the <code>BeautifulSoup</code> object using the statements below.

In [4]:
tables = data.find_all('tbody')
rows = tables[0].find_all('tr')

Here, the variable tables gets the body of all the tables in the web page and the variable rows gets all the rows of the first table.

You can now iterate over the rows to find the required data. Use the code shown below to extract the information.

In [5]:
for row in rows:
    if count<50:
        col = row.find_all('td')
        if len(col)!=0:
            data_dict = {"Average Rank": col[0].contents[0],
                         "Film": col[1].contents[0],
                         "Year": col[2].contents[0]}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df,df1], ignore_index=True)
            count+=1
    else:
        break

The code functions as follows.

1. Iterate over the contents of the variable rows.
2. Check for the loop counter to restrict to 50 entries.
3. Extract all the td data objects in the row and save them to col.
4. Check if the length of col is 0, that is, if there is no data in a current row. This is important since, many timesm there are merged rows that are not apparent in the web page appearance.
5. Create a dictionary data_dict with the keys same as the columns of the dataframe created for recording the output earlier and corresponding values from the first three headers of data.
6. Convert the dictionary to a dataframe and concatenate it with the existing one. This way, the data keeps getting appended to the dataframe with every iteration of the loop.
7. Increment the loop counter.
8. Once the counter hits 50, stop iterating over rows and break the loop.

Print the contents of the dataframe using the following:

In [6]:
print(df)

   Average Rank                                           Film  Year
0             1                                  The Godfather  1972
1             2                                   Citizen Kane  1941
2             3                                     Casablanca  1942
3             4                         The Godfather, Part II  1974
4             5                            Singin' in the Rain  1952
5             6                                         Psycho  1960
6             7                                    Rear Window  1954
7             8                                 Apocalypse Now  1979
8             9                          2001: A Space Odyssey  1968
9            10                                  Seven Samurai  1954
10           11                                        Vertigo  1958
11           12                                    Sunset Blvd  1950
12           13                                   Modern Times  1936
13           14                   

## Storing the data

After the dataframe has been created, you can save it to a CSV file using the following command:

In [13]:
df.to_csv(csv_path)

To store the required data in a database, you first need to initialize a connection to the database, save the dataframe as a table, and then close the connection. This can be done using the following code:

In [14]:
conn = sqlite3.connect(db_name)
df.to_sql(table_name, conn, if_exists='replace', index=False)
conn.close()