<a href="https://colab.research.google.com/github/jiaojx1987/Practice/blob/main/Webscraping_and_Querying_SQLite3_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Consider that you have been hired by a Multiplex management organization to extract the information of the top 50 movies with the best average rating from the web link shared below.

# Objectives
By the end of this lab, you will be able to:

*  Use the `requests` and `BeautifulSoup` libraries to extract the contents of a web page

*  Analyze the `HTML` code of a webpage to find the relevant information

*  Extract the relevant information and save it in the required form

Consider that you have been hired by a Multiplex management organization to extract the information of the top 50 movies with the best average rating from the web link shared below.

```
https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films
```
The information required is `Average Rank`, `Film`, and `Year`.
You are required to write a Python script `webscraping_movies.py` that extracts the information and saves it to a `CSV` file `top_50_films.csv`. You are also required to save the same information to a database `Movies.db` under the table name `Top_50`.

In [None]:
#pip install bs4

In [1]:
import sys
import pandas as pd
import bs4 as bs

In [None]:
print(sys.version)
print(sys.executable)

3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
/usr/bin/python3


Initial steps
You require the following libraries for this lab.

1. `pandas` library for data storage and manipulation.
2. `BeautifulSoup` library for interpreting the `HTML` document.
3. `requests` library to communicate with the web page.
4. `sqlite3` for creating the database instance.

While `requests` and `sqlite3` come bundled with `Python3`, you need to install `pandas` and `BeautifulSoup` libraries to the IDE.

For this, run the following commands in a terminal window.
```
python3.11 -m pip install pandas
python3.11 -m pip install bs4
```
Copied!Executed!
Now, create a new file by the name of `webscraping_movies.py` in the path `/home/project/`.

You will write all of your code in this file.

# Code setup
To create a Python script, call the relevant libraries and the initializations as a first step.

## Importing Libraries
Import the following four libraries by adding lines of code noted below to your `webscraping_movies.py` file.



In [2]:
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup

##Initialization of known entities
You must declare a few entities at the beginning. For example, you know the required `URL`, the `CSV` name for saving the record, the database name, and the table name for storing the record. You also know the entities to be saved. Additionally, since you require only the top 50 results, you will require a loop counter initialized to 0. You may initialize all these by using the following code in `webscraping_movies.py`:




In [10]:
url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name = 'Movies.db'
table_name = 'Top_50'
#csv_path = '/home/project/top_50_films.csv'
csv_path = '/content/top_50_films.csv'
df = pd.DataFrame(columns=["Average Rank","Film","Year"])
count = 0

##Loading the webpage for Webscraping

To access the required information from the web page, you first need to load the entire web page as an `HTML` document in `python` using the `requests.get().text` function and then parse the text in the `HTML` format using `BeautifulSoup` to enable extraction of relevant information.

Add the following code to `webscraping_movies.py`:



In [5]:
html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')

# Analyzing the HTML code for relevant information
Open the web page in a browser and locate the required table by scrolling down to it. Right-click the table and click `Inspect` at the bottom of the menu, as shown in the image below.

This opens the `HTML` code for the page and takes you directly to the point where the definition of the table begins. To check, take your mouse pointer to the `tbody` tag in the `HTML` code and see that the table is highlighted in the page section.

Notice that all rows under this table are mentioned as `tr` objects under the table. Clicking one of them would show that the data in each row is further saved as a `td` object, as seen in the image above. You require the information under the first three headers of this stored data.

It is also important to note that this is the first table on the page. You must identify the required table when extracting information.

Previous


# Scraping of required information
You now need to write the loop to extract the appropriate information from the web page. The rows of the table needed can be accessed using the find_all() function with the `BeautifulSoup` object using the statements below.


In [6]:
tables = data.find_all('tbody')
rows = tables[0].find_all('tr')



Here, the variable `tables` gets the body of all the tables in the web page and the variable `rows` gets all the rows of the first table.

You can now iterate over the rows to find the required data. Use the code shown below to extract the information.



In [7]:
for row in rows:
    if count<50:
        col = row.find_all('td')
        if len(col)!=0:
            data_dict = {"Average Rank": col[0].contents[0],
                         "Film": col[1].contents[0],
                         "Year": col[2].contents[0]}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df,df1], ignore_index=True)
            count+=1
    else:
        break

The code functions as follows.

1.   Iterate over the contents of the variable `rows`.
2.   Check for the loop counter to restrict to 50 entries.
3.   Extract all the `td` data objects in the row and save them to `col`.
4.   Check if the length of `col` is 0, that is, if there is no data in a current row. This is important since, many timesm there are merged rows that are not apparent in the web page appearance.
5.   Create a dictionary `data_dict` with the keys same as the columns of the dataframe created for recording the output earlier and corresponding values from the first three headers of data.
6.   Convert the dictionary to a dataframe and concatenate it with the existing one. This way, the data keeps getting appended to the dataframe with every iteration of the loop.
7.   Increment the loop counter.
8.   Once the counter hits 50, stop iterating over rows and break the loop.

Print the contents of the dataframe using the following:


In [8]:
print(df)

   Average Rank                                           Film  Year
0             1                                  The Godfather  1972
1             2                                   Citizen Kane  1941
2             3                                     Casablanca  1942
3             4                         The Godfather, Part II  1974
4             5                            Singin' in the Rain  1952
5             6                                         Psycho  1960
6             7                                    Rear Window  1954
7             8                                 Apocalypse Now  1979
8             9                          2001: A Space Odyssey  1968
9            10                                  Seven Samurai  1954
10           11                                        Vertigo  1958
11           12                                    Sunset Blvd  1950
12           13                                   Modern Times  1936
13           14                   

# Storing the data
After the dataframe has been created, you can save it to a CSV file using the following command:


In [11]:
df.to_csv(csv_path)

Remember that you defined the variable `csv_path` earlier.

To store the required data in a database, you first need to initialize a connection to the database, save the dataframe as a table, and then close the connection. This can be done using the following code:


In [12]:
conn = sqlite3.connect(db_name)
df.to_sql(table_name, conn, if_exists='replace', index=False)
conn.close()


This database can now be used to retrieve the relevant information using SQL queries. You will learn how to do that later in the course.

# Practice problems
Try the following practice problems to test your understanding of the lab. Please note that the solutions for the following are not shared. You are encouraged to use the discussion forums in case you need help.

1. Modify the code to extract `Film`, `Year`, and `Rotten Tomatoes' Top 100` headers.

2. Restrict the results to only the top 25 entries.

3. Filter the output to print only the films released in the 2000s (year 2000 included).

In [13]:
###############################################################################

# What is SQLite3?
SQLite3 is an in-process Python library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. It is a popular choice as an embedded database for local/client storage in application software.

## How to connect to the SQLite3?
You can connect to SQLIte3 using the connect() function by passing the required database name as an argument.

In [14]:
import sqlite3
sql_connection = sqlite3.connect('database.db')

This makes the variable `sql_connection` an object of the SQL code engine. You can then use this to run the required queries on the database.
### How to create a database table using SQLite3 and Pandas?
You can directly load a Pandas dataframe to a SQLite3 database object using the following syntax.

In [15]:
df.to_sql(table_name, sql_connection, if_exists = 'replace', index = False)

0

Here, you use the `to_sql()` function to convert the `pandas` dataframe to an SQL table.

The `table_name` and `sql_connection` arguments specify the name of the required table and the database to which you should load the dataframe.

The `if_exists` parameter can take any one of three possible values:
`'fail'`: This denies the creation of a table if one with the same name exists in the database already.
`'replace'`: This overwrites the existing table with the same name.
`'append'`: This adds information to the existing table with the same name.

Keep the `index` parameter set to `True` only if the index of the data being sent holds some informational value. Otherwise, keep it as `False`.

## How to query a database table using SQLite3 and Pandas?
You can use the Pandas function `read_sql()` to query a database table.

The function returns a Pandas dataframe with the output to the query. Use the function with the following syntax:

In [None]:
df = pandas.read_sql(query_statement, sql_connection)

Here, the `query_statement` argument contains the complete query to the required table as a string.

## Example Queries
Some typical queries with their meanings are shown in the table below.

|Query statement	|Purpose
|:---|:---|
|SELECT * FROM table_name	|Retrieve all entries of the table.|
|SELECT COUNT(*) FROM table_name	|Retrieve total number of entries in the table.|
|SELECT Column_name FROM table_name	|Retrieve all entries of a specific column in the table.|
|SELECT * FROM table_name WHERE <condition>|	Retrieve all entries of the table that meet the specified condition.|