### Webscraping in Python

#### Objectives

* Use the `requests` and `BeautifulSoup` libraries to extract the contents of a web page.
* Analyze the `HTML` code of a webpage to find the relevant information.
* Extract the relevant information and save it in the required form.

#### Scenario

Extract the information of the top 50 movies with the best average rating from the web link:
https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films

The information required is `Average Rank`, `Film`, and `Year`.   
Extract the information and save it to a `CSV` file `top_50_films.csv`.   
Save the same information to a database `Movies.db` under the name `Top_50`.

In [3]:
# Import relevant libraries
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup

In [21]:
# Initialize known entities
url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name = 'Movies.db'
table_name = 'Top_50'
df = pd.DataFrame(columns=['Average Rank', 'Film', 'Year'])
count = 0  # a loop counter initialized to 0

In [5]:
# Load webpage for webscraping
html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')

In [6]:
# Scraping of required information
tables = data.find_all('tbody')
rows = tables[0].find_all('tr')

In [23]:
# Iterate over the rows to extract the required information
for row in rows:
    if count < 50:
        col = row.find_all('td')
        if len(col) != 0:
            data_dict = {
                'Average Rank': col[0].contents[0],
                'Film': col[1].contents[0],
                'Year': col[2].contents[0]
            }
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df, df1], ignore_index=True)
            count += 1
    else:
        break

In [24]:
df

Unnamed: 0,Average Rank,Film,Year
0,1,The Godfather,1972
1,2,Citizen Kane,1941
2,3,Casablanca,1942
3,4,"The Godfather, Part II",1974
4,5,Singin' in the Rain,1952
5,6,Psycho,1960
6,7,Rear Window,1954
7,8,Apocalypse Now,1979
8,9,2001: A Space Odyssey,1968
9,10,Seven Samurai,1954


In [28]:
# Save the dataframe to a CSV file
df.to_csv('top_50_films.csv', index=False)