## Web Scraping
Web Scraping is the extraction of data from a website, and in this case, the Python library called **Beautiful Soup** will be used. The scraper loads the HTML code of the page the user wants to collect data from, then the scraper will either extract all the data on the page or the user will go through the process of selecting the specific data they want from the page. That is done by looking at the website’s HTML code and selecting the the specific element or tag that the desired information is in. 

### Data to Scrape
In this practical we will look at how to do web scraping on imdb.com to fetch information about movies with different genres using Python BeautifulSoup and requests. IMDB (Internet Movie Database) website is owned by Amazon, is one of the best platforms for finding information about films, television shows, web series, etc.

The data that we want to extract from it are:
* Movie title
* Release date
* Genre
* Movie length
* Movie certification
* Rating
* Metascore
* Description
* Votes

To extract all of this data, our scrapper will need to go inside each film’s webpage. Now let's start scrapping.

## Load Libraries
Before we begin, we need to import the libraries that will be used for this practical.

In [15]:
# Load packages
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Getting URLs of different pages
The first thing we need to do is to get URLs of different movie genres, for example, the genres include Adventure, Animation, Drama, Comedy, Horror, etc.


In [16]:

genres = ["Adventure","Animation", "Biography", "Comedy", "crime", "Drama", "Family", "Fantasy", "Film-Noir", "History", "Horror", "Thriller", "Music", "Musical", "Mystery", "Romance","Sci-Fi", "Sport","War","Western"]

url_dict = {}

for genre in genres:
    url = "https://www.imdb.com/search/title/?genres={}"
    formated_url = url.format(genre)
    url_dict[genre] = formated_url


print(url_dict)


{'Adventure': 'https://www.imdb.com/search/title/?genres=Adventure', 'Animation': 'https://www.imdb.com/search/title/?genres=Animation', 'Biography': 'https://www.imdb.com/search/title/?genres=Biography', 'Comedy': 'https://www.imdb.com/search/title/?genres=Comedy', 'crime': 'https://www.imdb.com/search/title/?genres=crime', 'Drama': 'https://www.imdb.com/search/title/?genres=Drama', 'Family': 'https://www.imdb.com/search/title/?genres=Family', 'Fantasy': 'https://www.imdb.com/search/title/?genres=Fantasy', 'Film-Noir': 'https://www.imdb.com/search/title/?genres=Film-Noir', 'History': 'https://www.imdb.com/search/title/?genres=History', 'Horror': 'https://www.imdb.com/search/title/?genres=Horror', 'Thriller': 'https://www.imdb.com/search/title/?genres=Thriller', 'Music': 'https://www.imdb.com/search/title/?genres=Music', 'Musical': 'https://www.imdb.com/search/title/?genres=Musical', 'Mystery': 'https://www.imdb.com/search/title/?genres=Mystery', 'Romance': 'https://www.imdb.com/search

## Parsing Movie Information
Now let's parse the movie information from IMDB. We will work with one genre first.

In [17]:
url = 'https://www.imdb.com/search/title/?genres=Adventure'
# Sending a request to the speciifed URL
resp = requests.get(url)
# Converting the response to Beautiful Soup Object
content = BeautifulSoup(resp.content, 'lxml')

In [21]:
# Iterating throught the List of movies
for movie in content.select('.lister-item-content'):
    try:
    # Creating a python dictonary
        data = {
            "title": movie.select('.lister-item-header')[0].get_text().strip(),
            "year": movie.select('.lister-item-year')[0].get_text().strip(),
            "certificate": movie.select('.certificate')[0].get_text().strip(),
            "time": movie.select(".runtime")[0].get_text().strip(),
            "genre": movie.select(".genre")[0].get_text().strip(),
            "rating": movie.select(".ratings-imdb-rating")[0].get_text().strip(),
            "metascore": movie.select(".metascore")[0].get_text().strip(),
            "simple_desc": movie.select(".text-muted")[2].get_text().strip(),
            "votes": movie.select(".sort-num_votes-visible")[0].get_text().strip(),
        }
    except IndexError:
        continue


    print(data)


{'title': '1.\nGuardians of the Galaxy Vol. 3\n(2023)', 'year': '(2023)', 'certificate': 'PG13', 'time': '150 min', 'genre': 'Action, Adventure, Comedy', 'rating': '8.3', 'metascore': '64', 'simple_desc': 'Still reeling from the loss of Gamora, Peter Quill rallies his team to defend the universe and one of their own - a mission that could mean the end of the Guardians if not successful.', 'votes': 'Votes:\n123,203'}
{'title': '2.\nDungeons & Dragons: Honor Among Thieves\n(2023)', 'year': '(2023)', 'certificate': 'PG13', 'time': '134 min', 'genre': 'Action, Adventure, Comedy', 'rating': '7.4', 'metascore': '72', 'simple_desc': 'A charming thief and a band of unlikely adventurers embark on an epic quest to retrieve a lost relic, but things go dangerously awry when they run afoul of the wrong people.', 'votes': 'Votes:\n99,139'}
{'title': '3.\nThe Super Mario Bros. Movie\n(2023)', 'year': '(2023)', 'certificate': 'PG', 'time': '92 min', 'genre': 'Animation, Adventure, Comedy', 'rating': '

Creating a scraping function
Now let's create a function that does the same as above but it can be reused several times for different URLs.

In [13]:
def get_movies(url):
    resp = requests.get(url)
    content = BeautifulSoup(resp.content, 'lxml')

    movie_list = []

    for movie in content.select('.lister-item-content'):

        try:
            data = {
                "title": movie.select('.lister-item-header')[0].get_text().strip(),
                "year": movie.select('.lister-item-year')[0].get_text().strip(),
                "certificate": movie.select('.certificate')[0].get_text().strip(),
                "time": movie.select(".runtime")[0].get_text().strip(),
                "genre": movie.select(".genre")[0].get_text().strip(),
                "rating": movie.select(".ratings-imdb-rating")[0].get_text().strip(),
                "metascore": movie.select(".metascore")[0].get_text().strip(),
                "simple_desc": movie.select(".text-muted")[2].get_text().strip(),
                "votes": movie.select(".sort-num_votes-visible")[0].get_text().strip(),
            }

        except IndexError:
            continue

        movie_list.append(data)

    return pd.DataFrame(movie_list)


## Scraping movies of different genres
The **get_movies()** function we write above can parse details from the IMDB web page of different genre URLs and can save them as a CSV file. So by using this function it is possible to scrape all genres that can be saved as separate CSV files. So let's see how this can be done.

In [14]:
url = 'https://www.imdb.com/search/title/?genres=adventure'
# Calling the function
get_movies(url)

Unnamed: 0,title,year,certificate,time,genre,rating,metascore,simple_desc,votes
0,1.\nGuardians of the Galaxy Vol. 3\n(2023),(2023),PG13,150 min,"Action, Adventure, Comedy",8.3,64,"Still reeling from the loss of Gamora, Peter Q...","Votes:\n123,203"
1,2.\nDungeons & Dragons: Honor Among Thieves\n(...,(2023),PG13,134 min,"Action, Adventure, Comedy",7.4,72,A charming thief and a band of unlikely advent...,"Votes:\n99,139"
2,3.\nThe Super Mario Bros. Movie\n(2023),(2023),PG,92 min,"Animation, Adventure, Comedy",7.2,46,The story of The Super Mario Bros. on their jo...,"Votes:\n111,060"
3,10.\nGuardians of the Galaxy\n(2014),(2014),PG13,121 min,"Action, Adventure, Comedy",8.0,76,A group of intergalactic criminals must pull t...,"Votes:\n1,212,863\n| Gross:\n$333.18M"
4,11.\nPeter Pan & Wendy\n(2023),(2023),PG,106 min,"Action, Adventure, Comedy",4.1,61,"Follow the adventures of Peter Pan, a boy who ...","Votes:\n20,047"
5,12.\nGhosted\n(I) (2023),(I) (2023),NC16,116 min,"Action, Adventure, Comedy",5.8,34,Cole falls head over heels for enigmatic Sadie...,"Votes:\n37,357"
6,14.\nGuardians of the Galaxy Vol. 2\n(2017),(2017),PG13,136 min,"Action, Adventure, Comedy",7.6,67,The Guardians struggle to keep together as a t...,"Votes:\n710,520\n| Gross:\n$389.81M"
7,15.\nFast X\n(2023),(2023),PG13,141 min,"Action, Adventure, Crime",6.4,55,Dom Toretto and his family are targeted by the...,"Votes:\n12,746"
8,19.\nAnt-Man and the Wasp: Quantumania\n(2023),(2023),PG,124 min,"Action, Adventure, Comedy",6.2,48,Scott Lang and Hope Van Dyne are dragged into ...,"Votes:\n145,150"
9,21.\nDune: Part One\n(2021),(2021),PG13,155 min,"Action, Adventure, Drama",8.0,74,A noble family becomes embroiled in a war for ...,"Votes:\n674,432\n| Gross:\n$108.33M"


In [7]:
df_data = pd.DataFrame()
for genre, url in url_dict.items():
    df_data = pd.concat([df_data, get_movies(url)])

df_data.to_csv('movies.csv')
df_data.head()

Unnamed: 0,title,year,certificate,time,genre,rating,metascore,simple_desc,votes
0,1.\nGuardians of the Galaxy Vol. 3\n(2023),(2023),PG13,150 min,"Action, Adventure, Comedy",8.3,64,"Still reeling from the loss of Gamora, Peter Q...","Votes:\n123,203"
1,2.\nDungeons & Dragons: Honor Among Thieves\n(...,(2023),PG13,134 min,"Action, Adventure, Comedy",7.4,72,A charming thief and a band of unlikely advent...,"Votes:\n99,139"
2,3.\nThe Super Mario Bros. Movie\n(2023),(2023),PG,92 min,"Animation, Adventure, Comedy",7.2,46,The story of The Super Mario Bros. on their jo...,"Votes:\n111,060"
3,10.\nGuardians of the Galaxy\n(2014),(2014),PG13,121 min,"Action, Adventure, Comedy",8.0,76,A group of intergalactic criminals must pull t...,"Votes:\n1,212,863\n| Gross:\n$333.18M"
4,11.\nPeter Pan & Wendy\n(2023),(2023),PG,106 min,"Action, Adventure, Comedy",4.1,61,"Follow the adventures of Peter Pan, a boy who ...","Votes:\n20,047"
