We can scrape the IMDb movie ratings and their details with the help of the **BeautifulSoup** library of Python. 

Read the documentation here: https://beautiful-soup-4.readthedocs.io/en/latest/

Below is the list of modules required to scrape from IMDB.

* *requests:* Requests library is an integral part of Python for making HTTP requests to a specified URL. Whether it be REST APIs or Web Scrapping, requests is must to be learned for proceeding further with these technologies. When one makes a request to a URI, it returns a response.

* *html5lib:* A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

* *bs4:* BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster.

**STEP 1. Import the library**

In [None]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

**STEP 2. Access the HTML content from the webpage by assigning the URL and creating a soup object.**

In [None]:
# Downloading imdb top 250 movie's data
# The IMDB website check the GET request for an Accept-Language parameter. If the request doesn't have one, it shows the Chinese version.
headers = {'Accept-Language': 'en-US,en;q=0.8'}
url = 'https://www.imdb.com/chart/top-english-movies'
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, "html")


**STEP 3. Extract the movie ratings and their details. Here, we are extracting data from BeautifulSoup object using Html tags like href, title, etc.** In this step you need basic html skill. Try to understand the structure before you loop through all the data. 

* select the tag \<```td```\> with the class ```titleColumn```. Please notice the text scraped is still in html format. 

In [None]:
movies = soup.select('td.titleColumn')
print(movies[0])

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>


* get the text in-between paired html tags.

In [None]:
title = [t.get_text() for t in soup.select('td.titleColumn a')]
print(title[0])


The Shawshank Redemption


* To get the link, notice that the text is inside the ```<a>``` tag within the ```<td>``` tag. We can separate it with a space. 

In [None]:
links = soup.select('td.titleColumn a')
print(links[0])

<a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>




*   Get the attribute inside the tag. 



In [None]:
#links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]

links=[]
for a in soup.select('td.titleColumn a'):
  links.append(a.attrs.get('href'))

print(links[0])

/title/tt0111161/




*   Cleaning the result with regex



In [None]:
ratings = soup.select('td.ratingColumn strong')
print(ratings[0].get_text())
vote = [a.attrs.get('title') for a in soup.select('td.ratingColumn strong')]
print(vote[0])


#apply regex searchon first record

match_s=re.search("on",vote[0])
match_e=re.search("user",vote[0])
vote = vote[0][match_s.end() :match_e.start()] #slicing
print(vote.strip()) #clean white space
  



9.3
9.3 based on 2,643,803 user ratings
2,643,803


**STEP 4. Looping the process and storing the data.**
You need some regex search here. 

In [None]:
list = []
 
# Iterating over movies to extract
# each movie's details
for i in range(len(soup.select('td.titleColumn'))):

    rank=soup.select('td.titleColumn')[i].get_text()
    # clean the rank data, we just want the first digit before ".", and remove all white space
    rank=rank[0:rank.find('.')].strip()

    movie_title=soup.select('td.titleColumn a')[i].string
    year=soup.select('td.titleColumn span')[i].string
    #clean the year data, remove "(" and ")"
    year=year.replace('(',"").replace(')',"")

    crew=soup.select('td.titleColumn a')[i].attrs.get('title')

    rating=soup.select('td.ratingColumn strong')[i].get_text()
    vote=soup.select('td.ratingColumn strong')[i].attrs.get('title')
    #cleaning the votes data
    match_s=re.search("on",vote)
    match_e=re.search("user",vote)
    vote = vote[ match_s.end() :match_e.start()].strip()

    link=soup.select('td.titleColumn a')[i].attrs.get('href')
    #add the domain name
    link="https://www.imdb.com" + link

    data = {"rank":rank,
            "movie_title": movie_title,
            "year": year,
            "crew": crew,
            "rating": rating,
            "link": link,
            "votes": vote}
    list.append(data)

**STEP 5. Save as DataFrame and store it as CSV for further analysis.** You can also store it as SQL if you prefer. 

In [None]:

df = pd.DataFrame(list, columns = ['rank', 'movie_title', 'year', 'crew', 'rating', 'link', 'votes'])
print(list[0])
print (df.head())

df.to_csv('250movie.csv', index=False)

{'rank': '1', 'movie_title': 'The Shawshank Redemption', 'year': '1994', 'crew': 'Frank Darabont (dir.), Tim Robbins, Morgan Freeman', 'rating': '9.3', 'link': 'https://www.imdb.com/title/tt0111161/', 'votes': '2,644,057'}
  rank               movie_title  year  \
0    1  The Shawshank Redemption  1994   
1    2             The Godfather  1972   
2    3           The Dark Knight  2008   
3    4     The Godfather Part II  1974   
4    5              12 Angry Men  1957   

                                                crew rating  \
0  Frank Darabont (dir.), Tim Robbins, Morgan Fre...    9.3   
1  Francis Ford Coppola (dir.), Marlon Brando, Al...    9.2   
2  Christopher Nolan (dir.), Christian Bale, Heat...    9.0   
3  Francis Ford Coppola (dir.), Al Pacino, Robert...    9.0   
4      Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb    9.0   

                                    link      votes  
0  https://www.imdb.com/title/tt0111161/  2,644,057  
1  https://www.imdb.com/title/tt00686

In [None]:
!pip install bs4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
