# Step-by-step: Scraping Community (2009)'s IMDb Ratings

**Reference:**
https://www.dataquest.io/blog/web-scraping-beautifulsoup/

First, I break down all of the steps.

### Taking a look at the url

In [1]:
from requests import get
url = 'https://www.imdb.com/title/tt1439629/episodes?season=1'
# Request the server the content of the web page by using get(), and store the server’s response in the variable response
response = get(url)
print(response.text[:500])


 










<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///title/tt1439629?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
  


### Use BeautifulSoup to parse the HTML content
Parse response.text by creating a BeautifulSoup object, and assign this object to html_soup. The 'html.parser' argument indicates that we want to do the parsing using Python’s built-in HTML parser.

In [2]:
from bs4 import BeautifulSoup

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

All of the containers we want are in `<div class="info" ...> </div>` containers. Each episode has one, so we will take those.

In [67]:
episode_containers = html_soup.find_all('div', class_='info')

`find_all()` returned a `ResultSet` object which is a list containing all the episode (25 for season 1) divs we are interested in.

Now, the variables we will be getting are:
- Episode title
- Episode number
- Airdate
- IMDb rating 
- Total votes
- Episode description

In [None]:
Read [this part](https://www.dataquest.io/blog/web-scraping-beautifulsoup/#thenameofthemovie) of the dataquest article to understand how calling the <a> tags works.

### Episode title

In [4]:
episode_containers[0].a['title']

'Pilot'

### Episode number

In [5]:
episode_containers[0].meta['content']

'1'

### Airdate

In [6]:
episode_containers[0].find('div', class_='airdate').text.strip()

'17 Sep. 2009'

### IMDb rating & total votes

In [7]:
episode_containers[0].find('div', class_='ipl-rating-star').text.split()

['7.8', '(3,178)']

### Episode description

In [8]:
episode_containers[0].find('div', class_='item_description').text.strip()


'An ex-lawyer is forced to return to community college to get a degree. However, he tries to use the skills he learned as a lawyer to get the answers to all his tests and pick up on a sexy woman in his Spanish class.'

In [9]:
episode_titles = []
episode_numbers = []
airdates = []
ratings_votes = []
descriptions = []

for episodes in episode_containers:
    if episodes.find_all('div', class_='info') is not None:
        title = episodes.a['title']
        episode_titles.append(title)

        number = episodes.meta['content']
        episode_numbers.append(number)

        airdate = episodes.find('div', class_='airdate').text.strip()
        airdates.append(airdate)

        rating_votes = episodes.find('div', class_='ipl-rating-star').text.split()
        ratings_votes.append(rating_votes)

        desc = episodes.find('div', class_='item_description').text.strip()
        descriptions.append(desc)

In [10]:
import pandas as pd 

In [11]:
test = pd.DataFrame({'episode_title':episode_titles,
'episode_num':episode_numbers,
'airdate':airdates,
'rating_votes':ratings_votes,
'description':descriptions})

test.head()

Unnamed: 0,episode_title,episode_num,airdate,rating_votes,description
0,Pilot,1,17 Sep. 2009,"[7.8, (3,178)]",An ex-lawyer is forced to return to community ...
1,Spanish 101,2,24 Sep. 2009,"[7.9, (2,752)]",Jeff takes steps to ensure that Brita will be ...
2,Introduction to Film,3,1 Oct. 2009,"[8.3, (2,692)]",Brita comes between Abed and his father when s...
3,Social Psychology,4,8 Oct. 2009,"[8.2, (2,468)]",Jeff and Shirley bond by making fun of Britta'...
4,Advanced Criminal Law,5,15 Oct. 2009,"[7.9, (2,368)]",Señor Chang is on the hunt for a cheater and t...


## Controlling the crawl-rate

In [13]:
from time import sleep
from random import randint

## Monitoring the loop as it's still going

Given that we’re scraping 72 pages, it would be nice if we could find a way to monitor the scraping process as it’s still going. This feature is definitely optional, but it can be very helpful in the testing and debugging process. Also, the greater the number of pages, the more helpful the monitoring becomes. If you are going to scrape hundreds or thousands of web pages in a single code run, I would say that this feature becomes a must.

For our script, we’ll make use of this feature, and monitor the following parameters:

- The **frequency (speed) of requests**, so we make sure our program is not overloading the server.
- The **number of requests**, so we can halt the loop in case the number of expected requests is exceeded.
- The **status code** of our requests, so we make sure the server is sending back the proper responses.

In [14]:
from time import time

In [15]:
from IPython.core.display import clear_output

In [16]:
start_time = time()

requests = 0

for _ in range(6):
    
    requests += 1
    sleep(randint(1,3))
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests,requests/elapsed_time))
    clear_output(wait=True)

Request: 6; Frequency: 0.49958881801791205 requests/s


# Final code

Putting all of the steps together to get the final dataset that contains the episode data for every season in the series:

In [17]:
# Initializing the series' that the loop will 
episode_titles = []
episode_numbers = []
airdates = []
ratings_votes = []
descriptions = []
season = []

seasons = [str(i) for i in range(1,7)]

# Preparing the monitoring of the loop
start_time = time()
requests = 0

# For every season in the series
for sn in seasons:
    response = get('https://www.imdb.com/title/tt1439629/episodes?season=' + sn)

    # Pause the loop
    sleep(randint(8,15))

    # Monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)

    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))
    
    # Break the loop if the number of requests is greater than expected
    if requests > 72:
        warn('Number of requests was greater than expected.')
        break

    # Parse the content of the request with BeautifulSoup
    page_html = BeautifulSoup(response.text, 'html.parser')

    # Select all the 50 movie containers from a single page
    episode_containers = page_html.find_all('div', class_ = 'info')

    # For each episode in each season
    for episodes in episode_containers:
        #if episodes.find_all('div', class_='info') is not None:
            season.append(sn)
            title = episodes.a['title']
            episode_titles.append(title)

            number = episodes.meta['content']
            episode_numbers.append(number)

            airdate = episodes.find('div', class_='airdate').text.strip()
            airdates.append(airdate)

            rating_votes = episodes.find('div', class_='ipl-rating-star').text.split()
            ratings_votes.append(rating_votes)

            desc = episodes.find('div', class_='item_description').text.strip()
            descriptions.append(desc)

Request:6; Frequency: 0.07929696231902113 requests/s


## Put all the series' together into a pandas DataFrame

In [57]:
community_episodes = pd.DataFrame({'episode_title':episode_titles,
'episode_num':episode_numbers,
'airdate':airdates,
'rating_votes':ratings_votes,
'description':descriptions,
'season':season})

community_episodes.head()

Unnamed: 0,episode_title,episode_num,airdate,rating_votes,description,season
0,Pilot,1,17 Sep. 2009,"[7.8, (3,178)]",An ex-lawyer is forced to return to community ...,1
1,Spanish 101,2,24 Sep. 2009,"[7.9, (2,752)]",Jeff takes steps to ensure that Brita will be ...,1
2,Introduction to Film,3,1 Oct. 2009,"[8.3, (2,692)]",Brita comes between Abed and his father when s...,1
3,Social Psychology,4,8 Oct. 2009,"[8.2, (2,468)]",Jeff and Shirley bond by making fun of Britta'...,1
4,Advanced Criminal Law,5,15 Oct. 2009,"[7.9, (2,368)]",Señor Chang is on the hunt for a cheater and t...,1


# Data Cleaning

### Giving the ratings and total votes their own columns

In [58]:
community_episodes[['rating','total_votes']] = pd.DataFrame(community_episodes.rating_votes.tolist(),index=community_episodes.index)

Dropping the `rating_votes` column now that we don't need it.

In [59]:
community_episodes.drop('rating_votes',axis=1,inplace=True)

In [60]:
community_episodes.head()

Unnamed: 0,episode_title,episode_num,airdate,description,season,rating,total_votes
0,Pilot,1,17 Sep. 2009,An ex-lawyer is forced to return to community ...,1,7.8,"(3,178)"
1,Spanish 101,2,24 Sep. 2009,Jeff takes steps to ensure that Brita will be ...,1,7.9,"(2,752)"
2,Introduction to Film,3,1 Oct. 2009,Brita comes between Abed and his father when s...,1,8.3,"(2,692)"
3,Social Psychology,4,8 Oct. 2009,Jeff and Shirley bond by making fun of Britta'...,1,8.2,"(2,468)"
4,Advanced Criminal Law,5,15 Oct. 2009,Señor Chang is on the hunt for a cheater and t...,1,7.9,"(2,368)"


## Fixing the data types

### Converting the total votes count to numeric

First, we create a function that replaces the ',' , '(', and ')' strings from `total_votes` so that we can make it numeric.

In [61]:
def remove_str(votes):
    for r in ((',',''), ('(',''),(')','')):
        votes = votes.replace(*r)
        
    return votes

Now we apply the function, taking out the strings, then change the type to int using `.astype()`

In [62]:
community_episodes['total_votes'] = community_episodes.total_votes.apply(remove_str).astype(int)
community_episodes.head()

In [64]:
community_episodes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   episode_title  110 non-null    object
 1   episode_num    110 non-null    object
 2   airdate        110 non-null    object
 3   description    110 non-null    object
 4   season         110 non-null    object
 5   rating         110 non-null    object
 6   total_votes    110 non-null    int32 
dtypes: int32(1), object(6)
memory usage: 5.7+ KB


### Making `rating` numeric instead of a string

In [26]:
community_episodes['rating'] = community_episodes.rating.astype(float)

### Converting the `airdate` from string to datetime

In [44]:
community_episodes['airdate'] = pd.to_datetime(community_episodes.airdate)

In [45]:
community_episodes.head()

Unnamed: 0,episode_title,episode_num,airdate,description,season,rating,total_votes
0,Pilot,1,2009-09-17,An ex-lawyer is forced to return to community ...,1,7.8,3178.0
1,Spanish 101,2,2009-09-24,Jeff takes steps to ensure that Brita will be ...,1,7.9,2752.0
2,Introduction to Film,3,2009-10-01,Brita comes between Abed and his father when s...,1,8.3,2692.0
3,Social Psychology,4,2009-10-08,Jeff and Shirley bond by making fun of Britta'...,1,8.2,2468.0
4,Advanced Criminal Law,5,2009-10-15,Señor Chang is on the hunt for a cheater and t...,1,7.9,2368.0


In [46]:
community_episodes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   episode_title  110 non-null    object        
 1   episode_num    110 non-null    object        
 2   airdate        110 non-null    datetime64[ns]
 3   description    110 non-null    object        
 4   season         110 non-null    object        
 5   rating         110 non-null    float64       
 6   total_votes    110 non-null    float64       
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 6.1+ KB


Now the data is ready for analysis and visualization!