## Basic IMDb scrapping 

The is adopted from this well-made [tutorial/blog](https://www.dataquest.io/blog/web-scraping-beautifulsoup/) 

Basic web scraper to get list of top movies on IMDb

In [48]:
from requests import get 
import numpy as np 
import pandas as pd 
from bs4 import BeautifulSoup

#url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie,documentary&release_date=1800-01-01,&sort=num_votes,desc'
url = 'https://www.imdb.com/search/title/?title_type=feature,documentary&release_date=2000-01-01,&countries=in&languages=hi&sort=num_votes,desc'
response = get(url)

In [49]:
html_soup = BeautifulSoup(response.text, 'html.parser')

In [74]:
num_titles = int(''.join(html_soup.find_all('div', class_='desc')[0].find('span').text.split()[-2].split(',')))

In [77]:
page_iter = np.arange(1,int(num_titles),50)[1:9]

[ 51 101 151 201 251 301 351 401]


In [50]:
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')
print(len(movie_container))

7


In [51]:
first_movie = movie_containers[0]
name = first_movie.h3.a.text
year = first_movie.h3.find('span',class_='lister-item-year text-muted unbold').text.strip('()')
ratings = float(first_movie.strong.text)
try:
    metascore = float(first_movie.find('span', class_='metascore').text.strip())
except AttributeError:
    metascore = None
num_votes = float(''.join(first_movie.find('span', attrs = {'name':'nv'}).text.split(',')))
print(name, year, ratings, metascore, num_votes)

3 Idiots 2009 8.4 67.0 325503.0


## Do it for all the entries on the page

In [52]:
names, year, rating, metascore, num_votes = [], [], [], [], [] 

for i, container in enumerate(movie_containers):
    container_entry = movie_containers[i] 
    names.append(container_entry.h3.a.text)
    year.append(container_entry.h3.find('span',class_='lister-item-year text-muted unbold').text.strip('()|-'))
    rating.append(float(container_entry.strong.text))
    num_votes.append(float(''.join(container_entry.find('span', attrs = {'name':'nv'}).text.split(','))))
    
    try:
        metascore.append(float(container_entry.find('span', class_='metascore').text.strip()))
    except AttributeError:
        metascore.append(np.nan)


In [53]:
df_movies = pd.DataFrame({'name':names,'year':year,'rating':rating,'metascore':metascore,'num_votes':num_votes})
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       50 non-null     object 
 1   year       50 non-null     object 
 2   rating     50 non-null     float64
 3   metascore  13 non-null     float64
 4   num_votes  50 non-null     float64
dtypes: float64(3), object(2)
memory usage: 2.1+ KB


In [33]:
df_movies.head()

Unnamed: 0,name,year,rating,metascore,num_votes
0,The Shawshank Redemption,1994,9.3,80.0,2244831.0
1,The Dark Knight,2008,9.0,84.0,2213428.0
2,Inception,2010,8.8,74.0,1967761.0
3,Fight Club,1999,8.8,66.0,1785572.0
4,Pulp Fiction,1994,8.9,94.0,1758111.0


To mimic human behavior, we’ll vary the amount of waiting `time` between requests by using the `randint()` function from the Python’s random module. `randint()` randomly generates integers within a specified interval.

In [46]:
import time as time 
from IPython.core.display import clear_output
requests = 0 
start_time = time.time()
for req_num in range(1,6):
    requests += 1.0
    time.sleep(np.random.randint(1,4))
    elapsed_time = time.time()-start_time
    print('Time taken for request {} : {:0.3f}'.format(req_num, elapsed_time))
    print('Request: {} | Frequency: {} requests / sec'.format(requests, requests/elapsed_time))
clear_output(wait = True)

Time taken for request 1 : 1.004
Request: 1.0 | Frequency: 0.9962350554789551 requests / sec
Time taken for request 2 : 4.006
Request: 2.0 | Frequency: 0.4992270516220966 requests / sec
Time taken for request 3 : 7.007
Request: 3.0 | Frequency: 0.42815030899650025 requests / sec
Time taken for request 4 : 8.012
Request: 4.0 | Frequency: 0.49926943689340264 requests / sec
Time taken for request 5 : 10.013
Request: 5.0 | Frequency: 0.4993336391024237 requests / sec


In [47]:
pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2000,2021)]

In [None]:
url_link = 'https://www.imdb.com/search/title/?title_type=feature,documentary&release_date=2000-01-01,&countries=in&languages=hi&sort=num_votes,desc'
response = get(url)

In [86]:
np.arange(1,300,50)[1:]

array([ 51, 101, 151, 201, 251])