#Web Scraping IMBD In Python


**Web Scraping intro:**

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

# Lets do it
---
*This activity provides a general workflow of integrating web scraping to extract movie information from IMBD*

*See [Overview of Colaboratory Features](https://colab.research.google.com/notebooks/basic_features_overview.ipynb) for an overview of using Colaboratory.*

## Set up the Python modules

In [0]:
from IPython.core.display import clear_output
from time import time
from time import sleep
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
from random import randint

In [0]:
pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2000,2018)]
headers = {"Accept-Language": "en-US, en;q=0.5"}

### Redeclaring the lists to store data in

In [0]:
names = []
years = []
imdb_ratings = []
metascores = []
votes = []

## The Scraper
Okay club, this next part gets hairy... But do not worry we will run through this step by step

In [0]:
# Preparing the monitoring of the loop
start_time = time()
requests = 0

# For every year in the interval 2000-2017
for year_url in years_url:

    # For every page in the interval 1-4
    for page in pages:

        # Make a get request
        response = get('http://www.imdb.com/search/title?release_date=' + year_url +
        '&sort=num_votes,desc&page=' + page, headers = headers)

        # Pause the loop
        sleep(randint(8,15))

        # Monitor the requests
        requests += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
        clear_output(wait = True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requests, response.status_code))

        # Break the loop if the number of requests is greater than expected
        if requests > 72:
            warn('Number of requests was greater than expected.')
            break

        # Parse the content of the request with BeautifulSoup
        page_html = BeautifulSoup(response.text, 'html.parser')

        # Select all the 50 movie containers from a single page
        mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')

        # For every movie of these 50
        for container in mv_containers:
            # If the movie has a Metascore, then:
            if container.find('div', class_ = 'ratings-metascore') is not None:

                # Scrape the name
                name = container.h3.a.text
                names.append(name)

                # Scrape the year
                year = container.h3.find('span', class_ = 'lister-item-year').text
                years.append(year)

                # Scrape the IMDB rating
                imdb = float(container.strong.text)
                imdb_ratings.append(imdb)

                # Scrape the Metascore
                m_score = container.find('span', class_ = 'metascore').text
                metascores.append(int(m_score))

                # Scrape the number of votes
                vote = container.find('span', attrs = {'name':'nv'})['data-value']
                votes.append(int(vote))

###Sanity Check
make sure it worked the way wer intended... *you should see 4 pages*

In [0]:
print(pages)

### Import specialized Python libraries for working with data
We will import two popular libraries for working with and graphing data in Python:

* **Pandas** - a data analysis library ([Pandas user guide](http://pandas.pydata.org/pandas-docs/stable/user_guide/index.html))
* **Matplotlib** - a Python 2D plotting library ([Matplotlib user guide](https://matplotlib.org/users/index.html))

In [0]:
import pandas as pd
import matplotlib.pyplot as plt

### Reformatting data for graphing
With a DataFrame we have access to functions to modify our dataset. We will remove unnecessary data fields (coulmns), edit existing fields, and create new data fields from existing fields.

In [0]:
movie_ratings = pd.DataFrame({'movie': names,
'year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes
})
print(movie_ratings.info())
movie_ratings.head(10)

In [0]:
movie_ratings = movie_ratings[['movie', 'year', 'imdb', 'metascore', 'votes']]
movie_ratings.head()

In [0]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (16,4))
ax1, ax2, ax3 = fig.axes
ax1.hist(movie_ratings['imdb'], bins = 10, range = (0,10)) # bin range = 1
ax1.set_title('IMDB rating')
ax2.hist(movie_ratings['metascore'], bins = 10, range = (0,100)) # bin range = 10
ax2.set_title('Metascore')
ax3.hist(movie_ratings['imdb']*10, bins = 10, range = (0,100), histtype = 'step')
ax3.hist(movie_ratings['metascore'], bins = 10, range = (0,100), histtype = 'step')
ax3.legend(loc = 'upper left')
ax3.set_title('The Two Normalized Distributions')
for ax in fig.axes:
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
plt.show()

In [0]:
for container in mv_containers:
  print(container.h3.a.text)