# Web Scraping

### Goal: 
Given a page from the IMDB website, compile a DataFrame of movie information (Title, Runtime, Genre, etc.)

### Website URL:
https://www.imdb.com/search/title/?release_date=2017&sort=num_votes,desc&page=1


## Import Statements

In [51]:
import requests # package for making the GET request and recieving the HTML
import bs4 as BeautifulSoup # package for parsing the HTML
from bs4 import BeautifulSoup
import pandas as pd

## Getting URL

In [52]:
page = requests.get("https://www.imdb.com/search/title/?release_date=2017&sort=num_votes,desc&page=1")

## Generate HTML
- BeautifulSoup is a Python library for pulling out data from HTML files
    - To install: pip install beautifulsoup4
- Converts all incoming documents to Unicode and all outgoing documents to UTF-8
- If the original document doesn't specify the encoding, and BeautifulSoup can't detect one, then you have to specify the original encoding

In [53]:
soup = BeautifulSoup(page.content)

In [54]:
main = soup.find(id="main") # finds the section with the id "main"

In [55]:
print(main.prettify())

<div id="main">
 <div class="article">
  <h1 class="header">
   Released between 2017-01-01 and 2017-12-31
(Sorted by Number of Votes Descending)
  </h1>
  <div class="nav">
   <br class="clear"/>
   <div class="display-mode float-right">
    View Mode:
    <a class="compact" href="/search/title/?release_date=2017-01-01,2017-12-31&amp;sort=num_votes,desc&amp;view=simple">
     Compact
    </a>
    <span class="ghost">
     |
    </span>
    <a class="detailed" href="/search/title/?release_date=2017-01-01,2017-12-31&amp;sort=num_votes,desc&amp;view=advanced">
     <strong>
      Detailed
     </strong>
    </a>
   </div>
   <div class="desc">
    <span>
     1-50 of 358,482 titles.
    </span>
    <span class="ghost">
     |
    </span>
    <a class="lister-page-next next-page" href="/search/title/?release_date=2017-01-01,2017-12-31&amp;sort=num_votes,desc&amp;start=51">
     Next »
    </a>
   </div>
  </div>
  <br class="clear"/>
  <div class="sorting">
   Sort by:
   <a href="/search

## Extracting Information
- For each movie, there are 3 features: The **Genre**, **Certificate**, and **Runtime**
- The following method iterates through the list of movies, and extracts this information for each movie

In [56]:
body = main.find(class_='lister-list')
title = body.find_all(class_='lister-item-header')

titles = []
years = []
indexCount = 0
import re 

# The title includes all of the information about all of the movies
# This breaks it up by movie
for tag in title:
  # This breaks each movie into index, title, and year
    for item in tag:
        # This gets rid of everything except for the text
        result = re.search(">(.+?)<", str(item))
        
        if result:
            found = result.group(1)
            if indexCount % 3 == 1:
                titles.append(found)
            if indexCount % 3 == 2:
                years.append(found.strip())
            indexCount += 1

## DataFrame Output
- Brackets should be removed from the last 3 columns in the DataFrame (will revisit if time permits)

In [71]:
imdb = pd.DataFrame({
        "title": titles, 
        "genre": genre, # The class "genre" is defined in the HTML
        "certificate": certificate, # The class "certificate" is defined in the HTML
        "runtime": runtime, # The class "runtime" is defined in the HTML
    })

imdb

Unnamed: 0,title,genre,certificate,runtime
0,Logan,"[\nAction, Drama, Sci-Fi ]",[R],[137 min]
1,Thor: Ragnarok,"[\nAction, Adventure, Comedy ]",[PG-13],[130 min]
2,Guardians of the Galaxy Vol. 2,"[\nAction, Adventure, Comedy ]",[PG-13],[136 min]
3,Star Wars: Episode VIII - The Last Jedi,"[\nAction, Adventure, Fantasy ]",[PG-13],[152 min]
4,Wonder Woman,"[\nAction, Adventure, Fantasy ]",[PG-13],[141 min]
5,Dunkirk,"[\nAction, Drama, History ]",[PG-13],[106 min]
6,Spider-Man: Homecoming,"[\nAction, Adventure, Sci-Fi ]",[PG-13],[133 min]
7,Get Out,"[\nHorror, Mystery, Thriller ]",[R],[104 min]
8,It,[\nHorror ],[R],[135 min]
9,Blade Runner 2049,"[\nAction, Drama, Mystery ]",[R],[164 min]
