# <center> Web Scraping

## <center>How to extract content and data from websites

<img src='spider_web.jpg'> 

# The easy way

# <center> BeautifulSoup

<img src='soup.png'>

Import BeautifulSoup4, the Requests library, and Pandas.

In [None]:
import bs4
import requests
import pandas as pd

 In honor of Spooky Season, let's look at a Wiki page for the movie Scream.

<img src='scream.jpg'>

The URL is https://horror.fandom.com/wiki/Scream_(film)

#### Making a request to the URL and viewing the HTML content

In [None]:
url = 'https://horror.fandom.com/wiki/Scream_(film)'
page = requests.get(url)
page.content

#### Parsing the HTML with BeautifulSoup

In [None]:
soup = bs4.BeautifulSoup(page.content, 'html.parser')

In [None]:
soup

#### Scraping a brief description of the movie.

In [None]:
soup.find_all('p')

In [None]:
soup.find_all('p')[0].text

In [None]:
scream_description = soup.find_all('p')[0].text.strip()
scream_description

#### Finding and scraping the director's name

In [None]:
directed_by_element = soup.find('div', text='Directed By')
directed_by_element

In [None]:
director_element = directed_by_element.find_next('div')
director_element.text

In [None]:
scream_director = director_element.text.strip()
scream_director

# <center> Activity

Using BeautifulSoup, find and scrape the release date of Scream as well as the cast list.

## Scraping tables and links

#### Scraping the deaths table and storing in dataframe

In [None]:
scream_deaths_table = soup.find('table', attrs={"class":'mw-collapsible'})
scream_deaths_table

In [None]:
deaths_df = pd.read_html(str(scream_deaths_table))[0]

In [None]:
deaths_df

Setting the column names.

In [None]:
deaths_df.columns = deaths_df.iloc[0]
deaths_df = deaths_df.iloc[1:]

In [None]:
deaths_df

#### Scraping the entire series

At this URL, we find the links to the individual pages for all the movies in the franchise - https://horror.fandom.com/wiki/Scream_(series).

In [None]:
url = 'https://horror.fandom.com/wiki/Scream_(series)'
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, 'html.parser')

In [None]:
scream_series_table = soup.find('table', attrs={"class":'article-table'})
scream_series_table

The base URL of the website is - https://horror.fandom.com.

In [None]:
base_url = 'https://horror.fandom.com'

In [None]:
rows = scream_series_table.find_all('tr')

In [None]:
scream_links = []
for row in rows:
    link = row.find('a',href=True)
    if link:
        scream_links.append(base_url + link['href'])
scream_links

In [None]:
descriptions = []
for url in scream_links:
    page = requests.get(url)
    soup = bs4.BeautifulSoup(page.content, 'html.parser')
    description = soup.find_all('p')[0].text.strip()
    descriptions.append(description)
descriptions

# <center> Activity

<img src='michael_myers.jpg'>

Using https://horror.fandom.com/wiki/Halloween_(franchise) and web scraping, compile a DataFrame consisting of the following data on all the movies in the <i>Halloween</i> franchise:
 - The title
 - The length
 - The revenue
 - The total number of deaths

<b>Note:</b> Don't include the most recent film, <i>Halloween Kills</i> as there won't be full info on it.

<b>Note:</b> If there is more than one release date, just get the earliest one.

<b>Hint:</b> Class attribute names may not be the same for the Halloween franchise table v.s. the Scream one. Use 'Inspect Element' to double-check.