## Basic Imports To Get Started:

In [1]:
import pandas as pd
import numpy as np

import requests

**Now to request a response from IMDB page of highest grossing comedies:**

In [15]:
imdb_response = requests.get('https://www.imdb.com/search/title/?genres=comedy&sort=boxoffice_gross_us,asc&explore=title_type,genres&view=advanced')
imdb_response

<Response [200]>

In [16]:
imdb_response.ok

True

In [17]:
imdb_response.status_code

200

**^^This response (200) indicates the IMDB server send request was successful**

In [18]:
imdb_response.text



**Now to get a better understanding of JSON APIs and how we interact with them:**

In [13]:
url = 'https://swapi.dev/api/people/5'
swapi_response = requests.get(url)
print(swapi_response.text)

{"name":"Leia Organa","height":"150","mass":"49","hair_color":"brown","skin_color":"light","eye_color":"brown","birth_year":"19BBY","gender":"female","homeworld":"http://swapi.dev/api/planets/2/","films":["http://swapi.dev/api/films/1/","http://swapi.dev/api/films/2/","http://swapi.dev/api/films/3/","http://swapi.dev/api/films/6/"],"species":[],"vehicles":["http://swapi.dev/api/vehicles/30/"],"starships":[],"created":"2014-12-10T15:20:09.791000Z","edited":"2014-12-20T21:17:50.315000Z","url":"http://swapi.dev/api/people/5/"}


**^^The response I got back from 'swapi.dev' is a JSON object.  B/c of this, I can use the '.json' method to get a data structure I can work with.** 

In [14]:
data = swapi_response.json()
print(type(data))
data

<class 'dict'>


{'name': 'Leia Organa',
 'height': '150',
 'mass': '49',
 'hair_color': 'brown',
 'skin_color': 'light',
 'eye_color': 'brown',
 'birth_year': '19BBY',
 'gender': 'female',
 'homeworld': 'http://swapi.dev/api/planets/2/',
 'films': ['http://swapi.dev/api/films/1/',
  'http://swapi.dev/api/films/2/',
  'http://swapi.dev/api/films/3/',
  'http://swapi.dev/api/films/6/'],
 'species': [],
 'vehicles': ['http://swapi.dev/api/vehicles/30/'],
 'starships': [],
 'created': '2014-12-10T15:20:09.791000Z',
 'edited': '2014-12-20T21:17:50.315000Z',
 'url': 'http://swapi.dev/api/people/5/'}

**It's easy to see that the JSON object returned is a dictionary.  Now let's use this same logic for the IMDB page**

In [22]:
imdb_response = requests.get('https://www.imdb.com/search/title/?genres=comedy&sort=boxoffice_gross_us,asc&explore=title_type,genres&view=advanced')

data = imdb_response.json()
data.keys()

JSONDecodeError: Expecting value: line 4 column 1 (char 3)

**After doing some homework on this JSONDecodeError, I have found out the following:**
    
    - 'requests' uses simple JSON; and
    
    - because the IMDB page has complex JSON, I need to change this

In [25]:
try:
    from simplejson.errors import JSONDecodeError
except ImportError:
    from json.decoder import JSONDecodeError

imdb_response = requests.get('https://www.imdb.com/search/title/?genres=comedy&sort=boxoffice_gross_us,asc&explore=title_type,genres&view=advanced')
try:
    print(imdb_response.json())
except JSONDecodeError:
    print("Is NOT JSON")

Is NOT JSON


**Okay, so I was wrong: the response from IMDB is NOT in JSON format.  Now what?**

In [28]:
from requests import get # stand in for my browser
from bs4 import BeautifulSoup
import os

In [30]:
url = 'https://www.imdb.com/search/title/?genres=comedy&sort=boxoffice_gross_us,asc&explore=title_type,genres&view=advanced'

headers = {"User_Agent":"Codeup Data Science, Nick Joseph"} # lets IMDB know who I am for security reasons

response = get(url, headers=headers) # sets headers to our own so IMDB doesn't have to deal with default headers

**^^Now that that's done, I need to take a look at what type of data I'm dealing with:**

In [32]:
# The opening to 
print(response.text[:200])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatibl


**^^It's an HTML document describing the overall structure of the webpage, along with what makes it unique.  Now I know what to do.**

- I will use command + option + u to see the source of the url in my Chrome browser

- From there, I will inspect the HTML of that page for tags using BeautifulSoup to identify the content I want.  There are two element properties I am looking for:

    1.) 'class' - the class(es) that are applied to the element; and 
    
    2.) 'id' - the unique identifier for the element on the page

In [33]:
# creating a 'soup' variable to hold the response object

soup = BeautifulSoup(response.content, 'html.parser')

In [38]:
print(soup.title.string) # gets the page's title - this is the '<title>' element 
# and the text that appears in the browser tab
print("\n")
print(soup.prettify) # to see the HTML

Comedy
(Sorted by US Box Office Ascending) - IMDb


<bound method Tag.prettify of 
<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Comedy
(Sorted by US Box Office Ascending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'funct

In [39]:
print(soup.find_all("a")) # finds all the anchor tags

[<a href="/?ref_=nv_home"><svg class="ipc-logo WNY8DBPCS1ZbiSd7NoqdP" height="56" version="1.1" viewbox="0 0 64 32" width="98" xmlns="http://www.w3.org/2000/svg"><g fill="#F5C518"><rect height="100%" rx="4" width="100%" x="0" y="0"></rect></g><g fill="#000000" fill-rule="nonzero" transform="translate(8.000000, 7.000000)"><polygon points="0 18 5 18 5 0 0 0"></polygon><path d="M15.6725178,0 L14.5534833,8.40846934 L13.8582008,3.83502426 C13.65661,2.37009263 13.4632474,1.09175121 13.278113,0 L7,0 L7,18 L11.2416347,18 L11.2580911,6.11380679 L13.0436094,18 L16.0633571,18 L17.7583653,5.8517865 L17.7707076,18 L22,18 L22,0 L15.6725178,0 Z"></path><path d="M24,18 L24,0 L31.8045586,0 C33.5693522,0 35,1.41994415 35,3.17660424 L35,14.8233958 C35,16.5777858 33.5716617,18 31.8045586,18 L24,18 Z M29.8322479,3.2395236 C29.6339219,3.13233348 29.2545158,3.08072342 28.7026524,3.08072342 L28.7026524,14.8914865 C29.4312846,14.8914865 29.8796736,14.7604764 30.0478195,14.4865461 C30.2159654,14.2165858 30.3021

**^^An 'anchor' tag is a piece of text taht marks the beginning and / or the end of a hypertext link.  In between the tags is either the start of the link or the actual destination**

In [40]:
print(soup.find("h1")) # finds all the h1 (or header) tags

<h1 class="header">Comedy
(Sorted by US Box Office Ascending) </h1>


**^^h1 tags are the headings of the website.  Here, it's "Comedy (Sorted by US Box Office Ascending)"**

In [41]:
print(soup.get_text()) # gets the text from w/in a matching piece of soup / HTML








var IMDbTimer={starttime: new Date().getTime(),pt:'java'};

    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }

(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
Comedy
(Sorted by US Box Office Ascending) - IMDb
(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);

    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }


    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }




    if (typeof uet == 'function') {
      uet("bb", "LoadIcons", {wb: 1});
    }

(function(t){ (t.events = t.events || {})["csm_head_pre_icon"] = new Date().getTime(); })(IMDbTimer);










(function(t){ (t.events = t.events || {})["csm_head_post_icon"] = new Date().getTime(); })(IMDbTimer);

    if (typeof uet == 'function') {
      uet("be", "LoadIcons", {wb: 1});
    }


    if (typeof uex == 'function') {
      ue

In [48]:
print(soup.select("p#text-muted")) # takes in a CSS selector as a string and returns all the matching elements.  SUPER USEFUL.

[]


**^^A 'CSS selector' selects the parts of the HTML I want to style.  There are 5 types of CSS selectors:**
    
    1.) Simple Selectors - select elements (parts) based on name, id, or class;
    
    2.) Combinator Selectors - select elements based on specific relationships between them; 
    
    3.) Pseudo-Class Selectors - select elements based on a certain state; 
    
    4.) Pseudo-Element Selectors - select AND STYLE a part of an element; and 
    
    5.) Attribute Selectors - select elements based on an attribute or attribute value