<img src="files/title.png">

### Where to get the repo
`https://github.com/justinkreft/web_scraping_presentation.git`
### Note: 
* Following the Readme to build environment will take a few min
* You do not need the environment to follow the presentation


In [2]:
# import required libraries
import logging
import requests
import re
import json
from scrapy.selector import Selector
import spacy
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')
logging.info("Preparing imports and log settings for presentation.")

01:00:39 INFO:Preparing imports and log settings for presentation.


<img src="files/intro.png">

<img src="files/1.png">

<img src="files/caveat.png">

### Ok Fantastic! We will be good citizens of the Internets.
### Now, gimmie all descriptions of all movies streaming from all the services!
<img src="files/all_the_things.gif">


<img src="files/2.png">

## Simplistic model of a web request
* A request is made by a client to a server,  
* is interpreted by the server,  
* which prepares -> delivers a complete static response  
* to the client for display  
<img src="files/xkcd-full.png">

In [3]:
import requests
# http://docs.python-requests.org/en/master/
response = requests.get('https://www.rottentomatoes.com/browse/dvd-streaming-all/')
print(response.status_code)
print(response.headers['content-type'])
print(response.encoding)
print(response.text[:1000] + " ...")

01:04:00 DEBUG:Starting new HTTPS connection (1): www.rottentomatoes.com:443
01:04:00 DEBUG:https://www.rottentomatoes.com:443 "GET /browse/dvd-streaming-all/ HTTP/1.1" 200 31092


200
text/html;charset=UTF-8
UTF-8
<!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/" >
	<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
    <script src="//cdn.optimizely.com/js/594670329.js"></script>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width,initial-scale=1">

    <meta name="google-site-verification" content="VPPXtECgUUeuATBacnqnCm4ydGO99reF-xgNklSbNbc" />

    <meta name="msvalidate.01" content="034F16304017CA7DCF45D43850915323" />

    <link href="https://staticv2-4.rottentomatoes.com/static/images/iphone/apple-touch-icon.png" rel="apple-touch-icon" />
    <link href="https://staticv2-4.rottentomatoes.com/static/images/icons/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <link href="https://staticv2-4.rottentomatoes.com/static/styles/css/rt_main.css" rel="styleshee

### Dynamic Content
#### But it isn't the 2000's anymore. There is WAY more going on under the hood than just a call and response.

#### Examine Rotten Tomatoes in Chrome Inspector
https://www.rottentomatoes.com/browse/dvd-streaming-all/

Notes:
* Generating an exact html payload to client on demand at scale can be strenuous on a server 
* Modern servers tend to follow highly templatized patterns
* The contents of these templates are then populated by various network calls resulting in dynamic content that is
    * Either fed in bulk via Javascript on the page
    * or requested as necessary through XHR network calls
    
### So what you are more likely to encounter is
#### "Sure Dude. Here is a template and a bunch of instructions to request additional resources so you can populate the template yourself... because I'm too lazy"
<img src="files/xkcd-full (copy).png">


### If attempting to build a sustainible webscraper we want to
* make as few calls as possible
* get the most data that we are interested in as possible per call
* interact with content whose organization structure is unlikely to change over time


## Parsing an HTML response
The most effective ways to parse content from a single static HTML response programatically are:
* xpath queries
* regex pattern mining
* directly calling ajax/xhr network calls
* AVOIDING dynamic scraping like the plague
    * Directly calling a network call is the most ideal if available.

*Words on why dynamic crawlers are bad*

### Interactive demo
* Demonstrate xpath usage with xpathHelper tool in Chrome
    * `//a/@href`
    * `/html/body[@class='body  ']/div[@class='body_main container']/div[@id='main_container']/div[@id='main-row']/div/div[@id='content-column']/div[2]/div[@class='mb-movies']/div[@class='mb-movie'][1]/div[@class='poster_container']/a/@href`
    * `//div[@class='poster_container']/a/@href`
    * `//div[contains(@class, 'poster_container')]/a/@href`
* Look at script objects in the DOM for additional regex options
    * `jsonLdSchema">({.*})<`

### Careful! 
*But what works in client browser is not always what you see in delivered in a simple request that doesn't run the javascript*

*Plus we have the problem with that pesky dynamic lazyloaded data*

In [4]:
# Exploring the xpath options
# making a quick scrapy selector from response.txt - this is inherited normally by scrapy Response objects
selector = Selector(text=response.text, type='html')
print(len(selector.xpath("//div[contains(@class, 'poster_container')]/a/@href")), " -- We expected 32 from xpath helper")
print(len(selector.xpath("//a/@href")), " -- We expected 228 from xpath helper")

0  -- We expected 32 from xpath helper
203  -- We expected 228 from xpath helper


In [5]:
# So what just happend???
print(selector.xpath("//div[@class='mb-movies list-view']").extract())
selector.xpath("//a/@href").extract()

[]


['http://www.facebook.com/rottentomatoes',
 'http://twitter.com/rottentomatoes',
 '/',
 '/lists/theater/',
 '/theaters/',
 '/lists/dvd/',
 '/lists/tv/',
 'https://editorial.rottentomatoes.com',
 '/about#whatisthetomatometer',
 '/critics/',
 '/',
 '/',
 '/',
 '/lists/theater/',
 '/lists/tv/',
 '/lists/dvd/',
 'javascript:void(0)',
 'https://editorial.rottentomatoes.com',
 'https://www.facebook.com/rottentomatoes',
 'https://twitter.com/rottentomatoes',
 '/browse/opening/',
 '/browse/opening/',
 '/browse/in-theaters/',
 '/browse/upcoming/',
 '/browse/box-office/',
 '/browse/cf-in-theaters/',
 '/dvd/',
 '/browse/dvd-streaming-all/?services=fandango_now',
 '/browse/dvd-streaming-all/?services=netflix_iw',
 '/browse/dvd-streaming-all/?services=itunes',
 '/browse/dvd-streaming-all/?services=amazon_prime;amazon',
 '/browse/top-dvd-streaming/',
 '/browse/dvd-streaming-new/',
 '/browse/dvd-streaming-upcoming/',
 '/browse/cf-dvd-streaming-all/',
 '/browse/dvd-streaming-all/',
 '/top/',
 '/traile

In [8]:
# Exploring the regex options
print(re.findall(r'Death House', response.text), " -- We expected 7 from Chrome inspector")

['Death House', 'Death House']  -- We expected 7 from Chrome inspector


In [9]:
print(re.search(r'jsonLdSchema">({.*})<', response.text).group(1))
# Now we are in business
json_obj = json.loads(re.search(r'jsonLdSchema">({.*})<', response.text).group(1))
json_obj['itemListElement'] 

{"@context":"http://schema.org","@type":"ItemList","name":"All DVDs/Streaming","itemListElement":[{"@type":"ListItem","position":0,"url":"/m/death_house"},{"@type":"ListItem","position":1,"url":"/m/the_domestics"},{"@type":"ListItem","position":2,"url":"/m/claires_camera"},{"@type":"ListItem","position":3,"url":"/m/the_delinquent_season"},{"@type":"ListItem","position":4,"url":"/m/the_third_murder"},{"@type":"ListItem","position":5,"url":"/m/outlaw_king"},{"@type":"ListItem","position":6,"url":"/m/gauguin_voyage_to_tahiti"},{"@type":"ListItem","position":7,"url":"/m/anchor_and_hope"},{"@type":"ListItem","position":8,"url":"/m/crazy_rich_asians"},{"@type":"ListItem","position":9,"url":"/m/kin_2018"},{"@type":"ListItem","position":10,"url":"/m/skate_kitchen"},{"@type":"ListItem","position":11,"url":"/m/the_last_race_2018"},{"@type":"ListItem","position":12,"url":"/m/we_the_animals"},{"@type":"ListItem","position":13,"url":"/m/the_long_dumb_road"},{"@type":"ListItem","position":14,"url":"

[{'@type': 'ListItem', 'position': 0, 'url': '/m/death_house'},
 {'@type': 'ListItem', 'position': 1, 'url': '/m/the_domestics'},
 {'@type': 'ListItem', 'position': 2, 'url': '/m/claires_camera'},
 {'@type': 'ListItem', 'position': 3, 'url': '/m/the_delinquent_season'},
 {'@type': 'ListItem', 'position': 4, 'url': '/m/the_third_murder'},
 {'@type': 'ListItem', 'position': 5, 'url': '/m/outlaw_king'},
 {'@type': 'ListItem', 'position': 6, 'url': '/m/gauguin_voyage_to_tahiti'},
 {'@type': 'ListItem', 'position': 7, 'url': '/m/anchor_and_hope'},
 {'@type': 'ListItem', 'position': 8, 'url': '/m/crazy_rich_asians'},
 {'@type': 'ListItem', 'position': 9, 'url': '/m/kin_2018'},
 {'@type': 'ListItem', 'position': 10, 'url': '/m/skate_kitchen'},
 {'@type': 'ListItem', 'position': 11, 'url': '/m/the_last_race_2018'},
 {'@type': 'ListItem', 'position': 12, 'url': '/m/we_the_animals'},
 {'@type': 'ListItem', 'position': 13, 'url': '/m/the_long_dumb_road'},
 {'@type': 'ListItem', 'position': 14, 'u

In [10]:
# Likewise, we could take advantage of that network call
# https://www.rottentomatoes.com/api/private/v2.0/browse?maxTomato=100&maxPopcorn=100&services=amazon%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvudu%3Bamazon_prime%3Bfandango_now&certified&sortBy=release&type=dvd-streaming-all&page=1
# Tip, you can use a nice json viewer like http://jsonviewer.stack.hu/ to explore the object
json_response = requests.get('https://www.rottentomatoes.com/api/private/v2.0/browse?maxTomato=100&maxPopcorn=100&services=amazon%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvudu%3Bamazon_prime%3Bfandango_now&certified&sortBy=release&type=dvd-streaming-all&page=1')
json_data = json.loads(json_response.text)
print(json_data.keys())
print(json_data['counts'])
print(len(json_data['results']))
print(json_data['results'][0]['url'])

01:16:58 DEBUG:Starting new HTTPS connection (1): www.rottentomatoes.com:443
01:16:59 DEBUG:https://www.rottentomatoes.com:443 "GET /api/private/v2.0/browse?maxTomato=100&maxPopcorn=100&services=amazon%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvudu%3Bamazon_prime%3Bfandango_now&certified&sortBy=release&type=dvd-streaming-all&page=1 HTTP/1.1" 200 10982


dict_keys(['counts', 'results', 'debugUrl'])
{'count': 32, 'total': 15792}
32
/m/death_house


# Building a spider
For our purposes, we can use the network calls to get everything we need, however, frequently spiders use all of the methods described above for different purposes. Also note, this particular spider is pretty straightforward and now that we found a pattern much of what we are going to do might be accomplished with curl requests. But then we wouldn't get a lot of other benefits that scrapy provides below.

## Enter Scrapy
* a highly extensible asynchronus framework
* generally low memory demand
* handels all request cue and item processing scheduling
* many middleware supports baked in for simplifying 
   * proxy management
   * cacheing pages
   * retry logic
   * redirect management
   * autothrottling requests
   * useragent string management
* it is maybe 20 times faster than Selenium (even without dynamic crawling)

"If you are building something robust and want to make it as efficient as possible with lots of flexibility and a bunch of functions, and a project use case requires longterm maintence then you should definitely use it."

### Demo Scrapy in repo
*References*
* Scrapy - https://doc.scrapy.org/en/latest/
* Xpath - https://doc.scrapy.org/en/xpath-tutorial/topics/xpath-tutorial.html
* Regex - https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285


In [11]:
#example stats output
"""
2018-11-25 11:06:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 4,
 'downloader/exception_type_count/twisted.internet.error.NoRouteError': 3,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 1,
 'downloader/request_bytes': 10470750,
 'downloader/request_count': 11635,
 'downloader/request_method_count/GET': 11635,
 'downloader/response_bytes': 381379582,
 'downloader/response_count': 11631,
 'downloader/response_status_count/200': 10417,
 'downloader/response_status_count/301': 1152,
 'downloader/response_status_count/404': 62,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 11, 25, 16, 6, 45, 912498),
 'httperror/response_ignored_count': 62,
 'httperror/response_ignored_status_count/404': 62,
 'item_scraped_count': 9922,
 'log_count/CRITICAL': 7,
 'log_count/DEBUG': 21558,
 'log_count/INFO': 276,
 'memusage/max': 1213353984,
 'memusage/startup': 992456704,
 'request_depth_max': 2,
 'response_received_count': 10479,
 'retry/count': 4,
 'retry/reason_count/twisted.internet.error.NoRouteError': 3,
 'retry/reason_count/twisted.internet.error.TimeoutError': 1,
 'scheduler/dequeued': 11634,
 'scheduler/dequeued/memory': 11634,
 'scheduler/enqueued': 11634,
 'scheduler/enqueued/memory': 11634,
 'start_time': datetime.datetime(2018, 11, 25, 12, 30, 18, 928327)}
2018-11-25 11:06:45 [scrapy.core.engine] INFO: Spider closed (finished)
"""

"\n2018-11-25 11:06:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:\n{'downloader/exception_count': 4,\n 'downloader/exception_type_count/twisted.internet.error.NoRouteError': 3,\n 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 1,\n 'downloader/request_bytes': 10470750,\n 'downloader/request_count': 11635,\n 'downloader/request_method_count/GET': 11635,\n 'downloader/response_bytes': 381379582,\n 'downloader/response_count': 11631,\n 'downloader/response_status_count/200': 10417,\n 'downloader/response_status_count/301': 1152,\n 'downloader/response_status_count/404': 62,\n 'finish_reason': 'finished',\n 'finish_time': datetime.datetime(2018, 11, 25, 16, 6, 45, 912498),\n 'httperror/response_ignored_count': 62,\n 'httperror/response_ignored_status_count/404': 62,\n 'item_scraped_count': 9922,\n 'log_count/CRITICAL': 7,\n 'log_count/DEBUG': 21558,\n 'log_count/INFO': 276,\n 'memusage/max': 1213353984,\n 'memusage/startup': 992456704,\n 'request_depth_max'

## So what do I do with this unstructured data? Structure it with NLP
Natural Language Processing is the process programming computers to process and analyze large amounts of natural language data.

NLP starts with data expressed in natural language. It is unstructured and very difficult for machines to parse. Typical steps to processing unstrucutred data are
* cleaning text
* stop word removal
* parsing sentences
* parsing tokens (tokenization)
* part of speech tagging (PoS)
* lemmatization (word stemming)
* n-gram parsing
* entitiy recognition
* word dependencies
* sense disambiguation
* sentiment/opion analysis
* word embeddings

We won't have time, whatsoever to delve into any of these topics. But there is a library worth exploring that provides key entry points to each of these topics: Spacy. At least for most of the parsing steps above. https://spacy.io/

A thorough introduction to NLP would walk you through the concepts of
* corpus analysis
* bag of words representations
* Tf-IDF (term-frequency inverse document frequency)
* document clustering
* similarity measurements
* various NLP specific models

Using these models, we could create vectors of the text we scrape that could then be fed into
* Topic analysis algorithms
* Machine learning classifiers
* Features in a Neural Network
* Recomendation engines
* Tuning search engine appications

ect. etc.

In [13]:
# Let's just say we wanted to generate some pre-trained vectors of the text_blob we extracted above 
# to use for a subsequent hackathon event
# must have `python -m spacy download en_core_web_md` installed in your environment
nlp = spacy.load('en_core_web_md')

In [14]:
christmas_chronicles = nlp("""THE CHRISTMAS CHRONICLES, a holiday adventure from producer Chris Columbus ("Home Alone", "Harry Potter and the Sorcerer's Stone") and director Clay Kaytis ("The Angry Birds Movie"), tells the story of sister and brother, Kate (Darby Camp) and Teddy Pierce (Judah Lewis), whose Christmas Eve plan to catch Santa Claus (Kurt Russell) on camera turns into an unexpected journey that most kids could only dream about. After staking out Santa's arrival, they sneak into his sleigh, cause it to crash and nearly derail Christmas. As their wild night unfolds, Kate and Teddy work together with Santa - as you've never seen him before - and his loyal Elves to save Christmas before it's too late.
Rating: NR
Genre: Animation, Comedy, Kids & Family
Directed By: Clay Kaytis
Written By: 
On Disc/Streaming: Nov 22, 2018
Studio: Netflix""")
christmas_chronicles.vector

array([-4.50760797e-02,  1.28042862e-01, -1.14963941e-01, -7.86615312e-02,
        7.22312853e-02, -5.04514202e-03, -4.66659060e-03, -1.10909365e-01,
       -1.25242583e-02,  1.75411654e+00, -2.41286412e-01, -1.09182307e-02,
        1.46873621e-02, -4.09719460e-02, -1.07738502e-01, -3.65779102e-02,
       -4.12049657e-03,  8.62485051e-01, -1.12931140e-01, -6.73135817e-02,
       -7.77821848e-03, -1.74913201e-02, -4.45148125e-02, -3.92508088e-03,
        2.22117994e-02,  7.19443634e-02, -8.39163139e-02, -7.67638907e-03,
        2.66896449e-02, -6.23520613e-02, -1.41106462e-02,  1.17388600e-03,
       -1.23405717e-01,  8.68092179e-02,  9.75956544e-02, -4.59623970e-02,
       -2.59339008e-02,  6.70086518e-02,  5.92920417e-03,  1.45868184e-02,
        5.59102409e-02,  6.16589151e-02,  9.49327182e-03, -6.25116304e-02,
        1.71876734e-03,  2.88100950e-02, -1.06286928e-01, -1.37714623e-03,
        6.71938062e-02,  4.38196361e-02,  2.84838565e-02, -9.45013575e-03,
       -1.18088694e-02, -

In [15]:
[nlp.vocab.strings[x] for x in christmas_chronicles.to_array(['lemma'])]

['the',
 'christmas',
 'chronicles',
 ',',
 'a',
 'holiday',
 'adventure',
 'from',
 'producer',
 'chris',
 'columbus',
 '(',
 '"',
 'home',
 'alone',
 '"',
 ',',
 '"',
 'harry',
 'potter',
 'and',
 'the',
 'sorcerer',
 "'s",
 'stone',
 '"',
 ')',
 'and',
 'director',
 'clay',
 'kaytis',
 '(',
 '"',
 'the',
 'angry',
 'birds',
 'movie',
 '"',
 ')',
 ',',
 'tell',
 'the',
 'story',
 'of',
 'sister',
 'and',
 'brother',
 ',',
 'kate',
 '(',
 'darby',
 'camp',
 ')',
 'and',
 'teddy',
 'pierce',
 '(',
 'judah',
 'lewis',
 ')',
 ',',
 'whose',
 'christmas',
 'eve',
 'plan',
 'to',
 'catch',
 'santa',
 'claus',
 '(',
 'kurt',
 'russell',
 ')',
 'on',
 'camera',
 'turn',
 'into',
 'an',
 'unexpected',
 'journey',
 'that',
 'most',
 'kid',
 'could',
 'only',
 'dream',
 'about',
 '.',
 'after',
 'stake',
 'out',
 'santa',
 "'s",
 'arrival',
 ',',
 '-PRON-',
 'sneak',
 'into',
 '-PRON-',
 'sleigh',
 ',',
 'because',
 '-PRON-',
 'to',
 'crash',
 'and',
 'nearly',
 'derail',
 'christmas',
 '.',
 '

In [16]:
[nlp.vocab.strings[x] for x in christmas_chronicles.to_array(['pos'])]

['DET',
 'PROPN',
 'PROPN',
 'PUNCT',
 'DET',
 'NOUN',
 'NOUN',
 'ADP',
 'NOUN',
 'PROPN',
 'PROPN',
 'PUNCT',
 'PUNCT',
 'NOUN',
 'PROPN',
 'PUNCT',
 'PUNCT',
 'PUNCT',
 'PROPN',
 'PROPN',
 'CCONJ',
 'DET',
 'PROPN',
 'PART',
 'PROPN',
 'PUNCT',
 'PUNCT',
 'CCONJ',
 'NOUN',
 'PROPN',
 'PROPN',
 'PUNCT',
 'PUNCT',
 'DET',
 'PROPN',
 'PROPN',
 'PROPN',
 'PUNCT',
 'PUNCT',
 'PUNCT',
 'VERB',
 'DET',
 'NOUN',
 'ADP',
 'NOUN',
 'CCONJ',
 'NOUN',
 'PUNCT',
 'PROPN',
 'PUNCT',
 'PROPN',
 'PROPN',
 'PUNCT',
 'CCONJ',
 'PROPN',
 'PROPN',
 'PUNCT',
 'PROPN',
 'PROPN',
 'PUNCT',
 'PUNCT',
 'ADJ',
 'PROPN',
 'PROPN',
 'VERB',
 'PART',
 'VERB',
 'PROPN',
 'PROPN',
 'PUNCT',
 'PROPN',
 'PROPN',
 'PUNCT',
 'ADP',
 'NOUN',
 'VERB',
 'ADP',
 'DET',
 'ADJ',
 'NOUN',
 'ADJ',
 'ADJ',
 'NOUN',
 'VERB',
 'ADV',
 'VERB',
 'ADP',
 'PUNCT',
 'ADP',
 'VERB',
 'PART',
 'PROPN',
 'PART',
 'NOUN',
 'PUNCT',
 'PRON',
 'VERB',
 'ADP',
 'ADJ',
 'NOUN',
 'PUNCT',
 'VERB',
 'PRON',
 'PART',
 'VERB',
 'CCONJ',
 'ADV',


In [17]:
christmas_chronicles = nlp("""THE CHRISTMAS CHRONICLES, a holiday adventure from producer Chris Columbus ("Home Alone", "Harry Potter and the Sorcerer's Stone") and director Clay Kaytis ("The Angry Birds Movie"), tells the story of sister and brother, Kate (Darby Camp) and Teddy Pierce (Judah Lewis), whose Christmas Eve plan to catch Santa Claus (Kurt Russell) on camera turns into an unexpected journey that most kids could only dream about. After staking out Santa's arrival, they sneak into his sleigh, cause it to crash and nearly derail Christmas. As their wild night unfolds, Kate and Teddy work together with Santa - as you've never seen him before - and his loyal Elves to save Christmas before it's too late.
Rating: NR
Genre: Animation, Comedy, Kids & Family
Directed By: Clay Kaytis
Written By: 
On Disc/Streaming: Nov 22, 2018
Studio: Netflix""")

outlaw_king = nlp("""OUTLAW KING tells the untold, true story of Robert the Bruce who transforms from defeated nobleman to outlaw hero during the oppressive occupation of medieval Scotland by Edward I of England. Despite grave consequences, Robert seizes the Scottish crown and rallies an impassioned group of men to fight back against the mighty army of the tyrannical King and his volatile son, the Prince of Wales. Filmed in Scotland, OUTLAW KING reunites director David Mackenzie (Hell or High Water) with star Chris Pine alongside Aaron Taylor-Johnson, Florence Pugh and Billy Howle.
Rating: R (for sequences of brutal war violence some sexuality, language and brief nudity)
Genre: Action & Adventure, Drama
Directed By: David Mackenzie
Written By: Bathsheba Doran, James MacInnes, David Mackenzie, Mark Bomback, David Harrower
In Theaters: Nov 9, 2018  Limited
On Disc/Streaming: Nov 9, 2018
Runtime: 117 minutes
Studio: Netflix""")

incredibles = nlp("""Everyone's favorite family of superheroes is back in "Incredibles 2"--but this time Helen (voice of Holly Hunter) is in the spotlight, leaving Bob (voice of Craig T. Nelson) at home with Violet (voice of Sarah Vowell) and Dash (voice of Huck Milner) to navigate the day-to-day heroics of "normal" life. It's a tough transistion for everyone, made tougher by the fact that the family is still unaware of baby Jack-Jack's emerging superpowers. When a new villain hatches a brilliant and dangerous plot, the family and Frozone (voice of Samuel L. Jackson) must find a way to work together again--which is easier said than done, even when they're all Incredible.
Rating: PG (for action sequences and some brief mild language)
Genre: Action & Adventure, Animation, Kids & Family
Directed By: Brad Bird
Written By: Brad Bird
In Theaters: Jun 15, 2018  Wide
On Disc/Streaming: Oct 23, 2018
Runtime: 118 minutes
Studio: Disney/Pixar""")

In [18]:
# Calc some similarities. A higher score is more similar
print(christmas_chronicles.similarity(outlaw_king))
print(christmas_chronicles.similarity(incredibles))
assert christmas_chronicles.similarity(incredibles) > christmas_chronicles.similarity(outlaw_king)


0.9393766484038705
0.9739821201122889


### Caveat: The extremely high sim scores here are related to
* use of a pretrained model rather than the corpus we are working with
* no use of feature selection or reduction of noise in comparisons
* lack of weighted features (i.e. we would do even better if weighting Genre and Rating

This was only for demonstration purposes. I would not use this without some significant feature selection.

*See the MovieNLPPipeline in this repo for example of application.*


<img src="files/end.png">