# Movie Review Sentiment Analysis

```
The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up
```

![Movie Review Sentiment Analysis](images/header.png "Movie Review Sentiment Analysis")

## Project Summary

The present project expresses the fact **artist review sentiment** and **film review sentiment**, based on the data provided by **[IMDb](https://www.imdb.com/)** and **[TMDb](https://www.themoviedb.org/)**

The goal of our **Data Pipeline** is to publish a PDF report to **S3** with the following summarised information:

1. Top 10 Films
2. Worst 10 Films
3. Review Sentiment Distibution
4. Top 10 Actors/Actresses in Best Reviewed Films
5. IMDb Average Voting vs TMDb Sentiment Reviews through Years

Our data sources are:

* **[IMDB Large Movie Reviews Sentiment Dataset](https://www.kaggle.com/jcblaise/imdb-sentiments)**
* **[TMDB Movies (2000-2020) with imdb_id](https://www.kaggle.com/hudsonmendes/tmdb-movies-20002020-with-imdb-id)**
* **[TMDB Reviews, movies released from 2000 and 2020](https://www.kaggle.com/hudsonmendes/tmdb-reviews-movies-released-from-2000-and-2020)**
* **[IMDB Cast (Links)](https://datasets.imdbws.com/title.principals.tsv.gz)**
* **[IMDB Cast (Names)](https://datasets.imdbws.com/name.basics.tsv.gz)**

## Project Scope

It's the focus of this project the implementation:

1. **Infra-structure As Code** installing the entire end-to-end structure
1. **Review Sentiment Classifier Model Trainer DAG** responsible for training our classifier
2. **End-to-end Data Warehouse Building DAG** responsible for creating our Data Warehouse

## Data

In [1]:
!pip install -q -U requests
!pip install -q -U tqdm
!pip install -q -U pandas
!pip install -q -U s3fs

In [2]:
import os
import json
import requests
import pandas as pd
from zipfile import ZipFile
from urllib.request import urlopen
from tqdm.notebook import tqdm

### Shared Utilities

Here we prepare some utilities that will help us visualise the raw data presented in this section.

In [3]:
def download_file(file_url, file_path, force=False):
    file_folder = os.path.dirname(file_path)
    if not os.path.isdir(file_folder):
        os.makedirs(file_folder)
    if force or not os.path.isfile(file_path):
        file_size = int(urlopen(file_url).info().get('Content-Length', -1))
        if os.path.exists(file_path):
            first_byte = os.path.getsize(file_path)
        else:
            first_byte = 0
        if first_byte < file_size:
            header = {"Range": "bytes=%s-%s" % (first_byte, file_size)}
            pbar = tqdm(total=file_size, initial=first_byte, unit='B', unit_scale=True, desc=file_url.split('/')[-1])
            req = requests.get(file_url, headers=header, stream=True)
            with(open(file_path, 'ab')) as f:
                for chunk in req.iter_content(chunk_size=1024):
                    if chunk:
                        f.write(chunk)
                        pbar.update(1024)
            pbar.close()
    return file_path

### [IMDB Large Movie Reviews Sentiment Dataset](https://www.kaggle.com/jcblaise/imdb-sentiments)

This datasets consists of movie reviews labelled as 1 (negative) or 0 (positive). Our goal is to use this information to train a Classifier model using Tensorflow, that will then be used to classify sentiment of user reviews connected to movies for which we desire to perform data analysis.

The original [Kaggle Dataset](https://www.kaggle.com/jcblaise/imdb-sentiments) cannot be downloaded directly. There fore, we download a copy of it from an [S3 DataLake](https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/jcblaise/imdb-sentiment.zip) into `data/raw/reviews-sentiment`.

**Important:** This data shall not be expressed directly expressed in our data warehouse's star schema. It will be used to train a tensorflow model responsible for classifying the sentiment between negative (-1) and positive (1), and that model will then be used to classify other user reviews which will then be actually presented in the facts table.

| Attribute  	| Value                                                                                                                     	|
|------------	|---------------------------------------------------------------------------------------------------------------------------	|
| Author     	| Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher 	|
| Title      	| Learning Word Vectors for Sentiment Analysis                                                                              	|
| Book Title 	| Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies      	|
| Month      	| June                                                                                                                      	|
| Year       	| 2011                                                                                                                      	|
| Address    	| Portland, Oregon, USA                                                                                                     	|
| Publisher  	| Association for Computational Linguistics                                                                                 	|
| Pages      	| 142--150                                                                                                                  	|
| URL        	| http://www.aclweb.org/anthology/P11-1015                                                                                  	|

In [4]:
imdb_sentiment_url  = 'https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/jcblaise/imdb-sentiment.zip'
imdb_sentiment_path = 'data/raw/imdb-sentiment.zip'
(imdb_sentiment_url, imdb_sentiment_path)

('https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/jcblaise/imdb-sentiment.zip',
 'data/raw/imdb-sentiment.zip')

In [5]:
download_file(imdb_sentiment_url, imdb_sentiment_path)

'data/raw/imdb-sentiment.zip'

In [6]:
df_raw_kaggle_train = None
with ZipFile(imdb_sentiment_path) as zip_file:
    df_raw_kaggle_train = pd.read_csv(
        zip_file.open('train.csv'),
        header=0,
        error_bad_lines=False)

In [7]:
pd.set_option('max_colwidth', 1024)
df_raw_kaggle_train[df_raw_kaggle_train.sentiment == 0].head(3)

Unnamed: 0,text,sentiment
0,"For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan ""The Skipper"" Hale jr. as a police Sgt.",0
1,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina's pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of ""Rosemary's Baby"" and ""The Exorcist""--but what a combination! Based on the best-seller by Jeffrey Konvitz, ""The Sentinel"" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbeat ending with skill. ***...",0
2,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the physicists playing badmitton, I loved the sweaters and the conversation while they waited for Robbins to retrieve the birdie.",0


In [8]:
pd.set_option('max_colwidth', 1024)
df_raw_kaggle_train[df_raw_kaggle_train.sentiment == 1].head(3)

Unnamed: 0,text,sentiment
12500,"Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience. Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.",1
12501,"Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question ""why in Gods name would they create another one of these dumpster dives of a movie?"" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we're from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to find things that burn ...",1
12502,"Ouch! This one was a bit painful to sit through. It has a cute and amusing premise, but it all goes to hell from there. Matthew Modine is almost always pedestrian and annoying, and he does not disappoint in this one. Deborah Kara Unger and John Neville turned in surprisingly decent performances. Alan Bates and Jennifer Tilly, among others, played it way over the top. I know that's the way the parts were written, and it's hard to blame actors, when the script and director have them do such schlock. If you're going to have outrageous characters, that's OK, but you gotta have good material to make it work. It didn't here. Run away screaming from this movie if at all possible.",1


### [TMDB Movies (2000-2020) with imdb_id](https://www.kaggle.com/hudsonmendes/tmdb-movies-20002020-with-imdb-id)

In order to deliver our data warehouse, we must write user review sentiment data to our facts table. In order to achieve that, we must link films to reviews.

Now, **IMDb** does have reviews in their website, however [IMDb Conditions of Use do not allow scrapping](https://www.imdb.com/conditions) as follows:

> **Robots and Screen Scraping:** You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

The alternative are the movie reviews available im **[TMDb (The Movie Database)](https://www.themoviedb.org/)** who provides user reviews for films.

**TMDb** allows you to find movies by `imdb_id` in their API. However, downloading each film by their `imdb_id` from their API, one by one, for more than 6 million titles, would take approximately 35 days if done in series. 

As a solution, we are going to use another [Kaggle Dataset](https://www.kaggle.com/hudsonmendes/tmdb-movies-20002020-with-imdb-id) that provides the same data, but ideal for consumption in batch. Once again, Kaggle does not offer a public link for our dataset, so we shall download this dataset from an [S3 DataLake](https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/jcblaise/imdb-sentiment.zip).

| Attribute      	| Value                                                                                                                    	|
|----------------	|--------------------------------------------------------------------------------------------------------------------------	|
| Title          	| TMDb Movies (2000-2020) with `imdb_id`                                                                                   	|
| Original Link  	| [Kaggle Dataset](https://www.kaggle.com/hudsonmendes/tmdb-movies-20002020-with-imdb-id)                                  	|
| Data Lake Link 	| [S3 DataLake](https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/hudsonmendes/tmdb-movies-with-imdb_id.zip) 	|
| Author         	| [Hudson Mendes](https://www.kaggle.com/hudsonmendes)                                                                     	|
| Published      	| 2020-05-20                                                                                                               	|
| License        	| [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)                                                          	|

In [9]:
tmdb_movies_url  = 'https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/hudsonmendes/tmdb-movies-with-imdb_id.zip'
tmdb_movies_path = 'data/raw/tmdb-movies-with-imdb_id.zip'
(tmdb_movies_url, tmdb_movies_path)

('https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/hudsonmendes/tmdb-movies-with-imdb_id.zip',
 'data/raw/tmdb-movies-with-imdb_id.zip')

In [10]:
download_file(tmdb_movies_url, tmdb_movies_path)

'data/raw/tmdb-movies-with-imdb_id.zip'

In [11]:
df_raw_kaggle_train = None
with ZipFile(tmdb_movies_path) as zip_file:
    df_raw_kaggle_train = pd.read_json(zip_file.open('tmdb-movies-2020.json'), lines=True)
df_raw_kaggle_train.head()

Unnamed: 0,id,video,vote_count,vote_average,title,release_date,original_language,original_title,genre_ids,backdrop_path,adult,overview,poster_path,popularity,id_imdb
0,634233,False,0,0.0,Class Action Park,,en,Class Action Park,[99],,False,"Discover the legacy of Action Park, a very real amusement park that used to exist that has something of a legendary story. Not only because the park was wild and far less structured than something like Disneyland, but because many people were seriously injured, and some even died. While it sounds made up, this place was very real.",/h21s0GnRiI9aL4hIFQWJGPZSwGg.jpg,0.6,tt11015214
1,436786,False,0,0.0,Clapboard Jungle: Surviving the Independent Film Business,2020-03-26,en,Clapboard Jungle: Surviving the Independent Film Business,[99],,False,A survival guide for the modern independent filmmaker.,/3ZslUuPHnv5EngMHbWyQzb9PX0L.jpg,0.926,tt4284084
2,688335,False,2,7.5,Close Encounters of the Fifth Kind,2020-04-07,en,Close Encounters of the Fifth Kind,[99],/efiDNmExd9GweIqmQKcwIt09hi7.jpg,False,"Dr. Steven Greer’s previous works, SIRIUS and UNACKNOWLEDGED, broke crowdfunding records and ignited a grassroots movement. CLOSE ENCOUNTERS OF THE FIFTH KIND features groundbreaking video and photographic evidence and supporting interviews from prominent figures such as Adam Curry of Princeton’s PEAR Lab; legendary civil rights attorney Daniel Sheehan, and Dr. Russell Targ, who headed the CIA’s top secret remote viewing program. Their message: For thousands of people, contact has begun. This is their story.",/R0rmE1wDbxMb2W0MyDE3HKPhOR.jpg,5.217,tt12108272
3,639103,False,1,4.0,Clover,2020-04-03,en,Clover,"[35, 53]",/hgvNpunbkYhPLKj46GotQqlTXng.jpg,False,Brothers Jackie and Mickey along with a teen witness Clover find themselves on the run from the city’s most notorious crime boss.,/wTp5SYAuTuHDGQGWePkF2z5sHTD.jpg,5.588,tt7801350
4,674334,False,0,0.0,Climbing Blind,2020-03-20,en,Climbing Blind,[99],/rjtXLRO0ixmbjoQM27Rj7rFRWe9.jpg,False,Blind climber Jesse Dufton's ascent of the Old Man of Hoy.,/rODi66xlhEIaHs2krHlj4A8da5f.jpg,2.586,tt11801494


### [TMDB Reviews, movies released from 2000 and 2020](https://www.kaggle.com/hudsonmendes/tmdb-reviews-movies-released-from-2000-and-2020)

Finally, the previous dataset allows to link **IMDb movies** to **TMDb movies**. We can now use the **TMDb reviews** that will be used for our sentiment analysis. With these we will finally have everything we need to build the facts table in our data warehouse.

Unfortunately, **TMDb reviews** must also be [downloaded from the TMDb api one by one](https://developers.themoviedb.org/3/movies/get-movie-reviews). And that is not an option for larger datasets. Therefore, we will again have to rely on a [Kaggle dataset](https://www.kaggle.com/hudsonmendes/tmdb-reviews-movies-released-from-2000-and-2020) that provides with the same information, but in JSON files format

| Attribute      	| Value                                                                                                        	|
|----------------	|--------------------------------------------------------------------------------------------------------------	|
| Title          	| TMDB Reviews, movies released from 2000 and 2020                                                             	|
| Original Link  	| [Kaggle Dataset](https://www.kaggle.com/hudsonmendes/tmdb-reviews-movies-released-from-2000-and-2020)        	|
| Data Lake Link 	| [S3 DataLake](https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/hudsonmendes/tmdb-reviews.zip) 	|
| Author         	| [Hudson Mendes](https://www.kaggle.com/hudsonmendes)                                                         	|
| Published      	| 2020-05-20                                                                                                   	|
| License        	| [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)                                              	|

In [12]:
tmdb_reviews_url  = 'https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/hudsonmendes/tmdb-reviews.zip'
tmdb_reviews_path = 'data/raw/tmdb-reviews.zip'
(tmdb_reviews_url, tmdb_reviews_path)

('https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/hudsonmendes/tmdb-reviews.zip',
 'data/raw/tmdb-reviews.zip')

In [13]:
download_file(tmdb_reviews_url, tmdb_reviews_path)

'data/raw/tmdb-reviews.zip'

In [14]:
df_raw_kaggle_train = None
with ZipFile(tmdb_reviews_path) as zip_file:
    df_raw_kaggle_train = pd.read_json(zip_file.open('tmdb-movies-2020-reviews.json'), lines=True)
df_raw_kaggle_train.head()

Unnamed: 0,author,content,id,url,movie_id
0,msbreviews,"If you enjoy reading my Spoiler-Free reviews, please follow my blog @\r\nhttps://www.msbreviews.com\r\n\r\nI always struggle to find reasons to dislike a Pixar film. One of the most annoying preconceived notions about genres is the one about animated movies. ""They're for children, how can you enjoy stuff like that, you're so childish"", people say. Little do they know that animated flicks have as much or more emotionally compelling narratives and characters than live-action films. The score is usually more important in the former genre, and the visuals always look stunning. The partnership Disney-Pixar is probably the best thing that could have happened to Hollywood.\r\n\r\nDan Scanlon delivered a surprisingly good sequel to Monsters Inc. back in 2013 with Monsters University. Making an efficient sequel twelve years after its original is a challenging task, and Scanlon succeed, so I had good expectations going in. Onward might not be one of Pixar's best movies, and I doubt that a lot of people will place i...",5e5d3685357c0000192c1b79,https://www.themoviedb.org/review/5e5d3685357c0000192c1b79,508439
1,Louisa Moore - Screen Zealots,"Unicorns, pixies, elves, and wizards have lost the magic of the past in “Onward,” a formulaic animated adventure from Pixar. Set in a suburban world of fantasy, the film tells the story of two teenage brothers who embark on a quest to find a magic gem that could hold the key to a little familial enchantment.\r\n\r\nIan (Tom Holland) and his older brother Barley (Chris Pratt) are still mourning the death of their father several years ago. Ian was just a baby when he died, and Barley barely has any memories of his dad. When mom (Julia Louis-Dreyfus) gives Ian a gift that his departed pop left for his 16th birthday, it turns out to be a wizard’s staff with instructions for a visitation spell that can bring the patriarch back for one day only. Something goes wrong, and the two siblings must set out on a danger-filled journey to find the missing piece of the puzzle before the sun sets.\r\n\r\nOnscreen adventures are supposed to be exciting, but this one is considerably dull. The story feels very personal yet u...",5e7d5b58fea0d710f5c933bd,https://www.themoviedb.org/review/5e7d5b58fea0d710f5c933bd,508439
2,SWITCH.,"Animation has such an important job. The messages I was talking about earlier are not just intended to teach kids things - we learn from them too. I think that's what made 'Onward' a little disappointing: I didn't come out having learned something new in the context of my adult life. Nonetheless, it's still a lovely allegorical tale that promotes brotherhood, adventure and generosity. Animation's got a big duty of shaping up future generations from a young age, and they can't always deliver the goods for all. With that, I sign off - eagerly awaiting Pixar's next drop where, fingers crossed, I get to discover something new too... don't forget the big kids, Pixar.\r\n- Lily Meek\r\n\r\nRead Lily's full article...\r\nhttps://www.maketheswitch.com.au/article/review-onward-not-the-fairytale-pixar-film-we-wanted",5e5f4a42357c0000132f80e0,https://www.themoviedb.org/review/5e5f4a42357c0000132f80e0,508439
3,Brandon,Onward is a family quest that delights (Boom Bastia!) and pulls at your heartstrings along the unpredictable Path of Peril. There is a delicate balance between tear-jerking moments and comic relief that Pixar and Disney have been known to master throughout many of their films. If you can look past the Pixar and Disney blueprint of the missing family member tragedy you’ll find many successful attempts to modernize the film to relate to many ages and families. \r\n\r\nRead the full review at FeaturedAnimation.com.,5e6aedae55c926001966fdd6,https://www.themoviedb.org/review/5e6aedae55c926001966fdd6,508439
4,msbreviews,"If you enjoy reading my Spoiler-Free reviews, please follow my blog @\r\nhttps://www.msbreviews.com\r\n\r\nFirst of all, no, I have no knowledge of the TV series this film is based upon, and after watching Jeff Wadlow's adaptation, I'm definitely not watching anything related to this story. I always follow this principle: the two pillars of any movie are its story and characters. It doesn't matter if everything else is absolutely perfect. If the two central pieces fail to convince the audience, then the whole movie falls to the ground. Sometimes, there are some crumbles that remain intact. Very rarely, there's a big segment that has no scratches whatsoever, which can help the film to survive and, even more rarely, actually make it ""okay"".\r\n\r\nFantasy Island has two little crumbles named Portia Doubleday (Sloane) and Maggie Q (Gwen), who deliver good performances, especially the former who everyone knows from one of the best TV shows of all-time, Mr. Robot. Everything else is completely destroyed. The G...",5e458dbedb8a0000179b3137,https://www.themoviedb.org/review/5e458dbedb8a0000179b3137,539537


### [IMDB Cast (Links)](https://datasets.imdbws.com/title.principals.tsv.gz)

As source for the cast, we shall use the IMDB **Principals** dataset. **Principals** are links between names and titles. Therefore, they don't have the names, just the ids.

Fortunately, IMDB does provide us a link for direct download, and we shall use that link to download the `.tsv` file and process it. However the IMDb endpoint offers very low QoS (bandwidth). For that reason, we have created a copy of the file in a data lake.

| Attribute      	| Value                                                                                                        	|
|----------------	|--------------------------------------------------------------------- |
| File          	| `title.principals.tsv.gz`                                            |
| Original Link  	| [IMDb Dataset](https://datasets.imdbws.com/title.principals.tsv.gz)  |
| Download Date     | 2020-05-26                                                           |
| License        	| [IMDb Non-Commercial License](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&pf_rd_r=AMBK36FKKT9RE70H2AK5&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1)    

In [15]:
imdb_cast_url = 'https://datasets.imdbws.com/title.principals.tsv.gz'
imdb_cast_path = 'data/raw/imdb-cast.tsv.gz'

In [16]:
download_file(imdb_cast_url, imdb_cast_path)

'data/raw/imdb-cast.tsv.gz'

In [17]:
df_imdb_cast = pd.read_csv(imdb_cast_path, sep='\t', compression='gzip')
df_imdb_cast.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


### [IMDB Cast (Names)](https://datasets.imdbws.com/name.basics.tsv.gz)

Now that we got the **Princials** (links between titles and artists), we can connec the IMDB movies to **Cast Names**, by using the **IMDB Names Dataset**.

Fortunately, IMDB does provide us a link for direct download, and we shall use that link to download the `.tsv` file and process it. However the IMDb endpoint offers very low QoS (bandwidth). For that reason, we have created a copy of the file in a data lake.

| Attribute      	| Value                                                                                                        	|
|----------------	|--------------------------------------------------------------------- |
| File          	| `name.basics.tsv.gz`                                                 |
| Original Link  	| [IMDb Dataset](https://datasets.imdbws.com/name.basics.tsv.gz)       |
| Download Date     | 2020-05-26                                                           |
| License        	| [IMDb Non-Commercial License](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&pf_rd_r=AMBK36FKKT9RE70H2AK5&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1)    

In [18]:
imdb_names_url = 'https://datasets.imdbws.com/name.basics.tsv.gz'
imdb_names_path = 'data/raw/name.basics.tsv.gz'

In [19]:
download_file(imdb_names_url, imdb_names_path)

'data/raw/name.basics.tsv.gz'

In [20]:
df_imdb_names = pd.read_csv(imdb_names_path, sep='\t', compression='gzip')
df_imdb_names.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0072308,tt0050419,tt0031983,tt0053137"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0038355,tt0037382,tt0117057,tt0071877"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0049189,tt0059956,tt0057345,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0072562,tt0077975,tt0078723,tt0080455"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0083922,tt0050976,tt0050986,tt0060827"


## Schema

### Star Schema: Review Sentiments by Film

![Star Schema: Film Review Sentiments](images/star-film-review-sentiment.png 'Star Schema: Film Review Sentiments')

### Star Schema: Film Review Sentiments by Cast

![Star Schema: Cast Review Sentiments](images/star-cast-film-review-sentiment.png 'Star Schema: Cast Review Sentiments')

## Data Dictionary

### `fact_films_review_sentiments`

| Attribute                	| Type          	| Nullable   	| Value                                                             	|
|--------------------------	|---------------	|------------	|-------------------------------------------------------------------	|
| `date_id`                	| `timestamp`      	| `not null` 	| yyyy-mm-dd, foreign key to `dim_dates`                            	|
| `film_id`                	| `int`         	| `not null` 	| foreign key to `dim_films`                                        	|
| `review_id`              	| `int`         	| `not null` 	| foreign key to `dim_reviews`                                      	|
| `review_sentiment_class` 	| `short`       	| `null`     	| [-1, 1] value representing the sentiment, classified by our model 	|

### `fact_cast_review_sentiments`

| Attribute                	| Type          	| Nullable   	| Value                                                             	|
|--------------------------	|---------------	|------------	|-------------------------------------------------------------------	|
| `date_id`                	| `timestamp`      	| `not null` 	| yyyy-mm-dd, foreign key to `dim_dates`                            	|
| `cast_id`             	| `int`         	| `not null` 	| foreign key to `dim_cast`                                   	|
| `review_id`              	| `int`         	| `not null` 	| foreign key to `dim_reviews`                                      	|
| `review_sentiment_class` 	| `short`       	| `null`     	| [-1, 1] value representing the sentiment, classified by our model 	|

### `dim_dates`

| Attribute                	| Type          	| Nullable   	| Value                             	|
|--------------------------	|---------------	|------------	|-----------------------------------	|
| `date_id`                	| `timestamp`      	| `not null` 	| primary key                       	|
| `year`                	| `int`         	| `not null` 	| `year` of timestamp in int format 	|
| `month`             		| `int`         	| `not null` 	| `month` of timestamp in int format	|
| `day`              		| `int`         	| `not null` 	| `day` of timestamp in int format  	|

### `dim_films`

| Attribute                	| Type          	| Nullable   	| Value                             	|
|--------------------------	|---------------	|------------	|-----------------------------------	|
| `film_id`                	| `varchar(32)` 	| `not null` 	| `idmb_id`                           	|
| `title`                	| `varchar(256)` 	| `not null` 	| the original title of the film    	|
| `release_year`         	| `int`         	| `not null` 	| `year` in which the film was released	|

### `dim_cast`

| Attribute                	| Type          	| Nullable   	| Value                             	|
|--------------------------	|---------------	|------------	|-----------------------------------	|
| `cast_id`                	| `varchar(32)` 	| `not null` 	| `cast_id` in the IMDB database    	|
| `film_id`                	| `varchar(32)` 	| `not null` 	| `imdb_id`                         	|
| `full_name`            	| `varchar(256)` 	| `not null` 	| the name of the actor or actress   	|

### `dim_reviews`

| Attribute                	| Type          	| Nullable   	| Value                             	|
|--------------------------	|---------------	|------------	|-----------------------------------	|
| `review_id`            	| `varchar(32)` 	| `not null` 	| `review_id` in the TMDb database  	|
| `film_id`                	| `varchar(32)` 	| `not null` 	| `imdb_id`                         	|
| `text`                	| `varchar(32)` 	| `not null` 	| `imdb_id`                         	|

## Data Pipeline


In this section we set out the procedures requred to:

1. Create Data Engineering Pipeline (EC2, Redshift, Airflow)
2. Upload the DAG files
3. Provide Airflow Access Links
4. Allow Teardown of Infra-structure when complete.

Once installed, the pipeline DAGs should be present by Airflow such as in the following images:

![Model Training DAG](images/dag-model_training.png "Model Training DAG")

![Model Data Warehouse DAG](images/dag-data_warehouse.png "Model Training DAG")

![Sentiment Report](images/report.png "Sentiment Report")

### Infrastructure

#### Components

Our infra-structure is composed of:

| # 	| Component        	| Type       	| Tech       	| Description                                                                                                                                                                                        	|
|---	|------------------	|------------	|------------	|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|
| 1 	| Data Lake        	| Persistent 	| S3         	| Cached source of raw data, used for the purpose of this project.                                                                                                                                   	|
| 2 	| Airflow          	| Ephemeral  	| EC2        	| Responsible for the backfill and scheduling of the DAGs responsible for (a) Performing the Sentiment Analysis over TMDb Reviews and (b) populating our Data Warehouse's dimensions and fact table. 	|
| 3 	| Classifier Model 	| Ephemeral  	| Tensorflow 	| Trained once every time we need to rebuild our Data Warehouse                                                                                                                                      	|
| 4 	| Data Warehouse   	| Ephemeral  	| Redshift   	| Populated with (dimensions) movies, actors, reviews and our review_sentiments.                                                                                                                     	|

#### AWS

##### Dependencies

In [21]:
!pip install -q -U boto3

In [22]:
import boto3
import time

##### Settings

In [23]:
dw_bucket_name = 'hudsonmendes-dw'

In [24]:
dw_region = 'eu-west-2'

##### Key Pair

In [25]:
ec2 = boto3.client('ec2')

In [26]:
ec2_pem_name      = 'udacity-data-eng-nano'
ec2_pem_path = f'./{ec2_pem_name}.pem'
if os.path.isfile(ec2_pem_path):
    os.remove(ec2_pem_path)
ec2_keypair = ec2.create_key_pair(KeyName=ec2_pem_name)
with open(ec2_pem_path, 'w+') as ec2_pem_file:
    ec2_pem_file.write(str(ec2_keypair['KeyMaterial']))
!chmod 400 {ec2_pem_path}
ec2_pem_path

'./udacity-data-eng-nano.pem'

#### Airflow

##### EC2 Resources

In [27]:
ec2 = boto3.client('ec2')
iam = boto3.client('iam')

In [28]:
ec2_role = iam.create_role(
    Path='/',
    RoleName='hudsonmendes-dw-movie-sentiment-analysis-ec2-role',
    Description='',
    MaxSessionDuration=3600,
    AssumeRolePolicyDocument="""{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "ec2.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }
  ]
}""".replace('<dw_bucket>', dw_bucket_name))['Role']
ec2_role['Arn']

'arn:aws:iam::673962978035:role/hudsonmendes-dw-movie-sentiment-analysis-ec2-role'

In [29]:
for ec2_policy in  [
        'arn:aws:iam::aws:policy/AmazonS3FullAccess']:
    assert iam.attach_role_policy(
        RoleName=ec2_role['RoleName'],
        PolicyArn=ec2_policy)['ResponseMetadata']['HTTPStatusCode'] == 200

In [30]:
ec2_instance_profile = iam.create_instance_profile(InstanceProfileName='hudsonmendes-dw-movie-sentiment-analysis-ec2-instance-profile')['InstanceProfile']
iam.get_waiter('instance_profile_exists').wait(InstanceProfileName='hudsonmendes-dw-movie-sentiment-analysis-ec2-instance-profile')
ec2_instance_profile['Arn']

'arn:aws:iam::673962978035:instance-profile/hudsonmendes-dw-movie-sentiment-analysis-ec2-instance-profile'

In [31]:
assert iam.add_role_to_instance_profile(
    InstanceProfileName=ec2_instance_profile['InstanceProfileName'],
    RoleName=ec2_role['RoleName'])['ResponseMetadata']['HTTPStatusCode'] == 200

In [32]:
ec2_sg = ec2.create_security_group(
    Description='Allows 22 trafic',
    GroupName='Airflow')
ec2.authorize_security_group_ingress(CidrIp='0.0.0.0/0', FromPort=22, ToPort=22, GroupId=ec2_sg['GroupId'], IpProtocol='TCP')
ec2.authorize_security_group_ingress(CidrIp='0.0.0.0/0', FromPort=8080, ToPort=8080, GroupId=ec2_sg['GroupId'], IpProtocol='TCP')
ec2.authorize_security_group_ingress(CidrIp='0.0.0.0/0', FromPort=5555, ToPort=5555, GroupId=ec2_sg['GroupId'], IpProtocol='TCP')
ec2.authorize_security_group_ingress(CidrIp='0.0.0.0/0', FromPort=3306, ToPort=3306, GroupId=ec2_sg['GroupId'], IpProtocol='TCP')
ec2_sg['GroupId']

'sg-011d6915a736cba3d'

In [33]:
time.sleep(30) #wait instance profile...
ec2_ami_id = 'ami-00f6a0c18edb19300'
ec2_spot = ec2.request_spot_instances(
    AvailabilityZoneGroup='eu-west-2',
    InstanceCount=1,
    LaunchSpecification={
        'SecurityGroupIds': [ec2_sg['GroupId']],
        'EbsOptimized': False,
        'KeyName': ec2_pem_name,
        'ImageId': ec2_ami_id,
        'InstanceType': 't3.large',
        'IamInstanceProfile': {
            'Arn': ec2_instance_profile['Arn']
        },
        "BlockDeviceMappings": [
            {
                "DeviceName": "/dev/sda1",
                "Ebs": {
                        "DeleteOnTermination": True,
                        "VolumeSize": 30,
                        "Encrypted": False,
                        "VolumeType": "gp2"
                }
            }
        ],
    },
    SpotPrice='0.04',
    Type='one-time',
    InstanceInterruptionBehavior='terminate'
)
ec2_spot_id = ec2_spot['SpotInstanceRequests'][0]['SpotInstanceRequestId']
ec2.get_waiter('spot_instance_request_fulfilled').wait(
    SpotInstanceRequestIds=[ec2_spot_id])
ec2_spot_id

'sir-i2trw3cj'

In [34]:
ec2_vm_id = ec2.describe_spot_instance_requests(SpotInstanceRequestIds=[ ec2_spot_id ]) \
    ['SpotInstanceRequests'] \
    [0] \
    ['InstanceId']
ec2.get_waiter('instance_status_ok').wait(InstanceIds=[ ec2_vm_id ])
ec2_vm_id

'i-04b7627debeca3b4e'

In [35]:
ec2_ip = ec2.allocate_address(Domain='vpc')
[ ec2_ip['PublicIp'], ec2_ip['AllocationId'] ]

['35.178.34.71', 'eipalloc-07695956a81fad65a']

In [36]:
ec2_vm_ip = ec2.associate_address(
     InstanceId = ec2_vm_id,
     AllocationId = ec2_ip["AllocationId"])
[ ec2_vm_ip['AssociationId'] ]

['eipassoc-0fdfa147657699b21']

##### SSH Installation

In [37]:
!pip install -q -U paramiko
!pip install -q -U scp

In [38]:
import paramiko
import scp
from tqdm.notebook import tqdm

In [39]:
def get_ssh(ip, pem_path):
    print(f"ssh -i udacity-data-eng-nano.pem ubuntu@{ip}")
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(hostname=ip, username='ubuntu', pkey=paramiko.RSAKey.from_private_key_file(pem_path))
    return ssh

In [40]:
def run_via_ssh(
        ip,
        pem_path,
        commands,
        display_output=False):
    
    ssh = get_ssh(ip, pem_path)
    try:
        for command in tqdm(commands):
            stdin, stdout, stderr = ssh.exec_command(command)
            exit_status = stdout.channel.recv_exit_status()
            if exit_status == 0:
                print(('ok', command))
                if display_output:
                    output_buffer = stdout.read().decode('utf-8')
                    if output_buffer:
                        print(f">>> {output_buffer}")
            else:
                error_buffer = stderr.read().decode('utf-8')
                print(('failed', command))
                print(f"!!! {error_buffer}")
    finally:
        ssh.close()

In [41]:
run_via_ssh(
    ip=ec2_ip['PublicIp'],
    pem_path=ec2_pem_path,
    commands=[
        'sudo apt-get -y update',

        'sudo apt-get install -y libmysqlclient-dev mysql-server',
        f"sudo mysql -e \"SET GLOBAL explicit_defaults_for_timestamp = 1;\"",
        f"sudo mysql -e \"CREATE DATABASE airflow;\"",
        f"sudo mysql -e \"CREATE USER 'airflow'@'localhost' IDENTIFIED BY 'airflow';\"",
        f"sudo mysql -e \"GRANT ALL PRIVILEGES ON airflow.* TO 'airflow'@'localhost';\"",

        'sudo apt-get install -y python3 python3-pip python3-setuptools',
        'sudo pip3 install -U pip',
        'sudo pip3 install -U apache-airflow',
        'sudo pip3 install -U apache-airflow[mysql]',
        'sudo pip3 install -U apache-airflow[celery]'
    ])

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71


HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))

('ok', 'sudo apt-get -y update')
('ok', 'sudo apt-get install -y libmysqlclient-dev mysql-server')
('ok', 'sudo mysql -e "SET GLOBAL explicit_defaults_for_timestamp = 1;"')
('ok', 'sudo mysql -e "CREATE DATABASE airflow;"')
('ok', 'sudo mysql -e "CREATE USER \'airflow\'@\'localhost\' IDENTIFIED BY \'airflow\';"')
('ok', 'sudo mysql -e "GRANT ALL PRIVILEGES ON airflow.* TO \'airflow\'@\'localhost\';"')
('ok', 'sudo apt-get install -y python3 python3-pip python3-setuptools')
('ok', 'sudo pip3 install -U pip')
('ok', 'sudo pip3 install -U apache-airflow')
('ok', 'sudo pip3 install -U apache-airflow[mysql]')
('ok', 'sudo pip3 install -U apache-airflow[celery]')



In [42]:
run_via_ssh(ip=ec2_ip['PublicIp'],
    pem_path=ec2_pem_path,
    commands=[
        'airflow initdb',
        'sudo apt-get install -y crudini',
        "crudini --set ~/airflow/airflow.cfg core load_examples False",
        "crudini --set ~/airflow/airflow.cfg core load_default_connections False",
        "crudini --set ~/airflow/airflow.cfg core sql_alchemy_conn 'mysql://airflow:airflow@localhost/airflow'",
        "crudini --set ~/airflow/airflow.cfg core executor LocalExecutor",
        "crudini --set ~/airflow/airflow.cfg core sql_alchemy_schema airflow",
        "crudini --set ~/airflow/airflow.cfg scheduler min_file_process_interval 10",
        "crudini --set ~/airflow/airflow.cfg scheduler dag_dir_list_interval 60",
        'airflow initdb',
    ])

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

('ok', 'airflow initdb')
('ok', 'sudo apt-get install -y crudini')
('ok', 'crudini --set ~/airflow/airflow.cfg core load_examples False')
('ok', 'crudini --set ~/airflow/airflow.cfg core load_default_connections False')
('ok', "crudini --set ~/airflow/airflow.cfg core sql_alchemy_conn 'mysql://airflow:airflow@localhost/airflow'")
('ok', 'crudini --set ~/airflow/airflow.cfg core executor LocalExecutor')
('ok', 'crudini --set ~/airflow/airflow.cfg core sql_alchemy_schema airflow')
('ok', 'crudini --set ~/airflow/airflow.cfg scheduler min_file_process_interval 10')
('ok', 'crudini --set ~/airflow/airflow.cfg scheduler dag_dir_list_interval 60')
('ok', 'airflow initdb')



In [43]:
run_via_ssh(
    ip=ec2_ip['PublicIp'],
    pem_path=ec2_pem_path,
    commands=[
        'mkdir -p ~/airflow/dags'
    ])

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

('ok', 'mkdir -p ~/airflow/dags')



In [44]:
run_via_ssh(
    ip=ec2_ip['PublicIp'],
    pem_path=ec2_pem_path,
    commands=[
        'sudo pip3 install -U tensorflow',
        'sudo pip3 install -U pandas',
        'sudo pip3 install -U scikit-learn',
        'sudo pip3 install -U numpy',
        'sudo pip3 install -U psycopg2-binary',
        'sudo pip3 install -U requests',
        'sudo pip3 install -U boto3',
        'sudo pip3 install -U matplotlib',
        'sudo pip3 install -U reportlab'
    ])

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))

('ok', 'sudo pip3 install -U tensorflow')
('ok', 'sudo pip3 install -U pandas')
('ok', 'sudo pip3 install -U scikit-learn')
('ok', 'sudo pip3 install -U numpy')
('ok', 'sudo pip3 install -U psycopg2-binary')
('ok', 'sudo pip3 install -U requests')
('ok', 'sudo pip3 install -U boto3')
('ok', 'sudo pip3 install -U matplotlib')
('ok', 'sudo pip3 install -U reportlab')



In [45]:
run_via_ssh(
    ip=ec2_ip['PublicIp'],
    pem_path=ec2_pem_path,
    commands=[
        'airflow scheduler -D',
        'airflow worker -D',
        'airflow flower -D',
        'airflow webserver -p 8080 -D'
    ])

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71


HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

('ok', 'airflow scheduler -D')
('ok', 'airflow worker -D')
('ok', 'airflow flower -D')
('ok', 'airflow webserver -p 8080 -D')



Run the next cell to get **Airflow** weblinks to se the:

1. **Web Server** the Airflow UI, used to manage your DAGS
2. **Flower Interface** see all workers running in this Airflow instance

**Important** these links may take a few minutes to become available.

##### Web Access

In [46]:
print(f"SSH      : ssh -i udacity-data-eng-nano.pem ubuntu@{ec2_ip['PublicIp']}")
print(f"WebServer: http://{ec2_ip['PublicIp']}:8080")
print(f"Flower   : http://{ec2_ip['PublicIp']}:5555")

SSH      : ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
WebServer: http://35.178.34.71:8080
Flower   : http://35.178.34.71:5555


#### Redshift

##### Dependencies

In [47]:
!pip install -q -U ipython-sql
!pip install -q -U psycopg2-binary
!pip install -q -U boto3

In [48]:
import psycopg2
import boto3

##### Redshift Resources

In [49]:
ec2 = boto3.client('ec2')
iam = boto3.client('iam')
redshift = boto3.client('redshift')

In [50]:
redshift_role = iam.create_role(
    Path='/',
    RoleName='hudsonmendes-dw-movie-sentiment-analysis-redshift-role',
    Description='',
    MaxSessionDuration=3600,
    AssumeRolePolicyDocument="""{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}""")['Role']
redshift_role['Arn']

'arn:aws:iam::673962978035:role/hudsonmendes-dw-movie-sentiment-analysis-redshift-role'

In [51]:
for redshift_policy in  [
        'arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess']:
    assert iam.attach_role_policy(
        RoleName=redshift_role['RoleName'],
        PolicyArn=redshift_policy)['ResponseMetadata']['HTTPStatusCode'] == 200

In [52]:
redshift_sg = ec2.create_security_group(
    Description='Allows 5432 trafic',
    GroupName='Redshift')
ec2.authorize_security_group_ingress(CidrIp='0.0.0.0/0', FromPort=5439, ToPort=5439, GroupId=redshift_sg['GroupId'], IpProtocol='TCP')
redshift_sg['GroupId']

'sg-06a001c5d960e2b0f'

In [53]:
redshift_ip = ec2.allocate_address(Domain='vpc')
[ redshift_ip['PublicIp'], redshift_ip['AllocationId'] ]

['18.130.22.131', 'eipalloc-0fe91c84526b61174']

In [54]:
redshift_host = redshift_ip['PublicIp']
redshift_port = 5439
redshift_user = 'redshift'
redshift_pass = 'r4d$hifT'
(redshift_host, redshift_port, redshift_user, redshift_pass)

('18.130.22.131', 5439, 'redshift', 'r4d$hifT')

In [55]:
redshift_cluster = redshift.create_cluster(
    DBName='dw_movie_review_sentiment',
    ClusterIdentifier='hudsonmendes-redshift',
    ClusterType='multi-node',
    NodeType='ra3.4xlarge',
    NumberOfNodes=2,
    MasterUsername=redshift_user,
    MasterUserPassword=redshift_pass,
    VpcSecurityGroupIds=[ redshift_sg['GroupId'] ],
    IamRoles=[ redshift_role['Arn'] ],
    AvailabilityZone=f'{dw_region}a',
    ElasticIp=redshift_host,
    PubliclyAccessible=True,
    Encrypted=False)['Cluster']
redshift.get_waiter('cluster_available').wait(ClusterIdentifier=redshift_cluster['ClusterIdentifier'])
redshift_cluster['ClusterIdentifier']

'hudsonmendes-redshift'

##### Connecting

In [56]:
%load_ext sql

In [57]:
redshift_db = 'dw_movie_review_sentiment'

In [58]:
redshift_url = f'postgresql://{redshift_user}:{redshift_pass}@{redshift_host}:5439/{redshift_db}'
print(f"redshift_url = '{redshift_url}'")
%sql $redshift_url

redshift_url = 'postgresql://redshift:r4d$hifT@18.130.22.131:5439/dw_movie_review_sentiment'


##### Creating Tables

In [59]:
%%sql
create table if not exists public.dim_dates
(
    date_id timestamp without time zone not null,
    year    integer not null,
    month   integer not null,
    day     integer,
    constraint dim_dates_pkey primary key (date_id)
)

 * postgresql://redshift:***@18.130.22.131:5439/dw_movie_review_sentiment
Done.


[]

In [60]:
%%sql
create table if not exists public.dim_films
(
    film_id      varchar(32)  not null,
    date_id      timestamp    not null,
    title        varchar(256) not null,
    constraint dim_films_pkey primary key (film_id),
    constraint dim_films_fkey_dates foreign key (date_id) references dim_dates (date_id)
)

 * postgresql://redshift:***@18.130.22.131:5439/dw_movie_review_sentiment
Done.


[]

In [61]:
%%sql
create table if not exists public.dim_cast
(
    cast_id    varchar(32)  not null,
    film_id    varchar(32)  not null,
    full_name  varchar(256) not null,
    constraint dim_cast_pkey primary key (cast_id),
    constraint dim_cast_fkey_films foreign key (film_id) references dim_films (film_id)
)

 * postgresql://redshift:***@18.130.22.131:5439/dw_movie_review_sentiment
Done.


[]

In [62]:
%%sql
create table if not exists public.dim_reviews
(
    review_id varchar(32)    not null,
    film_id   varchar(32)    not null,
    text      varchar(40000) not null,
    constraint dim_reviews_pkey primary key (review_id),
    constraint dim_reviews_fkey_films foreign key (film_id) references dim_films (film_id)
)

 * postgresql://redshift:***@18.130.22.131:5439/dw_movie_review_sentiment
Done.


[]

In [63]:
%%sql
create table if not exists public.fact_film_review_sentiments
(
    date_id                timestamp without time zone not null,
    film_id                varchar(32) not null,
    review_id              varchar(32) not null,
    review_sentiment_class integer     not null,
    constraint fact_films_review_sentiments_pkey primary key (date_id, film_id, review_id),
    constraint fact_films_review_sentiments_fkey_dates foreign key (date_id) references dim_dates (date_id),
    constraint fact_films_review_sentiments_fkey_films foreign key (film_id) references dim_films (film_id),
    constraint fact_films_review_sentiments_fkey_reviews foreign key (review_id) references dim_reviews (review_id)
)

 * postgresql://redshift:***@18.130.22.131:5439/dw_movie_review_sentiment
Done.


[]

In [64]:
%%sql
create table if not exists public.fact_cast_review_sentiments
(
    date_id                timestamp without time zone not null,
    cast_id                varchar(32) not null,
    review_id              varchar(32) not null,
    review_sentiment_class integer     not null,
    constraint fact_cast_review_sentiments_pkey primary key (date_id, cast_id, review_id),
    constraint fact_cast_review_sentiments_fkey_dates foreign key (date_id) references dim_dates (date_id),
    constraint fact_cast_review_sentiments_fkey_cast foreign key (cast_id) references dim_cast (cast_id),
    constraint fact_cast_review_sentiments_fkey_reviews foreign key (review_id) references dim_reviews (review_id)
)

 * postgresql://redshift:***@18.130.22.131:5439/dw_movie_review_sentiment
Done.


[]

### DAGs

#### Sentiment Classifier Training DAG

The **Sentiment Classifier Training DAG** is a simplified DAG put together to train a simple model devised by the **[model training notebook](Movie%20Review%20Classifier%20Trainer.ipynb)** that reaches 90+% accurracy at classifying movie reviews between positive and negative.

The Model Training process has been coded into a single Airflow Operator. An obvious improvement to this process would be to break the model building process with [TFX (Tensorflow Extended)](https://www.tensorflow.org/tfx), which is outside the scope of this project.

#### Data Warehouse Building DAG

The **Data Warehouse Building DAG** is a larger DAG put together to run through all the procedures described by the [data pipeline notebook](http://localhost:8888/notebooks/DAG%20-%20Movie%20Review%20Sentiment%20Data%20Warehouse.ipynb). It uses the *sentiment classifier model* previously trained by the other DAG in order to classify sentiments of reviews and build our dimension and fact tables.

In [65]:
!pip install -q -U Pygments

In [66]:
def print_python_code(code):
    from pygments import highlight
    from pygments.lexers import PythonLexer
    from pygments.formatters import HtmlFormatter
    import IPython
    formatter = HtmlFormatter()
    return IPython.display.HTML('<style type="text/css">{}</style>{}'.format(
        formatter.get_style_defs('.highlight'),
        highlight(code, PythonLexer(), formatter)))

In [67]:
def upload_dag_file(ip, pem_path, file_name, family_dir='airflow', display_file=True):
    code = None
    file_path = f'{file_name}'
    file_dir = os.path.dirname(file_path)
    if display_file:
        with open(file_path) as f:
            code = f.read()
        if not code:
            return None
    
    ssh = get_ssh(ip, pem_path)
    try:
        remote_file_dir = f'~/{family_dir}/{file_dir}'
        print(f"scp -i udacity-data-eng-nano.pem '{file_path}' 'ubuntu@{ip}:{remote_file_dir}'")
        scp_client = scp.SCPClient(ssh.get_transport())
        scp_client.put(files=[file_path], remote_path=remote_file_dir, preserve_times=True)
    finally:
        ssh.close()
    
    if display_file:
        return print_python_code(code)
    else:
        return file_path

#### Airflow Folder Setup

In [68]:
run_via_ssh(
    ip=ec2_ip['PublicIp'],
    pem_path=ec2_pem_path,
    commands=[
        f"mkdir -p ~/pipe/imdb_sentiment_analysis/data",
        f"mkdir -p ~/pipe/imdb_sentiment_analysis/model",
        f"mkdir -p ~/pipe/imdb_sentiment_analysis/working",
        f"mkdir -p ~/pipe/imdb_sentiment_analysis/images"
    ])

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71


HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

('ok', 'mkdir -p ~/pipe/imdb_sentiment_analysis/data')
('ok', 'mkdir -p ~/pipe/imdb_sentiment_analysis/model')
('ok', 'mkdir -p ~/pipe/imdb_sentiment_analysis/working')
('ok', 'mkdir -p ~/pipe/imdb_sentiment_analysis/images')



#### Airlfow Variables Setup

In [69]:
run_via_ssh(
    ip=ec2_ip['PublicIp'],
    pem_path=ec2_pem_path,
    commands=[
        f"airflow variables --set 'model_training_data_url' '{imdb_sentiment_url}'",
        f"airflow variables --set 'model_training_data_path' '/home/ubuntu/pipe/imdb_sentiment_analysis/data/imdb-sentiment.zip'",
        f"airflow variables --set 'model_output_path' '/home/ubuntu/pipe/imdb_sentiment_analysis/model'",
        f"airflow variables --set 'model_max_length' '30'",
        f"airflow variables --set 'model_vocab_size' '10000'",
        f"airflow variables --set 'model_emb_dims' '64'",
        f"airflow variables --set 'model_lstm_units' '128'",
        f"airflow variables --set 'model_training_batch_size' '50'",
        f"airflow variables --set 'model_training_epochs' '5'",
        f"airflow variables --set 's3_redshift_iam_role' '{redshift_role['Arn']}'",
        f"airflow variables --set 's3_redshift_region' '{dw_region}'",
        f"airflow variables --set 's3_movies_source_url' 'https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/hudsonmendes/tmdb-movies-with-imdb_id.zip'",
        f"airflow variables --set 's3_movie_reviews_source_url' 'https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/hudsonmendes/tmdb-reviews.zip'",
        f"airflow variables --set 'http_cast_source_url' 'https://datasets.imdbws.com/title.principals.tsv.gz'",
        f"airflow variables --set 'http_cast_names_source_url' 'https://datasets.imdbws.com/name.basics.tsv.gz'",
        f"airflow variables --set 's3_staging_bucket' '{dw_bucket_name}'",
        f"airflow variables --set 's3_staging_movies_folder' 'tmdb-movies-with-imdb_id'",
        f"airflow variables --set 's3_staging_movie_reviews_folder' 'tmdb-reviews'",
        f"airflow variables --set 's3_staging_cast_path' 'imdb-cast/imdb-cast.tsv.gz'",
        f"airflow variables --set 's3_staging_cast_names_path' 'imdb-cast/imdb-cast-names.tsv.gz'",
        f"airflow variables --set 'redshift_db' '{redshift_db}'",
        f"airflow variables --set 'path_images_dir' '/home/ubuntu/pipe/imdb_sentiment_analysis/images'",
        f"airflow variables --set 'path_working_dir' '/home/ubuntu/pipe/imdb_sentiment_analysis/working'",
        f"airflow variables --set 's3_report_bucket' '{dw_bucket_name}'",
        f"airflow variables --set 's3_report_folder' 'reports-sentiment'"
    ])

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71


HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))

('ok', "airflow variables --set 'model_training_data_url' 'https://hudsonmendes-datalake.s3.eu-west-2.amazonaws.com/kaggle/jcblaise/imdb-sentiment.zip'")
('ok', "airflow variables --set 'model_training_data_path' '/home/ubuntu/pipe/imdb_sentiment_analysis/data/imdb-sentiment.zip'")
('ok', "airflow variables --set 'model_output_path' '/home/ubuntu/pipe/imdb_sentiment_analysis/model'")
('ok', "airflow variables --set 'model_max_length' '30'")
('ok', "airflow variables --set 'model_vocab_size' '10000'")
('ok', "airflow variables --set 'model_emb_dims' '64'")
('ok', "airflow variables --set 'model_lstm_units' '128'")
('ok', "airflow variables --set 'model_training_batch_size' '50'")
('ok', "airflow variables --set 'model_training_epochs' '5'")
('ok', "airflow variables --set 's3_redshift_iam_role' 'arn:aws:iam::673962978035:role/hudsonmendes-dw-movie-sentiment-analysis-redshift-role'")
('ok', "airflow variables --set 's3_redshift_region' 'eu-west-2'")
('ok', "airflow variables --set 's3_mo

#### Airflow Connections Setup

In [70]:
run_via_ssh(
    ip=ec2_ip['PublicIp'],
    pem_path=ec2_pem_path,
    commands=[
        f"""airflow connections --add \
            --conn_id 'redshift_dw' \
            --conn_type 'postgres' \
            --conn_host '{redshift_host}' \
            --conn_login '{redshift_user}' \
            --conn_password '{redshift_pass}' \
            --conn_schema 'public' \
            --conn_port 5439""",
    ])

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

('ok', "airflow connections --add             --conn_id 'redshift_dw'             --conn_type 'postgres'             --conn_host '18.130.22.131'             --conn_login 'redshift'             --conn_password 'r4d$hifT'             --conn_schema 'public'             --conn_port 5439")



#### Uploading Operators & DAGs

In [71]:
for filename in tqdm(os.listdir('./dags/')):
    if filename.endswith(".py"):
        upload_dag_file(
            ip=ec2_ip['PublicIp'],
            pem_path=ec2_pem_path,
            file_name=f'dags/{filename}',
            display_file=False)

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'dags/model_training_dag.py' 'ubuntu@35.178.34.71:~/airflow/dags'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'dags/s3_pdf_sentiment_report_operator.py' 'ubuntu@35.178.34.71:~/airflow/dags'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'dags/data_warehouse_dag.py' 'ubuntu@35.178.34.71:~/airflow/dags'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'dags/tensorflow_postgres_model_classification_operator.py' 'ubuntu@35.178.34.71:~/airflow/dags'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'dags/s3_download_and_upload_operator.py' 'ubuntu@35.178.34.71:~/airflow/dags'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'dags/tensorflow_review_classifier_trainer_operator.py' 'ubuntu@35.178.34.71:~/airflow/dags'
ssh

#### Uploading DAG Assets

In [72]:
for filename in tqdm(os.listdir('./images/')):
    if filename.endswith(".png"):
        upload_dag_file(
            ip=ec2_ip['PublicIp'],
            pem_path=ec2_pem_path,
            file_name=f'images/{filename}',
            family_dir='pipe/imdb_sentiment_analysis',
            display_file=False)

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))

ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'images/dag-data_warehouse.png' 'ubuntu@35.178.34.71:~/pipe/imdb_sentiment_analysis/images'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'images/star-schema.png' 'ubuntu@35.178.34.71:~/pipe/imdb_sentiment_analysis/images'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'images/film_review_distro_fig.png' 'ubuntu@35.178.34.71:~/pipe/imdb_sentiment_analysis/images'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'images/header.png' 'ubuntu@35.178.34.71:~/pipe/imdb_sentiment_analysis/images'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'images/tmdb-logo.png' 'ubuntu@35.178.34.71:~/pipe/imdb_sentiment_analysis/images'
ssh -i udacity-data-eng-nano.pem ubuntu@35.178.34.71
scp -i udacity-data-eng-nano.pem 'images/robot-sentiment-analysis.png' 'ubuntu@35

### Teardown

In [74]:
question = """
Terminating Infrastructure.
Do you want to continue? [y/n]"""
if input(question).lower() == 'y':

    # clients
    ec2 = boto3.client('ec2')
    iam = boto3.client('iam')
    redshift = boto3.client('redshift')
    
    # redshift
    redshift.delete_cluster(ClusterIdentifier=redshift_cluster['ClusterIdentifier'], SkipFinalClusterSnapshot=True)
    redshift.get_waiter('cluster_deleted').wait(ClusterIdentifier=redshift_cluster['ClusterIdentifier'])
    ec2.delete_security_group(GroupId=redshift_sg['GroupId'])
    ec2.release_address(AllocationId=redshift_ip['AllocationId'])
    for attached_policy in iam.list_attached_role_policies(RoleName=redshift_role['RoleName'])['AttachedPolicies']:
        iam.detach_role_policy(RoleName=redshift_role['RoleName'], PolicyArn=attached_policy['PolicyArn'])
    for policy_name in iam.list_role_policies(RoleName=redshift_role['RoleName'])['PolicyNames']:
        iam.delete_role_policy(RoleName=redshift_role['RoleName'], PolicyName=policy_name)
    iam.delete_role(RoleName=redshift_role['RoleName'])
    
    # ec2
    ec2.cancel_spot_instance_requests(SpotInstanceRequestIds=[ ec2_spot_id ])
    ec2.terminate_instances(InstanceIds=[ ec2_vm_id ])
    ec2.get_waiter('instance_terminated').wait(InstanceIds=[ ec2_vm_id ])
    ec2.release_address(AllocationId=ec2_ip['AllocationId'])
    ec2.delete_security_group(GroupId=ec2_sg['GroupId'])
    ec2.delete_key_pair(KeyName=ec2_pem_name)
    for attached_policy in iam.list_attached_role_policies(RoleName=ec2_role['RoleName'])['AttachedPolicies']:
        iam.detach_role_policy(RoleName=ec2_role['RoleName'], PolicyArn=attached_policy['PolicyArn'])
    for policy_name in iam.list_role_policies(RoleName=ec2_role['RoleName'])['PolicyNames']:
        iam.delete_role_policy(RoleName=ec2_role['RoleName'], PolicyName=policy_name)
    iam.remove_role_from_instance_profile(InstanceProfileName=ec2_instance_profile['InstanceProfileName'], RoleName=ec2_role['RoleName'])
    iam.delete_instance_profile(InstanceProfileName=ec2_instance_profile['InstanceProfileName'])
    iam.delete_role(RoleName=ec2_role['RoleName'])
    print('Termination completed.')


Terminating Infrastructure.
Do you want to continue? [y/n]y
Termination completed.
