# MyMDB Analyzer
This notebook accesses the IMDb-Scraping pipeline consisting of the webscraper, the database connector and provides a GUI for the interactive visualization.

![IMDb Logo](https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/IMDB_Logo_2016.svg/440px-IMDB_Logo_2016.svg.png)

[Wikipedia IMDb Definition](https://en.wikipedia.org/wiki/IMDb):

IMDb (an acronym for `I`nternet `M`ovie `D`ata`b`ase) is an online database of information related to films, television series, podcasts, home videos, video games, and streaming content online – including cast, production crew and personal biographies [...]

This notebook visually explores rankings of movies and actors and 

## USAGE
Step through this notebook manually so everything is executed in order

#### OVERCOMING TECHNICAL DIFFICULTIES

In [1]:
# workarround for using event_loops in notebooks (ipykernel already uses the global non-reentrant one)
import nest_asyncio
nest_asyncio.apply()

#### Imports

In [2]:
import db
import analyze
import ui
import numpy as np
import matplotlib.pyplot as plt

connecting to db via conn_str: DRIVER=SQL Server;SERVER=localhost;PORT=1433;DATABASE=MyMDB;UID=SA;PWD=Pr0dRdyPw!


### SCRAPING
Challenges were respecting the site's scraping rules located in robots.txt, not getting timed out even though we followed through and handling those timeouts as well as handling (rather omitting a ton of) missing/format-inhomogenous data as well as asynchronous caching in python.

In [3]:
movies = []
import_ui = ui.init_ui(movies)
import_ui

VBox(children=(Tab(children=(VBox(children=(ToggleButtons(description='playlist', options=('Top 250', 'Roulett…

In [4]:
movies = import_ui.movies
# try:
#     movies = import_ui.movies
# except AttributeError:
#     input("press enter to continue")
#     movies = import_ui.movies

AttributeError: 'VBox' object has no attribute 'movies'

### INSERTION
This part was done purely in the SQL Server Backend, using a stored procedure for parsing the insertion data, and taking care of NaN values and the likes.

A trigger was used to recursively update the average instead of recomputation over several tables for each new insert.

In [None]:
for movie in movies:
    try:
        db.insert_movie(**movie)
    except Exception as e:
        print(f"issue at {movie}\n{e}")
"no duplicates found" if db.check_no_duplicates() else "duplicates in actors detected"

### READING
For further analysis the movies are queried back from the db and converted to dataframes

In [None]:
df_movies = analyze.get_movie_dataframe()
df_actors = analyze.get_actors_dataframe()
df_movie_actors = analyze.get_movie_actors_dataframe()
df_actor_features = analyze.get_actor_feature_dataframe()
df_summary = (
 analyze.get_summary_dataframe()
 .sort_values(by="name")
)

display("movies", df_movies.head())
display("actors", df_actors.head())
display("movie_actors", df_movie_actors.head(5))
display("joined data", df_summary.head(10))
display(
    df_summary
    .drop(["movie_id", "actor_id"], axis=1)
    .describe()
)
display("interesting features for actor analysis", df_actor_features.head())
display(df_actor_features.describe())

'movies'

Unnamed: 0,id,title,year,genre,rating
0,2,Der Pate,1972,Krimi,9.2
1,3,The Dark Knight,2008,Action,9.0
2,4,Der Pate 2,1974,Krimi,9.0
3,7,Der Herr der Ringe: Die Rückkehr des Königs,2003,Action,8.9
4,8,Pulp Fiction,1994,Krimi,8.8


'actors'

Unnamed: 0,id,name,age,avg_rating
0,11,Marlon Brando,99,8.6
1,12,Al Pacino,83,8.7
2,13,James Caan,83,9.2
3,14,Diane Keaton,77,9.1
4,15,Richard S. Castellano,89,9.2


'movie_actors'

Unnamed: 0,movie_id,actor_id
0,2,11
1,2,12
2,2,13
3,2,14
4,2,15


'joined data'

Unnamed: 0,name,age,avg_rating,movie_id,actor_id,title,year,genre,rating
883,Aamir Khan,58,8.3,299,683,Taare Zameen Par: Ein Stern auf Erden,2007,Drama,8.2
878,Aamir Khan,58,8.3,283,683,3 Idiots,2009,Komödie,8.3
873,Aamir Khan,58,8.3,124,683,Dangal: Die Hoffnung auf den großen Sieg,2016,Action,8.2
48,Aaron Eckhart,55,9.0,3,21,The Dark Knight,2008,Action,9.0
1180,Abdel Ahmed Ghili,-1,8.0,347,1536,Hass,1995,Krimi,8.0
1208,Adam Baldwin,61,8.2,375,1633,Full Metal Jacket,1987,Drama,8.2
243,Ade,-1,8.2,119,659,Snatch: Schweine und Diamanten,2000,Komödie,8.2
1100,Adolphe Menjou,133,8.4,273,1272,Wege zum Ruhm,1957,Drama,8.4
493,Adrien Brody,50,8.3,33,215,Der Pianist,2002,Biografie,8.5
499,Adrien Brody,50,8.3,184,215,Grand Budapest Hotel,2014,Abenteuer,8.1


Unnamed: 0,age,avg_rating,year
count,1250.0,1250.0,1250.0
mean,51.2696,8.288,1986.736
std,42.490108,0.220778,25.167857
min,-1.0,8.0,1921.0
25%,-1.0,8.1,1967.0
50%,56.0,8.25,1994.0
75%,84.0,8.4,2007.0
max,155.0,9.2,2023.0


'interesting features for actor analysis'

Unnamed: 0,name,age,appearances,avg_rating
0,William Sadler,73,1,9.2
1,Tim Robbins,64,1,9.2
2,Richard S. Castellano,89,1,9.2
3,John Marley,115,1,9.2
4,James Caan,83,1,9.2


Unnamed: 0,age,appearances,avg_rating
count,613.0,613.0,613.0
mean,76.482871,1.378467,8.37863
std,27.848947,0.861481,0.208911
min,18.0,1.0,8.0
25%,55.0,1.0,8.2
50%,73.0,1.0,8.3
75%,97.0,1.0,8.5
max,155.0,9.0,9.2


### Visualization
analyze.scatter_actor_features()

# outlook

- scrape random movies and try to classify if a given actor could really make it using the top 250 (interesting outliers for certain actors included in top 250 that say only did a single movie)
- try to predict imdb rating