### Notebook info:
> **Movie Streaming Data Visualization** <br/>
> *Movies_Streaming_Analysis.ipynb* Version 1.0 <br/>
> Launched in: September 15th, 2021; by Luiz Gustavo Fagundes Malpele. <br/>

> *Movies_Streaming_Analysis.ipynb* Version 1.1 <br/>
> Last updated in: October 19th, 2021; by Luiz Gustavo Fagundes Malpele. <br/>
> Corrects major problems with data preprocessing and organizes the code in modules. <br/>

<br/>
<div class="alert alert-block alert-success">

### To-Do:

**High-priority:**
- [ ] Generate a Data Visualization Template
- [ ] Generate Histograms for the main quantitave variables

**Streamlit:**
- [ ] Begin the User Interface


    
</div>
<br/><hr/>

<br/>

### Package/library dependencies:

- **matplotlib**, for plots and graphs
- **numpy**, for float-point ranges
- **plotly**, for plotting aesthetics
- **pandas**, for reading json files into data frames
- **datetime**, for time related operations

In [1]:
#import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
from datetime import datetime, timedelta
import plotly.express as px 
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

<br/>

### Importing **Functions** library:

In [2]:
%run -i ../libraries/Preprocessing_Library.ipynb
%run -i ../libraries/Functionalities_Library.ipynb

<br/><hr/>
## **Initializations**

In [3]:
user_vector_path = '../data/user_vector_baseline.csv'
movies_data_path = '../data/movies_streaming_platforms.csv'
movies_cleaned_data_path = '../data/movies_streaming_platforms_cleaned.csv'

<br/><hr/>
## **Data Acquisition**

In [4]:
#movies_data = prepare_movies_dataframe(path = movies_data_path, to_csv = True)

In [5]:
movies_data = read_cleaned_movies_dataframe(path = movies_cleaned_data_path)

In [6]:
movies_data = filter_by_platforms(df = movies_data, hulu_display = True, netflix_display = None, 
                                  prime_video_display = None, disney_display= None, display_all = None)

In [7]:
movies_data = get_column_dummies_from_list(movies_data, column_name = 'genres', merge_dummies = True)
movies_data = get_column_dummies_from_list(movies_data, column_name = 'age', merge_dummies = True)

In [8]:
filter_by_age(df = movies_data, display_7 = None, display_13 = True, 
              display_16 = True, display_18 = None, display_pg = None)

Unnamed: 0_level_0,title,year,age,imdb,rotten_tomatoes,netflix,hulu,prime_video,disney,directors,...,Sport,Talk-Show,Thriller,War,Western,13+,16+,18+,7+,all
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
80,The Social Network,2010,13+,77,84,True,True,False,False,[David Fincher],...,0,0,0,0,0,1,0,0,0,0
167,Hunt for the Wilderpeople,2016,13+,79,80,True,True,False,False,[Taika Waititi],...,0,0,0,0,0,1,0,0,0,0
170,The Artist,2011,13+,79,79,True,True,False,False,[Michel Hazanavicius],...,0,0,0,0,0,1,0,0,0,0
199,The Da Vinci Code,2006,13+,66,78,True,True,False,False,[Ron Howard],...,0,0,1,0,0,1,0,0,0,0
248,Angels & Demons,2009,13+,67,76,True,True,False,False,[Ron Howard],...,0,0,1,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4649,Fafner in the Azure: Dead Aggressor - Heaven a...,2010,16+,63,43,False,True,False,False,[Other Directors],...,0,0,0,0,0,0,1,0,0,0
4667,Women & Sometimes Men,2017,16+,43,41,False,True,False,False,[Lesley Demetriades],...,0,0,0,0,0,0,1,0,0,0
4668,Hard Romanticker,2011,13+,61,41,False,True,True,False,[Su-yeon Gu],...,0,0,0,0,0,1,0,0,0,0
4697,Man-Eating Python,2017,16+,47,31,False,True,False,False,[Mark Beech],...,0,0,0,0,0,0,1,0,0,0


|Variable|DataFrame|Description|Data Type|Example|
|:---|:---:|:----|:---:|:---:|
|title|movies_data|Movies' title|string|The Irishman|
|year|movies_data|Movies' lauch year|int64|2019|
|age|movies_data|Parental Guidance Minimal Age Suggested|string|18+|
|imdb|movies_data|IMDB Score|float64|7.8|
|rotten_tomatoes|movies_data|Rotten Tomato Score|float64|98.0|
|netflix|movies_data|Movie is available on Netflix|bool|True|
|hulu|movies_data|Movie is available on Hulu|bool|False|
|prime_video|movies_data|Movie is available on Prime Video|bool|False|
|disney|movies_data|Movie is available on Disney+|bool|False|
|directors|movies_data|Movie's directors|object(list)|[Marting Scorsese]|
|genres|movies_data|Movie's genres|object(list)|[Biography, Crime, Drama]|
|language|movies_data|Movie's original language|object(list)|[English, Italian, Latin, Spanish, German]|
|runtime|movies_data|Movie's length in minutes|float64|209.0|
|group|tsne_df|Movie's cluster number|object(int)|18|
|X|tsne_df|The X-value on the t-distributed stochastic neighbor embedding|object(float64)|-64.7094|
|Y|tsne_df|The Y-value on the t-distributed stochastic neighbor embedding|object(float64)|43.7906|



In [20]:
user_vector = pd.read_csv(user_vector_path, index_col = 0, squeeze = True)
user_vector.update({'rotten_tomatoes':75, 'Drama': 1, '16+': 1, 'age': '16+'})

In [22]:
user_vector

index
title                              User Vector
year                                      2020
age                                        16+
imdb                                        80
rotten_tomatoes                             75
netflix                                   None
hulu                                      None
prime_video                               None
disney                                    None
directors          [Elizabeth Allen Rosenbaum]
genres                                 [Drama]
country                        [United States]
language                             [English]
runtime                                    120
Action                                       0
Adventure                                    0
Animation                                    0
Biography                                    0
Comedy                                       0
Crime                                        0
Documentary                                  0
Drama  

In [10]:
def get_features_column_list(df:pd.DataFrame):
    #Select the features on the basis of ehich you want to cluster
    remove_list = ['hulu', 'disney', 'netflix', 'prime_video', 'title', 'country', 
                   'genres', 'runtime', 'age', 'directors', 'language']
    
    #Passes all DataFrame's columns to a list
    column_list = df.columns.to_list()
    
    #Remove columns from remove list
    for remove_element in remove_list:
        column_list.remove(remove_element)
        
    #Returns column list which will serve as features
    return column_list

In [11]:
def get_integer_features(df:pd.DataFrame):
    #Select the features on the basis of ehich you want to cluster
    column_list = get_features_column_list(df = df)
    features = df[column_list].astype(int)
    features_user = user_vector[column_list].astype(int)
    features = features.append(features_user)
    df = df.append(user_vector)
    return features

In [13]:
movies_data = get_integer_features(df = movies_data)

In [None]:
def generate_tsne_transfomation(features:pd.DataFrame):
    #Scaling the data
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(features)

    #Using TSNE
    tsne = TSNE(n_components=2)
    transformed_genre = tsne.fit_transform(scaled_data)

    #Kmeans
    cluster = KMeans(n_clusters=23)
    group_pred = cluster.fit_predict(scaled_data)

    #Consider adding the genre
    tsne_df = pd.DataFrame(np.column_stack((transformed_genre, group_pred, movies_data['title'], 
                                            movies_data['genres'], movies_data['age'])),
                                            columns=['X','Y','Group','Title', 'Genres', 'Age'])
    
    return tsne_df

In [None]:
tsne_df = generate_tsne_transfomation(features = features)

In [None]:
def get_recommendations(df:pd.DataFrame, refresher_counter:int = 0):
    '''
    Calculates the Euclidian distance of the User Vector to other t-SNE points and generates the top 10 recommendations.
    It also allows the user to refresh the recommendations and get the other 10 closest points to the User Vector.
    '''
    tsne_user_x = float(df[df['Title'] == 'User Vector']['X'])
    tsne_user_y = float(df[df['Title'] == 'User Vector']['Y'])
    df['UserDistance'] = ((df['X'] - tsne_user_x)**2 + (df['Y'] - tsne_user_y)**2)**0.5
    df = df.sort_values(by=['UserDistance'])
    if refresher_counter == 0:
        recommendations_df = df[1:11]
    else:
        recommendations_df = df[1+10*refresher_counter:11+10*refresher_counter]
    return recommendations_df

In [None]:
get_recommendations(df = tsne_df, refresher_counter = 2)

In [None]:
def genrate_tsne_visualization(df:pd.DataFrame):
    tsne_user_x = tsne_df[tsne_df['Title'] == 'User Vector']['X']
    tsne_user_y = tsne_df[tsne_df['Title'] == 'User Vector']['Y']

    # Build figure
    fig = go.Figure()

    # Add scatter trace with medium sized markers
    fig.add_trace(
        go.Scatter(
            mode = 'markers',
            x = tsne_df['X'],
            y = tsne_df['Y'],
            customdata = tsne_df,
            marker = dict(
                color = tsne_df['Group'],
                colorscale='Viridis'
            ),
            hovertemplate =
                '<b>%{customdata[3]} </b><br><br>' +
                'Location: (%{customdata[0]:.2f},%{customdata[1]:.2f})<br>' +
                'Genres: %{customdata[4]}<br>' +
                'Age: %{customdata[5]}<br>' + 
                'Group: %{customdata[2]}<extra></extra>',
            showlegend = False))

    fig.add_trace(
        go.Scatter(
            mode = 'markers',
            marker_symbol = 'circle-open-dot',
            marker_line_width = 5,
            x = tsne_user_x,
            y = tsne_user_y,
            marker = dict(size=[40],
            color = 'red'),
            name = 'User Profile'
        ))

    #Standard Figure Layout for Data Visualization
    fig.update_layout(
        dict(
            height=600, 
            width=1000,
            plot_bgcolor = "#F1F1F3",
            paper_bgcolor = 'white',
            xaxis_title = 'Dimension 1',
            yaxis_title = 'Dimension 2',
            title={'text' : 't-SNE Results and User Vector Location',
                   'x':0.5,
                   'xanchor': 'center'})
    )

    return fig

In [None]:
genrate_tsne_visualization(df = tsne_df)

<br/><hr/>
## **Regressor Mapping Approach**

user_vector = user_vector.to_frame().T #DataFrame row
features_user = features_user.to_frame().T #T-sne input