### Notebook info:
> **Movie Streaming Data Visualization** <br/>
> *Movies_Streaming_Analysis.ipynb* Version 1.0 <br/>
> Launched in: September 15th, 2021; by Luiz Gustavo Fagundes Malpele. <br/>

> *Movies_Streaming_Analysis.ipynb* Version 1.1 <br/>
> Last updated in: October 19th, 2021; by Luiz Gustavo Fagundes Malpele. <br/>
> Corrects major problems with data preprocessing and organizes the code in modules. <br/>

<br/>
<div class="alert alert-block alert-success">

### To-Do:

**High-priority:**
- [ ] Generate a Data Visualization Template
- [ ] Generate Histograms for the main quantitave variables

**Streamlit:**
- [ ] Begin the User Interface


    
</div>
<br/><hr/>

<br/>

### Package/library dependencies:

- **matplotlib**, for plots and graphs
- **numpy**, for float-point ranges
- **plotly**, for plotting aesthetics
- **pandas**, for reading json files into data frames
- **datetime**, for time related operations

In [1]:
#import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
from datetime import datetime, timedelta
import plotly.express as px 
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

<br/>

### Importing **Functions** library:

In [2]:
%run -i ../libraries/Preprocessing_Library.ipynb
%run -i ../libraries/Functionalities_Library.ipynb

<br/><hr/>
## **Initializations**

In [3]:
user_vector_path = '../data/user_vector_baseline.csv'
movies_data_path = '../data/movies_streaming_platforms.csv'
movies_cleaned_data_path = '../data/movies_streaming_platforms_cleaned.csv'

<br/><hr/>
## **Data Acquisition**

In [4]:
#movies_data = prepare_movies_dataframe(path = movies_data_path, to_csv = True)

In [19]:
movies_data = read_cleaned_movies_dataframe(path = movies_cleaned_data_path)

In [21]:
movies_data = filter_by_platforms(df = movies_data, hulu_display = True, netflix_display = None, 
                                  prime_video_display = None, disney_display= None, display_all = True)

In [22]:
movies_data = get_column_dummies_from_list(movies_data, column_name = 'genres', merge_dummies = True)
movies_data = get_column_dummies_from_list(movies_data, column_name = 'age', merge_dummies = True)

In [23]:
movies_data

Unnamed: 0_level_0,title,year,age,imdb,rotten_tomatoes,netflix,hulu,prime_video,disney,directors,...,Sport,Talk-Show,Thriller,War,Western,13+,16+,18+,7+,all
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,The Irishman,2019,18+,7.8,98.0,True,False,False,False,[Martin Scorsese],...,0,0,0,0,0,0,0,1,0,0
1,Dangal,2016,7+,8.4,97.0,True,False,False,False,[Nitesh Tiwari],...,1,0,0,0,0,0,0,0,1,0
2,David Attenborough: A Life on Our Planet,2020,7+,9.0,95.0,True,False,False,False,"[Alastair Fothergill, Jonathan Hughes, Keith S...",...,0,0,0,0,0,0,0,0,1,0
3,Lagaan: Once Upon a Time in India,2001,7+,8.1,94.0,True,False,False,False,[Ashutosh Gowariker],...,1,0,0,0,0,0,0,0,1,0
4,Roma,2018,18+,7.7,94.0,True,False,False,False,[Other Directors],...,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9510,Most Wanted Sharks,2020,18+,6.3,14.0,False,False,False,True,[Other Directors],...,0,0,0,0,0,0,0,1,0,0
9511,Doc McStuffins: The Doc Is In,2020,18+,6.3,13.0,False,False,False,True,[Chris Anthony Hamilton],...,0,0,0,0,0,0,0,1,0,0
9512,Ultimate Viking Sword,2019,18+,6.3,13.0,False,False,False,True,[Other Directors],...,0,0,0,0,0,0,0,1,0,0
9513,Hunt for the Abominable Snowman,2011,18+,6.3,10.0,False,False,False,True,[Dan Oliver],...,0,0,0,0,0,0,0,1,0,0


In [24]:
movies_data['rotten_tomatoes'] = movies_data['rotten_tomatoes'].astype(int)
movies_data['imdb'] = movies_data['imdb'].map(lambda x:x*10).astype(int)

In [25]:
movies_data

Unnamed: 0_level_0,title,year,age,imdb,rotten_tomatoes,netflix,hulu,prime_video,disney,directors,...,Sport,Talk-Show,Thriller,War,Western,13+,16+,18+,7+,all
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,The Irishman,2019,18+,78,98,True,False,False,False,[Martin Scorsese],...,0,0,0,0,0,0,0,1,0,0
1,Dangal,2016,7+,84,97,True,False,False,False,[Nitesh Tiwari],...,1,0,0,0,0,0,0,0,1,0
2,David Attenborough: A Life on Our Planet,2020,7+,90,95,True,False,False,False,"[Alastair Fothergill, Jonathan Hughes, Keith S...",...,0,0,0,0,0,0,0,0,1,0
3,Lagaan: Once Upon a Time in India,2001,7+,81,94,True,False,False,False,[Ashutosh Gowariker],...,1,0,0,0,0,0,0,0,1,0
4,Roma,2018,18+,77,94,True,False,False,False,[Other Directors],...,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9510,Most Wanted Sharks,2020,18+,63,14,False,False,False,True,[Other Directors],...,0,0,0,0,0,0,0,1,0,0
9511,Doc McStuffins: The Doc Is In,2020,18+,63,13,False,False,False,True,[Chris Anthony Hamilton],...,0,0,0,0,0,0,0,1,0,0
9512,Ultimate Viking Sword,2019,18+,63,13,False,False,False,True,[Other Directors],...,0,0,0,0,0,0,0,1,0,0
9513,Hunt for the Abominable Snowman,2011,18+,63,10,False,False,False,True,[Dan Oliver],...,0,0,0,0,0,0,0,1,0,0


|Variable|DataFrame|Description|Data Type|Example|
|:---|:---:|:----|:---:|:---:|
|title|movies_data|Movies' title|string|The Irishman|
|year|movies_data|Movies' lauch year|int64|2019|
|age|movies_data|Parental Guidance Minimal Age Suggested|string|18+|
|imdb|movies_data|IMDB Score|float64|7.8|
|rotten_tomatoes|movies_data|Rotten Tomato Score|float64|98.0|
|netflix|movies_data|Movie is available on Netflix|bool|True|
|hulu|movies_data|Movie is available on Hulu|bool|False|
|prime_video|movies_data|Movie is available on Prime Video|bool|False|
|disney|movies_data|Movie is available on Disney+|bool|False|
|directors|movies_data|Movie's directors|object(list)|[Marting Scorsese]|
|genres|movies_data|Movie's genres|object(list)|[Biography, Crime, Drama]|
|language|movies_data|Movie's original language|object(list)|[English, Italian, Latin, Spanish, German]|
|runtime|movies_data|Movie's length in minutes|float64|209.0|
|group|tsne_df|Movie's cluster number|object(int)|18|
|X|tsne_df|The X-value on the t-distributed stochastic neighbor embedding|object(float64)|-64.7094|
|Y|tsne_df|The Y-value on the t-distributed stochastic neighbor embedding|object(float64)|43.7906|



In [10]:
user_vector = pd.read_csv(user_vector_path, index_col = 0, squeeze = True)
user_vector.update({'rotten_tomatoes':75, 'Drama': 1, '16+': 1, 'age': '16+'})

In [11]:
#Select the features on the basis of ehich you want to cluster
remove_list = ['hulu', 'disney', 'netflix', 'prime_video', 'title', 'country', 
               'genres', 'runtime', 'age', 'directors', 'language']
column_list = movies_data.columns.to_list()
for remove_element in remove_list:
    column_list.remove(remove_element)
features = movies_data[column_list].astype(int)
features_user = user_vector[column_list].astype(int)
features = features.append(features_user)
movies_data = movies_data.append(user_vector)
features

Unnamed: 0_level_0,year,imdb,rotten_tomatoes,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Sport,Talk-Show,Thriller,War,Western,13+,16+,18+,7+,all
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2019,78,98,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,2016,84,97,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
2,2020,90,95,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,1,0
3,2001,81,94,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
4,2018,77,94,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9511,2020,63,13,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9512,2019,63,13,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9513,2011,63,10,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9514,2019,63,10,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0


In [12]:
user_vector

index
title                              User Vector
year                                      2020
age                                        16+
imdb                                        80
rotten_tomatoes                             75
netflix                                   None
hulu                                      None
prime_video                               None
disney                                    None
directors          [Elizabeth Allen Rosenbaum]
genres                                 [Drama]
country                        [United States]
language                             [English]
runtime                                    120
Action                                       0
Adventure                                    0
Animation                                    0
Biography                                    0
Comedy                                       0
Crime                                        0
Documentary                                  0
Drama  

In [14]:
#Scaling the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(features)

#Using TSNE
tsne = TSNE(n_components=2)
transformed_genre = tsne.fit_transform(scaled_data)

In [15]:
#Kmeans
cluster = KMeans(n_clusters=23)
group_pred = cluster.fit_predict(scaled_data)

#Consider adding the genre
tsne_df = pd.DataFrame(np.column_stack((transformed_genre, group_pred, movies_data['title'], 
                                        movies_data['genres'], movies_data['age'])),
                                        columns=['X','Y','Group','Title', 'Genres', 'Age'])

In [16]:
tsne_df

Unnamed: 0,X,Y,Group,Title,Genres,Age
0,36.6,-59.1211,1,The Irishman,"[Biography, Crime, Drama]",18+
1,78.8569,8.56701,13,Dangal,"[Action, Biography, Drama, Sport]",7+
2,44.4041,-64.1767,7,David Attenborough: A Life on Our Planet,"[Documentary, Biography]",7+
3,86.0075,25.0987,13,Lagaan: Once Upon a Time in India,"[Drama, Musical, Sport]",7+
4,-0.126293,-75.8522,9,Roma,"[Action, Drama, History, Romance, War]",18+
...,...,...,...,...,...,...
9511,25.3055,-11.7073,8,Doc McStuffins: The Doc Is In,[Animation],18+
9512,4.41536,-105.57,3,Ultimate Viking Sword,[Other Genres],18+
9513,12.358,-52.9384,1,Hunt for the Abominable Snowman,"[Drama, History]",18+
9514,38.959,-74.3726,7,Women of Impact: Changing the World,[Documentary],7+


In [17]:
tsne_df[tsne_df['Title'] == 'User Vector']['Y']
tsne_user_x = tsne_df[tsne_df['Title'] == 'User Vector']['X']
tsne_user_y = tsne_df[tsne_df['Title'] == 'User Vector']['Y']

In [18]:
# Build figure
fig = go.Figure()

# Add scatter trace with medium sized markers
fig.add_trace(
    go.Scatter(
        mode = 'markers',
        x = tsne_df['X'],
        y = tsne_df['Y'],
        customdata = tsne_df,
        marker = dict(
            color = tsne_df['Group'],
            colorscale='Viridis'
        ),
        hovertemplate =
            '<b>%{customdata[3]} </b><br><br>' +
            'Location: (%{customdata[0]:.2f},%{customdata[1]:.2f})<br>' +
            'Genres: %{customdata[4]}<br>' +
            'Age: %{customdata[5]}<br>' + 
            'Group: %{customdata[2]}<extra></extra>',
        showlegend = False))

fig.add_trace(
    go.Scatter(
        mode = 'markers',
        marker_symbol = 'circle-open-dot',
        marker_line_width = 5,
        x = tsne_user_x,
        y = tsne_user_y,
        marker = dict(size=[40],
        color = 'red'),
        name = 'User Profile'
    ))

#Standard Figure Layout for Data Visualization
fig.update_layout(
    dict(
        height=600, 
        width=1000,
        plot_bgcolor = "#F1F1F3",
        paper_bgcolor = 'white',
        xaxis_title = 'Dimension 1',
        yaxis_title = 'Dimension 2',
        title={'text' : 't-SNE Results and User Vector Location',
               'x':0.5,
               'xanchor': 'center'})
)
 

fig.show()

<br/><hr/>
## **Regressor Mapping Approach**

user_vector = user_vector.to_frame().T #DataFrame row
features_user = features_user.to_frame().T #T-sne input