### Notebook info:
> **Movie Streaming Data Visualization** <br/>
> *Movies_Streaming_Analysis.ipynb* Version 1.0 <br/>
> Launched in: September 15th, 2021; by Luiz Gustavo Fagundes Malpele. <br/>

> *Movies_Streaming_Analysis.ipynb* Version 1.1 <br/>
> Last updated in: October 19th, 2021; by Luiz Gustavo Fagundes Malpele. <br/>
> Corrects major problems with data preprocessing and organizes the code in modules. <br/>

<br/>
<div class="alert alert-block alert-success">

### To-Do:

**High-priority:**
- [ ] Generate a Data Visualization Template
- [ ] Generate Histograms for the main quantitave variables

**Streamlit:**
- [ ] Begin the User Interface


    
</div>
<br/><hr/>

<br/>

### Package/library dependencies:

- **matplotlib**, for plots and graphs
- **numpy**, for float-point ranges
- **plotly**, for plotting aesthetics
- **pandas**, for reading json files into data frames
- **datetime**, for time related operations

In [1]:
#import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
from datetime import datetime, timedelta
import plotly.express as px 
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

<br/>

### Importing **Functions** library:

In [2]:
%run -i ../libraries/Preprocessing_Library.ipynb
%run -i ../libraries/Functionalities_Library.ipynb

<br/><hr/>
## **Initializations**

In [3]:
user_vector_path = '../data/user_vector_baseline.csv'
movies_data_path = '../data/movies_streaming_platforms.csv'
movies_cleaned_data_path = '../data/movies_streaming_platforms_cleaned.csv'

<br/><hr/>
## **Data Acquisition**

In [4]:
#movies_data = prepare_movies_dataframe(path = movies_data_path, to_csv = True)

In [5]:
movies_data = read_cleaned_movies_dataframe(path = movies_cleaned_data_path)

In [6]:
movies_data = filter_by_platforms(df = movies_data, hulu_display = True, netflix_display = None, 
                                  prime_video_display = None, disney_display= None, display_all = True)

In [7]:
movies_data = get_column_dummies_from_list(movies_data, column_name = 'genres', merge_dummies = True)
movies_data = get_column_dummies_from_list(movies_data, column_name = 'age', merge_dummies = True)

In [8]:
movies_data['rotten_tomatoes'] = movies_data['rotten_tomatoes'].fillna(0)
movies_data['imdb'] = movies_data['imdb'].fillna(0).map(lambda x:x*10)

|Variable|DataFrame|Description|Data Type|Example|
|:---|:---:|:----|:---:|:---:|
|title|movies_data|Movies' title|string|The Irishman|
|year|movies_data|Movies' lauch year|int64|2019|
|age|movies_data|Parental Guidance Minimal Age Suggested|string|18+|
|imdb|movies_data|IMDB Score|float64|7.8|
|rotten_tomatoes|movies_data|Rotten Tomato Score|float64|98.0|
|netflix|movies_data|Movie is available on Netflix|bool|True|
|hulu|movies_data|Movie is available on Hulu|bool|False|
|prime_video|movies_data|Movie is available on Prime Video|bool|False|
|disney|movies_data|Movie is available on Disney+|bool|False|
|directors|movies_data|Movie's directors|object(list)|[Marting Scorsese]|
|genres|movies_data|Movie's genres|object(list)|[Biography, Crime, Drama]|
|language|movies_data|Movie's original language|object(list)|[English, Italian, Latin, Spanish, German]|
|runtime|movies_data|Movie's length in minutes|float64|209.0|
|group|tsne_df|Movie's cluster number|object(int)|18|
|X|tsne_df|The X-value on the t-distributed stochastic neighbor embedding|object(float64)|-64.7094|
|Y|tsne_df|The Y-value on the t-distributed stochastic neighbor embedding|object(float64)|43.7906|



In [16]:
user_vector = pd.read_csv(user_vector_path, index_col = False).T
user_vector.columns = user_vector.iloc[0]
user_vector = user_vector[1:]
pd.Series(user_vector)

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [15]:
#Select the features on the basis of ehich you want to cluster
remove_list = ['hulu', 'disney', 'netflix', 'prime_video', 'title', 'country', 
               'genres', 'runtime', 'age', 'directors', 'language']
column_list = movies_data.columns.to_list()
for remove_element in remove_list:
    column_list.remove(remove_element)
features = movies_data[column_list].astype(int)
features_user = user_vector[column_list].astype(int)
features = features.append(features_user)
movies_data = movies_data.append(user_vector)
features

KeyError: "['Adventure', 'Fantasy', 'rotten_tomatoes', 'Short', 'Other', 'all', 'Documentary', 'Talk-Show', 'Biography', 'Action', 'Reality-TV', 'Romance', 'Comedy', 'Game-Show', 'Sci-Fi', '7+', 'Mystery', 'Drama', 'War', 'Film-Noir', '13+', 'Western', 'Musical', 'Animation', 'Family', 'History', 'Horror', 'year', 'Sport', 'Music', 'Thriller', 'Crime', '16+', 'imdb', 'News'] not in index"

In [None]:
%%timeit

#Scaling the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(features)

#Using TSNE
tsne = TSNE(n_components=2)
transformed_genre = tsne.fit_transform(scaled_data)

In [None]:
%%timeit

#Kmeans
cluster = KMeans(n_clusters=23)
group_pred = cluster.fit_predict(scaled_data)

#Consider adding the genre
tsne_df = pd.DataFrame(np.column_stack((transformed_genre, group_pred, movies_data['title'], 
                                        movies_data['genres'], movies_data['age'])),
                                        columns=['X','Y','Group','Title', 'Genres', 'Age'])

In [None]:
tsne_df

In [None]:
tsne_df[tsne_df['Title'] == 'User Vector']['Y']
tsne_user_x = tsne_df[tsne_df['Title'] == 'User Vector']['X']
tsne_user_y = tsne_df[tsne_df['Title'] == 'User Vector']['Y']

In [None]:
# Build figure
fig = go.Figure()

# Add scatter trace with medium sized markers
fig.add_trace(
    go.Scatter(
        mode = 'markers',
        x = tsne_df['X'],
        y = tsne_df['Y'],
        customdata = tsne_df,
        marker = dict(
            color = tsne_df['Group'],
            colorscale='Viridis'
        ),
        hovertemplate =
            '<b>%{customdata[3]} </b><br><br>' +
            'Location: (%{customdata[0]:.2f},%{customdata[1]:.2f})<br>' +
            'Genres: %{customdata[4]}<br>' +
            'Age: %{customdata[5]}<br>' + 
            'Group: %{customdata[2]}<extra></extra>',
        showlegend = False))

fig.add_trace(
    go.Scatter(
        mode = 'markers',
        marker_symbol = 'circle-open-dot',
        marker_line_width = 5,
        x = tsne_user_x,
        y = tsne_user_y,
        marker = dict(size=[40],
        color = 'red'),
        name = 'User Profile'
    ))

#Standard Figure Layout for Data Visualization
fig.update_layout(
    dict(
        height=600, 
        width=1000,
        plot_bgcolor = "#F1F1F3",
        paper_bgcolor = 'white',
        xaxis_title = 'Dimension 1',
        yaxis_title = 'Dimension 2',
        title={'text' : 't-SNE Results and User Vector Location',
               'x':0.5,
               'xanchor': 'center'})
)
 

fig.show()

<br/><hr/>
## **Regressor Mapping Approach**

user_vector = user_vector.to_frame().T #DataFrame row
features_user = features_user.to_frame().T #T-sne input