### Notebook info:
> **Movie Streaming Data Visualization** <br/>
> *Movies_Streaming_Analysis.ipynb* Version 1.0 <br/>
> Last updated in: September 15th, 2021; by Luiz Gustavo Fagundes Malpele. <br/>

<br/>
<div class="alert alert-block alert-success">

### To-Do:

**High-priority:**
- [ ] Generate a Data Visualization Template
- [ ] Generate Histograms for the main quantitave variables

**Streamlit:**
- [ ] Begin the User Interface


    
</div>
<br/><hr/>

<br/>

### Package/library dependencies:

- **matplotlib**, for plots and graphs
- **numpy**, for float-point ranges
- **plotly**, for plotting aesthetics
- **pandas**, for reading json files into data frames
- **datetime**, for time related operations

In [1]:
#import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
from datetime import datetime, timedelta
import plotly.express as px 
import plotly.graph_objects as go

<br/>

### Importing **Functions** library:

In [2]:
%run -i ../libraries/Preprocessing_Library.ipynb
%run -i ../libraries/Functionalities_Library.ipynb

<br/><hr/>
## **Initializations**

In [3]:
movies_data_path = '../data/movies_streaming_platforms.csv'
movies_cleaned_data_path = '../data/movies_streaming_platforms_cleaned.csv'

<br/><hr/>
## **Data Acquisition**

In [4]:
#movies_data = prepare_movies_dataframe(path = movies_data_path, to_csv = True)

In [5]:
movies_data = read_cleaned_movies_dataframe(path = movies_cleaned_data_path)

In [6]:
movies_data = filter_by_platforms(df = movies_data, hulu_display = None, netflix_display = None, 
                                  prime_video_display = None, disney_display= True, display_all = True)

In [7]:
movies_data = get_column_dummies_from_list(movies_data, column_name = 'genres', merge_dummies = True)

In [8]:
movies_data['rotten_tomatoes'] = movies_data['rotten_tomatoes'].fillna(0)
movies_data['imdb'] = movies_data['imdb'].fillna(0).map(lambda x:x*10)

In [9]:
movies_data

Unnamed: 0_level_0,title,year,age,imdb,rotten_tomatoes,netflix,hulu,prime_video,disney,directors,...,Other,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,The Irishman,2019,18+,78.0,98.0,True,False,False,False,[Martin Scorsese],...,0,0,0,0,0,0,0,0,0,0
1,Dangal,2016,7+,84.0,97.0,True,False,False,False,[Nitesh Tiwari],...,0,0,0,0,0,1,0,0,0,0
2,David Attenborough: A Life on Our Planet,2020,7+,90.0,95.0,True,False,False,False,"[Alastair Fothergill, Jonathan Hughes, Keith S...",...,0,0,0,0,0,0,0,0,0,0
3,Lagaan: Once Upon a Time in India,2001,7+,81.0,94.0,True,False,False,False,[Ashutosh Gowariker],...,0,0,0,0,0,1,0,0,0,0
4,Roma,2018,18+,77.0,94.0,True,False,False,False,[Other],...,0,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9510,Most Wanted Sharks,2020,,0.0,14.0,False,False,False,True,[Other],...,0,1,0,0,0,0,0,0,0,0
9511,Doc McStuffins: The Doc Is In,2020,,0.0,13.0,False,False,False,True,[Chris Anthony Hamilton],...,0,0,0,0,0,0,0,0,0,0
9512,Ultimate Viking Sword,2019,,0.0,13.0,False,False,False,True,[Other],...,1,0,0,0,0,0,0,0,0,0
9513,Hunt for the Abominable Snowman,2011,,0.0,10.0,False,False,False,True,[Dan Oliver],...,0,0,0,0,0,0,0,0,0,0


In [10]:
#Select the features on the basis of ehich you want to cluster
features = movies_data[['Action', 'Adventure', 'Animation',
                        'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family',
                        'Fantasy', 'Film-Noir', 'Game-Show', 'History', 'Horror', 'Music',
                        'Musical', 'Mystery', 'News', 'Reality-TV', 'Romance', 'Sci-Fi',
                        'Short', 'Sport', 'Talk-Show', 'Thriller', 'War', 'Western', 
                        'year', 'imdb', 'rotten_tomatoes']].astype(int)

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE


#Scaling the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(features)

#Using TSNE
tsne = TSNE(n_components=2)
transformed_genre = tsne.fit_transform(scaled_data)

In [12]:
from sklearn.cluster import KMeans

#KMeans - Elbow Method
distortions = []
K = range(1,100)
for k in K:
    kmean = KMeans(n_clusters=k)
    kmean.fit(scaled_data)
    distortions.append(kmean.inertia_)
fig = px.line(x=K,y=distortions,title='The Elbow Method Showing The Optimal K',
              labels={'x':'No of Clusters','y':'Distortions'})
fig.show()

In [13]:
#Kmeans
cluster = KMeans(n_clusters=23)
group_pred = cluster.fit_predict(scaled_data)

#Consider adding the genre
tsne_df = pd.DataFrame(np.column_stack((transformed_genre, group_pred, movies_data['title'], 
                                        movies_data['genres'])),columns=['X','Y','Group','Title', 'Genres'])

fig = px.scatter(tsne_df,x='X',y='Y',hover_data=['Title', 'Genres'],color='Group',
                 color_discrete_sequence=px.colors.cyclical.IceFire)
fig.show()

In [14]:
tsne_df

Unnamed: 0,X,Y,Group,Title,Genres
0,42.6411,-75.0315,21,The Irishman,"[Biography, Crime, Drama]"
1,73.0988,-54.0957,16,Dangal,"[Action, Biography, Drama, Sport]"
2,49.4834,-55.9279,10,David Attenborough: A Life on Our Planet,"[Documentary, Biography]"
3,79.7262,-39.423,16,Lagaan: Once Upon a Time in India,"[Drama, Musical, Sport]"
4,-64.0659,-70.5384,18,Roma,"[Action, Drama, History, Romance, War]"
...,...,...,...,...,...
9510,38.9551,8.12214,11,Most Wanted Sharks,"[Crime, Reality-TV]"
9511,14.635,-8.57007,5,Doc McStuffins: The Doc Is In,[Animation]
9512,22.5897,-4.41292,5,Ultimate Viking Sword,[Other]
9513,13.3686,-41.4544,22,Hunt for the Abominable Snowman,"[Drama, History]"
