# **1. Introduction**

**TV Analytics** 

TV analyitcs gives perfect measurement of TV ads and programs on digital channels

**For example**, a viewership analytics report covers the number of people who watched the program in a specific area for a week in comparison with the data taken from other districts. It helps TV advertisers measure the impact of the campaigns in real-time. Advertisers observe the traffic periods on their websites which are directly linked to their TV campaigns so that they can figure out the viewers’ preferences.


**Benefits of using TV Analytics:**

- Analyzing and understanding the audience
- Help companies 
  - Gain proper insight about their customers 
  - Optimize the marketing strategy
  - Deliver better services


Analytics provide quantitative data to business and corporate houses which they can use in improving the existing services.

**COVID-19 Impact on TV Analytics:**

With the outbreak of the pandemic, the TV industry has witnessed a lot of transformations in terms of performance, transparency and outcomes. However, the growth of the sector has been accelerated in more than one way.

- With more and more people staying at home, the demand for OTT TV channels has certainly experienced a huge leap.
-  Advertisers, in the meantime, have also started changing their existing creatives and implementing innovative strategies to have a great share in the pie.

**Market Research**

According to Allied Market Research, the global TV analytics market is expected to grow at a significant Compound Annual Grow Rate (CAGR) from 2019 to 2026. Rising demand for managing important data and increase in need to obtain meaningful insights about consumer behaviour & several advertisement preferences have resulted in the adoption of analytics solutions.


# **2. Background Information**

During this time of pandemic we have seen a tremendous growth in the OTT Platforms.

- The global over-the-top (OTT) market is expected to observe remarkable growth owing to rising adoption of OTT platforms as more people are subscribing to these entertainment platforms during the Covid-19 pandemic lockdown. 

- It is expected to grow from 104.11 billion dollar in 2019 to 161.37 billion dollar in 2020 at a compound annual growth rate of a whopping 55%. 

- This exponential growth, again, is mainly due to worldwide lockdown caused by the COVID-19 outbreak, during which subscription to various OTT streaming channels and viewership has increased.

- Customer’s social behavior is shifting from traditional subscriptions to broadcasting services and to over-the-top (OTT) on-demand video. This will drive the OTT streaming market in the forecast period at a very fast pace. Various segments of the population have started using video streaming services instead of regular television for entertainment, due to added benefits such as on-demand services and ease of access. 

- The dramatic rise in demand for live streaming channels and ongoing creation of cloud-based OTT services would drive substantial market growth in the following years.

Audiences are shifting to online streaming platforms like Hotstar, Hulu, Netflix and Amazon Prime Video. Not only users are  binging movies and series from such platforms, the businesses are making huge profits from their exclusive content.


**Flicky** an OTT service provider wants to enhance the user experience by curating the content of all such providers. 

With the help of Flicky you can:

- Pay for the best platform with most popular movies/shows.
- Easily find relevant content you watched from your current subscription of netlfix or Prime Video.
- Get best recommendations from your favorite genre.

# **3. Dataset Information**

 
* __ID__:  Movie ID

* __Title__: Movie Title

* __Year__: Release Year

* __Age__: Age restriction

* __IMDB__: IMDB Rating

* __Rottern Tomatoes__: Tomatoes Rating

* __Netflix, Hulu,Prime Video, Disney+__: OTT Platforms

* __Type__: Movie Genres

* __Directors__: Movie Director

* __Country__: Release in Country

# **4. Task to be Performed**


- Import required libraries
- Read the dataset and perform necessary changes
- Generate a ProfileReport using Pandas Profiling
- Perform exploratory data analysis over the data
- Process the data for recommendation engine
- Create a recommendation engine using CountVectorizer
- Create a WebApp using Streamlit which showcases both the engines and data analysis report

# **5. Libraries**

## **5.1. Install Required Libraries**

In [7]:
!pip install plotly
!pip install pandas-profiling
!pip install pycountry_convert
!pip install streamlit



## **5.2. Import Libraries**

In [8]:
import pandas as pd
import numpy as np
import pickle
import re

import pandas_profiling
from pandas_profiling import ProfileReport

import plotly 
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.colors import hex_to_rgb

from IPython.display import display
from ipywidgets import interactive, interact
import ipywidgets

import pycountry_convert as pc

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import LabelEncoder

## **5.3. Run Files**

In [9]:
%run functions.ipynb

# **6. Import Data**

In [10]:
file_name = 'dataset.csv'
df = pd.read_csv(file_name)
orig_df = df.copy()

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


In [12]:
# Remove unnamed column / the first column.
# Set the ID column as index.
df = df.iloc[:,1:]
df.set_index('ID', inplace=True)

In [13]:
df.head()

Unnamed: 0_level_0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


# **7. Exploratory Data Analysis**

## **7.1. Pandas Profiling**

In [14]:
# Generating Report using Pandas Profiling
report_df = ProfileReport(df)
report_df.to_notebook_iframe()

Summarize dataset: 100%|██████████| 29/29 [00:17<00:00,  1.69it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:14<00:00, 14.02s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00,  2.97s/it]


In [15]:
# Saving the report
report_df.to_file('Movie_Recommendation.html')

Export report to file: 100%|██████████| 1/1 [00:00<00:00, 43.32it/s]


## **7.2. Null Values**

In [16]:
check_null_values(df)

Unnamed: 0,Null Values,Count_Nulls,Percentage_Nulls,Dtype
Rotten Tomatoes,True,11586,69.194935,object
Age,True,9390,56.07979,object
Directors,True,726,4.335882,object
Language,True,599,3.577401,object
Runtime,True,592,3.535595,float64
IMDb,True,571,3.410177,float64
Country,True,435,2.597946,object
Genres,True,275,1.642379,object
Title,False,0,0.0,object
Year,False,0,0.0,int64


#### Observations:
- 8 columns have missing data.
- Some columns are of object datatype.

In [17]:
# Fill runtime with -1
df['Runtime'].fillna(-1, inplace=True)

# Filling Null values with NA - This is useful in cases when you know the origin of the data and 
# can be certain which values should be missing.
df.fillna('NA', inplace=True)

In [18]:
# Checking the missing data
check_null_values(df)

Unnamed: 0,Null Values,Count_Nulls,Percentage_Nulls,Dtype
Title,False,0,0.0,object
Year,False,0,0.0,int64
Age,False,0,0.0,object
IMDb,False,0,0.0,object
Rotten Tomatoes,False,0,0.0,object
Netflix,False,0,0.0,int64
Hulu,False,0,0.0,int64
Prime Video,False,0,0.0,int64
Disney+,False,0,0.0,int64
Type,False,0,0.0,int64


## **7.3. Backup Dataset**

In [19]:
# Keeping a copy of df
df_backup = df.copy()

## **7.4. Observations**

### **7.4.1. Age**

In [20]:
plot_value_counts_bar(df, 'Age')

#### Observations:
- Ignoring the null part, most of the movies/series are targeted to adult audience.

### **7.4.2. Rotten Tomatoes**


In [21]:
# 0  - 40  = Very Bad
# 41 - 55  = Bad
# 56 - 70  = Average
# 71 - 85  = Good
# 86 - 100 = Very Good

df['Rotten_Tomatoes_Rounded'] = df['Rotten Tomatoes'].apply(round_fix)

In [22]:
df.head()

Unnamed: 0_level_0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime,Rotten_Tomatoes_Rounded
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0,Very_Good
2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0,Very_Good
3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0,Good
4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0,Very_Good
5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0,Very_Good


In [23]:
plot_value_counts_bar(df, 'Rotten_Tomatoes_Rounded')

#### Observations:
- As per Rotten Tomatoes we have a lot of "Very_Good" movies to watch.

### **7.4.3. IMDB Ratings**

In [24]:
# 0   - 4.0  = Very Bad
# 4.1 - 5.5  = Bad
# 5.6 - 7.0  = Average
# 7.1 - 8.5  = Good
# 8.6 - 10.0 = Very Good

df['IMDB_Rounded'] = df['IMDb'].apply(round_fix_imdb)

In [25]:
plot_value_counts_bar(df, 'IMDB_Rounded')

#### Observations:
- Most of the content on streaming platforms has average IMDB ratings.

### **7.4.4. Highest IMDb Movies/Shows**

In [26]:
# Getting counts of Very_Good movie on all OTT Platforms
netflix_count = df[df['IMDB_Rounded']=='Very_Good']['Netflix'].sum()
hulu_count = df[df['IMDB_Rounded']=='Very_Good']['Hulu'].sum()
disney_count = df[df['IMDB_Rounded']=='Very_Good']['Disney+'].sum()
prime_count = df[df['IMDB_Rounded']=='Very_Good']['Prime Video'].sum()

indexes = ['Netflix', 'Hulu', 'Disney', 'Amazon Prime']
values = [netflix_count, hulu_count, disney_count, prime_count]

In [27]:
fig = px.pie(labels=indexes, values=values,title='Top content on OTT',hover_name=indexes)
fig.show()

#### Observations:
- Amazon Prime has the highest rated content
- Well the data suggests that Amazon Prime has better content to watch than rest of the OTT Platforms

### **7.4.5. Data Processing**

In [28]:
# Creating a temporary dataframe to keep the processed data safe.
temp_df = df.copy()

In [29]:
temp_df.head()

Unnamed: 0_level_0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime,Rotten_Tomatoes_Rounded,IMDB_Rounded
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0,Very_Good,Very_Good
2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0,Very_Good,Very_Good
3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0,Good,Good
4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0,Very_Good,Good
5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0,Very_Good,Very_Good


In [30]:
# Apply encoding to the columns ['Genres', 'Country', 'Language']
kata, temp_df = apply_encoding(temp_df, ['Genres', 'Country', 'Language'], get_kata=1)

In [31]:
temp_df.head()

Unnamed: 0_level_0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,...,Language_Kriolu,Language_Dari,Language_Acholi,Language_Ukrainian,Language_Thai,Language_East-Greenlandic,Language_Gallegan,Language_Minangkabau,Language_Brazilian Sign Language,Language_Guarani
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Inception,2010,13+,8.8,87%,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


### **7.4.6. Most Popular Genre**

In [32]:
# Get the base counts of for each category and sort them by counts
base_counts = get_counts(temp_df, 'Genres', kata['Genres'])
base_counts = pd.DataFrame(index=base_counts.keys(),
                           data=base_counts.values(),
                           columns=['Counts'])
base_counts.sort_values(by='Counts', inplace=True)

# Plot the chart which shows top genres and separate by color where genre<1000
colors=['#988D90' if i<1000 else '#F00045' for i in base_counts['Counts']]
fig = px.bar(x=base_counts.index,
             y=base_counts['Counts'],
             title='Most Popular Genre',
             color_discrete_sequence=colors,
             color=base_counts.index)

fig.show()     

#### Observations:
- Most popular genre is Drama.

### **7.4.7. Most Released Content**

In [33]:
# Get the base counts for the country and sort them
base_counts = get_counts(temp_df, 'Country', kata['Country'])
base_counts = pd.DataFrame(index=base_counts.keys(),
                           data=base_counts.values(),
                           columns=['Counts'])
base_counts.sort_values(by='Counts', ascending=False, inplace=True)

# Plot the top 10 countries
fig = px.bar(x=base_counts.index[:10],
             y=base_counts['Counts'][:10],
             color=base_counts['Counts'][:10],
             title='Most Released Content: Country')
             
fig.show()

#### Observations:
- Most released content was in US.

### **7.4.8. Most Released Language**

In [34]:
# Get the base counts for language and sort them
base_counts = get_counts(temp_df, 'Language', kata['Language'])
base_counts = pd.DataFrame(index=base_counts.keys(),
                           data=base_counts.values(),
                           columns=['Counts'])
base_counts.sort_values(by='Counts', ascending=False, inplace=True)

# Plot the top 5 languages
fig = px.bar(x=base_counts.index[:5],
             y=base_counts['Counts'][:5],
             color=base_counts['Counts'][:5],
             title='Most Released Content: Language')
             
fig.show()

#### Observations:
- We can work with few genres with count more than 1000 and rest of the genres can be categorized as others.
- It is important to keep countries, (but at continent level) for better clarity.
- Most of the content is in english. 

# **8. OTT Platforms**

## **8.1. Content Releases**

In [35]:
# Get the counts
release_scores = get_ott_counts(temp_df,
                                ['Netflix', 'Hulu', 'Prime Video', 'Disney+'],
                                'Year')

>>>> Done: Netflix
>>>> Done: Hulu
>>>> Done: Prime Video
>>>> Done: Disney+


## **8.2. Top Genres**

In [36]:
# Get the genres and platforms
genres = kata['Genres'].copy()
genres.extend(['All'])
platform = ['Netflix', 'Hulu', 'Prime Video', 'Disney+', 'All']

In [37]:
# Replace NA to -1 in IMDb column
temp_df['IMDb'] = temp_df['IMDb'].replace('NA', -1)

In [38]:
# change the datatype to float
temp_df['IMDb'] = temp_df['IMDb'].astype(float)

In [39]:
# An interactive chart that displays the top movies for each genre and platform
def color_platform(platform):
    #specify a color for each platform based on their color theme
    if platform == 'Netflix':
        return ['#6F0000', '#FF0000']
    elif platform == 'Prime Video':
        return ['#06064D', '#1616CD']
    elif platform == 'Hulu':
        return ['#00DE00', '#005800']
    else:
        return ['#00BDBD', '#004242']


@interact  #To convert any function into an inteactive one just
# write "@interact" immediately before the function definition
def plot_genres(genres=genres, platform=platform):
    tg_df = temp_df.copy()
    if genres == 'All' and platform == 'All':
        title = 'Top 15 Movies/Series'
        tg_df.sort_values(by='IMDb', ascending=False, inplace=True)
        fig = px.bar(tg_df[:15],
                     y='Title',
                     x='IMDb',
                     color='IMDb',
                     title=title,
                     color_continuous_scale=['#E6009B', '#5E003F'],
                     orientation='h')
    elif genres == 'All' and platform != 'All':
        cequence = color_platform(platform)
        title = 'Top 15 Movies/Series on ' + platform
        tg_df = tg_df[tg_df[platform] == 1]
        tg_df.sort_values(by='IMDb', ascending=False, inplace=True)
        fig = px.bar(tg_df[:15],
                     y='Title',
                     x='IMDb',
                     color='IMDb',
                     title=title,
                     color_continuous_scale=cequence,
                     orientation='h')
    elif genres != 'All' and platform == 'All':
        title = 'Top 15 ' + genres + ' Movies/Series'
        tg_df = tg_df[(tg_df['Genres_' + genres] == 1)]
        tg_df.sort_values(by='IMDb', ascending=False, inplace=True)
        fig = px.bar(tg_df[:15],
                     y='Title',
                     x='IMDb',
                     color='IMDb',
                     title=title,
                     color_continuous_scale=['#F52668', '#6D0023'],
                     orientation='h')
    else:
        cequence = color_platform(platform)
        title = 'Top 15 ' + genres + ' Movies/Series on ' + platform
        tg_df = tg_df[(tg_df[platform] == 1)
                          & (tg_df['Genres_' + genres] == 1)]
        tg_df.sort_values(by='IMDb', ascending=False, inplace=True)
        fig = px.bar(tg_df[:15],
                     y='Title',
                     x='IMDb',
                     color='IMDb',
                     title=title,
                     color_continuous_scale=cequence,
                     orientation='h')
    fig.show()

interactive(children=(Dropdown(description='genres', options=('Talk-Show', 'Adventure', 'Crime', 'Romance', 'M…

## **8.3. Data Processing**

In [40]:
# Getting the processed data
mutated_df = df.copy()

### **8.3.1. Age Column**

In [41]:
# Appyly the converAge method to Age column
df['AgeRestriction'] = df['Age'].apply(convertAge)

### **8.3.2. Genres Column**

In [42]:
# Selecting only genres with more than 1000 movies for them rest will be categorized as others
base_counts = get_counts(temp_df, 'Genres', kata['Genres'])
base_counts = pd.DataFrame(index=base_counts.keys(),
                           data=base_counts.values(),
                           columns=['Counts'])
base_counts.sort_values(by='Counts', inplace=True)
keep_genres = list(base_counts[base_counts['Counts']>1000].index)
keep_genres.append('Others')

In [43]:
# Udpdate the data with new genres
updated_genres = encode_data(df['Genres'], keep_genres, col='Genres')
df = update_data(df,'Genres', updated_genres)

In [44]:
# Pickle Keep Genres
with open('keep_genres.pickle', 'wb') as f:
    pickle.dump(keep_genres, f)

### **8.3.3. Country Column**

In [45]:
# Get the continent info
continent_info = continentName(df['Country'], 'Continent', df.shape[0])

In [46]:
# Update the dataframe
df = update_data(df, 'Country', continent_info)

### **8.3.4. Language Column**

In [47]:
# If more than 500 titles in respective language, we keep it else it goes in others category
lang_counts = get_counts(temp_df, 'Language', kata['Language'])
lang_counts = pd.DataFrame(index=lang_counts.keys(),
                           data=lang_counts.values(),
                           columns=['Counts'])
lang_counts.sort_values(by='Counts',inplace=True)
keep_lang = list(lang_counts[lang_counts.Counts>500].index)
keep_lang.append('Others')

In [48]:
# Pickle Keep Lang
with open('keep_lang.pickle', 'wb') as f:
    pickle.dump(keep_lang, f)

In [49]:
# Update for each language
updated_lang = encode_data(df['Language'], keep_lang, col='Language')
df = update_data(df, 'Language', updated_lang)

### **8.3.5. Year Column**

In [50]:
print('Released content timeline from', df['Year'].min(),'to', df['Year'].max())

Released content timeline from 1902 to 2020


In [51]:
# Era 1: (1900 - 1940) - Old
# Era 2: (1941 - 1970) - Vintage
# Era 3: (1971 - 1990) - Golden
# Era 4: (1991 - 2010) - Modern
# Era 5: (2011 - 2020) - Latest

# Create an Era colun for categorization of Year
df['Era'] = df['Year'].apply(yearConvert)

In [52]:
# Features to be removed or to be encoded
remove = ['Age','Year','Type','Rotten Tomatoes','IMDb','Rotten_Tomatoes_Rounded']
dummy = ['Era','AgeRestriction','IMDB_Rounded']

In [53]:
# Saving the data for further use in WebApp
df.to_csv('movies_data_encoded.csv')

## **8.4. Observations**

### **8.4.1. Age Restriction**

In [54]:
# Plot value counts for AgeRestriction
plot_value_counts_bar(df, 'AgeRestriction')

#### Observations:
- Most content is for Adult.

### **8.4.2. Genres**

In [55]:
# Plot the graph for them
plot_category_counts_bar('Genres')

#### Observation:
- Drama is the most popular genre.

### **8.4.3. Continent**

In [56]:
# Plot a graph for continents
plot_category_counts_bar('Continent')

#### Observations:
- The most movies/shows are from North America.

### **8.4.4. Language**

In [57]:
# Plot a graph for the language column
plot_category_counts_bar('Language')

#### Observations:
- The most movies/shows is in English.

### **8.4.5. Era**

In [58]:
# Get base counts for the era for plotting
# Convert sum to dataframe
base_counts_era = df['Era'].value_counts().to_frame(name='Counts').reset_index()

# Rename column name
base_counts_era.rename(columns={'index': 'Era'}, inplace=True)

In [59]:
# Plot the value counts for Era
fig = px.bar(base_counts_era, x='Era', y='Counts', color='Era')
fig.update_layout()

#### Observations:
- Most movies/shows are the latest.

### **8.4.6. Number of Movies Released/Year on Each Platform.**

In [60]:
# Plot scatter plot
fig = px.scatter(release_scores,
                 x='Year',
                 y='Count',
                 size='Count',
                 color='Platform',
                 title='Content Per OTT Apps released in consecutive years',
                 color_discrete_sequence=['#E50914', '#3DBB3D', '#00A8E1', '#048f70 '])

fig.show()

#### Observations:
- Most contents released are in Prime Video.


# **9. Content Based Recommendation: CountVectorizer**

In [61]:
df.head()

Unnamed: 0_level_0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,...,Continent_Oceania,Continent_South America,Continent_NA,Language_NA,Language_Hindi,Language_French,Language_Spanish,Language_English,Language_Others,Era
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Inception,2010,13+,8.8,87%,1,0,0,0,0,...,0,0,0,0,0,1,0,1,1,Modern
2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,Modern
3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,Latest
4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,Golden
5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,...,0,0,0,0,0,0,0,0,1,Vintage


Countvectorizer converts a collection of text into a matrix of counts with each hit.

Taking an example with 3 sentences:
- I enjoy coding.
- I like python.
- I like java.

The count vectorizer will create a matrix where it determines the frequency of each word occured:
![img](https://i.imgur.com/9CSw7ra.png)

In [62]:
cbr_df = df.copy()

In [63]:
cbr_df.drop(remove, axis=1, inplace=True)
cbr_df.drop('Runtime', axis=1, inplace=True)

In [64]:
cbr_df.head()

Unnamed: 0_level_0,Title,Netflix,Hulu,Prime Video,Disney+,Directors,IMDB_Rounded,AgeRestriction,Genres_Fantasy,Genres_Sci-Fi,...,Continent_Oceania,Continent_South America,Continent_NA,Language_NA,Language_Hindi,Language_French,Language_Spanish,Language_English,Language_Others,Era
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Inception,1,0,0,0,Christopher Nolan,Very_Good,Teen,0,1,...,0,0,0,0,0,1,0,1,1,Modern
2,The Matrix,1,0,0,0,"Lana Wachowski,Lilly Wachowski",Very_Good,Adult,0,1,...,0,0,0,0,0,0,0,1,0,Modern
3,Avengers: Infinity War,1,0,0,0,"Anthony Russo,Joe Russo",Good,Teen,0,1,...,0,0,0,0,0,0,0,1,0,Latest
4,Back to the Future,1,0,0,0,Robert Zemeckis,Good,Non-Adult,0,1,...,0,0,0,0,0,0,0,1,0,Golden
5,"The Good, the Bad and the Ugly",1,0,1,0,Sergio Leone,Very_Good,Adult,0,0,...,0,0,0,0,0,0,0,0,1,Vintage


In [65]:
%%time
# Create the soup for count vectorizer
cbr_df['soup'] = cbr_df.apply(create_soup, axis=1)

Wall time: 1.9 s


In [66]:
%%time
# Apply Countvectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(cbr_df['soup'])

Wall time: 445 ms


#### Finding Cosine Similarity

In [67]:
%%time
# Finding Cosine Similarity
cosine_sim2 = cosine_similarity(count_matrix)

Wall time: 932 ms


In [68]:
# The function will return top 10 movies related to given movie
def get_recommendations_new(title, data, o_data, cosine_sim=cosine_sim2):
    data = data.reset_index()
    indices = pd.Series(data.index, index=cbr_df['Title'])
    idx = indices[title]
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    o_data['Title'] = o_data['Title'].str.replace(': ', ':<br>')
    o_data['Genres'] = o_data['Genres'].str.replace(',', '<br>')
    o_data['Directors'] = o_data['Directors'].str.replace(',', '<br>')
    o_data['Country'] = o_data['Country'].str.replace(',', '<br>')
    o_data['Language'] = o_data['Language'].str.replace(',', '<br>')
    return o_data[[
        'Title', 'IMDb', 'Genres', 'Directors', 'Country', 'Language'
    ]].iloc[movie_indices]

In [69]:
l = get_recommendations_new('The Avengers', cbr_df, orig_df.copy(), cosine_sim2)

In [70]:
# Plotting the dat using plotly
l.sort_values(by='IMDb', ascending=False, inplace=True)
colorscale = [[0, '#477BA8'], [.5, '#C9EEF2'], [1, '#D0F5F5']]
fig = ff.create_table(l, colorscale=colorscale, height_constant=70)

fig.show()