# Movie Recommendation Algorithm

As part of this task, you are expected follow the instructions below and create a movie recommendation algorithm to make your users happy :)

### 1. We start off with the known dependencies we need

Pandas is a must. Pandas is a powerful data analysis library built for Python users. It helps us manipulate complicated data in a user-friendly manner. You will understand soon the convenience of it and come to love it as much as we do. 

Please use the run button on the right hand side to execute the code block below. Once the dependencies are imported successfully, you will see a green tick at the bottom of the code block.

In [3]:
import pandas as pd # refer to pandas as pd
from movie_details import get_movies_by_id  

### 2. Then we need to get our movies dataset

The dataset includes thousands of movies and detailed information for each one of them. Let's see how it looks like shall we? 

Run the code block below to execute the code. Once the code block runs, the dataset will be displayed underneath the code block.

In [None]:
path = "./data/dataset2.csv"
df = pd.read_csv(path) # df stands for Data Frame
df.head(10) # display the first 10 rows

As you can see, the dataset doesn't look very pretty, does it? That's why we need Pandas to get exactly what we need from the dataset. We need the following three columns from the dataset:
* title
* imdbID
* overview

Try running the code block below to see how to get your desired columns from a dataset.

In [None]:
df[['title', 'imdb_id', 'overview']]    

Let's create a function that returns the **title**, **imdbID** and **overview** of all the movies in the dataset.

In [None]:
def get_dataset():
    path = "./data/dataset2.csv"
    df = pd.read_csv(path)
    df = df[['title', 'imdb_id', 'overview']]
    df['overview'] = df['overview'].fillna('') # replace NaN values with an empty string
    return df

df = get_dataset() # call the function and save it as a Dataframe
df.head(5) # Display the first 5 entries of the Dataframe

### 3. It's time to deliver our user's request

Our users have requested a feature where when they add a new movie to their favourites, they would like to get 2 new movie recommendations. The UI has already been prepared for this task. All there is left to do is:
* Get the movie details of the favourited movie from OMDB API using the IMDB ID of the movie
* Add the details to our dataset 
* If it already exists in the dataset, remove duplicates

#### 3.1 Get the details of the favourited movie

Luckily, another teammate has already created a function called 'get_movies_by_id' to get the movie details by IMDB ID. Run the code block below and test it out!

In [None]:
response = get_movies_by_id("tt0029927") # get movie details with IMDB ID
# response = pd.json_normalize(response) # uncomment to normalize response from OMDB API into a flat Data Frame
response

So it works! But as you can see the column names do not match the ones that the dataset has for the same information... This sort of mismatch happens often when we use different data sources.

| column info | column name in dataset | column name in API response |
| --- | --- | --- |
|Title |title|Title|
|IMDB ID|imdb_id| imdbID|
|Overview | overview| Plot |

##### 3.2 Let's create a function that adds the neccessary details of the favourited movie to the dataset

Important points to consider:
* Function should take an IMDB ID as an input
* The column names should match
* There shouldn't be any duplicates in the dataset
* Function should return the updated dataframe

In [13]:
# Write your code here    


### 4. Create the recommendation algorithm

We will be focusing on the plot of the movies and use NLP (Natural Language Processing) to find similarities between the plot of the favourited movie and all the other movie plots in our database.

First, we need to import the necessary Python machine learning libraries that we can use to complete the task. 

Scikit-learn (full name for sklearn) is a machine learning library for Python programming. It includes simple and efficient tools for predictive data analysis. [Check here for more](https://scikit-learn.org/stable/index.html).

We will use one of it's functionalities that allows us to perform data analysis on text. We will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

Now if you are wondering how we will achieve that, one way of doing it is to create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix... 

In human words, **TF (Term Frequency)** is the relative frequency of a word in a document and is given as (term instances/total instances). 

**IDF (Inverse Document Frequency)** is the relative count of documents containing the term is given as log(number of documents/documents with term).

The overall importance of each word to the documents in which they appear is equal to **TF * IDF**.

This will give you a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each row represents a movie, as before.This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

In [None]:
df = add_plot("tt0029927")

# Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

We see that over 11,000 different words were used to describe the 1517 movies in our dataset. Now we can go ahead and calculate the cosine similarity score using our matrix and **linear_kernel()** functionality from the Scikit-learn library.

One explanation of a kernel is as follows: 

> a collection of distinct forms of pattern analysis algorithms, using a linear classifier, they solve an existing non-linear problem

So we will use a linear kernel to classify movies as similar or not.

In [15]:
# Similarity scores between all movies in the dataset
sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# We create a new Series where the index is the IMDB IDs and the values are the actual indices from the original dataframe
indices = pd.Series(df.index, index=df['imdb_id'])

# Let's try getting the index of one movie with its IMDB ID
idx = indices["tt0029927"]

# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(sim[idx]))

# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# Get the scores of the 2 most similar movies
sim_scores = sim_scores[1:3]

# Get the movie indices
movie_indices = [i[0] for i in sim_scores]


#### Final Challenge

Create a function that takes an IMDB ID as an input and returns the IMDB IDs of the recommended two movies and display the output.

In [None]:
# Write your code here


Now let's move all the functions you created to the recommend.py file. Functions to move:
* get_dataset()
* add_plot()
* get_recommendation()