<div style="text-align:center; font-size:45px; color: teal; letter-spacing: .1em;">
    MOVIES RECOMMENDATION SYSTEMS
</div>

<div id='tbl-contents'>
    <h2>TABLE OF CONTENTS</h2>
    <ol>
        <li>
            <h4>
                <a href='#introduction'>Introduction</a>
            </h4>
        </li>
        <li>
            <h4>
                <a href='#load-dependencies'>Load Dependencies</a>
            </h4>
        </li>
        <li>
            <h4>
                <a href='#load-data'>Load Data</a>
            </h4>
        </li> 
        <li>
            <h4>
                <a href='#sneak-peek'>Sneak Peek into Loaded Data</a>
            </h4>
        </li> 
        <li>
            <h4>
                <a href='#data-cleaning'>Data Cleaning</a>
            </h4>
        </li>  
    </ol>
</div>



<div>
    <h4><a href='#tbl-contents'>Back to table of contents</a></h4>
</div>
<div id='introduction'>
    <h2> INTRODUCTION </h2>
</div>


Intelligent algorithms can help viewers sift through tens of thousands of titles to find the best ones. Recommender systems are both socially and economically important in ensuring that people can make informed decisions about the content they consume on a daily basis. This is particularly true in the case of movie recommendations.

Providing an accurate and robust solution to this challenge has enormous economic potential, with users of the system receiving personalized recommendations, thereby enhancing platform affinity for the streaming services that best facilitate their audience's viewing.

### RECOMMENDER SYSTEM

A recommendation system is an information filtering system whose main goal is to predict the rating or preference a user might give to an item. This helps create personalized content and better product search experience. One popular use is recommending to users which movie to watch. This is because significant dependencies exist between users and item centric activity. For example a user who is interested in s historical documentary is more likely to be interested in another historical documentary or an educational program, rather than in an action movie.

A recommendation system can use either of these two techniques:

- Content based filtering
- Collaborative filtering

In content based filtering, the algorithm seeks to make recommendations based on how similar the properties or features of an item are to other items.

In collaborative filtering, we use similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B.

Here we are going to explore both methods and assess which recommendation system gives us the best results. Increasing sales is the primary goal of a recommender system. By recommending carefully selected items to users, recommender systems bring relevant items to the attention of users. This increases the sales volumes and profits to the merchants.

<div>
    <h4><a href='#tbl-contents'>Back to table of contents</a></h4>
</div>
<div id='load-dependencies'>
    <h2>LOAD DEPENDENCIES</h2>
</div>

In [1]:
import os # for os operations on kaggle

# for pattern searching and extraction 
import re

# libraries for data analysis and manipulation
import pandas as pd
import numpy as np

# libraries for numerical efficiencies
import scipy as sp
from scipy import stats

# libraries for data visualizations
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import figure

# libraries used during sorting procedures.
import operator
import heapq

# library to evaluate strings containing python literals
from ast import literal_eval

# libraries for natural language processing
from nltk.corpus import wordnet
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

# libraries for building and analyzing recommender systems that deal with explicit rating data.
from surprise import Reader, Dataset, SVD
from surprise import KNNBasic, BaselineOnly, SVDpp

# libraries for entity featurization and similarity computation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from surprise.model_selection import cross_validate, train_test_split

# to ignore whatever warnings that may arise
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 10) # defaulting all plots to a fixed size
plt.style.use('ggplot')
sns.set_palette(sns.dark_palette("#69d"))

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<div>
    <h4><a href='#tbl-contents'>Back to table of contents</a></h4>
</div>
<div id='load-dependencies'>
    <h2>LOAD DATASETS</h2>
</div>

In [2]:
def convert_columns(data):
    """
    This function takes in a dataset and converts the 
    dtype of each column to a lesser version to reduce
    the size of the dataset for further operations.
    """
    
    for col in data.columns: # iterate over the columns in the dataset
        
        if data[col].dtype == 'object':
            data[col] = data[col].astype('category') # convert objects to categories
        
        if data[col].dtype == 'int64':
            data[col] = data[col].astype('int32') # convert int64 to int32
        
        if data[col].dtype == 'float64':
            data[col] = data[col].astype('float32') # convert float64 to float32
        
    return data # return converted data


In [3]:
imdb = pd.read_csv('/kaggle/input/movie-recommendation-data/imdb_data.csv')
movies = pd.read_csv('/kaggle/input/movie-recommendation-data/movies.csv')
meta_data = pd.read_csv('/kaggle/input/movies-metadata/movies_metadata.csv')
genome_scores = pd.read_csv('/kaggle/input/movie-recommendation-data/genome_scores.csv')
genome_scores = convert_columns(genome_scores)
genome_tags = pd.read_csv('/kaggle/input/movie-recommendation-data/genome_tags.csv')
genome_tags= convert_columns(genome_tags)
train = pd.read_csv('/kaggle/input/movie-recommendation-data/train.csv')
train = convert_columns(train)
test = pd.read_csv('/kaggle/input/movie-recommendation-data/test.csv')
test = convert_columns(test)
links = pd.read_csv('/kaggle/input/movie-recommendation-data/links.csv')
links = convert_columns(links)
tags = pd.read_csv('/kaggle/input/movie-recommendation-data/tags.csv')
tags = convert_columns(tags)

# movies = convert_columns(movies)
sample_submission = pd.read_csv('/kaggle/input/movie-recommendation-data/sample_submission.csv')

<div>
    <h4><a href='#tbl-contents'>Back to table of contents</a></h4>
</div>
<div id='sneak-peek'>
    <h2 style='text-transform: uppercase;'>Sneak Peak into Loaded Data</h2>
</div>

#### 1. IMDB DATA

In [4]:
# imdb data
imdb.head(3)

From the above output, we see that the IMDB Data is made up of __6__ columns - __movieId__, __title_cast__, __director__, __runtime__, __budget__, __plot_keywords__.

We can observe that the __title_cast__ and __plot_keywords__ columns are separated by a pipe - '|'. This makes each row in these columns seem to be one(1) long complicated word, which will make further analysis difficult. We will treat this problem in the `Data Cleaning` section.

Also, data in the __budget__ column, which is meant to be a numerical column, are prepended with currency symbols and separated by commas(,). This is bad format and needs to be taken care of in the `Data Cleaning` section.

In [5]:
# get the dimensions of the data
imdb.shape

IMDB data is made up of __27,278__ rows and __6__ columns.

How about some information about the data? 

In [6]:
imdb.info()

The IMDB data has __1__ column of dtype `float64` - __runtime__, __1__ column of dtype `int64` - __movieId__ and __4__ columns of dtype `object` - __title_cast, director, budget & plot_keywords__. 

The __budget__ column is meant to be numerical to aid aggregation. This will be taken care of in the `Data Cleaning` section.

We also have a case of missing data in all columns bar __movieId__. Let's see by how much.

In [7]:
# Extract the number of missing data and the percentage
# of missing data and concatenate into one dataframe
imdb_missing_data = pd.concat([imdb.isnull().sum(), round(imdb.isnull().sum()/imdb.shape[0] * 100)], axis=1)
imdb_missing_data.columns = ['missing_count', 'missing_percentage'] # rename columns
imdb_missing_data

`Budget` has the highest missing data with __19,372__! rows of missing data, making up __71%__! of the entire column - that is a huge amount!.

`Runtime` has the second highest missing values at __44%__, followed closely by `plot_keywords` at __41%__. `title_cast` and `director` also record missing values at __37%__ and __36%__ respectively.

These look like a lot of missing data and have to come up with creative ways to deal with this problem in `Feature Engineering` section.

Next, Movies data

#### 2. MOVIES DATA

In [8]:
# movies data
movies.head(3)

Here we can see that the Movies data is made up of __3__ columns - __movieId__, __title__, __genres__.

Similar to IMDB's __title_cast__ and __plot_keywords__, data in the __genres__ column are separated distinctly by a '|' symbol. As stated earlier, this will need to be taken care of in the `Data Cleaning` section.

The __title__ column holds both the _title_ of the movie and the _year of release_, like co-joined twins they need to be separated in the theatre of `Feature Engineering`.


Next, we will look at the dimensions of the data using `.shape` attribute of a Dataframe

In [9]:
# get the dimensions of the data
movies.shape

The movies dataset has __62,423__ rows of data and __3__ columns of features.

Fine, let's drill down a bit on the data by columns to gain a slightly better understanding using `.info` 

In [10]:
# get more information about the data
movies.info()

There are __2__ objects columns - __title__ and __genres__ and __1__ numerical(_int64_) column.

We can safely say there are no missing values in any of the columns, judging from the shape of the dataset and the number of _non-null count_ for each column.

In [11]:
# Extract the number of missing data and the percentage
# of missing data and concatenate into one dataframe
movies_missing_data = pd.concat([movies.isnull().sum(), round(movies.isnull().sum()/movies.shape[0] * 100)], axis=1)
movies_missing_data.columns = ['missing_count', 'missing_percentage'] # rename columns
movies_missing_data

Our intuition was right afterall...

Next on the list, Meta_data!

#### 3. META_DATA DATA

In [12]:
meta_data.head(3)

Wow! the meta_data dataset seem to be a more robust, upgraded and an agglomerated version of the IMDB dataset and Movies dataset with a lot more information about a movie. This will be very useful for our recommendation systems.

A drawback of note is that the meta_data does not have a "movieId", while we may use the "id" column instead, it doesn't map correctly with "movieId" of other datasets. 

What to do? let's keep that pending while we continue exploring the dataset.

Let's look at the dimensions of the data next.

In [13]:
# get the dimension of the data
meta_data.shape

This dataset has __45,466__ rows and __24__ columns.

More information please! `.info`

In [14]:
# get more information about the dataset
meta_data.info()

We have a lot more columns than we do for both `movies` and `IMDB` datasets combined. 

Question is, are they all useful for what we are trying to achieve? 

There are columns with missing values. It is difficult to know by how much, so let's break it down.

In [15]:
# Extract the number of missing data and the percentage
# of missing data and concatenate into one dataframe
meta_data_missing_data = pd.concat([meta_data.isnull().sum(), round(meta_data.isnull().sum()/meta_data.shape[0] * 100)], axis=1)
meta_data_missing_data.columns = ['missing_count', 'missing_percentage'] # rename columns
meta_data_missing_data

An astonishing __90%__! of data are missing in the __belongs_to_collection__ column, but that can be very misleading because not all the movies are part of a franchise or collection, meaning they don't have sequels. Also, __budget__ column appears to not have any missing value but from the initial sneak peek, we can see that there are movies with **Zero (0)** budget. This is practically not possible and need to be dealt with.

__homepage__ on the other hand, which has __83%__ of its data missing is not useful to us in the particular context of a recommender system. Therefore, it will be removed during `Feature engineering`

__tagline__, while have approximately half of its data missing may be of value us and cannot be discarded so easily.

Honorable mentions in the missing data category include; __overview__ - __2%__, __poster_path__ - __1%__ and __runtime__ - __1%__

Up Next, Genome_scores.

#### 4. GENOME SCORES DATA

In [16]:
genome_scores.head(3)

This dataset is made up of __3__ columns, namely; __movieId__, __tagId__ and __relevance__.

Right now, we can only assume that __relevance__ indicates by how much a tag is of importance to a movie.

Let's look at the shape of the dataset

In [17]:
genome_scores.shape

There are over __15 Million__ rows of data.

In [18]:
genome_scores.info()

There's no information on the number of non-null rows.

Ther are __2__ *int32* columns and __1__ *float32* column, indicating it's an all-numeric dataset

Let's see if there are any missing data

In [19]:
# Extract the number of missing data and the percentage
# of missing data and concatenate into one dataframe
gs_missing_data = pd.concat([genome_scores.isnull().sum(), round(genome_scores.isnull().sum()/genome_scores.shape[0] * 100)], axis=1)
gs_missing_data.columns = ['missing_count', 'missing_percentage'] # rename columns
gs_missing_data

There are no missing data.

Next, we look at Genome Tags.

#### 5. GENOME TAGS DATA 

In [20]:
genome_tags.head(5)

This dataset has __2__ columns; __tagId__ and __tag__. 

Movie tags are another way to relate movies to each other.

Next, The dimensions of the dataset

In [21]:
genome_tags.shape

There are __1,128__ rows of data.

In [22]:
genome_tags.info()

The dataset consists of __1__ categorical or text data column, much like 'object' and __1__ numerical column ('int32')

And there are no missing values, but let's double check.

In [23]:
# Extract the number of missing data and the percentage
# of missing data and concatenate into one dataframe
gt_missing_data = pd.concat([genome_tags.isnull().sum(), round(genome_tags.isnull().sum()/genome_tags.shape[0] * 100)], axis=1)
gt_missing_data.columns = ['missing_count', 'missing_percentage'] # rename columns
gt_missing_data

Our suspicion was correct afterall. Let's trust our gut feelings next time. :-)

Up next, we will be sneak peaking into the train dataset. stay tuned!

#### 6. TRAIN DATA

In [24]:
train.head(3)

At a glance, we see that the train dataset is made up of __4__ columns; __userId__, __movieId__, __rating__ and __timestamp__.

Here we have the rating each user gives a movie and also a timestamp of when such rating occured.

We will look at the shape of the data next

In [25]:
train.shape

The dataset is made up of about __10 Million__ rows of data. Pretty large.

Let's extract more information.

In [26]:
train.info()

There are __4__ columns, all of which are numerical, consisting of __3__ columns of dtype *int32* and __1__ column of dtype *float32*. I am tempted to say there are no missing values and trust my gut feelings, but just to double check again...

In [27]:
# Extract the number of missing data and the percentage
# of missing data and concatenate into one dataframe
train_missing_data = pd.concat([train.isnull().sum(), round(train.isnull().sum()/train.shape[0] * 100)], axis=1)
train_missing_data.columns = ['missing_count', 'missing_percentage'] # rename columns
train_missing_data

Sorry gut feelings!

#### 7. LINKS DATA

In [61]:
links.head()

Links, as the name suggests, contains only primary keys (columns with unique identity for each data point)  to other datasets. 

It is made up of **3** columns; __movieId__, __imdbId__ & __tmdbId__

In [62]:
links.shape

There are __62,423__ rows in the dataset

In [63]:
links.info()

All columns are numeric columns; __2__ _int32_ and __1__ _float32_ column(s) respectively. 

__tmdbId__ seems to have missing values, let's check by how much

In [65]:
# Extract the number of missing data and the percentage
# of missing data and concatenate into one dataframe
links_missing_data = pd.concat([links.isnull().sum(), round(links.isnull().sum()/links.shape[0] * 100, 3)], axis=1)
links_missing_data.columns = ['missing_count', 'missing_percentage'] # rename columns
links_missing_data

The number of missing data here is very negligible at __0.171%__, which I believe won't create significant problems

For our final show in this section, we will take a sneak peek into the tags dataset

#### 8. TAGS DATA

In [28]:
tags.head()

This dataset also has __4__ columns; __userId__, __movieId__, __tag__ and __timestamp__.

__tag__ also features here as it did in `genome_tags` dataset. Is there a difference? or are they the same? This we will explore in the `Exploratory Data Analysis` section.

Next, we will take a look at the dimensions of the dataset

In [29]:
tags.shape

There are over __1 Million__ rows of data in this dataset.

Let's get some more information.

In [30]:
tags.info()

There are __3__ numerical columns and __1__ categorical column which is the __tag__ column. 

The tag column also seem to be missing some values. Let's confirm this.

In [31]:
# Extract the number of missing data and the percentage
# of missing data and concatenate into one dataframe
tags_missing_data = pd.concat([tags.isnull().sum(), round(tags.isnull().sum()/tags.shape[0] * 100, 3)], axis=1)
tags_missing_data.columns = ['missing_count', 'missing_percentage'] # rename columns
tags_missing_data

There's a very very negligible number of missing data at __0.001%__. That won't do us much harm.

And with that, we've come to the end of the `Sneak Peek into Loaded Data Section`. 

Here, we had a brief overview of the datasets we intend to work with and what needs to be done to get our data ready for further analysis and modeling.


During the sneak peeking, we noticed that there are some columns that aren't just right. In the next section, we will be making use of a bunch of techniques to prepare the data into the right and useable formats in a process known as `Data Cleaning`

<div>
    <h4><a href='#tbl-contents'>Back to table of contents</a></h4>
</div>
<div id='data-cleaning'>
    <h2 style='text-transform: uppercase;'>data cleaning</h2>
</div>

#### WHAT IS DATA CLEANING?

The process of repairing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data from a dataset is known as `Data Cleaning`. There are numerous opportunities for data to be duplicated or mislabeled when combining multiple data sources. If the data is incorrect, the results and algorithms are untrustworthy, even if they appear to be correct.

This will be done for every datasets we intend to work with.

#### 1. IMDB DATA

First, let's make a copy of the dataset

In [32]:
# make a copy of the dataset
imdb_copy = imdb.copy(deep=True)

...then let's remind ourselves what messy data we have on our hands

In [33]:
imdb_copy.head(3)

For the *title_cast* column, we will split each row on '|' and we want to keep the firstname and lastname of the actors together so we join the first names and last names with an underscore('_'), same for the directors

For *plot_keyword*, we will be replacing '|' with a space ' '

For the *budget* column, we will remove the commas(',') and extract the digits into a separate column and the currency symbols into another

In [54]:
# check for duplicated data

imdb_copy.duplicated().sum()

There are no duplicated data in the IMDB Dataset

#### Agenda
1. Fill null values with ''(blank) for both *title_cast* and *plot_keywords*
2. Split *title_cast* on '|', join the first and last names of each actor with '_'
3. Replace '|' with ' '(a space) in *plot_keywords*
4. Replace ',' in _budget_ column with ''(nothing)
5. Extract currency symbol into another column called 'symbol' and amount into 'budget_amount'
6. Convert the dtype of budget to 'float32'

In [42]:
# Agenda 1: Fill null values with ''(blank)

imdb_copy['title_cast'] = imdb_copy['title_cast'].fillna('')
imdb_copy['director'] = imdb_copy['director'].fillna('')
imdb_copy['plot_keywords'] = imdb_copy['plot_keywords'].fillna('')

In [43]:
# Agenda 2: Split title_cast on '|', join the first and last names of each actor with '_'

def clean_text(text):
    
    # split text on '|'
    text_split = text.split('|')
    
    # replace the space between the actors first name and
    # lastname with an underscore, convert to lowercase
    # and then join into a string.
    text_replace = ' '.join([x.replace(' ', '_') if len(x) > 0 else '' for x in text_split]).lower()
    
    # return transformed text
    return text_replace

# apply clean_text function to each row
imdb_copy['title_cast'] = imdb_copy['title_cast'].apply(clean_text) 

# replace the space between the directors' first name and 
# last names with an underscore, and convert to lowercase
imdb_copy['director'] = imdb_copy['director'].apply(lambda row: row.replace(' ', '_').lower())

In [44]:
# Agenda 3: Replace '|' with ' '(a space) in plot_keywords

# select column and use .apply() with the lambda function to replace "|" character
# with a space.
imdb_copy['plot_keywords'] = imdb_copy['plot_keywords']\
                            .apply(lambda row: row.replace('|', ' '))

In [46]:
# Agenda 4: Replace ',' in budget column with ''(nothing)

# replace commas in budget amount with blanks
# excluding rows with values np.Nan
imdb_copy['budget'] = imdb_copy['budget']\
                                .apply(lambda row: row.replace(',', '')\
                                      if type(row) == str else row)

Removing the commas before using regex to extract amounts and currency symbols is important because trying to extract the digits first before removing commas will result in extracting only the first few digits found before a comma, leaving us with only the million figure without the zeros (i.e 35 instead of 35000000)

In [47]:
# Agenda 5: Extract currency symbol into another column called 'symbol' and amount into 'budget_amount'

pattern = '[0-9]+' # pattern to search for digits
symbol = '[$A-Za-z]+' # patter to search for alpha characters

# extract the budget amount from the budget column 
# and put it in `budget_amount column 
# excluding rows with values np.Nan
imdb_copy['budget_amount'] = imdb_copy['budget']\
                            .apply(lambda row: re.search(pattern, row)\
                                   .group() if type(row) == str else row)

# extract the currency symbol from the budget column
# and put it in `symbol` column
# excluding rows with values np.Nan
imdb_copy['symbol'] = imdb_copy['budget']\
                            .apply(lambda row: re.search(symbol, row)\
                                   .group() if type(row) == str else row)

In [48]:
# Agenda 6: Convert the dtype of budget to 'float32'

imdb_copy['budget_amount'] = imdb_copy['budget_amount'].astype('float32')

In [50]:
imdb_copy.head(3)

So far, we have been able to clean the imdb data to a certain degree and I am currently happy with the result. 

The redundant columns will be handled during `Feature Engineering`

Next we will be cleaning the Movies Dataset

#### 2. MOVIES DATA

In [51]:
# make a copy

movies_copy = movies.copy(deep=True)

In [52]:
movies_copy.head(3)

In [56]:
# check for duplicated data

movies.duplicated().sum()

There are no duplicated data

The Agenda for this dataset will be;

1. Split the title into 2 columns, title and year_released
2. Replace '|' with ' '(space) in the genres column

In [57]:
# Agenda 1: Split the title into 2 columns, title and year_released

movies['year'] = movies['title'].apply(lambda x: x[-7:].replace('(', '').replace(')', ''))
movies['title'] = movies['title'].apply(lambda x: x[:-7].strip().lower())

In [58]:
# Agenda 2: Replace '|' with ' '(space) in the genres column

movies['genres'] = movies['genres'].apply(lambda row: row.replace('|', ' ').lower())

In [60]:
movies.head(3)

Everything seems to have worked out fine, and I am happy with the results


Next, Meta_data

#### 3. META DATA

Recall that meta_data is a robust agglomerated version of the IMDB and Movies Datasets. The meta_data meanwhile, does not contain values for some of the data points that are available in Movies and IMDB datasets, for instance, some movie budgets data available in the IMDB dataset are not available in the Meta dataset.

After cleaning, we will take a look at how to maximally utilise all datasets involved

In [66]:
meta_copy = meta_data.copy(deep=True)

In [67]:
meta_copy.head(3)

In [71]:
# displaying the truncated columns
meta_copy.loc[:, 'overview': 'release_date'].head(3)

This is a very messy data. Let's get to work

Let's set our cleaning agenda for this dataset;

1. Column *Belongs_to_collection* should be converted to a boolean field (True or False)
2. Extract the genres from column _genres_ as they are store in dictionaries in a list
3. Extract the digits from the *imdb_id* column while renaming the column to _imdbId_
4. Compare *original_title* and *title*
5. Convert *popularity* to 'float32'
6. Extract *production_companies* and *production_countries*
7. Extract _year_ from *release_data*
8. Extract _language_ from *spoken_languages*
9. Compare _language_ with *original_language*

In [77]:
# Agenda 1: Column Belongs_to_collection should be converted to a boolean field (True or False)

meta_copy['belongs_to_collection'] = meta_copy['belongs_to_collection']\
                                        .apply(lambda row: True if type(row) != float else False )

In [None]:
# engineer a_list_actors, b_list_actors, c_list_actors

def classify(df, col, columns=[], threshold=[]):
    most_featured = feature_importance(df[col])
    most_featured_df = pd.DataFrame.from_dict(most_featured.items())
    most_featured_df.columns = columns

    print('feature extraction complete')

    a_list_actors = most_featured_df[columns[0]][most_featured_df[columns[1]] >= threshold[0]].tolist()
    b_list_actors = most_featured_df[columns[0]][(most_featured_df[columns[1]] >= threshold[1]) & (most_featured_df[columns[1]] < threshold[0])].tolist()
    c_list_actors = most_featured_df[columns[0]][most_featured_df[columns[1]] < threshold[1]].tolist()

    a_list_count = []
    b_list_count = []
    c_list_count = []

    for row in df[col]:
        a_counter = 0
        b_counter = 0
        c_counter = 0

        for item in row:
            if item in a_list_actors:
                a_counter += 1
            if item in b_list_actors:
                b_counter += 1
            if item in c_list_actors:
                c_counter += 1

        a_list_count.append(a_counter)
        b_list_count.append(b_counter)
        c_list_count.append(c_counter)

    return (a_list_count, b_list_count, c_list_count)

def feature_importance(series):
    items = aggregate(series)
    frequency = {}

    for item in items:
        if item in frequency:
            frequency[item] += 1
        else:
            frequency[item] = 1
    
    sorted_freq = {k: v for k, v in sorted(frequency.items(), reverse=True, key=lambda item: item[1])}

    return sorted_freq

def aggregate(series):
    aggregate = []
    for i in series:
        for j in i:
            aggregate.append(j)
  
    return aggregate

def split_text(text, sep):
    split_text = text.split(sep)
    return split_text

def join_lists(list_):
    return ' '.join(list_)

def stringify(num):
    num = str(num)
    return num

def fill_na(df, cols=[]):
    for col in cols:
        df[col] = df[col].fillna(df[col].median())
  
    return df

In [None]:
movies.head()

In [None]:
tags.head()

In [None]:
imdb.head()

## Based On: Director and Title Cast | Genres and Plot Keywords

In [None]:
# genre, director, title_cast
pkt = movies.merge(imdb, how='left', on='movieId')

In [None]:
pkt[['plot_keywords', 'genres', 'title_cast']] = pkt[['plot_keywords', 'genres', 'title_cast']].fillna('')
pkt['plot_keywords'] = pkt['plot_keywords'].apply(split_text, args=('|', ))
pkt['genres'] = pkt['genres'].apply(split_text, args=('|', ))
pkt['title_cast'] = pkt['title_cast'].apply(split_text, args=('|', ))
pkt['director'] = pkt['director'].fillna('')
pkt['director'] = pkt['director'].apply(lambda x: [x, x, x])
pkt[['plot_keywords', 'genres', 'title_cast']] = pkt[['plot_keywords', 'genres', 'title_cast']].fillna('')
pkt['plot_keywords'] = pkt['plot_keywords'].apply(lambda x: " ".join(x).lower() if len(x) > 0 else '')
pkt['title_cast'] = pkt['title_cast'].apply(lambda x: " ".join(x).lower() if len(x) > 0 else '')
pkt['genres'] = pkt['genres'].apply(lambda x: " ".join(x).lower() if len(x) > 0 else '')
pkt['director'] = pkt['director'].apply(lambda x: " ".join(x).lower())
pkt['plot_keywords'] = pkt['plot_keywords'].fillna('')
pkt['soup'] = pkt['title_cast'] + " " + pkt['director'] + pkt['plot_keywords']
pkt.head()

### Title Cast and Director

In [None]:
pkt = pkt[['soup', 'movieId']]
pkt = convert_columns(pkt)
pkt_movies = pkt.merge(movies[['movieId', 'title']], on='movieId')
pkt_movies.head()

In [None]:
pkt_movies['title'] = pkt_movies['title'].apply(lambda x: x[:-7])
pkt_movies.head()

In [None]:
# pkt_movies['soup'] = pkt_movies['soup'].astype('O').fillna('')
pkt_movies = convert_columns(pkt_movies)

In [None]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 1),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(pkt_movies['soup'])

In [None]:
tfidf_matrix.shape

In [None]:
tfidf_matrix = tfidf_matrix.astype('float32')

In [None]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
pkt_movies  = pkt_movies.reset_index()
titles = pkt_movies['title']
indices = pd.Series(pkt_movies.index, index=pkt_movies['title'])

In [None]:
cosine_sim.shape

In [None]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [None]:
get_recommendations('Balto').head(10)

In [None]:
movies.title[movies.title.str.startswith('B')]

# Collaborative Filtering

In [None]:
reader = Reader()

In [None]:
train.head()

In [None]:
train['userId'] = train['userId'].apply(stringify)
train['movieId'] = train['movieId'].apply(stringify)
# train['Id'] = train['userId'] + "_" + train['movieId']

In [None]:
data = Dataset.load_from_df(train[['userId', 'movieId', 'rating']], reader)

In [None]:
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
# knn =KNNBasic(sim_options=sim_options)
svd = SVDpp()

In [None]:
trainset, testset = train_test_split(data, test_size=.25)

In [None]:
svd.fit(trainset)

In [None]:
predictions = svd.test(testset)

In [None]:
from surprise import accuracy

accuracy.rmse(predictions)

In [None]:
test['userId'] = test['userId'].apply(stringify)
test['movieId'] = test['movieId'].apply(stringify)

In [None]:
from surprise import dump
import os

model_filename = "/kaggle/working/model.pickle"

print(">>> starting dump")
file_name = os.path.expanduser(model_filename)
dump.dump(file_name, algo=svd)
print('dump done')
print(file_name)

In [None]:
def item_rating(user, item):
    uid = str(user)
    iid = str(item)
    model = svd
    prediction = model.predict(user, item, verbose=False)
    rating = round(prediction.est, 1)
    
    return rating

In [None]:
test['ratings'] = test.apply(lambda row: item_rating(row['userId'], row['movieId']), axis=1)

In [None]:
test.head()

In [None]:
# get relevant tags and merge with genome_tags

threshold = 0.80 # get genomes with relevance 80% and above

relevant_genomes = genome_scores[genome_scores.relevance >= threshold]

In [None]:
scores_tags = genome_tags.merge(relevant_genomes, on='tagId')

In [None]:
movie_count = scores_tags.groupby('tag')['movieId'].count().reset_index().rename(columns={'movieId':'movieId_counts'})

In [None]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

match_list = []
ratio_list = []

bad_tags = scores_tags.tag.values
good_tags = movie_count.tag[movie_count.movieId_counts > 100].values

threshold = 80

for b_tag in bad_tags:
    process_extract = process.extractOne(b_tag, good_tags, scorer=fuzz.token_sort_ratio)
    match_list.append(process_extract[0])
    ratio_list.append(process_extract[1])

scores_tags['matches'] = match_list
scores_tags['match_ratio'] = ratio_list

In [None]:
print(scores_tags.shape)
scores_tags.head()

In [None]:
filtered_tags_threshold = scores_tags[scores_tags.match_ratio >= threshold]
adjusted_tags = pd.merge(scores_tags[['movieId', 'tag']], filtered_tags_threshold[['tag', 'matches']], on='tag')
cleaned_tags = adjusted_tags.groupby(['movieId', 'matches'])['tag'].count().reset_index().rename(columns={'tag':'tag_count', 'matches':'tag'})

print(cleaned_tags.shape)
cleaned_tags.head()

In [None]:
cleaned_tags = cleaned_tags.groupby('movieId')['tag'].agg(list).reset_index()

cleaned_tags.head()

### clean imdb data

In [None]:
imdb_data = movies.merge(imdb, on='movieId')

In [None]:
imdb_data.shape

In [None]:
imdb_data.budget = imdb_data.budget.apply(lambda x: x.replace(',', '') if type(x) == str else x)

In [None]:
import re

pattern = '[0-9]+'
symbol = '[$A-Za-z]+'
imdb_data['budget_amount'] = imdb_data.budget.apply(lambda x: re.search(pattern, x).group() if type(x) == str else x)
imdb_data['symbol'] = imdb_data.budget.apply(lambda x: re.search(symbol, x).group() if type(x) == str else x)
imdb_data['year'] = imdb_data['title'].apply(lambda x: x[-7:].replace('(', '').replace(')', ''))
imdb_data['title'] = imdb_data['title'].apply(lambda x: x[:-7].strip())
# imdb_data['year'] = imdb_data['year'].apply(lambda x: int(x) if x.isnumeric() else np.NaN)

In [None]:
imdb_data.head()

In [None]:
imdb_data.year = pd.to_numeric(imdb_data.year, errors='coerce')

In [None]:
imdb_data.isnull().sum()

In [None]:
imdb_data.budget_amount = imdb_data.budget_amount.astype('float')

In [None]:
country2curr = pd.read_csv('/kaggle/input/countries-currency/4680482-b61a5bdf5f3d5c69399f9d9e592c4896fd0dc53c/country-code-to-currency-code-mapping.csv')
curr_conv = pd.read_csv('/kaggle/input/exchange-rate-2020-cleaned/API_PA.NUS.FCRF_DS2_en_csv_v2_3930759.csv')

In [None]:
missing_list = [np.nan, 'ITL', 'BGL','TRL', 'FRF', 'VEB', 'BEF', 'PTE', 'DEM', 'ESP', '$', 'RUR', 'SIT', 'NLG', 'FIM', 'ATS']
map_list = [np.nan,'Italy', 'Bulgaria', 'Turkey', 'France', 'Venezuela', 'Belgium', 'Portugal', 'Germany', 'Spain', 'USA', 'Russia', 'Slovenia', 'Netherlands', 'Finland', 'Austria']

In [None]:
imdb_data['country'] = imdb_data['symbol'].map(dict(zip(country2curr.Code.tolist()+missing_list, country2curr.Country.tolist()+map_list)))

In [None]:
imdb_data.isnull().sum()

In [None]:
def replace(text):
    if text == 'Saint Pierre and Miquelon':
        st = 'Europe'
    else:
        st = text
    return st

In [None]:
imdb_data['country'] = imdb_data['country'].apply(replace)

In [None]:
imdb_data.tail()

In [None]:
directors = list(imdb_data['director'].unique())

In [None]:
from scipy import stats
group = []
for k in directors:
    sub_data = imdb_data[imdb_data.director == k]
    sub_data['budget_amount'].fillna(sub_data['budget_amount'].dropna().mean(), inplace=True)
    mode = stats.mode(sub_data['country'].dropna().tolist())[0][0] if len(sub_data['country'].dropna().tolist()) > 0 else np.nan
    sub_data['country'].fillna(mode, inplace=True)
    group.append(sub_data)

In [None]:
imdb_data_remade = pd.concat(group).sort_index()

In [None]:
imdb_data_remade.info()

In [None]:
imdb_data_remade.isnull().sum()/imdb_data_remade.shape[0] * 100

In [None]:
imdb.shape, imdb_data.shape, imdb_data_remade.shape

In [None]:
imdb_data_remade.drop(['symbol', 'plot_keywords', 'budget', 'title'], axis=1, inplace=True)

In [None]:
imdb_data_remade['budget_amount'].fillna(np.median(imdb_data_remade['budget_amount'].dropna().tolist()), inplace=True)
imdb_data_remade['country'].fillna(stats.mode(imdb_data_remade['country'].dropna().tolist())[0][0], inplace=True)
imdb_data_remade['runtime'].fillna(np.mean(imdb_data_remade['budget_amount'].dropna().tolist()), inplace=True)
imdb_data_remade['year'].fillna(np.median(imdb_data_remade['year'].dropna().tolist()), inplace=True)

In [None]:
imdb_data_remade.isnull().sum()

In [None]:
imdb_data_remade['conversion_rate/USD'] = imdb_data_remade['country'].map(dict(zip(curr_conv['Country Name'].tolist()+['USA'], curr_conv['2020'].tolist()+[1])))

In [None]:
imdb_data_remade['conversion_rate/USD'].fillna(0.88, inplace=True)

In [None]:
imdb_data_remade.isnull().sum()

In [None]:
imdb_data_remade['usd_budget'] = imdb_data_remade['budget_amount'] / imdb_data_remade['conversion_rate/USD']

In [None]:
imdb_data_remade.head()

In [None]:
imdb_data_remade.tail()

In [None]:
imdb_data_remade.drop(['budget_amount', 'conversion_rate/USD'], axis=1, inplace=True)
imdb_data_remade.drop(imdb_data_remade.runtime[imdb_data_remade.runtime > 420].index, inplace=True)
imdb_data_remade.drop(imdb_data_remade[imdb_data_remade.usd_budget > 300000000].index, inplace=True)

In [None]:
imdb_data_remade.isnull().sum()

In [None]:
imdb_data_remade.usd_budget = imdb_data_remade.usd_budget.apply(lambda x: \
                                                                1000000 if x == 0 else \
                                                               x*1000000 if x <= 10 else \
                                                               x*100000 if x <= 100 else \
                                                               x*10000 if x <= 1000 else \
                                                               x*1000 if x <= 10000 else \
                                                               x*100 if x <= 100000 else x)

In [None]:
imdb_data_remade.isnull().sum()

In [None]:
imdb_data_remade.drop(list(imdb_data_remade[imdb_data_remade.runtime<20].index), inplace=True)

In [None]:
imdb_data_remade.shape

In [None]:
imdb_data_remade.dropna(inplace=True)

In [None]:
imdb_data_remade['title_cast'] = imdb_data_remade['title_cast'].apply(split_text, args=("|", ))

In [None]:
imdb_data_remade['genres'] = imdb_data_remade['genres'].apply(split_text, args=("|", ))

In [None]:
# create a list of our conditions
conditions = [
    (imdb_data_remade['usd_budget'] > imdb_data_remade['usd_budget'].quantile(.75)),
    (imdb_data_remade['usd_budget'] >= imdb_data_remade['usd_budget'].quantile(.25)),
    (imdb_data_remade['usd_budget'] < imdb_data_remade['usd_budget'].quantile(.25)),
    ]

# create a list of the values we want to assign for each condition
values = ['blockbuster', 'average', 'low_budget']

# create a new column and use np.select to assign values to it using our lists as arguments
imdb_data_remade['level'] = np.select(conditions, values)

In [None]:
imdb_data_remade.isnull().sum()

In [None]:
a, b, c = classify(imdb_data_remade, 'title_cast', columns=['actor', 'appearance'], threshold=[30, 15])

In [None]:
imdb_data_remade['alist'], imdb_data_remade['blist'], imdb_data_remade['clist'] = a, b, c

In [None]:
imdb_data_remade.head()

In [None]:
imdb_data_remade.drop('title_cast', axis=1, inplace=True)

In [None]:
imdb_data_remade.reset_index(inplace=True)

In [None]:
imdb_data_remade.drop('index',axis=1, inplace=True)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
count_vector=cv.fit_transform(' '.join(x) for x in imdb_data_remade.genres)
count_vect_df = pd.DataFrame(count_vector.todense(), columns = cv.get_feature_names_out())
imdb_data_rem = pd.concat([imdb_data_remade,count_vect_df], axis=1)

In [None]:
imdb_data_rem.isnull().sum()

In [None]:
imdb_data_rem.drop('genres', axis=1, inplace=True)
imdb_data_rem.head()

In [None]:
exclude_list = ['movieId', 'country', 'usd_budget', 'level', 'director', 'year']
include_list = [x for x in imdb_data_rem.columns if x not in exclude_list]

imdb_data_rem[include_list] = imdb_data_rem[include_list].astype('int8')

In [None]:
imdb_data_rem = convert_columns(imdb_data_rem)

In [None]:
imdb_data_rem.info()