# Recommendation System 3 - Vectorizer and Cosine Similarity

**Using Bag of words and TF-IDF**

**TF-IDF**

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general.

As its name implies, TF-IDF vectorizes/scores a word by multiplying the word's Term Frequency (TF) with the Inverse Document Frequency (IDF). 

Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

**Bag of words**

The bag-of-words model is a model of text which uses a representation of text that is based on an unordered collection of words. It is used in natural language processing and information retrieval. It disregards word order but captures multiplicity.

A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag” of words because any information about the order or structure of words in the document is discarded.

In [1]:
import pandas as pd
import numpy as np

In [4]:
movies = pd.read_csv('tmdb_5000_movies.csv')

In [5]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

In [7]:
# We need only [['id', 'title', 'overview', 'genre']] columns

movies = movies[['id', 'title', 'overview', 'genres']]

In [8]:
movies.head(2)

Unnamed: 0,id,title,overview,genres
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""..."


In [9]:
movies['tags'] = movies['overview'] + movies['genres']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags'] = movies['overview']+movies['genres']


In [10]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        4803 non-null   int64 
 1   title     4803 non-null   object
 2   overview  4800 non-null   object
 3   genres    4803 non-null   object
 4   tags      4800 non-null   object
dtypes: int64(1), object(4)
memory usage: 187.7+ KB


In [11]:
movies['tags'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [14]:
movies.drop(['overview','genres'],axis=1,inplace=True)

In [15]:
movies.head(2)

Unnamed: 0,id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."


In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
cv = CountVectorizer(max_features=10000, stop_words='english')
cv

In [21]:
vector = cv.fit_transform(movies['tags'].values.astype('U')).toarray()

In [22]:
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

**Cosine Similarity**

In [23]:
from sklearn.metrics.pairwise import cosine_similarity

In [24]:
similarity = cosine_similarity(vector)

In [25]:
similarity

array([[1.        , 0.5547002 , 0.41025641, ..., 0.31300428, 0.        ,
        0.10127394],
       [0.5547002 , 1.        , 0.40061681, ..., 0.30564977, 0.        ,
        0.09128709],
       [0.41025641, 0.40061681, 1.        , ..., 0.25431598, 0.        ,
        0.07595545],
       ...,
       [0.31300428, 0.30564977, 0.25431598, ..., 1.        , 0.05463584,
        0.09658343],
       [0.        , 0.        , 0.        , ..., 0.05463584, 1.        ,
        0.04714045],
       [0.10127394, 0.09128709, 0.07595545, ..., 0.09658343, 0.04714045,
        1.        ]])

In [28]:
def recommend(movie_input):
    index=movies[movies['title']==movie_input].index[0]
    distance = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda vector:vector[1])
    for i in distance[0:5]:
        print(movies.iloc[i[0]].title)

In [29]:
recommend("Iron Man")

Iron Man
The Helix... Loaded
Godzilla 2000
Iron Man 3
Mad Max


**Import files**

In [30]:
import pickle

In [31]:
pickle.dump(movies, open('movies_list.pkl', 'wb'))
pickle.dump(similarity, open('similarity.pkl', 'wb'))