### Recommendation Systems

#### Content Based Filtering


#### Step-by-Step Approach

1.   Load dataset

2.   Feature engineering (text features)

3.   Vectorization using TF-IDF

4.   Compute similarity matrix

5.   Build recommendation function

6.   Test recommendations

### Step 1:  Load dataset

In [621]:
import  numpy              as  np
import  pandas             as  pd
import  matplotlib.pyplot  as plt
import  seaborn            as sns

from   sklearn.feature_extraction.text import TfidfVectorizer
from   sklearn.metrics.pairwise        import cosine_similarity



data = {
    "movie_id": [1, 2, 3, 4, 5],
    "title": [
        "The Dark Knight",
        "Batman Begins",
        "Avengers Endgame",
        "Iron Man",
        "The Hangover"
    ],
    "genres": [
        "Action Crime Drama",
        "Action Crime Thriller",
        "Action Adventure Sci-Fi",
        "Action Adventure Sci-Fi",
        "Comedy"
    ]
}

df = pd.DataFrame(data)

In [622]:
df

Unnamed: 0,movie_id,title,genres
0,1,The Dark Knight,Action Crime Drama
1,2,Batman Begins,Action Crime Thriller
2,3,Avengers Endgame,Action Adventure Sci-Fi
3,4,Iron Man,Action Adventure Sci-Fi
4,5,The Hangover,Comedy


### Step 2:  Feature engineering (text features)


We will recommend the movies based on the genres.

In [623]:
df['content'] = df['genres']

In [624]:
df

Unnamed: 0,movie_id,title,genres,content
0,1,The Dark Knight,Action Crime Drama,Action Crime Drama
1,2,Batman Begins,Action Crime Thriller,Action Crime Thriller
2,3,Avengers Endgame,Action Adventure Sci-Fi,Action Adventure Sci-Fi
3,4,Iron Man,Action Adventure Sci-Fi,Action Adventure Sci-Fi
4,5,The Hangover,Comedy,Comedy


### Step 3:  Vectorization using TF-IDF

In [625]:
from sklearn.feature_extraction.text  import TfidfVectorizer


### create an object for tfidfvectorizer

tfidf = TfidfVectorizer(stop_words='english')

### create a tfidf matrix from the content of the dataset

tfidf_matrix = tfidf.fit_transform(df['content'])

print(tfidf_matrix)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 15 stored elements and shape (5, 8)>
  Coords	Values
  (0, 0)	0.4015651234424611
  (0, 3)	0.5750625560879445
  (0, 4)	0.7127752157729959
  (1, 0)	0.4015651234424611
  (1, 3)	0.5750625560879445
  (1, 7)	0.7127752157729959
  (2, 0)	0.373917935101458
  (2, 1)	0.5354703159558152
  (2, 6)	0.5354703159558152
  (2, 5)	0.5354703159558152
  (3, 0)	0.373917935101458
  (3, 1)	0.5354703159558152
  (3, 6)	0.5354703159558152
  (3, 5)	0.5354703159558152
  (4, 2)	1.0


### OBSERVATIONS:

1. Now the content column has been converted into tfidf matrix.

2. The text column has been converted into the numerical columns using tfidf vectorizer.

### Step 4:  Compute similarity matrix

In [626]:
from sklearn.metrics.pairwise  import cosine_similarity

### find out the cosine similarity between the matrices

cm = cosine_similarity(tfidf_matrix, tfidf_matrix)

print(cm)

[[1.         0.49195149 0.1501524  0.1501524  0.        ]
 [0.49195149 1.         0.1501524  0.1501524  0.        ]
 [0.1501524  0.1501524  1.         1.         0.        ]
 [0.1501524  0.1501524  1.         1.         0.        ]
 [0.         0.         0.         0.         1.        ]]


### OBSERVATIONS:

1. Using the cosine similarity, it helps in finding out the relation between two similar matrices.

### Create Index Mapping

In [627]:
df

Unnamed: 0,movie_id,title,genres,content
0,1,The Dark Knight,Action Crime Drama,Action Crime Drama
1,2,Batman Begins,Action Crime Thriller,Action Crime Thriller
2,3,Avengers Endgame,Action Adventure Sci-Fi,Action Adventure Sci-Fi
3,4,Iron Man,Action Adventure Sci-Fi,Action Adventure Sci-Fi
4,5,The Hangover,Comedy,Comedy


In [628]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [629]:
df['title']

0     The Dark Knight
1       Batman Begins
2    Avengers Endgame
3            Iron Man
4        The Hangover
Name: title, dtype: object

In [630]:
### Construct the series from the above data

indices = pd.Series(df.index, index = df['title']).drop_duplicates()

In [631]:
indices 


title
The Dark Knight     0
Batman Begins       1
Avengers Endgame    2
Iron Man            3
The Hangover        4
dtype: int64

### Step 5:  Build recommendation function

In [632]:
df['title'].values

array(['The Dark Knight', 'Batman Begins', 'Avengers Endgame', 'Iron Man',
       'The Hangover'], dtype=object)

In [633]:
def get_recommendations(title, cosine_sim=cm, df=df):
    ### To check whether the movies does not exists in the dataset, if not then return  "Movie not found in database"
    if(title not in df['title'].values):
        return("Movie not found in database")
    
    ### fetches the row indexes for the movie
    idx = indices[title]

    ### get the cosine similarity of every movie and enumerate it in a list
    sim_scores = list(enumerate(cosine_sim[idx]))

    ### sort all the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key= lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1:6]

    ### get the movie value at every indices
    movie_indices = [x[0] for x in sim_scores]

    ### return the movie at the respective indices
    return(df['title'].iloc[movie_indices])



### Step 6:  Test recommendations

In [634]:
res = get_recommendations("Avengers Endgame")

In [635]:
res

3           Iron Man
0    The Dark Knight
1      Batman Begins
4       The Hangover
Name: title, dtype: object

### Step 7: Test a Sample Data

In [636]:
def get_recommendation(title, cosine_sim = cm, df = df):
    ### Check if the movie exists in the dataset
    if(title not in df['title'].values):
        return("Movie does not exists in the dataset:")
    
    ### get the indices of  the movie
    idx = indices[title]

    ### enumerate the cosine similarity in the list
    sim_scores = list(enumerate(cosine_sim[idx]))

    ### get the top five movies
    sim_scores = sim_scores[1:6]

    ### get the indices of all the movie indices
    ans = [x[0] for x in sim_scores]

    ### return the movie at the respective index
    return(df['title'].iloc[ans])

    ##print(ans)
    
   

In [637]:
ans = get_recommendation("Iron Man")

In [638]:
ans

1       Batman Begins
2    Avengers Endgame
3            Iron Man
4        The Hangover
Name: title, dtype: object