<a href="https://colab.research.google.com/github/iamgirupashankar/recommendationSystemByContentBased/blob/main/recommendationSystemByContentBased.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendation System using Python and TensorFlow

Recommendation systems are the invisible engine behind the success of platforms like Netflix, Amazon, Spotify, and YouTube.  They Personalize your experience by suggesting what to watch, buy, or listen to next.   

Here, I am use a real Netflix dataset containing titles, content types, languages, and viewing hours. By the end, you'll have a deep learning model that can answer questions like. "If someone liked Wednesday, what else might they enjoy?

## Step 1: Load and Understand the Dataset


Here I'm using a Netflix 2023 dataset with the following fields:

1. Title
2. Available Globally?
3. Release Date
4. Hours Viewed
5. Language Indicator
6. Content Type

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/thecleverprogrammer/AI & ML Projects to Get Job Ready /Recommendation System using TensorFlow/netflix_content.csv')
df.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type
0,The Night Agent: Season 1,Yes,2023-03-23,812100000,English,Show
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000,English,Show
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000,Korean,Show
3,Wednesday: Season 1,Yes,2022-11-23,507700000,English,Show
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000,English,Movie


Note: This data is rich for content-based filtering, even without user behaviour data.

## Step 2: Clean and Preprocess the Data

Before modelling, we need to convert the data into a numerical format. So, let’s clean and preprocess the data:

In [3]:
df['Hours Viewed'] = df['Hours Viewed'].str.replace(',', '', regex=False).astype('int64')


In [4]:
# drop rows with missing titles or duplicate titles
df.dropna(subset=['Title'], inplace=True)
df.drop_duplicates(subset=['Title'], inplace=True)

In [17]:
# create simple content IDs for Tensorflow embeddings
df['Content_ID'] = df.reset_index().index.astype('int32')

In [18]:
# encode 'Language Indicator' and 'Content Type'
df['Language_ID'] = df['Language Indicator'].astype('category').cat.codes
df['ContentType_ID'] = df['Content Type'].astype('category').cat.codes

In [24]:
data = df[['Content_ID', 'Title', 'Hours Viewed', 'Language_ID', 'ContentType_ID']]

In [25]:
data.head()

Unnamed: 0,Content_ID,Title,Hours Viewed,Language_ID,ContentType_ID
0,0,The Night Agent: Season 1,812100000,0,1
1,1,Ginny & Georgia: Season 2,665100000,0,1
2,2,The Glory: Season 1 // 더 글로리: 시즌 1,622800000,3,1
3,3,Wednesday: Season 1,507700000,0,1
4,4,Queen Charlotte: A Bridgerton Story,503000000,0,0


Note: TensorFlow doesn’t work with strings; it needs numbers. So, we converted content metadata into categorical encodings for use in embeddings.

## Step 3: Build a Neural Recommendation Model Using TensorFlow

We will use embeddings to capture complex relationships between features like language, type, and content ID:

In [29]:
import tensorflow as tf
from tensorflow.keras import layers, Model

In [27]:
num_contents = df['Content_ID'].nunique()
num_languages = df['Language_ID'].nunique()
num_types = df['ContentType_ID'].nunique()


In [30]:
content_input = layers.Input(shape=(1,), dtype=tf.int32, name='content_id')
language_input = layers.Input(shape=(1,), dtype=tf.int32, name='language_id')
type_input = layers.Input(shape=(1,), dtype=tf.int32, name='content_type')


In [31]:

content_embedding = layers.Embedding(input_dim=num_contents+1, output_dim=32)(content_input)
language_embedding = layers.Embedding(input_dim=num_languages+1, output_dim=8)(language_input)
type_embedding = layers.Embedding(input_dim=num_types+1, output_dim=4)(type_input)

content_vec = layers.Flatten()(content_embedding)
language_vec = layers.Flatten()(language_embedding)
type_vec = layers.Flatten()(type_embedding)

combined = layers.Concatenate()([content_vec, language_vec, type_vec])
x = layers.Dense(64, activation='relu')(combined)
x = layers.Dense(32, activation='relu')(x)
output = layers.Dense(num_contents, activation='softmax')(x)

model = Model(inputs=[content_input, language_input, type_input], outputs=output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Note: Embeddings compress high-dimensional categorical data (like content IDs or languages) into dense vectors where similar values cluster together. It will allow our model to learn which content is similar.

## Step 4: Train the Recommendation Model

We’ll use the content itself as the label so the model learns to predict content from its features. This is a self-supervised learning approach:

In [33]:
model.fit(
    x={
        'content_id': df['Content_ID'],
        'language_id': df['Language_ID'],
        'content_type': df['ContentType_ID']
    },
    y=df['Content_ID'],
    epochs=5,
    batch_size=64
)

Epoch 1/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 50ms/step - accuracy: 0.0000e+00 - loss: 9.8788
Epoch 2/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 42ms/step - accuracy: 0.0000e+00 - loss: 9.8649
Epoch 3/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 42ms/step - accuracy: 0.0018 - loss: 9.5690
Epoch 4/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 44ms/step - accuracy: 0.0175 - loss: 7.9803
Epoch 5/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 42ms/step - accuracy: 0.1528 - loss: 5.6380


<keras.src.callbacks.history.History at 0x7cef23c46910>

Note: It structures the embedding space based on real metadata. Similar content will have similar embeddings. It will allow recommendations based on closeness in vector space.

## Step 5: Recommend Similar Content

In [34]:
import numpy as np

def recommend_similar(content_title, top_k=5):
    content_row = df[df['Title'].str.contains(content_title, case=False, na=False)].iloc[0]
    content_id = content_row['Content_ID']
    language_id = content_row['Language_ID']
    content_type_id = content_row['ContentType_ID']

    predictions = model.predict({
        'content_id': np.array([content_id]),
        'language_id': np.array([language_id]),
        'content_type': np.array([content_type_id])
    })

    top_indices = predictions[0].argsort()[-top_k-1:][::-1]
    recommendations = df[df['Content_ID'].isin(top_indices)]
    return recommendations[['Title', 'Language Indicator', 'Content Type', 'Hours Viewed']]

recommend_similar("Wednesday")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 135ms/step


Unnamed: 0,Title,Language Indicator,Content Type,Hours Viewed
2013,iZombie: Season 2,0,1,10200000
2932,The Magic School Bus Rides Again: Season 1,0,1,6500000
4933,Queer Eye: Season 4,0,1,2900000
5137,Skylanders Academy: Season 3,0,1,2700000
8811,The Guardians of Justice: Season 1,0,1,800000
24807,We Are Black and British: Season 1,0,1,100000


The embeddings map each content item into a 32-dimensional space. Items that are closer in this space are likely to be similar in:

1. Language
2. Content Type
3. Viewership Pattern  
So, even without user feedback, your model can say: “Hey, these titles are kind of alike.”