# Project: **The Social Catalog**

*A Hybrid Book & Movie Recommendation System*

Author: Haniya Sudheer

Date: *January 2026*

## Project Overview


This project implements a personalized recommendation system that combines Social Filtering (based on friends' ratings) and Content-Based Filtering (using AI-driven Semantic Embeddings).

Key Objectives:



*   Integrate datasets for Books and Movies.
*   Utilize Sentence-Transformers for semantic text analysis.
*   Build an interactive UI using Streamlit.



## 1. Environment Setup

In this section, we install the necessary libraries and configure the GPU runtime for high-performance embedding generation.




In [1]:
# Installing core libraries for Data Science, AI, and Deployment
!pip install pandas numpy scikit-learn streamlit sentence-transformers pyngrok


Collecting streamlit
  Downloading streamlit-1.53.0-py3-none-any.whl.metadata (10 kB)
Collecting pyngrok
  Downloading pyngrok-7.5.0-py3-none-any.whl.metadata (8.1 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.53.0-py3-none-any.whl (9.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m103.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyngrok-7.5.0-py3-none-any.whl (24 kB)
Downloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyngrok, pydeck, streamlit
Successfully installed pydeck-0.9.1 pyngrok-7.5.0 streamlit-1.53.0


In [35]:
import pandas as pd
import numpy as np
import torch

from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. Data Acquisition & Preprocessing

We load the raw datasets and perform feature engineering to unify the "Books" and "Movies" catalogs into a single searchable entity.



In [3]:
# CREATE USER & FRIEND DATA
data = {
    "user": ["You", "Friend1", "Friend2", "Friend3", "You", "Friend1", "Friend2"],
    "title": ["Inception", "Interstellar", "Harry Potter", "The Hobbit", "The Alchemist", "Inception", "The Alchemist"],
    "type": ["Movie", "Movie", "Book", "Book", "Book", "Movie", "Book"],
    "rating": [5.0, 4.0, 5.0, 4.0, 4.0, 5.0, 5.0],
    "description": [
        "dream within a dream sci-fi thriller",
        "space exploration and time relativity",
        "wizard magic friendship adventure",
        "fantasy adventure dwarves dragon",
        "spiritual journey self discovery",
        "dream manipulation thriller",
        "philosophical journey"
    ]
}
user_ratings_df = pd.DataFrame(data)


In [4]:
import pandas as pd
# Load datasets (uploaded files)
books_raw = pd.read_csv("GoodReads_100k_books.csv")
movies_raw = pd.read_csv("IMBD.csv")

In [5]:
books_raw.columns


Index(['author', 'bookformat', 'desc', 'genre', 'img', 'isbn', 'isbn13',
       'link', 'pages', 'rating', 'reviews', 'title', 'totalratings'],
      dtype='object')

In [6]:
movies_raw.columns


Index(['movie', 'genre', 'runtime', 'certificate', 'rating', 'stars',
       'description', 'votes', 'director'],
      dtype='object')

In [7]:
# CLEAN & PREPARE CATALOG DATA
# Process Books
books_df = books_raw.copy()
books_df["user"] = None
books_df["type"] = "Book"
books_df["description"] = books_df["author"].fillna("") + " " + books_df["desc"].fillna("")
books_df = books_df[["user", "title", "type", "rating", "description"]]


books_df.head()

Unnamed: 0,user,title,type,rating,description
0,,Between Two Fires: American Indians in the Civ...,Book,3.52,Laurence M. Hauptman Reveals that several hund...
1,,Fashion Sourcebook 1920s,Book,4.51,"Charlotte Fiell,Emmanuelle Dirix Fashion Sourc..."
2,,Hungary 56,Book,4.15,Andy Anderson The seminal history and analysis...
3,,All-American Anarchist: Joseph A. Labadie and ...,Book,3.83,"Carlotta R. Anderson ""All-American Anarchist"" ..."
4,,Les oiseaux gourmands,Book,4.0,"Jean Leveille Aujourdâ€™hui, lâ€™oiseau nous i..."


In [8]:
# Process Movies
movies_df = movies_raw.copy()
movies_df["user"] = None
movies_df = movies_df.rename(columns={"movie": "title"})
movies_df["type"] = "Movie"
movies_df["rating"] = (movies_df["rating"] / 2).round(2)                                              # Normalize 10 to 5 scale
movies_df["description"] = movies_df["genre"].fillna("") + " " + movies_df["description"].fillna("")
movies_df = movies_df[["user", "title", "type", "rating", "description"]]

movies_df.head()

Unnamed: 0,user,title,type,rating,description
0,,The Witcher,Movie,4.05,"Action, Adventure, Drama Geralt of..."
1,,Mission: Impossible - Dead Reckoning Part One,Movie,4.0,"Action, Adventure, Thriller Ethan ..."
2,,Sound of Freedom,Movie,3.95,"Action, Biography, Drama The incre..."
3,,Secret Invasion,Movie,3.1,"Action, Adventure, Drama Fury and ..."
4,,Special Ops: Lioness,Movie,3.75,"Action, Drama, Thriller Joe attemp..."


In [9]:
# UNIFY DATASET
# Concat them all into one master dataframe
df = pd.concat([user_ratings_df, books_df, movies_df], ignore_index=True)
# Remove duplicates in the catalog if they appear in user_ratings
df = df.drop_duplicates(subset=["title", "type"], keep="first")
df = df.reset_index(drop=True)

df.head()

Unnamed: 0,user,title,type,rating,description
0,You,Inception,Movie,5.0,dream within a dream sci-fi thriller
1,Friend1,Interstellar,Movie,4.0,space exploration and time relativity
2,Friend2,Harry Potter,Book,5.0,wizard magic friendship adventure
3,Friend3,The Hobbit,Book,4.0,fantasy adventure dwarves dragon
4,You,The Alchemist,Book,4.0,spiritual journey self discovery


## 3. AI Feature Extraction (Embeddings)

We use the all-MiniLM-L6-v2 model to transform text descriptions into 384-dimensional vectors. This allows the system to understand "meaning" rather than just keywords.


In [10]:
# GENERATE EMBEDDINGS (AI PART)
# Moving model to GPU for 10x faster encoding
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("all-MiniLM-L6-v2").to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
# Generating semantic embeddings for the entire catalog
print("Encoding descriptions... this may take a few minutes on CPU.")
embeddings = model.encode(df["description"].fillna("").tolist(), show_progress_bar=True)


Encoding descriptions... this may take a few minutes on CPU.


Batches:   0%|          | 0/6675 [00:00<?, ?it/s]

## 4. Recommendation Logic

We define the algorithms for both social-based suggestions and content-similarity matching.

(*get_friend_recommendations* and *get_content_recommendations* functions)


In [12]:
# RECOMMENDATION FUNCTIONS
def get_friend_recommendations(df, user="You", min_rating=4):
    """Items your friends liked that you haven't seen."""
    friends_likes = df[(df["user"] != user) & (df["user"].notnull()) & (df["rating"] >= min_rating)]
    already_seen = df[df["user"] == user]["title"].tolist()
    recs = friends_likes[~friends_likes["title"].isin(already_seen)]
    return recs.drop_duplicates("title")

In [13]:
def get_content_recommendations(df, embeddings, user="You", top_n=5):
    """Items similar to what YOU have rated highly."""
    user_data = df[df["user"] == user]
    if user_data.empty:
        return pd.DataFrame()

    # Get indices of things the user liked (rating >= 4)
    liked_indices = user_data[user_data["rating"] >= 4].index.tolist()
    if not liked_indices:
        return pd.DataFrame()

    # Calculate average similarity to all liked items
    sim_scores = cosine_similarity(embeddings[liked_indices], embeddings)
    avg_sim = sim_scores.mean(axis=0)

    # Add similarity to a temp copy of DF
    temp_df = df.copy()
    temp_df["similarity"] = avg_sim

    # Filter out things the user has already seen
    already_seen = user_data["title"].tolist()
    recommendations = temp_df[~temp_df["title"].isin(already_seen)]

    # Filter out the "Friend" entries to get pure catalog suggestions
    recommendations = recommendations[recommendations["user"].isnull()]

    return recommendations.sort_values(by="similarity", ascending=False).head(top_n)


In [14]:
# EXECUTION & SAVING
print("\n--- Testing Recommendations for 'You' ---")
f_recs = get_friend_recommendations(df)
c_recs = get_content_recommendations(df, embeddings)

print(f"Friend Recommendations Found: {len(f_recs)}")
print(f"Content Recommendations Found: {len(c_recs)}")



--- Testing Recommendations for 'You' ---
Friend Recommendations Found: 3
Content Recommendations Found: 5


## 5. Exporting Assets for UI

To ensure the Streamlit app runs efficiently without re-calculating embeddings, we save the processed data and vectors to disk.



In [15]:
# Saving processed data and numpy embeddings
df.to_csv("processed_data.csv", index=False)
np.save("embeddings.npy", embeddings)
print("\nSuccess: Data and Embeddings saved for Streamlit UI.")


Success: Data and Embeddings saved for Streamlit UI.


In [16]:
!pip install streamlit



## 6. Streamlit Deployment

The following block generates the app.py file which creates the Social Catalog interface.

In [33]:
%%writefile app.py


import streamlit as st
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import os

# --- PAGE CONFIG ---
st.set_page_config(page_title="The Social Catalog", page_icon="🌿", layout="wide")

# --- STYLING ---
st.markdown("""
    <style>
    @import url('https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@0,400;0,700;1,400&family=Lora:ital,wght@0,400;0,500;1,400&display=swap');

    /* Main background - Deeper Cream Parchment */
    .stApp {
        background-color: #f2eee3;
        background-image: radial-gradient(#d8debf 1px, #f2eee3 1px);
        background-size: 35px 35px;
    }

    /* Typography */
    h1, h2, h3 {
        font-family: 'Playfair Display', serif !important;
        color: #3a4d39 !important; /* Deep Moss Green */
    }

    p, span, div, label, .stMarkdown {
        font-family: 'Lora', serif !important;
        color: #4a3728 !important; /* Earthy Wood Brown */
    }

    /* Sidebar Styling - Significantly Darker Off-White */
    section[data-testid="stSidebar"] {
        background-color: #e5e0c5 !important;
        border-right: 2px solid #d1c7a7;
    }

    /* Container Differentiation for Sidebar Widgets */
    [data-testid="stSidebar"] .stSelectbox,
    [data-testid="stSidebar"] .stSlider {
        background-color: rgba(255, 255, 255, 0.25);
        padding: 15px;
        border-radius: 18px;
        border: 1px solid rgba(58, 77, 57, 0.15);
        margin-bottom: 12px;
    }

    /* Cards - Soft Sage and Rose */
    .card {
        background-color: rgba(255, 255, 255, 0.85);
        padding: 24px;
        border-radius: 24px;
        border: 1px solid #e2e8ce;
        margin-bottom: 20px;
        box-shadow: 0 10px 20px rgba(58, 77, 57, 0.05);
        transition: all 0.3s ease;
    }
    .card:hover {
        transform: translateY(-4px);
        border-color: #e8b4b8; /* Dusty Rose border on hover */
        box-shadow: 0 12px 24px rgba(232, 180, 184, 0.2);
    }

    /* Buttons */
    .stButton>button {
        width: 100%;
        border-radius: 30px;
        background-color: #798c56 !important; /* Sage/Moss Green */
        color: white !important;
        border: none;
        font-family: 'Playfair Display', serif;
        font-size: 18px;
        font-style: italic;
        padding: 12px;
        transition: all 0.3s ease;
        box-shadow: 0 4px 10px rgba(58, 77, 57, 0.15);
    }

    /* Sidebar Specific Button Distinction */
    [data-testid="stSidebar"] .stButton>button {
        background-color: #6b7a4a !important; /* Slightly deeper green for sidebar */
        border: 1px solid rgba(255, 255, 255, 0.2);
        margin-top: 10px;
    }

    .stButton>button:hover {
        background-color: #e8b4b8 !important; /* Dusty Rose on hover */
        border: none;
        transform: scale(1.02);
        box-shadow: 0 6px 12px rgba(232, 180, 184, 0.3);
    }

    /* Tags */
    .book-tag {
        background-color: #e8b4b8; /* Dusty Rose */
        color: white;
        padding: 4px 12px;
        border-radius: 20px;
        font-size: 11px;
        font-weight: bold;
        text-transform: uppercase;
        letter-spacing: 1px;
    }
    .movie-tag {
        background-color: #b2bd7e; /* Sage Green */
        color: white;
        padding: 4px 12px;
        border-radius: 20px;
        font-size: 11px;
        font-weight: bold;
        text-transform: uppercase;
        letter-spacing: 1px;
    }

    /* Tabs */
    .stTabs [data-baseweb="tab-list"] {
        gap: 24px;
        background-color: transparent;
    }
    .stTabs [data-baseweb="tab"] {
        height: 45px;
        background-color: transparent !important;
        color: #798c56 !important;
        font-family: 'Playfair Display', serif;
        font-size: 20px;
    }
    .stTabs [aria-selected="true"] {
        color: #3a4d39 !important;
        border-bottom: 2px solid #e8b4b8 !important;
    }

    /* Enhanced Slider Styling */
    [data-baseweb="slider"] > div {
        background-color: #d1c7a7 !important; /* Darker track for visibility */
        height: 8px;
    }
    [data-baseweb="slider"] div[role="slider"] {
        background-color: #e8b4b8 !important; /* Rose thumb */
        border: 2px solid #3a4d39 !important;
        width: 24px;
        height: 24px;
    }
    </style>
""", unsafe_allow_html=True)

# --- DATA LOADING ---
@st.cache_data
def load_data():
    if not os.path.exists("processed_data.csv") or not os.path.exists("embeddings.npy"):
        return None, None
    try:
        df = pd.read_csv("processed_data.csv")
        df['user'] = df['user'].astype(str).replace('nan', None)
        embeddings = np.load("embeddings.npy")
        return df, embeddings
    except Exception as e:
        st.error(f"Error loading files: {e}")
        return None, None

df_init, embeddings = load_data()

# --- IF FILES ARE MISSING ---
if df_init is None:
    st.error("### ⚠️ Data Files Missing")
    st.write("Please ensure `processed_data.csv` and `embeddings.npy` are available in the project directory.")
    st.stop()

# --- SESSION STATE ---
if 'user_ratings' not in st.session_state:
    initial_you = df_init[df_init['user'] == 'You'][['title', 'rating', 'type']].to_dict('records')
    st.session_state.user_ratings = initial_you

# --- RECOMMENDATION LOGIC ---
def get_recommendations():
    current_df = df_init.copy()
    rated_titles = [r['title'] for r in st.session_state.user_ratings]

    # Friend Recommendations
    friends_recs = current_df[
        (current_df['user'].notnull()) &
        (current_df['user'] != 'You') &
        (current_df['rating'] >= 4) &
        (~current_df['title'].isin(rated_titles))
    ].drop_duplicates('title').head(10)

    # AI Discovery
    liked_titles = [r['title'] for r in st.session_state.user_ratings if r['rating'] >= 4]
    liked_indices = current_df[current_df['title'].isin(liked_titles)].index.tolist()

    ai_recs = pd.DataFrame()
    if liked_indices:
        sim_scores = cosine_similarity(embeddings[liked_indices], embeddings).mean(axis=0)
        current_df['similarity'] = sim_scores
        ai_recs = current_df[
            (~current_df['title'].isin(rated_titles)) &
            (current_df['user'].isna())
        ].sort_values('similarity', ascending=False).head(10)

    return friends_recs, ai_recs

# --- SIDEBAR ---
with st.sidebar:
    st.markdown("## 👤 My Profile")
    st.write("Rate items to improve your personalized recommendations.")

    catalog = sorted(df_init[df_init['user'].isna()]['title'].dropna().unique())
    selected_item = st.selectbox("Search Catalog:", catalog)
    rating = st.slider("Rating:", 1.0, 5.0, 4.0, 0.5)

    if st.button("Submit Rating"):
        if any(r['title'] == selected_item for r in st.session_state.user_ratings):
            st.warning("You have already rated this item.")
        else:
            item_type = df_init[df_init['title'] == selected_item]['type'].values[0]
            st.session_state.user_ratings.append({
                "title": selected_item,
                "rating": rating,
                "type": item_type
            })
            st.success("Rating submitted successfully.")
            st.rerun()

    st.markdown("---")
    st.markdown("### Recent Ratings")
    for r in reversed(st.session_state.user_ratings[-5:]):
        st.caption(f"⭐ {r['rating']} — {r['title']}")

# --- MAIN CONTENT ---
st.title("The Social Catalog")
st.markdown("##### A Personalized Book & Movie Recommendation Platform")

f_recs, a_recs = get_recommendations()

tab1, tab2 = st.tabs(["👥 Friend Recommendations", "🤖 AI Discovery"])

with tab1:
    if f_recs.empty:
        st.info("No recommendations currently available from your friends.")
    else:
        for _, row in f_recs.iterrows():
            st.markdown(f"""
                <div class="card">
                    <span class="{'book-tag' if row['type'] == 'Book' else 'movie-tag'}">{row['type']}</span>
                    <h3 style="margin-top: 15px; margin-bottom: 5px;">{row['title']}</h3>
                    <p style="font-size: 0.9em;">Rated <b>{row['user']}</b> with <b>{row['rating']} ⭐</b></p>
                    <p style="font-size: 0.85em; opacity: 0.8;">{row['description'][:180]}...</p>
                </div>
            """, unsafe_allow_html=True)

with tab2:
    if a_recs.empty:
        st.warning("Rate items with 4+ stars to activate AI discovery.")
    else:
        cols = st.columns(2)
        for i, (_, row) in enumerate(a_recs.iterrows()):
            with cols[i % 2]:
                match_pct = int(row['similarity'] * 100)
                st.markdown(f"""
                    <div class="card">
                        <span class="{'book-tag' if row['type'] == 'Book' else 'movie-tag'}">{row['type']}</span>
                        <h3 style="margin-top: 15px; margin-bottom: 5px;">{row['title']}</h3>
                        <p style="color: #798c56; font-weight: bold; font-size: 0.9em;">{match_pct}% Match Score</p>
                        <p style="font-size: 0.85em; opacity: 0.8;">{row['description'][:140]}...</p>
                    </div>
                """, unsafe_allow_html=True)

st.markdown("---")
st.markdown("<p style='text-align: center; opacity: 0.4; font-size: 0.7em;'>Social Recommendation Engine v1.0</p>", unsafe_allow_html=True)

Overwriting app.py


In [34]:
import os
from pyngrok import ngrok

# 1. Forcefully kill any background ngrok or streamlit processes
!pkill -9 ngrok
!pkill -9 streamlit

# 2. Reset the ngrok library state
ngrok.kill()

# 3. SET YOUR AUTH TOKEN AGAIN
NGROK_AUTH_TOKEN = "38KKkz0rVreC9jFTzmIuyrsaxHG_5SjauB8C9BwUtCVB4ibNH" # Replace with your actual token
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# 4. Create a new tunnel
try:
    public_url = ngrok.connect(8501)
    print(f"\n✅ SUCCESS!")
    print(f"★ New Streamlit Link: {public_url} ★")

    # 5. Start Streamlit again in the background
    get_ipython().system_raw('streamlit run app.py --server.port 8501 --server.address 0.0.0.0 &')
    print("\nWaiting for the interface to initialize...")
    print("If you see 'Bad Gateway' at first, just refresh the page in 5 seconds.")

except Exception as e:
    print(f"❌ Error: {e}")
    print("\nIf you still see ERR_NGROK_334, go to https://dashboard.ngrok.com/tunnels/agents and manually stop the active session.")


✅ SUCCESS!
★ New Streamlit Link: NgrokTunnel: "https://vacillatorily-evolvable-art.ngrok-free.dev" -> "http://localhost:8501" ★

Waiting for the interface to initialize...
If you see 'Bad Gateway' at first, just refresh the page in 5 seconds.
