# Movie Recommednation System Using Hyperspace Engine
This notebook demonstrates the use of Hyperspace engine to create a movie recommendation system, by first using classic search and then hybrid search, a combination of classic and vector searches, by combining  word embedding with metadata filtering.

# The Dataset
The data is taken from [MovieLens Latest Datasets](https://grouplens.org/datasets/movielens/latest/) and was downloaded from [Kaggle movie recommender system dataset ](https://www.kaggle.com/code/rounakbanik/movie-recommender-systems). The data includes 40954 valid movies. The data is in SQL format (table) and will be converted to NoSQL (documents) format. The data preprocessing is given in the notebook titles "MovieRecommendationDataPrep", available in this repository.

## Setting up the Hyperspace environment
Setting the enviorment requires the following steps


1. Download and install the client API
2. Connect to a server
3. Create data schema file
4. Create collection
5. Ingest data
6. Run query

In [None]:
data_path = 'drive/MyDrive/Demos/MovieRecommendation/Movie_Recommendation_Processed.csv'

###Install the Hyperspace client API

Installation of Hyperspace cliend is straightforward and can be done using  standarad python modules, such as pip

In [None]:
pip install git+https://github.com/hyper-space-io/hyperspace-py

Collecting git+https://github.com/hyper-space-io/hyperspace-py
  Cloning https://github.com/hyper-space-io/hyperspace-py to /tmp/pip-req-build-lf3mmgdz
  Running command git clone --filter=blob:none --quiet https://github.com/hyper-space-io/hyperspace-py /tmp/pip-req-build-lf3mmgdz
  Resolved https://github.com/hyper-space-io/hyperspace-py to commit e397b51e57fd6c3d83cdde8a8ed1b6b81d0509a7
  Preparing metadata (setup.py) ... [?25l[?25hdone


###Connect to Server
The Hyperspace engine requires connection to a remote machine with pre-provided credendtials.

In [None]:
import hyperspace

hyperspace_client = hyperspace.HyperspaceClientApi(host='https://search-master-demo.development.hyper-space.xyz',
                                                   username=username, password=password)

Check that the cluster is live

In [None]:
cluster_status = hyperspace_client.cluster_status()
display(cluster_status)

###Create the data schema file

Similarly to other search databases, Hyperspace database requires a configuration file which outlines the data schema.

In [None]:
import json

config = {
  "configuration": {
    "adult": {
      "type": "boolean"
    },
    "belongs_to_collection": {
      "type": "keyword"
    },
    "budget": {
      "type": "integer"
    },
    "genres": {
      "struct_type": "list",
      "type": "keyword"
    },
    "id": {
      "type": "integer"
    },
    "original_language": {
      "type": "keyword"
    },
    "popularity": {
      "type": "float"
    },
    "production_companies": {
      "struct_type": "list",
      "type": "keyword"
    },
    "production_countries": {
      "struct_type": "list",
      "type": "keyword"
    },
    "rating": {
      "type": "float"
    },
    "release_date_unix_time": {
      "type": "date"
    },
    "revenue": {
      "type": "float"
    },
    "runtime_days": {
      "type": "integer"
    },
    "spoken_languages": {
      "struct_type": "list",
      "type": "keyword"
    },
    "title": {
      "type": "keyword"
    },
        "description embedding": {
            "type": "dense_vector",
            "dim": 2048,
            "index_type": "brute_force",
            "metric": "IP"
      }
  }
}

with open('MovieRecommendation_config.json', 'w') as f:
    f.write(json.dumps(config, indent=2))




# The NoSQL Dataset Fields
The processed metadata includes the following fields:

1.   **adult** [boolean] - states if the movie is rated 18+
2.   **belongs_to_collection** [Keyword] - name of the collection that includes the movie. If the movie is not a part of a collection, value will be "None"
3. **budget** [integer] - The budget of the movie in USD
4. **genres** [list[Keyword]] - list of movie genres (i.e drama)
5. **id** [integer]] - unique id per movie
6. **original_language** [Keyword] - the original language in which the movie was produced
7. **popularity** [float] - the popularity of the movie, formulated as an unbounded score
8. **production_companies** [list[Keyword]] - list of production companies involved in the movie
9. **production_countries** [list[Keyword]] - list of all countries in which the movie was filmed
10. **rating** [float] - the movie IMDB weighted average rating  score
11. **release_date_unix_time** [int] - the movie release date in unix time
12. **revenue** [float] - the movie rvenue in [USD]
13. **runtime_days** [int] - the number cinema run time days
14. **spoken_languages** [list[Keyword]] - list of all languages spoken in the movie
15. **title** [Keyword] - the movie title

# **Loading the SQL data**
We first load the movie metadata from a csv file, using the pandas module. The data was previously processed in order to only include the relevant features.


In [None]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
df = pd.read_csv(data_path)
df["runtime_days"] = df["runtime_days"].astype(int)
df.info()


# **Word Embedding**
The next step is to embedd the movie overview and the taglines. We will do a simple vectorization on the internal space of each column (in contrast to a more sophisticated embedding using,i.e., BERT or GPT). We will use the SKLEARN TfidfVectorizer. The first step will be to normalize the text and then replace rare words with base tense

In [None]:
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

lemmatizer = WordNetLemmatizer()
stopwords = list(stopwords.words('english'))
replacement_dict = {"Cheated":"Cheat","Photographs":"Photograph","Awfully":"Awful","Poisoner":"Poisoner","comix":"comics",
                    "embarrassingly":"embarrassing"}

def normalize_text(tagline):
    tagline = re.sub(r'\W', ' ', tagline)

    words = nltk.word_tokenize(tagline)
    normalized_words = [word for word in words if word.lower() not in ['be', 'is', 'are', 'am', 'was', 'were', 'been', 'being'] + stopwords]
    normalized_words = [lemmatizer.lemmatize(word, pos='v') for word in normalized_words if len(word) > 1 and not word.isdigit()]

    normalized_tagline = ' '.join(normalized_words)
    for key in replacement_dict:
      normalized_tagline = normalized_tagline.replace(key,replacement_dict[key])
    return normalized_tagline

df['tagline'] = df['tagline'].fillna("''")
df['tagline'] = df['tagline'].apply(normalize_text)
df['overview'] = df['overview'].apply(normalize_text)
replacement_dict = {"Cheated":"Cheat","Photographs":"Photograph","Awfully":"Awful","Poisoner":"Poisoner","comix":"comics",
                    "embarrassingly":"embarrassing"}


df["description text"] = df["overview"] + df["tagline"]
del(df["overview"])
del(df["tagline"])


In [None]:
embedding_vector_length = 2048
def embedded_text(df_data, min_word_count):
  tfidf = TfidfVectorizer(token_pattern=r'\b\w+\b', stop_words="english", min_df=min_word_count)
  tfidf_matrix = tfidf.fit_transform(df_data)
  idf_values = tfidf.idf_
  top_terms_idx = idf_values.argsort()[::-1][:embedding_vector_length]
  top_terms = [list(tfidf.vocabulary_.keys())[i] for i in top_terms_idx]
  new_tfidf = TfidfVectorizer(vocabulary=top_terms)
  new_tfidf_matrix = new_tfidf.fit_transform(df_data)
  new_tfidf_matrix = round(100 * new_tfidf_matrix)/100
  tfidf_matrix = new_tfidf_matrix.toarray()
  new_col = df_data.copy()
  return list(tfidf_matrix)

df["description embedding"] = embedded_text(df["description text"], 10)
df["description embedding"] = df["description embedding"].map(lambda x: list(x))
df.reset_index(inplace=True)

### Create Collection
Collections are used to store data of similar context, etc.

Using the Hyperspace engine can be done connecting to a remote machine with pre-provided credendtials. The process utilizes a pre-prepared configuration file which outlines the data structure

In [None]:
delete_collections = True
collection_name = 'Movies'

if delete_collections:
  if collection_name in cluster_status[0]['Collections size']:
    hyperspace_client.delete_collection(collection_name)

hyperspace_client.create_collection('MovieRecommendation_config.json', collection_name)
hyperspace_client.cluster_status()

#Data Ingestion

In [None]:

BATCH_SIZE = 500


def chunker(df, size):
    return (df.iloc[pos:pos + size] for pos in range(0, len(df), size))


i = 0
for chunk in chunker(df.iloc[i:], BATCH_SIZE):

    batch = [hyperspace.Document(str(i + j), row) for j, row in enumerate(chunk.to_dict('records'))]

    i += BATCH_SIZE

    if i % BATCH_SIZE == 0:
        response = hyperspace_client.add_batch(batch, collection_name)
        batch.clear()
        print(i, response)

hyperspace_client.commit('Movies')



# Creating The Query
Hyper Space queries are created in python format and saved as strings.

#Loading the score function
The score function encorporates logic based on movied budget and rating, and gives bonus to movies of similar production_companies. Only movies of the same genre are returned

In [None]:
sf_file = 'movies_score_function.py'
hyperspace_client.set_function(sf_file, collection_name=collection_name, function_name='popular_films_recommendation')

# **Running The Query**
The next step is use the query logic and apply the query. We will run a vector search, followed by a hybrid search which includes analytic logic - boost based on rating, genres, etc. Let's start with vector search.

In [None]:
import time
from pprint import pprint

input_vector = hyperspace_client.get_document(document_id='47', collection_name=collection_name)

print("searching for matches for '",input_vector["title"],"'")
print("-------------------------------------------------")

query = {
    'params': input_vector,
    "knn": {
        "query": {"boost": 0}, # boost = 0 , means no metadata filtering
        "description embedding": {
            "boost": 10,
        }
    }
}

results = hyperspace_client.search(query,
                                        size=15,
                                        function_name='popular_films_recommendation',
                                        collection_name=collection_name)

candidates = results['candidates']

print(f"Query run time = {results['took_ms']}ms")
print("-------------------------------------------------")

for i, result in enumerate(results['similarity']):
  api_response = hyperspace_client.get_document(document_id=result['document_id'], collection_name=collection_name)
  print(i + 1, "id", result['document_id'],  ":", api_response['title'])



Now with the Hybrid search

In [None]:
import time
from pprint import pprint

input_vector = hyperspace_client.get_document(document_id='47', collection_name=collection_name)

print("searching for matches for '",input_vector["title"],"'")
print("-------------------------------------------------")

query = {
    'params': input_vector,
    "knn": {
        "query": {"boost": 100}, # boost = 0 , means no metadata filtering
        "description embedding": {
            "boost": 1,
        }
    }
}

results = hyperspace_client.search(query,
                                        size=15,
                                        function_name='popular_films_recommendation',
                                        collection_name=collection_name)

candidates = results['candidates']

print(f"Query run time = {results['took_ms']}ms")
print("-------------------------------------------------")

for i, result in enumerate(results['similarity']):
  api_response = hyperspace_client.get_document(document_id=result['document_id'], collection_name=collection_name)
  print(i + 1, "id", result['document_id'],  ":", api_response['title'])

