<a href="https://colab.research.google.com/github/nimalankarthik/AnimeFinder/blob/main/Backend.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Anime Finder Application*
An application which accepts a plot description of an anime form the user and displays the most likely anime the user is describing using Vector embedings similarity scores and LLMs.

# Step 1: Install dependencies & load mandatory files

Make sure to add the files **mal_anime.csv** and **secure-connect-animeinfodb.zip** in the same path as this notebook before running the notebook.

In [None]:
%pip install InstructorEmbedding
%pip install -U sentence-transformers
%pip install cassio cohere datasets langchain openai tiktoken

# Step 2: Configure API keys
The API keys are used to connect with Astra DB and OpenAI.
We use AstraDB , a cssandra based database to store the generated vector embedings of the anime dataset.
THe OpenAI API is used to acces their embedding model **text-embedding-ada-002** and their Large Language Model **gpt-3.5-turbo**.

In [None]:
ASTRA_DB_SECURE_BUNDLE_PATH="/content/secure-connect-animeinfodb.zip"
ASTRA_DB_APPLICATION_TOKEN="AstraCS:ReplaceWithYourToken"
ASTRA_DB_CLIENT_ID="ReplaceWithYourId"
ASTRA_DB_CLIENT_SECRET="ReplaceWithYourSecret"
ASTRA_DB_KEYSPACE="ReplaceWithYourKeyspace"
OPENAI_API_KEY="ReplaceWithYourAPIKey"

# Step 3: Import all dependencies of the program

In [None]:
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

# import datasets as ds
import pandas as pd
import time
from openai import OpenAI

# Step 4: Data Preprocessing:
This involves combining all rows into a singel dataframe. The dataset includes the following 17 parameters:

1. title: title of the anime
2. episodes: number of episodes
3. status: whether the anime is still airing or finished airing already
4. theme: the theme of the anime
5. demographic: the demographic of the anime (eg. shonen, shojo, seinen and josei)
6. genres: genres of the anime
7. type: whether the anime is a tv show or movie etc
8. favorites: the number of authenticated users that favorited the anime
9. popularity: the ranking of the anime based on the total members count compare to other anime
10. rank: the ranking of the anime based on the score compare to other anime
11. score: the average score of all authenticated users that made a public vote on the anime
12. members: total number of people that added the anime to their personal anime list (eg. completed, watching, on-hold, dropped)
13. synopsis: plot of the anime
14. aired: when the anime is aired
15. duration: the duration of the anime eg. duration per episode
16. premiered: the season in which the anime is aired
17. studio: the studio that produces the anime

In [None]:
animeDf = pd.read_csv("/content/mal_anime.csv")
animeSynopsis = animeDf['synopsis']
animeName = animeDf['title']
animeAllInfoDf = animeDf.apply(lambda row: ', '.join(f"{col}: {val}" for col, val in row.items() if col != animeDf.columns[0]), axis=1)
animeAllInfoDf = pd.DataFrame(animeAllInfoDf)
animeAllInfoDf.head()

Unnamed: 0,0
0,"title: Fullmetal Alchemist: Brotherhood, episo..."
1,"title: Steins;Gate, episodes: 24, status: Fini..."
2,"title: Bleach: Sennen Kessen-hen, episodes: 13..."
3,"title: Gintama°, episodes: 51, status: Finishe..."
4,title: Kaguya-sama wa Kokurasetai: Ultra Roman...


# Step 5: Embedding Generation
OpenAI's text-embeddings-ada-002 model was used to embed the anime dataset. This was done via batch Processing with wait time to work around OpenAI's token and request per minute limitations.
Once the embeddings are generated, they are written in a vector database based on Apache Cassandra called Astra DB.

In [None]:
# Function to process data in batches
def process_in_batches(data, batch_size):
    for i in range(14700, len(data), batch_size):
        calculateBatchValues = data.iloc[i:i + batch_size]
        characterLengthInBatch = sum(calculateBatchValues.apply(count_characters))
        print(characterLengthInBatch)
        if  characterLengthInBatch < characters_per_minute_limit:
          yield calculateBatchValues
        else:
          print("Character limit too high in this batch! (More than 160000 characters)")

def count_characters(cell):
    return len(str(cell))


# OpenAI rate limits
characters_per_minute_limit = 160000
last_request_time = time.time()
# Initialize OpenAI and Cassandra connections
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
myEmbedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

cloudConfig = {"secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH}
authProvider = PlainTextAuthProvider(ASTRA_DB_CLIENT_ID, ASTRA_DB_CLIENT_SECRET)
cluster = Cluster(cloud=cloudConfig, auth_provider=authProvider)
astraSession = cluster.connect()
batchNumber = 1
RPM = 1

myCassandraVectorStore = Cassandra(embedding=myEmbedding, session=astraSession, keyspace=ASTRA_DB_KEYSPACE, table_name="AnimeCompleteInfoTable")

# Process data in batches
batch_size = 100  # Adjust the batch size based on your data processing capability

for batch_data in process_in_batches(animeAllInfoDf, batch_size):

    current_time = time.time()
    if RPM <= 3:
        print(batch_data)
        print(type(batch_data))
        # Make requests
        batch_string = batch_data.apply(lambda row: ', '.join(map(str, row)), axis=1)
        myCassandraVectorStore.add_texts(batch_string)
        RPM += 1
        print(f"Batch {batchNumber} successfully pushed to DB")
        batchNumber += 1

        # Update the last request time
        last_request_time = time.time()


    # Add a delay if approaching the rate limits

    elapsed_time = last_request_time - current_time

    if elapsed_time < 60 and RPM > 2:
        wait_time = 60 - elapsed_time
        print(f"Waiting for {wait_time} seconds to stay within rate limits...")
        time.sleep(wait_time)
        RPM = 1

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(132302812229408) 4e8b9911-2212-41c5-9a44-450c21f0778b-us-east-2.db.astra.datastax.com:29042:4d353b1e-55dc-4efd-94d9-8c4173a52315> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


684
                                                       0
14700  title: Annyeong Jadoo: Eondeo Deo Ssi, episode...
14701  title: Annyeong Jadoo: Hyonyeojadu & Eomjigong...
14702  title: Annyeong Jadoo: In-eogongju Pyeon, epis...
14703  title: Annyeong Jadoo: Jindallaeui Bimil, epis...
14704  title: Annyeong Jadoo: Seoyugi Eojjeoda Yeong-...
...                                                  ...
14795  title: AU: Mengxiang Xilie, episodes: 40, stat...
14796  title: AU: Shen Qihuan Jing, episodes: 26, sta...
14797  title: AU: Shenqi Luobo, episodes: 52, status:...
14798  title: AU: Shenqi Luobo 2nd Season, episodes: ...
14799  title: AU: Shenqi Luobo 3rd Season, episodes: ...

[100 rows x 1 columns]
<class 'pandas.core.frame.DataFrame'>
Batch 1 successfully pushed to DB
684
                                                       0
14800  title: AU: Tong Qu Xilie, episodes: 100, statu...
14801  title: AU: Tu Zhi Lai Le, episodes: 26, status...
14802  title: AU: Tu Zhi Lai Le 2nd Seaso

# Step 5: Establish connection to Database

Connect to the Astra DB vector database to compare the user input query and the embedded values based on similarity scores.

In [None]:
# Start the connection to Cassandra
cloudConfig={"secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH}
authProvider = PlainTextAuthProvider(ASTRA_DB_CLIENT_ID, ASTRA_DB_CLIENT_SECRET)
cluster = Cluster(cloud=cloudConfig, auth_provider=authProvider)
astraSession = cluster.connect()

#llm = OpenAI(openai_api=OPENAI_API_KEY)
myEmbedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

myCassandraVectorStore = Cassandra(embedding=myEmbedding, session=astraSession, keyspace=ASTRA_DB_KEYSPACE, table_name="AnimeCompleteInfoTable")


ERROR:cassandra.connection:Closing connection <AsyncoreConnection(138227953203184) 4e8b9911-2212-41c5-9a44-450c21f0778b-us-east-2.db.astra.datastax.com:29042:d6a63e88-41d1-462e-91f9-bd53eed20a1e> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


# Step 6: User input handeling, similarity score calculation and LLM Results

An input string is taken from the user wherein the user describes the plot of an anime. THis input string is compare with the vector database and the top 3 results are returned. These 3 documents along with the user input query is fed into a larger language model (gpt-3.5-turbo) using a prompt template. The LLM makes an informed decision and outputs the most likely anime the user is describing and provides 3 more anime with the highes similarity scores.


In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

query = input("Enter the description of the anime: ")
docs = ""

# Assuming myCassandraVectorStore is correctly defined and contains the desired data
for doc, score in myCassandraVectorStore.similarity_search_with_score(query, k=5):
    doc_str = str(doc)
    docs += "\n\n" + doc_str

template = """Give me the name and details of the anime which has a plot similar to the following: %s

Use the following template strictly.

Answer:
I think the anime you are talking about is, anime_name (Replace this with the most likely anime)

Here are few other anime's I think you would like. (Give exactly 3 results and keep this description short and provide similarity score rounded off to the 4th decimal point)

Note:
You can use the following information, which I think are the most likely Anime.
The higher the similarity score, the more probable the Anime. \n %s
""" % (query, docs)

message = [{"role": "system", "content": template}]

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=message
)

print(response.choices[0].message.content)


Enter the description of the anime: Giant human looking monsters eat humans




I think the anime you are talking about is Shingeki no Kyojin (Attack on Titan)

Here are few other anime's I think you would like. 

1. Aigan Kaijuu - Similarity Score: 0.5731
2. Godzilla 1: Kaijuu Wakusei - Similarity Score: 0.3533
3. Monster Musume no Oishasan - Similarity Score: 0.3183


In [None]:
# Close the Cassandra connection
cluster.shutdown()