In [None]:
import os

import pandas as pd
from tqdm.notebook import tqdm

# Generating Sentences

Because we can't directly use the csv files, we need to extract the most important parts from the data. Because we're working with application reviews and feedback, the most important data are:
- 'review_rating'
- 'review_text'
- 'sentiment'

Using these data, we can construct a sentence that is understandable and readable by the vectore store. We can use different kind of sentence format:

1. Simple Concatination
```
{review_text} [Rating: {review_rating}, Sentiment: {sentiment}]
```
2. Structured Concatination
```
Review: "{review_text}", Rating: {review_rating}, Sentiment: {sentiment}
```
3. Custom Separator
```
{review_text} || {review_rating} || {sentiment}
```
4. Structured Natural Report
```
A user provided a review for the application, giving it a {review_rating} rating. The sentiment expressed was {sentiment} with the following feedback: "{review_text}".
```

We're going to use the Structured Natural Report because it creates a more readable and understandable format. We can also add other relevant data but keep in mind it will add more tokens for the prompt.

Now let's convert the previous csv files into a txt files.

In [22]:
files = [
    "spotify_reviews_cleaned.csv",
    "spotify_reviews_cleaned_filtered.csv",
    "spotify_reviews_cleaned_filtered_balanced.csv",
]

In [23]:
for file in files:
    file_path = os.path.join("../data", file)
    df = pd.read_csv(file_path)
    df = df.sample(frac=1)  # shuffle the dataset
    df = df.sample(frac=1)  # shuffle it again
    output_path = file_path.replace(".csv", ".txt")
    with open(output_path, "w") as f:
        for index, data in tqdm(df.iterrows(), total=df.shape[0]):
            combined_sentence = (
                "A user provided a review for the application, giving it a {} rating. "
                'The sentiment expressed was {} with the following feedback: "{}".'
            ).format(data["review_rating"], data["sentiment"], data["review_text"])
            f.write(combined_sentence)
            f.write("\n")

  0%|          | 0/2710400 [00:00<?, ?it/s]

  0%|          | 0/1784947 [00:00<?, ?it/s]

  0%|          | 0/1105624 [00:00<?, ?it/s]

# Generate and Store using Deep Lake Vector Store

We're going to generate the embeddings and store it using the [Deep Lake Vector Store](https://github.com/activeloopai/deeplake). We're using `OpenAI-text-embedding-ada-002` to compute the embeddings.

## Text Splitting
For the text splitter, we're going to use `RecursiveCharacterTextSplitter` with a `chunk_size` of 1000. We can play around with the `chunk_size` using [chunkviz](https://chunkviz.up.railway.app/)

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

with open("../data/spotify_reviews_cleaned_filtered_balanced.txt", encoding="utf-8") as f:
    sample_text = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
)

docs = text_splitter.split_text(sample_text)

In [25]:
print(len(docs))
for doc in docs[:1]:
    print(doc)

341664
A user provided a review for the application, giving it a 1 rating. The sentiment expressed was Negative with the following feedback: "very bad , more ads, and every option in premium mode only.".
A user provided a review for the application, giving it a 5 rating. The sentiment expressed was Positive with the following feedback: "love the offline mode".
A user provided a review for the application, giving it a 2 rating. The sentiment expressed was Negative with the following feedback: "it has been very disappointing lately. the issues i have been facing a lot is the music timeline slider bug and the local music download bug on android. it has been utterly annoying since the last good update around half a year ago. ur programming team needs to sort it out fod a satisfaction experience.".


## Generating Embeddings

For now we're only process a single data source. Feel free to try it the other sources (**keep in mind that you'll have to pay for the OpenAI services**).

Because it will going to take roughly 1 hour to compute the entire embeddings (~340K), let's limit the number of embeddings to just 50K. So it should be finished in ~10 minutes

In [26]:
from dotenv import load_dotenv
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

load_dotenv()

MAX_DOCS = 50000

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

my_activeloop_dataset_name = "spotify_reviews_cleaned_filtered_balanced_50K_docs_1000_chunk"
dataset_path = f"../deeplake/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding=embeddings, num_workers=4)

# Make sure to compute it once
if len(db.vectorstore) == 0:
    # add documents to our Deep Lake dataset
    db.add_texts(docs[:MAX_DOCS])

Deep Lake Dataset in ../deeplake/spotify_reviews_cleaned_filtered_balanced_50K_docs_1000_chunk already exists, loading from the storage


## Test Similarity Search

In [27]:
# let's see the top relevant documents to a specific query
query = "enjoy awesome ui"
docs = db.similarity_search(query)
print(docs[1].page_content)

A user provided a review for the application, giving it a 1 rating. The sentiment expressed was Negative with the following feedback: "me parece una vergüenza que las personas que no son premium disfruten de una ui totalmente nueva y agradable a la vista mientras que los que pagamos por el servicio yles estamos dando dinero a la empresa tengamos una ui vieja y desactualizada".
A user provided a review for the application, giving it a 5 rating. The sentiment expressed was Positive with the following feedback: "great musical experience! love it.".


In [28]:
docs

[Document(page_content='A user provided a review for the application, giving it a 4 rating. The sentiment expressed was Positive with the following feedback: "edit: great ui revamp, lot has changed since the last time. things are less boxy and i love it. the only improvement i can think of is add a swipe down to minimize now playing.".'),
 Document(page_content='A user provided a review for the application, giving it a 1 rating. The sentiment expressed was Negative with the following feedback: "me parece una vergüenza que las personas que no son premium disfruten de una ui totalmente nueva y agradable a la vista mientras que los que pagamos por el servicio yles estamos dando dinero a la empresa tengamos una ui vieja y desactualizada".\nA user provided a review for the application, giving it a 5 rating. The sentiment expressed was Positive with the following feedback: "great musical experience! love it.".'),
 Document(page_content='A user provided a review for the application, giving it