In [None]:
import os

import pandas as pd
from tqdm.notebook import tqdm

In [None]:
df = pd.read_csv("../data/spotify_reviews_cleaned_filtered_balanced.csv")

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1105624 entries, 0 to 1105623
Data columns (total 9 columns):
 #   Column              Non-Null Count    Dtype 
---  ------              --------------    ----- 
 0   Unnamed: 0          1105624 non-null  int64 
 1   review_id           1105624 non-null  object
 2   author_name         1105624 non-null  object
 3   review_text         1105624 non-null  object
 4   review_rating       1105624 non-null  int64 
 5   review_likes        1105624 non-null  int64 
 6   author_app_version  1105624 non-null  object
 7   sentiment           1105624 non-null  object
 8   review_month        1105624 non-null  object
dtypes: int64(3), object(6)
memory usage: 75.9+ MB


In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,author_name,review_text,review_rating,review_likes,author_app_version,sentiment,review_month
0,0,9542f216-8238-4e77-9f86-2cc076176cdd,A Google user,free music sound turns out flat,5,0,8.5.9.737,Positive,2019-06
1,1,59399276-ddd8-4bbc-aa4a-cd6bcd500e90,A Google user,i love being able to find the music i like. i ...,5,0,8.5.50.916,Positive,2020-03
2,2,1d499252-5691-4a11-9b41-3dc84ba096f3,A Google user,i'm loving being able to listen to my music wi...,5,0,3.5.0.963,Positive,2015-08
3,3,61ec8879-aae8-43d6-bf8b-916a8192d581,A Google user,"i recommend premium personally, but it's still...",5,0,3.5.0.963,Positive,2015-08
4,4,6a508555-a056-4ff7-94a7-b588f0f36614,A Google user,my day wouldn't be complete without spotify.,5,0,8.4.81.558,Positive,2018-12


# Generating Sentences

Because we can't directly use the csv files, we need to extract the most important parts from the data. Because we're working with application reviews and feedback, the most important data are:
- 'review_rating'
- 'review_text'
- 'review_likes'
- 'review_month'
- 'sentiment'

Using these data, we can construct a sentence that is understandable and readable by the vectore store. We can use different kind of sentence format:

1. Simple Concatination
```
{review_text} [Rating: {review_rating}, Sentiment: {sentiment}]
```
2. Structured Concatination
```
Review: "{review_text}", Rating: {review_rating}, Sentiment: {sentiment}
```
3. Custom Separator
```
{review_text} || {review_rating} || {sentiment}
```
4. Structured Natural Report
```
As of {review_month}, a user provided a {sentiment} review with a rating of {review_rating} stars. They mentioned: '{review_text}'. This review received {review_likes} upvotes from others.
```
5. Structured Natural Report with App Version
```
As of {review_month}, a user provided a {sentiment} review for version {data['major_version']}.{data['minor_version']}.{data['patch_version']} with a rating of {review_rating} stars. They mentioned: '{review_text}'. This review received {review_likes} upvotes from others.
```

We're going to use the Structured Natural Report because it creates a more readable and understandable format. We can also add other relevant data but keep in mind it will add more tokens for the prompt.

Now let's convert the previous csv files into a txt files.

In [None]:
files = [
    "spotify_reviews_cleaned.csv",
    "spotify_reviews_cleaned_filtered.csv",
    "spotify_reviews_cleaned_filtered_balanced.csv",
    "spotify_filter_version_ratings_count.csv",
]

In [None]:
def combine_sentence(data):
    sentence = ""
    if "major_version" in data:
        sentence = f"As of {data['review_month']}, a user provided a {data['sentiment'].lower()} review for version {data['major_version']}.{data['minor_version']}.{data['patch_version']} with a rating of {data['review_rating']} stars. They mentioned: '{data['review_text']}'. This review received {data['review_likes']} upvotes from others."
    else:
        sentence = f"As of {data['review_month']}, a user provided a {data['sentiment'].lower()} review with a rating of {data['review_rating']} stars. They mentioned: '{data['review_text']}'. This review received {data['review_likes']} upvotes from others."
    return sentence

In [None]:
for file in files:
    file_path = os.path.join("../data", file)
    df = pd.read_csv(file_path)
    df = df.sample(frac=1)  # shuffle the dataset
    df = df.sample(frac=1)  # shuffle it again
    df["review_month"] = pd.to_datetime(df["review_month"], format="%Y-%m").dt.strftime("%B %Y")

    output_path = file_path.replace(".csv", ".txt")
    with open(output_path, "w") as f:
        for index, data in tqdm(df.iterrows(), total=df.shape[0]):
            combined_sentences = combine_sentence(data)
            f.write(combined_sentences)
            f.write("\n")

# Generate and Store using Deep Lake Vector Store

We're going to generate the embeddings and store it using the [Deep Lake Vector Store](https://github.com/activeloopai/deeplake). We're using `OpenAI-text-embedding-ada-002` to compute the embeddings.

For now we're only process a single data source. Feel free to try it the other sources (**keep in mind that you'll have to pay for the OpenAI services**).

For the text splitter, we're going to use `RecursiveCharacterTextSplitter` with a `chunk_size` of 1000. We can play around with the `chunk_size` using [chunkviz](https://chunkviz.up.railway.app/)

In [None]:
from dotenv import load_dotenv
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import DeepLake

load_dotenv()

True

### Original Filtered Balanced

Because it will going to take roughly 1 hour to compute the entire embeddings (~340K), let's limit the number of embeddings to just 50K. So it should be finished in ~10-15 minutes.

In [None]:
MAX_DOCS = 50000
with open("../data/spotify_reviews_cleaned_filtered_balanced.txt", encoding="utf-8") as f:
    sample_text = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
)

docs = text_splitter.split_text(sample_text)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

my_activeloop_dataset_name = "spotify_reviews_cleaned_filtered_balanced_50K_docs_1000_chunk"
dataset_path = f"../deeplake/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding=embeddings, num_workers=4)

# Make sure to compute it once
if len(db.vectorstore) == 0:
    # add documents to our Deep Lake dataset
    db.add_texts(docs[:MAX_DOCS])

In [None]:
MAX_DOCS = 50000
with open("../data/spotify_filter_version_ratings_count.txt", encoding="utf-8") as f:
    sample_text = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
)

docs = text_splitter.split_text(sample_text)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

my_activeloop_dataset_name = "spotify_filter_version_ratings_count_50K_docs_1000_chunk"
dataset_path = f"../deeplake/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding=embeddings, num_workers=4)

# Make sure to compute it once
if len(db.vectorstore) == 0:
    # add documents to our Deep Lake dataset
    db.add_texts(docs[:MAX_DOCS])

  warn_deprecated(
Creating 50000 embeddings in 100 batches of size 500:: 100%|█████████████████████████████████████| 100/100 [10:28<00:00,  6.28s/it]

Dataset(path='../deeplake/spotify_filter_version_ratings_count_50K_docs_1000_chunk', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype        shape       dtype  compression
  -------    -------      -------     -------  ------- 
   text       text      (50000, 1)      str     None   
 metadata     json      (50000, 1)      str     None   
 embedding  embedding  (50000, 1536)  float32   None   
    id        text      (50000, 1)      str     None   





## Test Similarity Search

In [None]:
# let's see the top relevant documents to a specific query
query = "enjoy awesome ui"
docs = db.similarity_search(query)
print(docs[1].page_content)

through ... iv now got 60-90 so something to simplify the ui would be great for us drivers that need to keep hands on a wheel. keep up the good work.'. This review received 3 upvotes from others.


In [None]:
docs

[Document(page_content="As of June 2023, a user provided a positive review for version 8.8.22 with a rating of 4 stars. They mentioned: 'edit: hey all! i'm a long-time user and premium-payer. for years, i've used this app as my sole music playing platform. a few months ago, they went through with a clunky, awkward, and honestly awful ui update. thankfully, they've seemed to listen to the reviews and have reverted the ui change, at least for the home screen. i'll give em' 4 stars again, mostly for the new ai dj feature, but only 4 because that awful ui persists in the new music, podcasts & shows, and audiobooks tabs at the top.'. This review received 4 upvotes from others."),
 Document(page_content="through ... iv now got 60-90 so something to simplify the ui would be great for us drivers that need to keep hands on a wheel. keep up the good work.'. This review received 3 upvotes from others."),
 Document(page_content="As of November 2018, a user provided a positive review for version 8.