In [2]:
import os

import pandas as pd
from tqdm.notebook import tqdm

In [4]:
df = pd.read_csv('../data/spotify_reviews_cleaned_filtered_balanced.csv')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1105624 entries, 0 to 1105623
Data columns (total 8 columns):
 #   Column              Non-Null Count    Dtype 
---  ------              --------------    ----- 
 0   Unnamed: 0          1105624 non-null  int64 
 1   review_id           1105624 non-null  object
 2   author_name         1105624 non-null  object
 3   review_text         1105624 non-null  object
 4   review_rating       1105624 non-null  int64 
 5   author_app_version  1105624 non-null  object
 6   sentiment           1105624 non-null  object
 7   review_month        1105624 non-null  object
dtypes: int64(2), object(6)
memory usage: 67.5+ MB


In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,author_name,review_text,review_rating,author_app_version,sentiment,review_month
0,0,bf5609a9-1982-4feb-9922-c4cef1c0d16f,A Google user,its a really good app. i can now listen to any...,5,8.1.0.785,Positive,2017-03
1,1,30ba3b60-3293-47cc-bc5c-4335155f9f57,ap************ta,it is the best music app ever used,5,8.6.58.994,Positive,2021-09
2,2,a01eccb8-b848-4e0f-bb68-85e3ac97448c,Sh**********an,everyday is amazing because of spotify. i got ...,5,8.7.4.1056,Positive,2022-02
3,3,ab926399-d93f-4950-8b8b-ba6d7bd13d40,Vi***********ms,very smooth to use.,5,8.5.70.868,Positive,2020-08
4,4,6a6f006d-0004-4496-8837-96ed2ca37098,Fa***************in,such a great apps. i love it.,5,8.5.80.1037,Positive,2020-10


# Generating Sentences

Because we can't directly use the csv files, we need to extract the most important parts from the data. Because we're working with application reviews and feedback, the most important data are:
- 'review_rating'
- 'review_text'
- 'review_likes'
- 'review_month'
- 'sentiment'

Using these data, we can construct a sentence that is understandable and readable by the vectore store. We can use different kind of sentence format:

1. Simple Concatination
```
{review_text} [Rating: {review_rating}, Sentiment: {sentiment}]
```
2. Structured Concatination
```
Review: "{review_text}", Rating: {review_rating}, Sentiment: {sentiment}
```
3. Custom Separator
```
{review_text} || {review_rating} || {sentiment}
```
4. Structured Natural Report
```
As of {review_month}, a user provided a {sentiment} review with a rating of {review_rating} stars. They mentioned: '{review_text}'. This review received {review_likes} upvotes from others.
```

We're going to use the Structured Natural Report because it creates a more readable and understandable format. We can also add other relevant data but keep in mind it will add more tokens for the prompt.

Now let's convert the previous csv files into a txt files.

In [1]:
files = [
    "spotify_reviews_cleaned.csv",
    "spotify_reviews_cleaned_filtered.csv",
    "spotify_reviews_cleaned_filtered_balanced.csv",
]

In [6]:
for file in files:
    file_path = os.path.join("../data", file)
    df = pd.read_csv(file_path)
    df = df.sample(frac=1)  # shuffle the dataset
    df = df.sample(frac=1)  # shuffle it again
    df['review_month'] = pd.to_datetime(df['review_month'], format='%Y-%m').dt.strftime('%B %Y')

    output_path = file_path.replace(".csv", ".txt")
    with open(output_path, "w") as f:
        for index, data in tqdm(df.iterrows(), total=df.shape[0]):
            combined_sentencs = f"As of {data['review_month']}, a user provided a {data['sentiment'].lower()} review with a rating of {data['review_rating']} stars. They mentioned: '{data['review_text']}'. This review received {data['review_likes']} upvotes from others."
            f.write(combined_sentencs)
            f.write("\n")

  0%|          | 0/2710400 [00:00<?, ?it/s]

  0%|          | 0/1784947 [00:00<?, ?it/s]

  0%|          | 0/1105624 [00:00<?, ?it/s]

# Generate and Store using Deep Lake Vector Store

We're going to generate the embeddings and store it using the [Deep Lake Vector Store](https://github.com/activeloopai/deeplake). We're using `OpenAI-text-embedding-ada-002` to compute the embeddings.

## Text Splitting
For the text splitter, we're going to use `RecursiveCharacterTextSplitter` with a `chunk_size` of 1000. We can play around with the `chunk_size` using [chunkviz](https://chunkviz.up.railway.app/)

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

with open("../data/spotify_reviews_cleaned_filtered_balanced.txt", encoding="utf-8") as f:
    sample_text = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
)

docs = text_splitter.split_text(sample_text)

In [13]:
print(len(docs))
for doc in docs[:1]:
    print(doc)

349659
As of September 2023, a user provided a negative review with a rating of 1 stars. They mentioned: 'lyrics are only available if you have premium what kinda business is this'. This review received 0 upvotes from others.
As of September 2018, a user provided a negative review with a rating of 1 stars. They mentioned: 'stops playing music randomly.'. This review received 1 upvotes from others.
As of October 2015, a user provided a positive review with a rating of 5 stars. They mentioned: 'i love it but wish it was free'. This review received 0 upvotes from others.
As of October 2019, a user provided a positive review with a rating of 4 stars. They mentioned: 'wonderful opportunity to listen and download hymns, vocal and instrumental. also the possibility to follow the different artists. sportify gives me great listening pleasure, the hymns help strengthen my christian disposition in life. so grateful for sportify, thank you ever so much.'. This review received 0 upvotes from others

## Generating Embeddings

For now we're only process a single data source. Feel free to try it the other sources (**keep in mind that you'll have to pay for the OpenAI services**).

Because it will going to take roughly 1 hour to compute the entire embeddings (~340K), let's limit the number of embeddings to just 50K. So it should be finished in ~10 minutes

In [14]:
from dotenv import load_dotenv
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

load_dotenv()

MAX_DOCS = 50000

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

my_activeloop_dataset_name = "spotify_reviews_cleaned_filtered_balanced_50K_docs_1000_chunk"
dataset_path = f"../deeplake/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding=embeddings, num_workers=4)

# Make sure to compute it once
if len(db.vectorstore) == 0:
    # add documents to our Deep Lake dataset
    db.add_texts(docs[:MAX_DOCS])



Creating 50000 embeddings in 100 batches of size 500:: 100%|█████████████████████████████████████| 100/100 [10:44<00:00,  6.44s/it]

Dataset(path='../deeplake/spotify_reviews_cleaned_filtered_balanced_50K_docs_1000_chunk', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype        shape       dtype  compression
  -------    -------      -------     -------  ------- 
   text       text      (50000, 1)      str     None   
 metadata     json      (50000, 1)      str     None   
 embedding  embedding  (50000, 1536)  float32   None   
    id        text      (50000, 1)      str     None   





## Test Similarity Search

In [15]:
# let's see the top relevant documents to a specific query
query = "enjoy awesome ui"
docs = db.similarity_search(query)
print(docs[1].page_content)

however has tremendous potential. their playlist are that good, and their ui is fun and delightful. beats music just needs to hurry up before it's too late.'. This review received 4 upvotes from others.


In [16]:
docs

[Document(page_content="As of January 2016, a user provided a neutral review with a rating of 3 stars. They mentioned: 'ui is such a pain to navigate. download song should be on the ''now playing'' page. also while playing a song from a album, there's no quick view for th rest of the tracks/now playing. good overall, barely uses any battery at all. please improve ui.'. This review received 0 upvotes from others.\nAs of January 2015, a user provided a positive review with a rating of 4 stars. They mentioned: 'love it just wish their hip hop selection was bigger'. This review received 0 upvotes from others.\nAs of April 2020, a user provided a neutral review with a rating of 3 stars. They mentioned: 'i had just watched an ad for 30 minutes of uninterrupted music and right after the song that had just played it gave me 3 more ads right after that. and also i am not getting premium, okay.'. This review received 0 upvotes from others."),
 Document(page_content="however has tremendous potent