<a href="https://colab.research.google.com/github/linhle32/Interactive-Models-with-Widget/blob/main/text_regression_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Regression - Sentiment Analysis

This project performs the regression task on text data. It can be done with any text data with numeric labels. However, in this example, perform sentiment analysis in which the sentiment scores to be actual numbers (1, 2, etc.).

I will use a small subset of the Google Play Store Review dataset available at https://www.kaggle.com/datasets/prakharrathi25/google-play-store-reviews.

For modeling, we will use the `sentence_transformers` library to create embedding vectors for each text, then build a simple regression model on top of that.

### Data Loading

In [None]:
!pip install transformers datasets evaluate
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv('reviews.csv', encoding = "ISO-8859-1")
data.head(3)

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,sortOrder,appId
0,gp:AOqpTOEhZuqSqqWnaKRgv-9ABYdajFUB0WugPGh-SG-...,Eric Tie,https://play-lh.googleusercontent.com/a-/AOh14...,I cannot open the app anymore,1,0,5.4.0.6,2020-10-27 21:24:41,,,newest,com.anydo
1,gp:AOqpTOH0WP4IQKBZ2LrdNmFy_YmpPCVrV3diEU9KGm3...,john alpha,https://play-lh.googleusercontent.com/a-/AOh14...,I have been begging for a refund from this app...,1,0,,2020-10-27 14:03:28,"Please note that from checking our records, yo...",2020-10-27 15:05:52,newest,com.anydo
2,gp:AOqpTOEMCkJB8Iq1p-r9dPwnSYadA5BkPWTf32Z1azu...,Sudhakar .S,https://play-lh.googleusercontent.com/a-/AOh14...,Very costly for the premium version (approx In...,1,0,,2020-10-27 08:18:40,,,newest,com.anydo


The data has multiple columns, however, we just need `content` for the text body, and `score` for the target. We will slice to this two columns.

In [None]:
data = data[['content','score']]
data.head(3)

Unnamed: 0,content,score
0,I cannot open the app anymore,1
1,I have been begging for a refund from this app...,1
2,Very costly for the premium version (approx In...,1


Next, we will install `sentence_transformers` and create embeddings. I will use the `all-mpnet-base-v2` model, however, there are others that you can try.

The embedding data is stored in the `data_embs` variable.

In [None]:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
data_embs = data['content'].apply(lambda x: model.encode(x))
data_embs = np.vstack(data_embs)

### Model training

Now we can train a model. As the embedding model is a deep language model already, there is no need to use complicated layer at this step. We will just fit a simple Ridge regression model with `sklearn`.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

train_embs, test_embs, train_labels, test_labels = train_test_split(data_embs, data['score'], test_size=0.2)
ridge = Ridge()
ridge.fit(train_embs, train_labels)

And check the model performance to verify if there is any issues. In this case, we do not.

In [None]:
ridge.score(train_embs, train_labels)

0.6256970037480142

In [None]:
ridge.score(test_embs, test_labels)

0.5755422160972778

### Save the Model

Finally, we save the model. This is a `sklearn` model, so we use `joblib`, and the saved object will be a single file instead of a folder like with `transformers`.

In [None]:
import joblib

# save
joblib.dump(ridge, ".../sentiment_ridge.pkl")

['/content/drive/MyDrive/IT7133/Week 6/sentiment_ridge.pkl']

# Application

Now, let us write a small application to perform prediction on new review. Everything will be reloaded, so this section can be run without the first.

In [None]:
model_path = '.../sentiment_ridge.pkl'

In [None]:
!pip install sentence_transformers
from google.colab import drive
drive.mount('/content/drive')
from sentence_transformers import SentenceTransformer
import joblib
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
ridge = joblib.load(model_path)

In [None]:
import ipywidgets as widgets
from IPython.display import display

output = widgets.Output()
text_input = widgets.Textarea(
    value='',
    placeholder='Please type something',
    description='Text:',
    disabled=False,
    layout=widgets.Layout(height="auto", width="auto")
)
button_summarize = widgets.Button(description='Classify')
output = widgets.Output()
display(text_input, button_summarize, output)

@output.capture()
def on_predict_clicked(b):
  output.clear_output()
  embedded = model.encode(text_input.value)
  label = ridge.predict([embedded])
  label = np.round(label, 2)[0]
  with output:
    print('Predicted Sentiment Score: ' + str(label))

button_summarize.on_click(on_predict_clicked)

Textarea(value='', description='Text:', layout=Layout(height='auto', width='auto'), placeholder='Please type s…

Button(description='Classify', style=ButtonStyle())

Output()