[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/ml/tensorflow_mongodbcharts_horoscopes.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/developer/products/mongodb/tensorflow-mongodb-charts/)

# Sentiment Analysis on my scraped horoscopes

Our first step is to do sentiment analysis on our .csv file of scraped horoscopes. Luckily for us, at this point in the tutorial we don't need to build a model (yet!), we can use a pre-trained model to figure out whether or not our horoscopes from the past six months are positive or negative.

I am using this tutorial here from [Medium](https://medium.com/@sharma.tanish096/sentiment-analysis-using-pre-trained-models-and-transformer-28e9b9486641), please feel free to take a look at it to better understand the code used below.

In [None]:
import numpy as np
import tensorflow as tf
from scipy.special import softmax
from transformers import AutoConfig, AutoTokenizer, TFAutoModelForSequenceClassification

Once everything is imported in, let's choose which pre-trained model we want to use. Since this is a TensorFlow tutorial, we can go ahead and use the ["distilbert-base-uncased-finetuned-sst-2-english"](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) model since it's compatible with TensorFlow, but there are a ton of options out there to choose from if you would like to switch it up.

In [None]:
# distilbert model we are using
distilbert = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(distilbert)
config = AutoConfig.from_pretrained(distilbert)
model = TFAutoModelForSequenceClassification.from_pretrained(distilbert)


def sentiment_finder(horoscope):
    input = tokenizer(
        horoscope, padding=True, truncation=True, max_length=512, return_tensors="tf"
    )
    output = model(input)
    scores = output.logits[0].numpy()
    scores = softmax(scores)
    ranking = np.argsort(scores)
    ranking = ranking[::-1]
    return config.id2label[ranking[0]]


# test and see if works before we try on our csv file
horoscope = "Things might get a bit confusing for you today, Capricorn. Don't feel like you need to make sense of it all. In fact, this task may be impossible. Just be yourself. Let your creative nature shine through. Other people are quite malleable, and you should feel free to take the lead in just about any situation. Make sure, however, that you consider other people's needs."
sentiment = sentiment_finder(horoscope)
print(f"Horoscope is {sentiment}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Horoscope is NEGATIVE


This is great! As we can see, that Capricorn horoscope is in fact negative, and we were able to use our pre-trained model to classify it. But, now we need to make some changes because we don't want to put in everything manually, we want to use this pre-trained model and put in our .csv file of all our horoscopes and figure out the sentiment analysis of everything in our file, while also incorporating in a new "sentiment" column that will include 1's if the horoscope is positive and 0 if the horoscope is negative.

In [None]:
# function to apply sentiment against each horoscope
def apply_sentiment(horoscope):
    sentiment = sentiment_finder(horoscope)
    return 1 if sentiment == "POSITIVE" else 0

Now load our "anaiya-six-months-horoscopes.csv" file

In [None]:
# import pandas
!pip install pandas
import pandas as pd



In [None]:
df = pd.read_csv("anaiya-six-months-horoscopes.csv")

In [None]:
# we want to apply each sentiment to our horoscopes with a new column to hold them
df["sentiment"] = df["horoscope"].apply(apply_sentiment)

In [None]:
# save our new dataframe to new csv file
df.to_csv("anaiya-six-months-horoscopes-sentiment.csv")

print("saved to new file called anaiya-six-months-horoscopes-sentiment.csv")

saved to new file called anaiya-six-months-horoscopes-sentiment.csv


In [None]:
df.head()

Unnamed: 0,date,zodiac,horoscope,sentiment
0,20240128,Aries,"Jan 28, 2024 - Drastic shifts in your emotions...",1
1,20240128,Taurus,"Jan 28, 2024 - Today is one of those days in w...",1
2,20240128,Gemini,"Jan 28, 2024 - You're most likely going to be ...",1
3,20240128,Cancer,"Jan 28, 2024 - You may find yourself staring a...",0
4,20240128,Leo,"Jan 28, 2024 - If you find yourself needing to...",1


## Save our new `.csv` file into MongoDB Atlas so we can visualize our data in MongoDB Charts


This part is done using MongoDB Compass and MongoDB Charts.

# TRAIN AND TEST MODEL WITH TENSORFLOW

Now that we have our dataset ready with our sentiment analysis done using our pre-trained model, we can go ahead and set up a way to train and test our data so that if we wanted to incorporate new horoscopes, we can see if they will be negative or positive.

In order to help me learn how to do this, I watched this video from freeCodeCamp.org: https://www.youtube.com/watch?v=VtRLrQ3Ev-U, and I used the skeleton code from this TensorFlow docs: https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers

Feel free to watch it to get a better understanding of the code used below.

In [None]:
# let's first download tensorflow_hub since we need it
import tensorflow_hub as hub

In [None]:
# now load in our new .csv file with our sentiment analysis
# but only keep the columns we want, which are "horoscope" and "sentiment"

df = pd.read_csv("anaiya-six-months-horoscopes-sentiment.csv")
df = df[["horoscope", "sentiment"]]

We want to split up our dataset into three sets. We need a training set, a validation set, and a test set.

# BALANCE DATASET
We need to balance our dataset since we need to make sure our model is trained on the same exact amount of negative and positive horoscopes, otherwise things will be swayed in one direction or the other. Check out this article for help on how to balance your dataset: https://medium.com/@daniele.santiago/balancing-imbalanced-data-undersampling-and-oversampling-techniques-in-python-7c5378282290 and https://semaphoreci.com/blog/imbalanced-data-machine-learning-python

In [None]:
# shuffle
df = df.sample(frac=1, random_state=42)

In [None]:
# first split variables into x and y
X = df["horoscope"]
y = df["sentiment"]

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split

# now do for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, shuffle=True, test_size=0.15, random_state=42
)

# create our RandomUnderSampler object
rus = RandomUnderSampler(random_state=42, sampling_strategy="majority")

# apply our RUS technique
X_resampled, y_resampled = rus.fit_resample(X_train.to_frame(), y_train)

# now convert it back to our dataframe
balanced_trained = pd.DataFrame(
    {"horoscope": X_resampled["horoscope"], "sentiment": y_resampled}
)

In [None]:
# print it out and make sure you have the exact same number of negative and positive
# horoscopes for our data.
sentiment_amount_training = balanced_trained["sentiment"].value_counts()
print(sentiment_amount_training)

sentiment
0    495
1    495
Name: count, dtype: int64


# SPLIT UP OUR DATASET

In [None]:
# split our balanced dataset into our training and validation
train, val = train_test_split(
    balanced_trained,
    test_size=0.2,
    stratify=balanced_trained["sentiment"],
    random_state=42,
)

In [None]:
# view the sizes of each set
print("Training set:", len(train))
print("Validation set:", len(val))
print("Test set:", len(X_test))

Training set: 792
Validation set: 198
Test set: 330


In [None]:
# combine back our X_test and y_test to a df
test = pd.DataFrame({"horoscope": X_test, "sentiment": y_test})

# CONVERT TO TENSORFLOW DATASET

Now, let's convert our dataframes to TensorFlow datasets. Use this code from the documentation: https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers

it converts each train, validation, and test dataset into a tensorflow dataset and will shuffle again and batch the data for you.

In [None]:
# this is the code from the tutorial
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    df = dataframe.copy()
    labels = df.pop("target")
    df = {key: value.values[:, tf.newaxis] for key, value in dataframe.items()}
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(batch_size)
    return ds

In [None]:
# code changed to meet my specific needs
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    df = dataframe.copy()
    labels = df.pop("sentiment")
    df = df["horoscope"]
    ds = tf.data.Dataset.from_tensor_slices((df, labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(tf.data.AUTOTUNE)
    return ds

In [None]:
train_data = df_to_dataset(train)
val_data = df_to_dataset(val)
test_data = df_to_dataset(test)

In [None]:
list(train_data)[0]

(<tf.Tensor: shape=(32,), dtype=string, numpy=
 array([b'Feb 13, 2024 - Machines involved in financial transactions, such as ATMs, phone systems, or banking websites could malfunction today, Gemini, so you might have to resort to dealing with money in the old fashioned way: by going into the bank or writing checks. Electrical storms or solar flares could be interfering with satellite signals, so there isn\xe2\x80\x99t much you can do. Needless to say, this isn\xe2\x80\x99t a good day to make any major financial transactions.',
        b"Feb 26, 2024 - There could be a missing person very much on your mind these days. Is it possible that the relationship is over and you're the last one to know? Don't let your insecurities get the better of you, Sagittarius. It's likely that your friend merely needs some time alone to sort out some big life issues. He or she will seek out your warmth and friendship again soon.",
        b'Jun 5, 2024 - It\xe2\x80\x99s time to get in touch with the people

Now we want to embed and build our model


# EMBEDDING LAYER

In [None]:
# using the embedding layer from tensorflow hub
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, dtype=tf.string, trainable=True)

# MODEL

Now, we need to build out our neural network model.
We want various layers here built out with the Sequential model since it's a way of stacking the layers one by one, and is the easiest model to understand and visualize. We are also going to be using Dropout layers since it's a good way to prevent overfitting, which can lead your model astray. We are going to be using a dropout of 0.4, 0.3 and 0.2, so 40%, 30% and 20% of our neural networks neurons will be randomly dropped out, or set to zero, so that our model can work better.

In [None]:
# model
model = tf.keras.Sequential()  # since layer by layer so sequential. most basic form
model.add(hub_layer)
model.add(tf.keras.layers.Dense(128, activation="relu"))  # first neural network layer
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(64, activation="relu"))  # second layer
model.add(tf.keras.layers.Dropout(0.3))  # another dropout layer
model.add(tf.keras.layers.Dense(32, activation="relu"))  # third layer
model.add(tf.keras.layers.Dropout(0.2))  # another dropout layer
# sigmoid is used for binary, so great for sentiment analysis
model.add(tf.keras.layers.Dense(1, activation="sigmoid"))  # output layer

Now we want to compile our model

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=["accuracy"],
)

Let's now train our model on our training data

In [None]:
history = model.fit(train_data, epochs=5, validation_data=val_data)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Once again we see that the loss may be plateauing in some places, but is overall decreasing. We can see that our val_accuracy is increasing so this means that our model is being trained nicely on our dataset.

Now, evaluate our model on our test dataset

In [None]:
loss, accuracy = model.evaluate(test_data)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

Loss: 0.5995222926139832
Accuracy: 0.6969696879386902


# NEW HOROSCOPE PREDICTION

https://www.tensorflow.org/api_docs/python/tf/squeeze

tf.squeeze is how you can get the probability from our prediction

In [None]:
def predict_sentiment(horoscope):
    # convert to so tensorflow can understand
    encoded_input = tf.constant([horoscope])

    prediction = model.predict(encoded_input)

    # prediction from probability
    probability = tf.squeeze(prediction).numpy()
    print(f"model probability: {probability}")

    # set it so that we can see if it's negative or positive
    sentiment = 1 if probability > 0.5 else 0

    return sentiment


# daily horoscope
positive_horoscope = "You're incredibly productive, with good business sense, Libra."
negative_horoscope = "This isn't the most cheerful time, Leo, because important issues are rearing their heads again and forcing you to address them."
pos_sentiment = predict_sentiment(positive_horoscope)
neg_sentiment = predict_sentiment(negative_horoscope)

print(f"This should be positive: {pos_sentiment}")
print(f"This should be negative: {neg_sentiment}")

model probability: 0.615037739276886
model probability: 0.43571737408638
This should be positive: 1
This should be negative: 0


## Let's see how our week will be going forward

In [None]:
file = "new-week-horoscopes2.csv"
df = pd.read_csv(file)

df["sentiment"] = df["horoscope"].apply(predict_sentiment)

for index, row in df.iterrows():
    zodiac = row["zodiac"]
    horoscope = row["horoscope"]
    sentiment = row["sentiment"]
    print(f"{zodiac} horoscope is {sentiment}")

model probability: 0.6883848905563354
model probability: 0.5126816034317017
model probability: 0.5237581729888916
model probability: 0.3824429214000702
model probability: 0.6003984212875366
model probability: 0.6250004172325134
model probability: 0.6173655390739441
model probability: 0.6848816275596619
model probability: 0.6913534998893738
model probability: 0.6552363634109497
model probability: 0.5873440504074097
model probability: 0.395006000995636
Aries horoscope is 1
Taurus horoscope is 1
Gemini horoscope is 1
Cancer horoscope is 0
Leo horoscope is 1
Virgo horoscope is 1
Libra horoscope is 1
Scorpio horoscope is 1
Sagittarius horoscope is 1
Capricorn horoscope is 1
Aquarius horoscope is 1
Pisces horoscope is 0


## lets save these back into MongoDB Atlas so we can visualize them in Charts

In [None]:
pip install pymongo

Collecting pymongo
  Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.8.0


In [None]:
# first connect to MongoDB Atlas
import getpass

from pymongo import MongoClient

# set up your MongoDB connection
connection_string = getpass.getpass(
    prompt="Enter connection string WITH USER + PASS here"
)
client = MongoClient(
    connection_string, appname="devrel.showcase.tensorflow_mongodbcharts"
)

# we are creating a new collection in the same database as before
database = client["horoscopes"]
collection = database["new_week_horoscope"]

for index, row in df.iterrows():
    zodiac = row["zodiac"]
    horoscope = row["horoscope"]
    sentiment = row["sentiment"]

    dict = {"zodiac": zodiac, "horoscope": horoscope, "sentiment": sentiment}

    collection.insert_one(dict)


print("saved in! go check")

Enter connection string WITH USER + PASS here··········
saved in! go check
