# UCI Sentiment Analysis - Custom Keras Model

This notebook creates a custom sentiment analysis model using the UCI Labelled Sentences dataset. The model will be trained on data from Yelp, Amazon, and IMDB sources to create a personalized sentiment analysis tool.

## 1. Download the dataset and unzip it in Google Colab

In [None]:
# download dataset from the UCI website
!curl -o uci-labelled-sentences.zip https://archive.ics.uci.edu/static/public/331/sentiment+labelled+sentences.zip

# unzip dataset in Colab
!unzip uci-labelled-sentences.zip

## 2. Import Keras and other libraries

In [None]:
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.callbacks import EarlyStopping

## 3. Load the datasets

In [None]:
df_list = []

# Yelp
df_yelp = pd.read_csv('sentiment labelled sentences/yelp_labelled.txt', names=['sentence', 'label'], sep='\t')
df_yelp['source'] = 'yelp'
df_list.append(df_yelp)

# Amazon
df_amazon = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', names=['sentence', 'label'], sep='\t')
df_amazon['source'] = 'amazon'
df_list.append(df_amazon)

# IMDB
df_imdb = pd.read_csv('sentiment labelled sentences/imdb_labelled.txt', names=['sentence', 'label'], sep='\t')
df_imdb['source'] = 'imdb'
df_list.append(df_imdb)

# Concatenate the dataframes
df = pd.concat(df_list)

df.head()

In [None]:
# Display dataset information
print(f"Total number of sentences: {len(df)}")
print(f"Number of positive labels: {len(df[df['label'] == 1])}")
print(f"Number of negative labels: {len(df[df['label'] == 0])}")
print(f"Sources: {df['source'].unique()}")

## 4. Tokenize the sentences

In [None]:
max_features = 2000
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(df['sentence'].values)
X = tokenizer.texts_to_sequences(df['sentence'].values)
X = pad_sequences(X)
y = df['label'].values

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

## 5. Split the dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.12)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

## 6. Define the model

In [None]:
def create_model():
  model = Sequential()
  model.add(Embedding(max_features, 64, input_length=X.shape[1]))
  model.add(LSTM(16))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

model = create_model()
model.summary()

## 7. Train the model

In [None]:
history = model.fit(X_train, y_train, 
                    epochs=6, 
                    batch_size=16, 
                    validation_data=(X_test, y_test), 
                    callbacks=[EarlyStopping(monitor='val_accuracy', 
                                            min_delta=0.001, 
                                            patience=2, 
                                            verbose=1)])

In [None]:
# Evaluate the model on test set
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")
print(f"Test loss: {test_loss:.4f}")

## 8. Test the model with sample sentences

In [None]:
def predict_sentiment(text):
    # Preprocess the text
    sequences = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(sequences, maxlen=X.shape[1])
    
    # Make prediction
    prediction = model.predict(padded)
    return prediction[0][0]

# Test with sample sentences
test_sentences = [
    "This product is amazing! I love it!",
    "This is the worst thing I've ever bought.",
    "The food was okay, nothing special.",
    "Excellent service and great quality!",
    "Terrible experience, would not recommend."
]

for sentence in test_sentences:
    score = predict_sentiment(sentence)
    sentiment = "Positive" if score > 0.5 else "Negative"
    print(f"'{sentence}' -> Score: {score:.4f} ({sentiment})")

## 9. Save the model and the tokenizer

In [None]:
# Save the trained model
model.save("uci_sentimentanalysis.h5")

# Save the tokenizer
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.DEFAULT_PROTOCOL)

print("Model and tokenizer saved successfully!")
print("Files created:")
print("- uci_sentimentanalysis.h5")
print("- tokenizer.pickle")

## 10. Download the files

After running all cells above, download the model and tokenizer files to your computer. These files will be used in the Flask application.