Tuazon, Francesca Marie A.
(BCS34)

In [1]:
!pip install nltk spacy



In [2]:
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# stopwords are common words (is, and, the)
from spacy.lang.en import English

# download NLTK resources
nltk.download('punkt') # punkt is a tokenizer model used for splitting text into sentences (sentence tokenization)
nltk.download('stopwords') # breaking down from sentence into words (word tokenization)

# load spacy model
nlp = English()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
from sklearn.feature_extraction.text import CountVectorizer # CountVectorizer is a tool that converts a collection of text document into matrix of token, it breaks down text into indiv words/tokens and counts how often (freq) each word appears in text
from sklearn.naive_bayes import MultinomialNB # MultinomialNB - this is a Naive bayes classifier for multinomially distributed data, often used for text classification - represents freq or counts
from sklearn.pipeline import make_pipeline # make_pipeline is a functionused to create a pipeline that sequentially combines several processing steps into single objects

# Sample Data
texts = [
    "The movie was fantastic, I loved every moment of it",
    "The food was terrible, I would never eat there again",
    "I had a great time at the concert",
    "The service at the restaurant was horrible",
    "I really enjoyed the book",
    "The hotel room was dirty and uncomfortable",
    "I am very satisfied with my purchase",
    "The delivery was late and the package was damaged",
    "The customer support was very helpful",
    "I am disappointed with the quality of the product"
]
labels = ["Positive", "Negative", "Positive", "Negative", "Positive",
          "Negative", "Positive", "Negative", "Positive", "Negative"]

# Create a pipeline for the classification
model = make_pipeline(CountVectorizer(), MultinomialNB())
model.fit(texts, labels)

# Accept user input
user_input = input("Enter a sentence for sentiment analysis: ")

# Predict the sentiment of the user input
prediction = model.predict([user_input])
print("Predicted sentiment: ", prediction)

Enter a sentence for sentiment analysis: the movie was fantastic
Predicted sentiment:  ['Positive']


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd # assuming you don't want to import anything


# Create CountVectorizer and transform the data
count_vectorizer = CountVectorizer(stop_words="english")
x_count = count_vectorizer.fit_transform(texts)

# Split data for CountVectorizer
x_train_count, x_test_count, y_train_count, y_test_count = train_test_split(
    x_count, labels, test_size=0.2, random_state=42
)

# Train and predict using Logistic Regression with CountVectorizer features
model_count = LogisticRegression(class_weight='balanced', max_iter=1000)
model_count.fit(x_train_count, y_train_count)
y_pred_count = model_count.predict(x_test_count)

# Calculate evaluation metrics for CountVectorizer
accuracy_count = accuracy_score(y_test_count, y_pred_count)
precision_count = precision_score(y_test_count, y_pred_count, pos_label='Positive')
recall_count = recall_score(y_test_count, y_pred_count, pos_label='Positive')
f1_count = f1_score(y_test_count, y_pred_count, pos_label='Positive')

# Create DataFrame for CountVectorizer evaluation
count_eval_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    'Score': [accuracy_count, precision_count, recall_count, f1_count]
})

print("CountVectorizer Evaluation:")
print(count_eval_df)

CountVectorizer Evaluation:
      Metric     Score
0   Accuracy  0.500000
1  Precision  0.500000
2     Recall  1.000000
3   F1-Score  0.666667


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd # assuming you don't want to import anything


# Create CountVectorizer and transform the data
count_vectorizer = CountVectorizer(stop_words="english")
x_count = count_vectorizer.fit_transform(texts)

# Split data for CountVectorizer
x_train_count, x_test_count, y_train_count, y_test_count = train_test_split(
    x_count, labels, test_size=0.2, random_state=42
)

# Train and predict using Logistic Regression with CountVectorizer features
model_count = LogisticRegression(class_weight='balanced', max_iter=1000)
model_count.fit(x_train_count, y_train_count)
y_pred_count = model_count.predict(x_test_count)

# Calculate evaluation metrics for CountVectorizer
accuracy_count = accuracy_score(y_test_count, y_pred_count)
precision_count = precision_score(y_test_count, y_pred_count, pos_label='Positive')
recall_count = recall_score(y_test_count, y_pred_count, pos_label='Positive')
f1_count = f1_score(y_test_count, y_pred_count, pos_label='Positive')

# Generate classification report
from sklearn.metrics import classification_report  # Import if needed
report_count = classification_report(y_test_count, y_pred_count)

# Create DataFrame for CountVectorizer evaluation
count_eval_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    'Score': [accuracy_count, precision_count, recall_count, f1_count]
})

print("CountVectorizer Evaluation:")
print(count_eval_df)
print("\nClassification Report:")
print(report_count)

CountVectorizer Evaluation:
      Metric     Score
0   Accuracy  0.500000
1  Precision  0.500000
2     Recall  1.000000
3   F1-Score  0.666667

Classification Report:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         1
    Positive       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Enhanced Model

In [6]:
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from spacy.lang.en import English
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Download 'wordnet' data before using WordNetLemmatizer
nltk.download('wordnet')

# Function to preprocess text
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return ' '.join(tokens)  # Join tokens back into a string

# Create a pipeline with preprocessing
model = make_pipeline(
    TfidfVectorizer(preprocessor=preprocess_text),
    MultinomialNB()
)
model.fit(texts, labels)

# Accept user input
user_input = input("Enter a sentence for sentiment analysis: ")

# Preprocess user input before prediction
processed_input = preprocess_text(user_input)

# Predict the sentiment
prediction = model.predict([processed_input])
print("Predicted sentiment: ", prediction)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Enter a sentence for sentiment analysis: the movie was fantastic
Predicted sentiment:  ['Positive']


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Assuming 'texts' and 'labels' are already defined in your environment
# and you have a 'preprocess_text' function (if needed)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words="english")  # No need to import
x = vectorizer.fit_transform(texts)

# Assign labels (assuming 'labels' is already defined)
y = labels

# Split into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)  # No need to import

# Train Logistic Regression model
model = LogisticRegression(class_weight='balanced', max_iter=1000)  # No need to import
model.fit(x_train, y_train)

# Predict on test data
y_pred = model.predict(x_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='Positive')
recall = recall_score(y_test, y_pred, pos_label='Positive')
f1 = f1_score(y_test, y_pred, pos_label='Positive')

# Generate classification report
from sklearn.metrics import classification_report  # Import if needed
report = classification_report(y_test, y_pred)

# Create a DataFrame for evaluation metrics
eval_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    'Score': [accuracy, precision, recall, f1]
})
print("TFIDVectorizer Evaluation:")
print(eval_df)
print("\nClassification Report:")
print(report)

TFIDVectorizer Evaluation:
      Metric  Score
0   Accuracy    0.5
1  Precision    0.0
2     Recall    0.0
3   F1-Score    0.0

Classification Report:
              precision    recall  f1-score   support

    Negative       0.50      1.00      0.67         1
    Positive       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Discussion of Results

The basic CountVectorizer with MultinomialNB provided a simple baseline, while CountVectorizer with Logistic Regression offered the potential to handle more complex relationships. The enhanced model, incorporating lemmatization and TF-IDF, further improved feature extraction.

While combining TF-IDF with Logistic Regression often yields robust sentiment predictions, in this specific scenario, it resulted in lower evaluation scores compared to CountVectorizer. This observation is likely attributed to the limited data size and TF-IDF's inherent focus on distinctive words, which might not be optimal for smaller datasets. Each approach, however, presented trade-offs, such as the simplicity of CountVectorizer with MultinomialNB versus the potential for overfitting with Logistic Regression.

Model evaluation involved analyzing classification reports and metrics like accuracy, precision, recall, and F1-score. By comparing these metrics across different models, we could identify the best-performing approach for the given task. The classification reports provided detailed insights into the model's performance for each sentiment class. Overall, this analysis showcased the trade-offs and considerations involved in selecting appropriate techniques for sentiment analysis.

