# Sentiment Classifier
In this notebook we develop a sentiment classifier using mutiple machine learning techniques and find the best model for our sample data from the hill news articles

### Reading Newly Created csv file in with sentiment 

In [8]:
import pandas as pd

In [13]:
df=pd.read_csv('sample_sentences_final.csv')

In [19]:
# Rename column
df = df.rename(columns={'classification (positive, neutral, negative)': 'sentiment'})


### Mostly "Neutral" sentiments == Unbalanced Dataframe

In [32]:
# Get the frequency of each value in the 'sentiment' column
sentiment_counts = df['sentiment'].value_counts()
print(sentiment_counts)

neutral     1347
negative     469
positive     184
Name: sentiment, dtype: int64


 This can lead to biased model predictions and poor generalization on new data.

# Text Classifier

 splits the dataset into training and testing sets, vectorizes the text data, trains classifiers, and evaluates their performance metrics. Note that the performance metrics use 'weighted' averaging to account for imbalanced classes.

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['sentence'], df['sentiment'], test_size=0.2, random_state=42)

# Vectorize the text data using CountVectorizer or TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Define classifiers
classifiers = [
    ('Multinomial Naive Bayes', MultinomialNB()),
    ('Logistic Regression', LogisticRegression(max_iter=1000)),
    ('SVM', LinearSVC()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('K-Nearest Neighbors', KNeighborsClassifier()),
    ('Neural Network (MLP)', MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42))
]

# Train and evaluate classifiers
for name, classifier in classifiers:
    # Train the classifier
    classifier.fit(X_train_vec, y_train)

    # Make predictions
    y_pred = classifier.predict(X_test_vec)

    # Calculate performance metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')

    # Print performance metrics
    print(f"{name}:\nAccuracy: {accuracy:.4f}\nPrecision: {precision:.4f}\nRecall: {recall:.4f}\nF1 Score: {f1:.4f}\n")


  _warn_prf(average, modifier, msg_start, len(result))


Multinomial Naive Bayes:
Accuracy: 0.6575
Precision: 0.6009
Recall: 0.6575
F1 Score: 0.5308

Logistic Regression:
Accuracy: 0.7200
Precision: 0.7655
Recall: 0.7200
F1 Score: 0.6554

Linear SVM:
Accuracy: 0.7625
Precision: 0.7585
Recall: 0.7625
F1 Score: 0.7417

Decision Tree:
Accuracy: 0.7075
Precision: 0.7008
Recall: 0.7075
F1 Score: 0.7011

Random Forest:
Accuracy: 0.7550
Precision: 0.7717
Recall: 0.7550
F1 Score: 0.7183

K-Nearest Neighbors:
Accuracy: 0.6850
Precision: 0.6661
Recall: 0.6850
F1 Score: 0.6647



  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Neural Network (MLP):
Accuracy: 0.7500
Precision: 0.7402
Recall: 0.7500
F1 Score: 0.7340



### Best Performance

Based on the performance metrics, SVM seems to be the best model among the classifiers. It has the highest accuracy and F1 score, which are commonly used evaluation metrics for classification tasks. 

Accuracy: 0.7625

Precision: 0.7585

Recall: 0.7625

F1 Score: 0.7417

In [30]:
# Creating new column for SVM classification

# Vectorize the text data using TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['sentence'])
y = df['sentiment']

# Train the Linear SVM classifier on the entire dataset
svm_classifier = LinearSVC()
svm_classifier.fit(X, y)

# Make predictions using the trained classifier
svm_predictions = svm_classifier.predict(X)

# Add the predictions as a new column in the DataFrame
df['SVM Classification'] = svm_predictions


      category                                           sentence sentiment  \
0     business               Someone forward you this newsletter?   neutral   
1      opinion  He was impeached (indicted, in a sense) for th...  negative   
2       policy  Crypto‚Äôs market cap is sitting right around ...  negative   
3      opinion  When it comes to fighting and ultimately defea...   neutral   
4      opinion  Weapons of war, such as AR-style firearms and ...   neutral   
...        ...                                                ...       ...   
1995    policy  Beyond seasonal mood changes, recent years hav...   neutral   
1996   opinion  Transmission is occurring mainly in health car...   neutral   
1997  business  ), whose state has¬†legalized¬†recreational ma...   neutral   
1998   opinion  Carolyn Kissane, Ph.D., is assistant dean of t...   neutral   
1999   opinion  In fact, data shows that women have been revie...   neutral   

     SVM Classification  
0               neutral  

### Finding average sentiment scores for each category of article

In [36]:
# Assign numerical scores to the sentiments
sentiment_scores = df['sentiment'].map({'negative': -1, 'neutral': 0, 'positive': 1})

# Add the sentiment scores as a new column in the DataFrame
df['Sentiment Scores'] = sentiment_scores

# Group the DataFrame by the 'category' column and calculate the mean sentiment score for each group
average_sentiment_by_category = df.groupby('category')['Sentiment Scores'].mean()

print(average_sentiment_by_category)

category
business   -0.008427
news       -0.036232
opinion    -0.255947
policy     -0.009464
Name: Sentiment Scores, dtype: float64


**Interpretation**: All scores are slightly negative, but very close to zero indicating mostly neutral. Opinion has the strongest negative average sentiment score, which makes sense as the opinion section does not attempt to stay un-biased. This indicates that the opinion section has more negative sentiments based on our sample of data.  

## Testing the SVM Classifier on 10 random sentences

In [37]:
# 10 random sentences with varying sentiments
random_sentences = [
    "I love this product! It's amazing.",
    "The movie was really boring and uninteresting.",
    "This restaurant serves the best pizza in town.",
    "The customer service was terrible and rude.",
    "The new policy changes seem to be reasonable.",
    "The weather today is quite pleasant.",
    "I'm extremely disappointed with the quality of the shoes.",
    "The event was well-organized and enjoyable.",
    "I had an average experience at the hotel.",
    "The book is not good, but not bad either."
]

# Transform the sentences using the same vectorizer used for training
random_sentences_vec = vectorizer.transform(random_sentences)

# Make predictions using the trained classifier
predicted_sentiments = svm_classifier.predict(random_sentences_vec)

# Print the predictions for the random sentences
for sentence, sentiment in zip(random_sentences, predicted_sentiments):
    print(f"Sentence: {sentence}\nPredicted Sentiment: {sentiment}\n")

Sentence: I love this product! It's amazing.
Predicted Sentiment: neutral

Sentence: The movie was really boring and uninteresting.
Predicted Sentiment: neutral

Sentence: This restaurant serves the best pizza in town.
Predicted Sentiment: neutral

Sentence: The customer service was terrible and rude.
Predicted Sentiment: neutral

Sentence: The new policy changes seem to be reasonable.
Predicted Sentiment: neutral

Sentence: The weather today is quite pleasant.
Predicted Sentiment: negative

Sentence: I'm extremely disappointed with the quality of the shoes.
Predicted Sentiment: negative

Sentence: The event was well-organized and enjoyable.
Predicted Sentiment: neutral

Sentence: I had an average experience at the hotel.
Predicted Sentiment: neutral

Sentence: The book is not good, but not bad either.
Predicted Sentiment: neutral



**Interpretation**: Since our dataset was very unbalanced with mostly neutral sentiments, our model **overfit** to this dataset and predicts more neutral sentiments. This shows although our model has relatively high performance on our training/testing data, this will **not** perform well in a real world setting