---Sentiment analysis---
The script conducts sentiment analysis of news articles in the MIND dataset. It takes the PerSenT dataset for training, where articles are marked as postive, negative, or neutral, and applies this classification to the MIND articles. The classification is based on OpenAI's embeddings (text-embedding-ada-002). 


In [2]:
# imports
import pandas as pd
import openai
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from ast import literal_eval

In [159]:
# load data 
sent_train = pd.read_csv("sentiment_analysis/PerSenT/train.csv")
df = sent_train.drop(['DOCUMENT_INDEX', 'TARGET_ENTITY', 'TITLE', 'MASKED_DOCUMENT', 'Paragraph0', 'Paragraph1', 'Paragraph2', 'Paragraph3', 'Paragraph4', 'Paragraph5', 'Paragraph6', 'Paragraph7', 'Paragraph8', 'Paragraph9', 'Paragraph10', 'Paragraph11', 'Paragraph12', 'Paragraph13', 'Paragraph14', 'Paragraph15'], axis=1)
df.head()

Unnamed: 0,DOCUMENT,TRUE_SENTIMENT
0,Germany's Landesbank Baden Wuertemberg won EU ...,Negative
1,The Philippine National Police (PNP) identifie...,Neutral
2,Sirleaf 70 acknowledged before the commissio...,Negative
3,Sawyer logged off and asked her sister Mari ...,Neutral
4,Candi Holyfield said in the protective order t...,Neutral


In [173]:
negative_rows = df[df['TRUE_SENTIMENT'] == 'Negative']
print(negative_rows)

                                               DOCUMENT TRUE_SENTIMENT
0     Germany's Landesbank Baden Wuertemberg won EU ...       Negative
2     Sirleaf  70  acknowledged before the commissio...       Negative
10    But could it be that U.S. Sen. Ted Cruz of Tex...       Negative
11    Soldiers will stop and check all motorists pas...       Negative
14    The cartoon caused a storm when the Times publ...       Negative
...                                                 ...            ...
3408  4. Geert Wilders Hate Speech Trials\nThe most ...       Negative
3409  Meanwhile  the city is still dealing with the ...       Negative
3410  Weâll get to the merits of the charges and c...       Negative
3411  Russia âs president Vladimir Putin  wanted t...       Negative
3412  All five living former US presidents are teami...       Negative

[409 rows x 2 columns]


In [174]:
positive_rows = df[df['TRUE_SENTIMENT'] == 'Positive']
print(positive_rows)

                                               DOCUMENT TRUE_SENTIMENT
8     Rukodelnikova is fond of a lot things from Chi...       Positive
13    The first team of women who summit Mt. Qomolan...       Positive
18    The funding would subsidize tourist agencies o...       Positive
23    Sheikh Ahmad  who is tasked with implementing ...       Positive
26    This week actor Paul McGann stood amid the dir...       Positive
...                                                 ...            ...
3338  A complicated era is over.  Carmelo Anthony  f...       Positive
3342  Kara Alaimo  an assistant professor of public ...       Positive
3345  Carmelo Anthony  stepped off a private jet Sun...       Positive
3347  Pittsburgh Steelers offensive tackle Alejandro...       Positive
3353  What is known for sure about  American militar...       Positive

[1758 rows x 2 columns]


In [190]:
sent_train = pd.read_csv("sentiment_analysis/PerSenT/train_bal.csv", sep = ';')
df = sent_train.drop(['DOCUMENT_INDEX', 'TARGET_ENTITY', 'TITLE', 'MASKED_DOCUMENT', 'Paragraph0', 'Paragraph1', 'Paragraph2', 'Paragraph3', 'Paragraph4', 'Paragraph5', 'Paragraph6', 'Paragraph7', 'Paragraph8', 'Paragraph9', 'Paragraph10', 'Paragraph11', 'Paragraph12', 'Paragraph13', 'Paragraph14', 'Paragraph15'], axis=1)
df.head()

Unnamed: 0,DOCUMENT,TRUE_SENTIMENT
0,Germany's Landesbank Baden Wuertemberg won EU ...,Negative
1,The Philippine National Police (PNP) identifie...,Neutral
2,Sirleaf 70 acknowledged before the commissio...,Negative
3,Sawyer logged off and asked her sister Mari ...,Neutral
4,Candi Holyfield said in the protective order t...,Neutral


In [191]:
# Keep only 500 rows where TRUE_SENTIMENT is "Positive"
positive_rows = df[df['TRUE_SENTIMENT'] == 'Positive'].head(500)

# Keep the rest of the DataFrame unchanged
other_rows = df[df['TRUE_SENTIMENT'] != 'Positive']

# Concatenate the two parts to get the final DataFrame
df = pd.concat([positive_rows, other_rows])

# Display the final DataFrame
print(df)

                                               DOCUMENT TRUE_SENTIMENT
8     Rukodelnikova is fond of a lot things from Chi...       Positive
13    The first team of women who summit Mt. Qomolan...       Positive
18    The funding would subsidize tourist agencies o...       Positive
23    Sheikh Ahmad  who is tasked with implementing ...       Positive
26    This week actor Paul McGann stood amid the dir...       Positive
...                                                 ...            ...
3408  4. Geert Wilders Hate Speech Trials\nThe most ...       Negative
3409  Meanwhile  the city is still dealing with the ...       Negative
3410  Weâll get to the merits of the charges and c...       Negative
3411  Russia âs president Vladimir Putin  wanted t...       Negative
3412  All five living former US presidents are teami...       Negative

[2155 rows x 2 columns]


In [193]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Vectorize the text using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['DOCUMENT'])

# Encode labels
y = df['TRUE_SENTIMENT']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)


# Initialize the SVM classifier with class weights
classifier = SVC(kernel='sigmoid')

# Train the classifier
classifier.fit(X_train, y_train)

# Make predictions on the test set
predictions = classifier.predict(X_test)

# Evaluate the classifier
report = classification_report(y_test, predictions)

print("Classification Report:\n", report)


Classification Report:
               precision    recall  f1-score   support

    Negative       0.43      0.03      0.06        88
     Neutral       0.56      0.98      0.71       237
    Positive       0.83      0.05      0.09       106

    accuracy                           0.56       431
   macro avg       0.61      0.35      0.29       431
weighted avg       0.60      0.56      0.42       431

