<a href="https://colab.research.google.com/github/natanael-santosd/Final-Project-Fusemachines/blob/main/Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **An Example of Emotion Detection and Topic Modeling Analysis of Crime News Comments on Instagram: The Paula Santana's Case**
#### By Natanael Santos Delgado

In this project, I am trying to find out what are the most common reactions of the Dominican users of IG when prompted crime news. I am analyzing the case of Paula Santana, who was the victim of two co-workers (killed in her job); one of them she had reported for harassment, but Human Resources didn't take it seriously. These kind of news always cause a strong reaction from the public, so I think it should be interesting to analyze these comments and try to establish the public's position on how to adress these issues, how it affects security concerns, especially for women, and a reflection on the victim's validation: people care more about victim's that are young, students, beautiful, etc. rather than the nature of the crime.

For emotion detection, I use a Support Vector Machine (SVM) classifier trained on TF-IDF vectors of comment text and for topic modeling, I use Latent Dirichlet Allocation (LDA). I also use a Multinomial Naive Bayes classifier to classify comments into sentiment categories such as positive, negative, or neutral or just how strong the negative reactions were.

One of the main challenges was the data collection, because web scraping for IG web seems to be very difficult, hence, many researchers prefer to extract comments from Twitter. I decided to use APIFY's instagram scraper for small news and excel cleaning for the mainstream media news. Also, the scope of my research questions might not be answered by the applied methods.

# **1. Literature Review**

## 1.1. Discourse Analysis

The International Encyclopedia of Education (2023) defines **discourse analysis** as "*the epistemological framework for investigating discourse which allows it to approach the variety of discursive genres and to describe the complexity of the discourse and of the interaction*".

Not from me: *Discourse analysis is a field of research composed of multiple heterogeneous, largely qualitative, approaches to the study of relationships between language-in-use and the social world. Researchers in the field typically view language as a form of social practice that influences the social world, and vice versa. Many contemporary varieties of discourse analysis have, explicitly or implicitly, been influenced by Michel Foucault's theories related to power, knowledge, and discourse*.

However, discourse analysis has been traditionally a qualitative research approach, that aims to extract social meaning from the study of the use of the language. According to Melissa N.P. Johnson, Ethan McLean (2020)   researchers in the field typically view language as a form of social practice that influences the social world, and vice versa.

https://www.sciencedirect.com/science/article/abs/pii/S0747563216305209

https://www.sciencedirect.com/science/article/abs/pii/S0010945220301854

https://aclanthology.org/P19-4003/


Not mine: Discourse analysis has been a fundamental problem in the ACL community, where the focus is to develop tools to automatically model language phenomena that go beyond the individual sentences. With the ongoing neural revolution, as the methods become more effective and flexible, analysis and interpretability beyond the sentence-level
is of particular interests for many core language processing tasks like language modeling (Ji et al., 2016) and applications such as machine translation and its evaluation (Sennrich, 2018; Laubli et al., 2018; Joty et al., 2017), text categorization (Ji and Smith, 2017), and sentiment analysis (Nejat et al., 2017). With the advent of Internet technologies, new forms of discourse are emerging (e.g., emails and discussion forums) with novel set of challenges for the computational models.

## 1.2 Emotion Detection and Topic Modeling

https://arxiv.org/abs/1205.4944

In [None]:
!pip install gitpython
import gitpython as git
git clone https://github.com/natanael-santosd/Final-Project-Fusemachines

SyntaxError: invalid syntax (<ipython-input-3-8a2cd79f35bb>, line 2)

# Mainstream Media Coverage

**Author:** Listín Diario \
**Date:**

In [None]:
import pandas as pd
from textblob import TextBlob

# Load the dataset
df = pd.read_csv('your_dataset.csv')

# Preprocess the comment text
# Add your preprocessing steps here (e.g., removing special characters, converting text to lowercase)

# Perform sentiment analysis on each comment
sentiment_scores = []
for index, row in df.iterrows():
    comment_text = row['data']  # Assuming 'data' is the column containing comment text
    blob = TextBlob(comment_text)
    # Specify the language as 'es' for Spanish
    sentiment_score = blob.sentiment.polarity
    sentiment_scores.append(sentiment_score)

# Aggregate sentiment scores to get an overall sentiment score for the crime news post
overall_sentiment_score = sum(sentiment_scores) / len(sentiment_scores)

print("Overall Sentiment Score:", overall_sentiment_score)

In [None]:
import pandas as pd
!pip install emoji
import emoji
from textblob import TextBlob

In [None]:
# Load the dataset
df = pd.read_csv('diariolibre2.csv')

# Preprocess the comment text
# Add your preprocessing steps here (e.g., removing special characters, converting text to lowercase)

# Convert emojis to text
#def convert_emojis_to_text(text):
    #return emoji.demojize(text)

#df['data'] = df['data'].apply(convert_emojis_to_text)

# Perform sentiment analysis on each comment
sentiment_scores = []
for index, row in df.iterrows():
    comment_text = row['data']
    blob = TextBlob(comment_text)
    sentiment_score = blob.sentiment.polarity
    sentiment_scores.append(sentiment_score)

# Aggregate sentiment scores to get an overall sentiment score for the crime news post
overall_sentiment_score = sum(sentiment_scores) / len(sentiment_scores)

print("Overall Sentiment Score:", overall_sentiment_score)

Overall Sentiment Score: -0.010933333333333331


# First News Article

**Author:** Female, Instagram User

**Date:**

**Link:**

**Content:** Comment

**Original Language:** Spanish

She was being harassed and the company labeled it as a "bad-taste joke" #JusticeForPaula

Paula's Case brings to the table the issue of work harassment, and with it:

- The absence of efficient mechanisms for prevention and monitoring of work harassment

- Normalization of harassment, ignoring that is violence and it can scalate quickly

- Desvalidación de experiencia de las víctimas

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report
!pip install chardet
import chardet
with open('data_journalist.csv', 'rb') as f:
    encoding = chardet.detect(f.read())['encoding']

# Load the dataset
df = pd.read_csv('data_journalist.csv', encoding=encoding)

# Separate features (comment text) and labels (emotion categories)
X = df['data']  # comment text
y = df[['sad', 'anger', 'fear', 'other']]  # emotion labels

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Support Vector Machine (SVM) classifier for each emotion category
svm_models = {}
for emotion in ['sad', 'anger', 'fear', 'other']:
    clf = SVC(kernel='linear')
    clf.fit(X_train_tfidf, y_train[emotion])
    svm_models[emotion] = clf

# Evaluate the models
for emotion in ['sad', 'anger', 'fear', 'other']:
    y_pred = svm_models[emotion].predict(X_test_tfidf)
    print(f"Emotion: {emotion}")
    print(classification_report(y_test[emotion], y_pred))

Emotion: sad
              precision    recall  f1-score   support

           0       0.57      0.80      0.67         5
           1       0.50      0.25      0.33         4

    accuracy                           0.56         9
   macro avg       0.54      0.53      0.50         9
weighted avg       0.54      0.56      0.52         9

Emotion: anger
              precision    recall  f1-score   support

           0       0.50      0.25      0.33         4
           1       0.57      0.80      0.67         5

    accuracy                           0.56         9
   macro avg       0.54      0.53      0.50         9
weighted avg       0.54      0.56      0.52         9

Emotion: fear
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9

    accuracy                           1.00         9
   macro avg       1.00      1.00      1.00         9
weighted avg       1.00      1.00      1.00         9

Emotion: other
              pr

# Second News Article

**Author:** Female, Instagram User

**Date:**

**Link:**

**Content:** Comment

**Original Language:** Spanish

In [None]:
# Load the dataset
df = pd.read_csv('data_ricardo.csv')

# Separate features (comment text) and labels (emotion categories)
X = df['data']  # comment text
y = df[['agreement', 'life_imp','other']]  # emotion labels

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Support Vector Machine (SVM) classifier for each emotion category
svm_models = {}
for emotion in ['agreement', 'life_imp','other']:
    clf = SVC(kernel='linear')
    clf.fit(X_train_tfidf, y_train[emotion])
    svm_models[emotion] = clf

# Evaluate the models
for emotion in ['agreement', 'life_imp','other']:
    y_pred = svm_models[emotion].predict(X_test_tfidf)
    print(f"Emotion: {emotion}")
    print(classification_report(y_test[emotion], y_pred))

Emotion: agreement
              precision    recall  f1-score   support

           0       0.50      0.50      0.50         2
           1       0.86      0.86      0.86         7

    accuracy                           0.78         9
   macro avg       0.68      0.68      0.68         9
weighted avg       0.78      0.78      0.78         9

Emotion: life_imp
              precision    recall  f1-score   support

           0       0.88      1.00      0.93         7
           1       1.00      0.50      0.67         2

    accuracy                           0.89         9
   macro avg       0.94      0.75      0.80         9
weighted avg       0.90      0.89      0.87         9

Emotion: other
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9

    accuracy                           1.00         9
   macro avg       1.00      1.00      1.00         9
weighted avg       1.00      1.00      1.00         9

