<a href="https://colab.research.google.com/github/natanael-santosd/Final-Project-Fusemachines/blob/main/Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **An Example of Emotion Detection and Topic Modeling Analysis of Crime News Comments on Instagram: The Paula Santana's Case**
#### By Natanael Santos Delgado

## Introduction

In this project, I am trying to find out what are the most common reactions of the Dominican users of IG when prompted crime news. I am analyzing the case of Paula Santana, who was the victim of two co-workers (killed in her job); one of them she had reported for harassment, but Human Resources didn't take it seriously. These kind of news always cause a strong reaction from the public, so I think it should be interesting to analyze these comments and try to establish the public's position on how to adress these issues, how it affects security concerns, especially for women, and a reflection on the victim's validation: people care more about victim's that are young, students, beautiful, etc. rather than the nature of the crime.

For emotion detection, I use a Support Vector Machine (SVM) classifier trained on TF-IDF vectors of comment text and for topic modeling, I use Latent Dirichlet Allocation (LDA). I also use a Multinomial Naive Bayes classifier to classify comments into sentiment categories such as positive, negative, or neutral or just how strong the negative reactions were.

One of the main challenges was the data collection, because web scraping for IG web seems to be very difficult, hence, many researchers prefer to extract comments from Twitter. I decided to use APIFY's instagram scraper for small news and excel cleaning for the mainstream media news. Also, the scope of my research questions might not be answered by the applied methods.

# **1. Literature Review**

## 1.1. Discourse Analysis

The International Encyclopedia of Education (2023) defines **discourse analysis** as "*the epistemological framework for investigating discourse which allows it to approach the variety of discursive genres and to describe the complexity of the discourse and of the interaction*".

Not from me: *Discourse analysis is a field of research composed of multiple heterogeneous, largely qualitative, approaches to the study of relationships between language-in-use and the social world. Researchers in the field typically view language as a form of social practice that influences the social world, and vice versa. Many contemporary varieties of discourse analysis have, explicitly or implicitly, been influenced by Michel Foucault's theories related to power, knowledge, and discourse*.

However, discourse analysis has been traditionally a qualitative research approach, that aims to extract social meaning from the study of the use of the language. According to Melissa N.P. Johnson, Ethan McLean (2020)   researchers in the field typically view language as a form of social practice that influences the social world, and vice versa.

https://www.sciencedirect.com/science/article/abs/pii/S0747563216305209

https://www.sciencedirect.com/science/article/abs/pii/S0010945220301854

https://aclanthology.org/P19-4003/


Not mine: Discourse analysis has been a fundamental problem in the ACL community, where the focus is to develop tools to automatically model language phenomena that go beyond the individual sentences. With the ongoing neural revolution, as the methods become more effective and flexible, analysis and interpretability beyond the sentence-level
is of particular interests for many core language processing tasks like language modeling (Ji et al., 2016) and applications such as machine translation and its evaluation (Sennrich, 2018; Laubli et al., 2018; Joty et al., 2017), text categorization (Ji and Smith, 2017), and sentiment analysis (Nejat et al., 2017). With the advent of Internet technologies, new forms of discourse are emerging (e.g., emails and discussion forums) with novel set of challenges for the computational models.

## 1.2 Emotion Detection and Topic Modeling

Emotion can be expressed in many ways that can be seen such as facial expression and gestures, speech and by written text. Emotion Detection in text documents is essentially a content - based classification problem involving concepts from the domains of Natural Language Processing as well as Machine Learning.

https://arxiv.org/abs/1205.4944

## 2.

# Mainstream Media Coverage

**Author:** Listín Diario \
**Date:**

In [None]:
import pandas as pd
from textblob import TextBlob
import demoji
import emoji

In [None]:
comment_text = "Que pena, ahí debe de estar preso hasta el seguridad hasta que todo se aclare y se haga justicia!"
blob = TextBlob(comment_text)
sentiment_score = blob.sentiment.polarity
print(sentiment_score)

0.0


First I want to convert all the emojis into text. This is a source of bias, since I will be defining the meaning, and maybe people did not have that intention or might not be exactly what they meant, but there's comments that only have emojis, so I would not like to delete them.

In [None]:
df = pd.read_csv('diariolibre1.csv')
df['data'] = df['data'].astype('str')

# Function to extract emojis from a string
def extract_distinct_emojis(text):
    return emoji.distinct_emoji_list(text)

# Apply the function to each value in the 'data' column
emojis_list = df['data'].apply(extract_distinct_emojis)

# Aggregate all extracted emojis into a single list
all_emojis = [emoji for sublist in emojis_list for emoji in sublist]

# Obtain unique emojis
unique_emojis = list(set(all_emojis))

print("Unique List of Emojis:")
print(unique_emojis)

emoji_mapping = {
    '🙌':'',
    '🫂':'',
    '😩':'',
    '😂':'',
    '🌹':'',
    '🥲':'',
    '💜':'',
    '☹️':'',
    '😳':'',
    '🔥':'',
    '🖤':'',
    '😔':'',
    '💰':'',
    '👀':'',
    '😄':'',
    '🙏🏻':'',
    '😖':'',
    '🇩🇴':'',
    '👏':'',
    '😞':'',
    '🙈':'',
    '⚖️':'',
    '😒':'',
    '\U0001f979':'',
    '👿':'',
    '😢':'',
    '❤️':'',
    '😇':'',
    '😮':'',
    '🙃':'',
    '😦':'',
    '🕯':'',
    '😍':'',
    '🤬':'',
    '🕊':'',
    '🙊':'',
    '😪':'',
    '🧑\u200d🎨':'',
    '🙏🏼':'',
    '❤':'',
    '🌎':'',
    '✝️':'',
    '😭':'',
    '🙏':'',
    '🐀':'',
    '🎥':'',
    '🙉':'',
    '🥺':'',
    '😡':'',
    '🕊️':'',
    '😥':'',
    '🤔':'',
    '👎':'',
    '💔':''
}

def all_emojis(dataset):
  """Iterates over the dataset and extract all strings that contain
     an emoji.
  """
  processed_text = ""
  for char in dataset:
    if char in emoji_mapping:
      processed_text += emoji_mapping[char]
    else:
      processed_text += char
  return processed_text

df['data'] = df['data'].apply(all_emojis)
print(df)

df_emoji = pd.read_csv('diariolibre1.csv')
df_emoji['data'] = df_emoji['data'].astype('str')

emoji_mapping_2 = {
    '🙌':'esperanza',
    '🫂':'abrazo',
    '😩':'angustia',
    '😂':'',
    '🌹':'rosa',
    '🥲':'llorar',
    '💜':'esperanza',
    '☹️':'llorar',
    '😳':'asombro',
    '🔥':'',
    '🖤':'esperanza',
    '😔':'pena',
    '💰':'dinero',
    '👀':'observar',
    '😄':'',
    '🙏🏻':'esperanza',
    '😖':'pena',
    '🇩🇴':'',
    '👏':'aplauso',
    '😞':'dolor',
    '🙈':'',
    '⚖️':'justicia',
    '😒':'cansancio',
    '\U0001f979':'',
    '👿':'enojo',
    '😢':'dolor',
    '❤️':'esperanza',
    '😇':'angel',
    '😮':'sorpresa',
    '🙃':'enojo',
    '😦':'sorpresa',
    '🕯':'paz',
    '😍':'',
    '🤬':'enojo',
    '🕊':'descansa en paz',
    '🙊':'enojo',
    '😪':'dolor',
    '🧑\u200d🎨':'',
    '🙏🏼':'orar',
    '❤':'esperanza',
    '🌎':'mundo',
    '✝️':'Dios',
    '😭':'llorar',
    '🙏':'orar',
    '🐀':'raton',
    '🎥':'pelicula',
    '🙉':'',
    '🥺':'llorar',
    '😡':'enojo',
    '🕊️':'descansa en paz',
    '😥':'tristeza',
    '🤔':'pensar',
    '👎':'disgusto',
    '💔':'corazon roto'
}

def all_emojis_emoji(dataset):
  """Iterates over the dataset and extract all strings that contain
     an emoji.
  """
  processed_text = ""
  for char in dataset:
    if char in emoji_mapping_2:
      processed_text += emoji_mapping_2[char]
    else:
      processed_text += char
  return processed_text

df_emoji['data'] = df_emoji['data'].apply(all_emojis_emoji)
print(df_emoji)

Unique List of Emojis:
['🙌', '🫂', '😩', '😂', '🌹', '🥲', '💜', '☹️', '😳', '🔥', '🖤', '😔', '💰', '👀', '😄', '🙏🏻', '😖', '🇩🇴', '👏', '😞', '🙈', '⚖️', '😒', '\U0001f979', '👿', '😢', '❤️', '😇', '😮', '🙃', '😦', '🕯', '😍', '🤬', '🕊', '🙊', '😪', '🧑\u200d🎨', '🙏🏼', '❤', '🌎', '✝️', '😭', '🙏', '🐀', '🎥', '🙉', '🥺', '😡', '🕊️', '😥', '🤔', '👎', '💔']
      id                                               data
0      0  No podemos dejar de hacer sonar el caso, hay q...
1      1  Y según la empresa, las cámaras del entorno do...
2      2     Dios mio, como apagaron el sueño de esa joven,
3      3  A ley de 4 meses paulita para graduarnos y mir...
4      4  Que raro que no se han filtrado fotos del acos...
..   ...                                                ...
495  495  Lo malo de este país que orita suertan.ese hij...
496  496  Esa muchacha faja trabajando y pagándose su ca...
497  497                             Hay Dios mío que dolor
498  498  Yo solo digo que eso está bien turbio, alguien...
499  499  Y la justici

In [None]:
# Load the dataset
# df = pd.read_csv('diariolibre1.csv')

# Perform sentiment analysis on each comment
sentiment_scores = []
for index, row in df.iterrows():
    comment_text = row['data']
    blob = TextBlob(comment_text)
    # Specify the language as 'es' for Spanish
    sentiment_score = blob.sentiment.polarity
    sentiment_scores.append(sentiment_score)

# Aggregate sentiment scores to get an overall
# sentiment score for the crime news post
overall_sentiment_score = sum(sentiment_scores) / len(sentiment_scores)

print("Overall Sentiment Score:", overall_sentiment_score)

Overall Sentiment Score: -0.011333333333333334


After receiving a close-to-zero overall sentime score, I suspect that the algorithm doesn't work well on this dataset. Hence, I will check different comments individually to judge whether the result is acceptable.

In [None]:
# This comment transates to: "What a tragedy, everyone involved should be in jail, even the security officer, until everything it's clear and justice is done!"
comment_text = "Que pena, ahí debe de estar preso hasta el seguridad hasta que todo se aclare y se haga justicia!"
blob = TextBlob(comment_text)
sentiment_score = blob.sentiment.polarity
print(sentiment_score)
# We get a sentime score of 0.0, showing it's neutral.

0.0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

X = df['data']

# Initialize the CountVectorizer
count_vectorizer = CountVectorizer()

# Fit and transform the data
X_count = count_vectorizer.fit_transform(X)

# Initialize the LDA model
lda_model = LatentDirichletAllocation(n_components=15, random_state=42)  # Assuming 10 topics

# Fit the LDA model
lda_model.fit(X_count)

# Get the top words for each topic
feature_names = count_vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda_model.components_):
    print(f"Top words for Topic {topic_idx}:")
    top_words_idx = topic.argsort()[:-10 - 1:-1]  # Top 10 words
    top_words = [feature_names[i] for i in top_words_idx]
    print(top_words)

In [105]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Assuming 'df' is your pandas DataFrame with 'id' and 'data' columns
df = pd.read_csv('svm_data.csv')
df['data'] = df['data'].astype('str')
encoded_df = pd.DataFrame()
encoded_df['data'] = df['data']
encoded_df['Justice'] = df['Justicia'] | df['Empresa'] | df['Pobre/Dinero']
encoded_df['Victim'] = df['Joven/Muchacha/Chica'] | df['Paula'] | df['Sueño']
encoded_df['Harassment'] = df['Acoso']
print(encoded_df)

# Load your dataset containing comments and label columns ('Justicia', 'Victim', 'Acoso')
# Assuming df is your DataFrame containing comments and label columns
# df = pd.read_csv('your_dataset.csv')

# Preprocess comments and extract features
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # You can adjust max_features as needed
X = tfidf_vectorizer.fit_transform(df['data'])

# Concatenate label columns into a single column for y
y = encoded_df[['Justice', 'Victim', 'Harassment']].astype(str).agg(''.join, axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the SVM classifier
svm_classifier = SVC(kernel='linear')

# Train the SVM classifier
svm_classifier.fit(X_train, y_train)

# Predict on the testing data
y_pred = svm_classifier.predict(X_test)

# Evaluate the classifier
print(classification_report(y_test, y_pred))

                                                  data  Justice  Victim  \
0    No podemos dejar de hacer sonar el caso, hay q...        1       0   
1    Y según la empresa, las cámaras del entorno do...        1       0   
2     😢Dios mio, como apagaron el sueño de esa joven,😢        0       1   
3    A ley de 4 meses paulita para graduarnos y mir...        0       1   
4    Que raro que no se han filtrado fotos del acos...        1       0   
..                                                 ...      ...     ...   
495  Lo malo de este país que orita suertan.ese hij...        0       0   
496  Esa muchacha faja trabajando y pagándose su ca...        0       1   
497                             Hay Dios mío que dolor        0       0   
498  Yo solo digo que eso está bien turbio, alguien...        0       0   
499  Y la justicia de aquí es una basura con un rég...        1       0   

     Harassment  
0             0  
1             0  
2             0  
3             0  
4        

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Class '000': This binary number indicates that all three labels are absent (0) for a given instance. Therefore, this class corresponds to the absence of 'Justice', 'Victim', and 'Harassment'.

Class '001': This binary number indicates that 'Acoso' is present (1) while 'Justicia' and 'Victim' are absent (0).

Class '010': This binary number indicates that 'Victim' is present (1) while 'Justicia' and 'Acoso' are absent (0).

Class '100': This binary number indicates that 'Justice' is present (1) while 'Victim' and 'Acoso' are absent (0).

Class '101': This binary number indicates that both 'Justicia' and 'Acoso' are present (1) while 'Victim' is absent (0).

Class '110': This binary number indicates that both 'Justicia' and 'Victim' are present (1) while 'Acoso' is absent (0).

In [None]:
import pandas as pd
import emoji
from textblob import TextBlob

In [None]:
# Load the dataset
df = pd.read_csv('diariolibre2.csv')

# Perform sentiment analysis on each comment
sentiment_scores = []
for index, row in df.iterrows():
    comment_text = row['data']
    blob = TextBlob(comment_text)
    sentiment_score = blob.sentiment.polarity
    sentiment_scores.append(sentiment_score)

# Aggregate sentiment scores to get an overall sentiment score for the crime news post
overall_sentiment_score = sum(sentiment_scores) / len(sentiment_scores)

print("Overall Sentiment Score:", overall_sentiment_score)

Overall Sentiment Score: -0.010933333333333331


# First News Article

**Author:** Female, Instagram User

**Date:**

**Link:**

**Content:** Comment

**Original Language:** Spanish

She was being harassed and the company labeled it as a "bad-taste joke" #JusticeForPaula

Paula's Case brings to the table the issue of work harassment, and with it:

- The absence of efficient mechanisms for prevention and monitoring of work harassment

- Normalization of harassment, ignoring that is violence and it can scalate quickly

- Desvalidación de experiencia de las víctimas

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report
!pip install chardet
import chardet
with open('data_journalist.csv', 'rb') as f:
    encoding = chardet.detect(f.read())['encoding']

# Load the dataset
df = pd.read_csv('data_journalist.csv', encoding=encoding)

# Separate features (comment text) and labels (emotion categories)
X = df['data']  # comment text
y = df[['sad', 'anger', 'fear', 'other']]  # emotion labels

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Support Vector Machine (SVM) classifier for each emotion category
svm_models = {}
for emotion in ['sad', 'anger', 'fear', 'other']:
    clf = SVC(kernel='linear')
    clf.fit(X_train_tfidf, y_train[emotion])
    svm_models[emotion] = clf

# Evaluate the models
for emotion in ['sad', 'anger', 'fear', 'other']:
    y_pred = svm_models[emotion].predict(X_test_tfidf)
    print(f"Emotion: {emotion}")
    print(classification_report(y_test[emotion], y_pred))

Emotion: sad
              precision    recall  f1-score   support

           0       0.57      0.80      0.67         5
           1       0.50      0.25      0.33         4

    accuracy                           0.56         9
   macro avg       0.54      0.53      0.50         9
weighted avg       0.54      0.56      0.52         9

Emotion: anger
              precision    recall  f1-score   support

           0       0.50      0.25      0.33         4
           1       0.57      0.80      0.67         5

    accuracy                           0.56         9
   macro avg       0.54      0.53      0.50         9
weighted avg       0.54      0.56      0.52         9

Emotion: fear
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9

    accuracy                           1.00         9
   macro avg       1.00      1.00      1.00         9
weighted avg       1.00      1.00      1.00         9

Emotion: other
              pr

# Second News Article

**Author:** Female, Instagram User

**Date:**

**Link:**

**Content:** Comment

**Original Language:** Spanish

In [None]:
# Load the dataset
df = pd.read_csv('data_ricardo.csv')

# Separate features (comment text) and labels (emotion categories)
X = df['data']  # comment text
y = df[['agreement', 'life_imp','other']]  # emotion labels

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a Support Vector Machine (SVM) classifier for each emotion category
svm_models = {}
for emotion in ['agreement', 'life_imp','other']:
    clf = SVC(kernel='linear')
    clf.fit(X_train_tfidf, y_train[emotion])
    svm_models[emotion] = clf

# Evaluate the models
for emotion in ['agreement', 'life_imp','other']:
    y_pred = svm_models[emotion].predict(X_test_tfidf)
    print(f"Emotion: {emotion}")
    print(classification_report(y_test[emotion], y_pred))

Emotion: agreement
              precision    recall  f1-score   support

           0       0.50      0.50      0.50         2
           1       0.86      0.86      0.86         7

    accuracy                           0.78         9
   macro avg       0.68      0.68      0.68         9
weighted avg       0.78      0.78      0.78         9

Emotion: life_imp
              precision    recall  f1-score   support

           0       0.88      1.00      0.93         7
           1       1.00      0.50      0.67         2

    accuracy                           0.89         9
   macro avg       0.94      0.75      0.80         9
weighted avg       0.90      0.89      0.87         9

Emotion: other
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9

    accuracy                           1.00         9
   macro avg       1.00      1.00      1.00         9
weighted avg       1.00      1.00      1.00         9

