### Introduction

In the evolving landscape of online communication, the proliferation of automated bots and the prevalence of cyberbullying present significant challenges to maintaining healthy digital environments. These phenomena not only distort genuine user interactions but also contribute to the spread of toxicity and misinformation. Addressing these issues necessitates sophisticated detection mechanisms capable of discerning complex human sentiments and behaviors. This paper proposes an advanced approach to detect online bots and bullying by leveraging deep sentiment analysis and knowledge graphs, utilizing cutting-edge technologies such as VADER (Valence Aware Dictionary and sEntiment Reasoner), the RoBERTa (Robustly Optimized BERT Pretraining Approach) pretrained model from Hugging Face, and the Hugging Face pipeline.


### Data Acquisition: Extracting Reddit Comments

The foundation of our analysis lies in the acquisition of a comprehensive dataset of Reddit comments, which will serve as the primary source for detecting online bots and bullying. Reddit, a vast online platform with a rich tapestry of user interactions, offers a diverse range of language use, sentiments, and behaviors, making it an ideal environment for our study. This section outlines the methodology for downloading and preparing Reddit comments for analysis.

**1. Choosing a Reddit Data Extraction Tool:**
   - **Pushshift API**: A widely used resource for accessing historical Reddit data, Pushshift provides an extensive archive of Reddit posts and comments. It allows for querying specific subreddits, time frames, and other parameters, making it a versatile tool for data collection.
   - **PRAW (Python Reddit API Wrapper)**: For real-time data extraction, PRAW is a Python library that interfaces with Reddit's official API, enabling the extraction of recent comments and posts.

**2. Defining Parameters for Data Collection:**
   - **Subreddit Selection**: Identify and select subreddits relevant to the study. This could include general subreddits or those known for heightened activities of bots and bullying.
   - **Time Frame**: Determine the time period from which to extract comments. This could range from specific dates to a continuous real-time feed.
   - **Volume and Diversity**: Ensure a large and diverse dataset to capture a wide range of sentiments and behaviors.

**3. Data Extraction Process:**
   - **Using Pushshift**: Leverage Pushshift to download historical comments. Utilize its querying capabilities to filter data based on date, subreddit, and other relevant criteria.
   - **Using PRAW**: For real-time data, use PRAW to stream comments. This involves setting up a script that continuously fetches new comments as they are posted.

**4. Data Preprocessing:**
   - **Cleaning**: Remove irrelevant content (e.g., URLs, non-textual elements) and standardize text (e.g., lowercasing, removing excessive whitespace).
   - **Anonymization**: Ensure privacy by anonymizing user data.
   - **Structuring**: Organize the data into a structured format suitable for analysis, such as JSON or CSV, containing fields like comment text, timestamp, subreddit, and user ID (if relevant).

**5. Storage and Management:**
   - **Database Storage**: Store the extracted data in a database system, considering scalability and ease of access. Options include relational databases like PostgreSQL or NoSQL databases like MongoDB.
   - **Backup and Security**: Implement regular backups and ensure data security, especially when handling large volumes of user-generated content.

**6. Ethical Considerations:**
   - **Compliance with Reddit's Terms of Service**: Adhere to Reddit's API terms of service and guidelines.
   - **User Privacy**: Respect user privacy and confidentiality, especially when dealing with sensitive content.

By meticulously following these steps, we can acquire a rich dataset of Reddit comments, which will serve as the cornerstone for our subsequent sentiment analysis and bot detection efforts. This data, once processed and analyzed through our proposed methodologies, will provide valuable insights into the dynamics of online interactions and the prevalence of bots and bullying on Reddit.

In [5]:
import praw

# Initialize PRAW with your client credentials
reddit = praw.Reddit(client_id='jxUeaoD8b_JZhpd0s2Q-UA',
                     client_secret='JtudGESVw0Bl1eWoybiwnw5-Rfrzvw',
                     user_agent='python:ZeAnalyst:v1.0 (by /u/U_HIT_MY_DOG)')

# Choose the subreddit
subreddit = reddit.subreddit('india')

# Open a file to save the submission titles and URLs
with open('submissions.txt', 'w', encoding='utf-8') as file:
    # Get the top submissions from the subreddit
    for submission in subreddit.top(limit=10):  # You can change the limit and the time filter
        # Write the title and URL of each submission to the file
        file.write(submission.title + '\n' + submission.url + '\n\n')


In [6]:
import datetime

# Initialize PRAW with your client credentials
reddit = praw.Reddit(client_id='jxUeaoD8b_JZhpd0s2Q-UA',
                     client_secret='JtudGESVw0Bl1eWoybiwnw5-Rfrzvw',
                     user_agent='python:ZeAnalyst:v1.0 (by /u/U_HIT_MY_DOG)')

# Choose the subreddit
subreddit = reddit.subreddit('india')

# Calculate the time 24 hours ago from now
one_day_ago = datetime.datetime.utcnow() - datetime.timedelta(days=1)

# Open a file to save the comments
with open('comments.txt', 'w', encoding='utf-8') as file:
    # Get the submissions from the last 24 hours
    for submission in subreddit.new(limit=None):  # Fetch new submissions
        submission_time = datetime.datetime.utcfromtimestamp(submission.created_utc)
        if submission_time > one_day_ago:
            submission.comments.replace_more(limit=0)
            for comment in submission.comments.list():
                file.write(comment.body + '\n')  # Write each comment to the file


In [7]:
import pandas as pd

# Initialize an empty list to store the comment data
comments_data = []

# Open the text file and read lines
with open('comments.txt', 'r', encoding='utf-8') as file:
    for line in file:
        # Assuming each comment is on a new line
        comments_data.append({'comment': line.strip()})  # strip() removes any leading/trailing whitespace

# Convert the list of dictionaries into a DataFrame
comments_df = pd.DataFrame(comments_data)


In [9]:
comments_df

Unnamed: 0,comment
0,"* If your image is not OC (Original Content), ..."
1,"* If your image is a camera photo, please prov..."
2,"* If your image is an Infographic, please prov..."
3,* Screenshots of social media posts / comments...
4,
...,...
1573,
1574,I think this year's was the second one.
1575,They also have women's creator program (since ...
1576,


In [10]:
import praw
import pandas as pd

# Initialize PRAW with your client credentials
reddit = praw.Reddit(client_id='jxUeaoD8b_JZhpd0s2Q-UA',
                     client_secret='JtudGESVw0Bl1eWoybiwnw5-Rfrzvw',
                     user_agent='python:ZeAnalyst:v1.0 (by /u/U_HIT_MY_DOG)')

# List of subreddits you want to scrape
subreddits = ['india', 'indiaspeaks', 'unitedstatesofindia']

# Initialize a list to store comment data
comments_data = []

for subreddit_name in subreddits:
    subreddit = reddit.subreddit(subreddit_name)
    for submission in subreddit.new(limit=100):  # Adjust the limit as needed
        submission.comments.replace_more(limit=0)
        post_text = submission.selftext
        post_id = submission.id  # Get the ID of the post
        for comment in submission.comments.list():
            # Skip moderator comments
            if comment.distinguished:
                continue
            # Collect comment data
            comments_data.append({
                'comment_id': comment.id,
                'author': str(comment.author),
                'body': comment.body,
                'score': comment.score,
                'subreddit': str(comment.subreddit),
                'post_text': post_text,
                'post_id': post_id  # Add the post ID to the data
            })

# Convert to DataFrame
comments_df = pd.DataFrame(comments_data)


In [11]:
comments_df

Unnamed: 0,comment_id,author,body,score,subreddit,post_text,post_id
0,k7z26rh,Barely_Excited,You didn't flush the toilet. ¯⁠\⁠_⁠ಠ⁠_⁠ಠ⁠_⁠/⁠¯,1,india,,17oiuix
1,k7yhhpb,Mahatma_F_Gandhi,How much net worth is required to enter this p...,45,india,,17ofvte
2,k7y3u2z,serLundry,Looks posh. \nTangential point but most of the...,29,india,,17ofvte
3,k7yfub0,oneinmanybillion,None of my shoes are new enough or clean enoug...,27,india,,17ofvte
4,k7ygyfo,iphone4Suser,Andar jaane ka ticket hai kya? I don't know 90...,19,india,,17ofvte
...,...,...,...,...,...,...,...
10185,k7gpeyt,ProbabilisticPotato,Fortunately,1,unitedstatesofindia,Hindi is the soul of Indian culture; Hindi can...,17l5s4j
10186,k7d862j,soldierbones,Yep so what's with those quotes? Don't they go...,6,unitedstatesofindia,Hindi is the soul of Indian culture; Hindi can...,17l5s4j
10187,k7hjyvq,Fast_Deoxy,Who TF speaks like that?,0,unitedstatesofindia,Hindi is the soul of Indian culture; Hindi can...,17l5s4j
10188,k7d9a4f,CommunicationCold650,> Don't they go against the spirit of the nati...,-8,unitedstatesofindia,Hindi is the soul of Indian culture; Hindi can...,17l5s4j


In [15]:
from textblob import TextBlob

# Assuming comments_data is your DataFrame and it has 'body' for the comment text
def get_sentiment(text):
    # This function returns the polarity of the text
    return TextBlob(text).sentiment.polarity

# Apply the function to the 'body' column to get sentiment scores
comments_data['sentiment'] = comments_data['body'].apply(get_sentiment)

# Classify sentiment into positive, neutral, or negative based on the polarity score
comments_data['sentiment_label'] = comments_data['sentiment'].apply(lambda x: 'positive' if x > 0 else ('neutral' if x == 0 else 'negative'))


In [16]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Assuming comments_data is your DataFrame and it has 'body' for comment text and 'sentiment' as 0 or 1
comments_data = comments_df # your DataFrame from previous steps

# Tokenize the comments
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(comments_data['body'])
sequences = tokenizer.texts_to_sequences(comments_data['body'])

# Pad sequences to ensure uniform input size
data = pad_sequences(sequences, maxlen=200)

# Prepare labels
labels = np.array(comments_data['sentiment'])

# Split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(0.2 * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

# Build the model
model = Sequential()
model.add(Embedding(5000, 128, input_length=200))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train,
          batch_size=128,
          epochs=10,
          validation_data=(x_val, y_val))

2023-11-05 15:46:56.094406: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x163a06bb0>

In [17]:
import praw
import pandas as pd

# Initialize PRAW with your client credentials
reddit = praw.Reddit(client_id='jxUeaoD8b_JZhpd0s2Q-UA',
                     client_secret='JtudGESVw0Bl1eWoybiwnw5-Rfrzvw',
                     user_agent='python:ZeAnalyst:v1.0 (by /u/U_HIT_MY_DOG)')

# List of subreddits you want to scrape
subreddits = ['worldnews']

# Initialize a list to store comment data
test_comments_data = []

for subreddit_name in subreddits:
    subreddit = reddit.subreddit(subreddit_name)
    for submission in subreddit.new(limit=100):  # Adjust the limit as needed
        submission.comments.replace_more(limit=0)
        post_text = submission.selftext
        post_id = submission.id  # Get the ID of the post
        for comment in submission.comments.list():
            # Skip moderator comments
            if comment.distinguished:
                continue
            # Collect comment data
            test_comments_data.append({
                'comment_id': comment.id,
                'author': str(comment.author),
                'body': comment.body,
                'score': comment.score,
                'subreddit': str(comment.subreddit),
                'post_text': post_text,
                'post_id': post_id  # Add the post ID to the data
            })

# Convert to DataFrame
test_comments_df = pd.DataFrame(test_comments_data)

In [21]:

test_comments_df['body']


0                                      Since the US left?
1                       Good. Fucking dirtbag antisemite.
2       Everyday, I make sure to think about the pligh...
3                                 This is not good is it?
4       Everyone was so crtical when Putin kept threat...
                              ...                        
7484                         Can you say more about that?
7485                            And exterminating uighurs
7486    Excerpt:\n\n> President of the European Commis...
7487    > **Hundreds forced to evacuate homes and othe...
7488            What the heck, this storm is still going?
Name: body, Length: 7489, dtype: object

In [25]:
# Initialize a list to hold predictions
predictions = []

for t in test_comments_df['body']:
    text_to_predict = t
    # Tokenize the text
    sequence = tokenizer.texts_to_sequences([text_to_predict])
    # Pad the sequence
    padded_sequence = pad_sequences(sequence, maxlen=200)
    # Predict the sentiment
    prediction = model.predict(padded_sequence)
    # Add the prediction to the list (assuming binary classification with one output neuron)
    predictions.append(prediction[0][0])

# Once all predictions are made, add them as a column to the DataFrame
test_comments_df['prediction'] = predictions





In [28]:
test_comments_df



Unnamed: 0,comment_id,author,body,score,subreddit,post_text,post_id,prediction
0,k801uph,scooterbike1968,Since the US left?,1,worldnews,,17opfv1,1.048661e-18
1,k801io9,Impressive_Alarm_817,Good. Fucking dirtbag antisemite.,1,worldnews,,17opeqh,6.219468e-04
2,k800w9q,lovo17,"Everyday, I make sure to think about the pligh...",1,worldnews,,17op6d7,6.594091e-02
3,k7zzi3y,Abracadabra__,This is not good is it?,1,worldnews,,17ooq5q,4.200956e-01
4,k8007xd,atari101103,Everyone was so crtical when Putin kept threat...,1,worldnews,,17ooq5q,2.208344e-01
...,...,...,...,...,...,...,...,...
7484,k7vsnpf,itemNineExists,Can you say more about that?,2,worldnews,,17nxblf,4.604762e-01
7485,k7wxjja,the_CCP_is_evil,And exterminating uighurs,3,worldnews,,17nxblf,4.207214e-10
7486,k7ul894,Geschichtsklitterung,Excerpt:\n\n> President of the European Commis...,7,worldnews,,17nx846,2.141242e-17
7487,k7un842,Throwaway_Blueberry,> **Hundreds forced to evacuate homes and othe...,2,worldnews,,17nx5j2,4.849846e-19


## Transformers model

In [29]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading (…)okenizer_config.json: 100%|██████████| 39.0/39.0 [00:00<00:00, 4.37kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 953/953 [00:00<00:00, 350kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 872k/872k [00:00<00:00, 2.54MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 76.9kB/s]
Downloading pytorch_model.bin: 100%|██████████| 669M/669M [00:43<00:00, 15.3MB/s] 


In [32]:
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

sentiment_score(test_comments_df['body'].iloc[1])
test_comments_df['sentiment'] = test_comments_df['body'].apply(lambda x: sentiment_score(x[:512]))



In [33]:
test_comments_df

Unnamed: 0,comment_id,author,body,score,subreddit,post_text,post_id,prediction,sentiment
0,k801uph,scooterbike1968,Since the US left?,1,worldnews,,17opfv1,1.048661e-18,1
1,k801io9,Impressive_Alarm_817,Good. Fucking dirtbag antisemite.,1,worldnews,,17opeqh,6.219468e-04,4
2,k800w9q,lovo17,"Everyday, I make sure to think about the pligh...",1,worldnews,,17op6d7,6.594091e-02,5
3,k7zzi3y,Abracadabra__,This is not good is it?,1,worldnews,,17ooq5q,4.200956e-01,1
4,k8007xd,atari101103,Everyone was so crtical when Putin kept threat...,1,worldnews,,17ooq5q,2.208344e-01,1
...,...,...,...,...,...,...,...,...,...
7484,k7vsnpf,itemNineExists,Can you say more about that?,2,worldnews,,17nxblf,4.604762e-01,3
7485,k7wxjja,the_CCP_is_evil,And exterminating uighurs,3,worldnews,,17nxblf,4.207214e-10,5
7486,k7ul894,Geschichtsklitterung,Excerpt:\n\n> President of the European Commis...,7,worldnews,,17nx846,2.141242e-17,1
7487,k7un842,Throwaway_Blueberry,> **Hundreds forced to evacuate homes and othe...,2,worldnews,,17nx5j2,4.849846e-19,1



**VADER: A Lexicon and Rule-Based Sentiment Analysis Tool**
- VADER stands out for its effectiveness in interpreting the nuances of social media language, including slang, emojis, and colloquial expressions. It employs a bag-of-words approach, which, despite its simplicity, is remarkably efficient in capturing the emotional valence of text. This makes it an ideal tool for initial sentiment assessments, providing a foundational layer for identifying potential negative or aggressive content indicative of bullying or bot-like behavior.

**RoBERTa Pretrained Model from Hugging Face**
- Building upon the capabilities of VADER, we integrate the RoBERTa model to delve deeper into the contextual understanding of text. RoBERTa, an optimized iteration of the BERT (Bidirectional Encoder Representations from Transformers) model, excels in capturing the subtleties of language context, making it highly effective in discerning complex sentiment and intent. This model's proficiency in understanding the intricacies of human language enables a more nuanced detection of bullying and bot-generated content, which often requires contextual interpretation beyond mere word-level analysis.

**Hugging Face Pipeline: Streamlining Model Application**
- To operationalize these models efficiently, we utilize the Hugging Face pipeline. This framework simplifies the deployment of machine learning models, allowing for seamless integration and application of advanced NLP (Natural Language Processing) tools. The pipeline facilitates the application of RoBERTa and other models to large datasets, handling preprocessing, model inference, and output generation. This not only enhances the scalability of our approach but also ensures consistency and accuracy in analysis across diverse data sets.

Through this technical amalgamation, our research aims to establish a robust, scalable, and accurate system for detecting online bots and bullying. By harnessing the strengths of both rule-based and deep learning approaches, complemented by the efficiency of the Hugging Face pipeline, we endeavor to contribute a significant advancement in the realm of online safety and digital well-being.

## Vader Transformer

In [43]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [51]:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm
import certifi
import nltk
nltk.download('vader_lexicon', download_dir=certifi.where())


##sia = SentimentIntensityAnalyzer()


[nltk_data] Downloading package vader_lexicon to /Library/Frameworks/P
[nltk_data]     ython.framework/Versions/3.9/lib/python3.9/site-
[nltk_data]     packages/certifi/cacert.pem...


NotADirectoryError: [Errno 20] Not a directory: '/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/certifi/cacert.pem/sentiment'

In [52]:
res = {}
for i, row in tqdm(test_comments_df.iterrows(), total=len(test_comments_df)):
    text = row['Text']
    myid = row['Id']
    test_comments_df['Vader_model'] = sia.polarity_scores(test_comments_df['body'])

ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

In [47]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [None]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Downloading pytorch_model.bin:   2%|▏         | 10.5M/499M [00:26<09:35, 849kB/s]

ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.

Downloading pytorch_model.bin: 100%|██████████| 499M/499M [00:08<00:00, 55.6MB/s]


# Roberta Pretrained Model

- Use a model trained of a large corpus of data.
- Transformer model accounts for the words but also the context related to other words.

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

# VADER results on example
print(example)
sia.polarity_scores(example)


# Run for Roberta Model
encoded_text = tokenizer(example, return_tensors='pt')
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores_dict = {
    'roberta_neg' : scores[0],
    'roberta_neu' : scores[1],
    'roberta_pos' : scores[2]
}
print(scores_dict)
def polarity_scores_roberta(example):
    encoded_text = tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        'roberta_neg' : scores[0],
        'roberta_neu' : scores[1],
        'roberta_pos' : scores[2]
    }
    return scores_dict

res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    try:
        text = row['Text']
        myid = row['Id']
        vader_result = sia.polarity_scores(text)
        vader_result_rename = {}
        for key, value in vader_result.items():
            vader_result_rename[f"vader_{key}"] = value
        roberta_result = polarity_scores_roberta(text)
        both = {**vader_result_rename, **roberta_result}
        res[myid] = both
    except RuntimeError:
        print(f'Broke for id {myid}')

results_df = pd.DataFrame(res).T
results_df = results_df.reset_index().rename(columns={'index': 'Id'})
results_df = results_df.merge(df, how='left')