# Sentiment Analysis

Research Question: What linguistic patterns and sentiment markers are most indicative of suicidal intent in social media text?

Note: This notebook exports analysis results to `.jsonl` files under `N-Gram Analysis` directory.

## Import Relevant Modules

In [32]:
import os
import sys
DATA_PATH = os.path.abspath(os.path.join(os.path.dirname(os.curdir), 'data'))
ASSIGNMENT_1_PATH = os.path.abspath(
    os.path.join(
        os.path.dirname(os.curdir), 
        "..",
        'Assignment 1',
        "src"
        )
)
sys.path.append(DATA_PATH)
sys.path.append(ASSIGNMENT_1_PATH)
import zipfile
import shutil
import re
from collections import Counter
import numpy as np
import pandas as pd
import json
from textblob import TextBlob
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
print(f"Installation of nltk.sentiment.vader success?: {'Yes' if nltk.download('vader_lexicon') else 'No'}")

Installation of nltk.sentiment.vader success?: Yes


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\kelvi\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [33]:
stopwords = open(
    os.path.join(
        ASSIGNMENT_1_PATH, 
        "StopWords.txt"
    ), "r").read().splitlines()
print(f"Number of Stopwords in consideration: {len(stopwords)}")

Number of Stopwords in consideration: 779


## Fetch the Texts from the Datasets

In [34]:
depression = pd.read_json(
    os.path.join(
        DATA_PATH, 
        'Depression_Tweets', 
        'depression_json'
    )
)
reddit = pd.read_csv(
    os.path.join(
        DATA_PATH, 
        'Reddit_SuicideWatch', 
        'reddit_suicidewatch.csv'
    ), 
    encoding='utf-8'
)
social_media_sentiment_analysis = pd.read_csv(
    os.path.join(
        DATA_PATH, 
        'Social_Media_Sentiments_Analysis_Dataset', 
        'sentimentdataset_annotated_binary.csv'
    ), 
    encoding='utf-8'
)
twitter_suicidal_data = pd.read_csv(
    os.path.join(
        DATA_PATH, 
        'Twitter_Suicidal_Data', 
        'twitter-suicidal_data.csv'
    ), 
    encoding='utf-8'
)


In [35]:
depression.columns

Index(['content'], dtype='object')

In [36]:
reddit.columns

Index(['subreddit', 'selftext', 'author_fullname', 'title', 'hide_score',
       'name', 'upvote_ratio', 'ups', 'author_flair_template_id', 'score',
       'edited', 'author_flair_css_class', 'created', 'selftext_html',
       'no_follow', 'over_18', 'id', 'author', 'num_comments',
       'author_flair_text_color', 'permalink', 'url', 'created_utc'],
      dtype='object')

In [37]:
social_media_sentiment_analysis.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'Text', 'Sentiment', 'Timestamp', 'User',
       'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month',
       'Day', 'Hour', 'Annotation'],
      dtype='object')

In [38]:
twitter_suicidal_data.columns

Index(['tweet', 'intention'], dtype='object')

## N-Gram Analysis

Performs an n-gram analysis (from 2-gram to 5-gram) and exports results to corresponding `.jsonl` files.

In [39]:
def get_n_gram(n: int=2, corpus: str="", stopwords: list=[]):
    pattern = re.compile(r'\b\w+\b')
    # filtered corpus should have no punctuations AND stopwords
    filtered_corpus = [word for word in corpus if pattern.match(word) and word not in stopwords]
    # Compute bigrams from the filtered corpus
    ngrams = [tuple(filtered_corpus[i:i+n]) for i in range(len(filtered_corpus)-n+1)]
    return ngrams

In [40]:
# Ensure the output directory exists
output_dir = 'N-Gram Analysis'
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
os.makedirs(output_dir)

datasets = {
    "depression_ngram.json": [],
    "reddit_ngram.json": [],
    "social_media_sentiment_analysis_ngram.json": [],
    "twitter_suicidal_ngram.json": []
}

# Print and Export top 20 n-grams for each dataset
for N in range(2, 6):  # bigram to 5-gram combinations
    # Depression
    print(f"{N}-gram from Depression JSON dataset:\n")

    ngrams_depression = get_n_gram(
        n=N,
        corpus=depression['content'][0].split(),
        stopwords=stopwords
    )

    print(f"Number of {N}-gram from Depression JSON dataset: {len(ngrams_depression)}")

    ngrams_depression_top_20 = Counter(ngrams_depression).most_common(20)
    print(f"First 20 {N}-grams from Depression JSON dataset: {ngrams_depression_top_20}")
        
    # Transform to desired format
    formatted_ngrams = [
        {"content": " ".join(ngram), "N-gram": N, "frequency": freq}
        for ngram, freq in ngrams_depression_top_20
    ]
    datasets["depression_ngram.json"].extend(formatted_ngrams)

    # Reddit
    print(f"{N}-gram from Reddit SuicideWatch dataset:\n")

    reddit['combined_text'] = reddit['title'] + " " + reddit['selftext']
    ngrams_reddit = get_n_gram(
        n=N,
        corpus=reddit['combined_text'][0].split(),
        stopwords=stopwords
    )

    print(f"Number of {N}-gram from Reddit SuicideWatch dataset: {len(ngrams_reddit)}")

    ngrams_reddit_top_20 = Counter(ngrams_reddit).most_common(20)
    print(f"First 20 {N}-grams from Reddit SuicideWatch dataset: {ngrams_reddit_top_20}")
        
    # Transform to desired format
    formatted_ngrams = [
        {"content": " ".join(ngram), "N-gram": N, "frequency": freq}
        for ngram, freq in ngrams_reddit_top_20
    ]
    datasets["reddit_ngram.json"].extend(formatted_ngrams)

    # Social Media Sentiment Analysis
    print(f"{N}-gram from Social Media Sentiment Analysis dataset:\n")

    social_media_sentiment_analysis['combined_text'] = social_media_sentiment_analysis['Text'] + " " + social_media_sentiment_analysis['Hashtags']
    ngrams_social_media_sentiment_analysis = get_n_gram(
        n=N,
        corpus=social_media_sentiment_analysis['combined_text'][0].split(),
        stopwords=stopwords
    )

    print(f"Number of {N}-gram from Social Media Sentiment Analysis dataset: {len(ngrams_social_media_sentiment_analysis)}")

    ngrams_social_media_sentiment_analysis_top_20 = Counter(ngrams_social_media_sentiment_analysis).most_common(20)
    print(f"First 20 {N}-grams from Social Media Sentiment Analysis dataset: {ngrams_social_media_sentiment_analysis_top_20}")
        
    # Transform to desired format
    formatted_ngrams = [
        {"content": " ".join(ngram), "N-gram": N, "frequency": freq}
        for ngram, freq in ngrams_social_media_sentiment_analysis_top_20
    ]
    datasets["social_media_sentiment_analysis_ngram.json"].extend(formatted_ngrams)

    # Twitter Suicidal Data
    print(f"{N}-gram from Twitter Suicidal dataset:\n")

    ngrams_twitter_suicidal_data = get_n_gram(
        n=N,
        corpus=twitter_suicidal_data['tweet'][0].split(),
        stopwords=stopwords
    )

    print(f"Number of {N}-gram from Twitter Suicidal dataset: {len(ngrams_twitter_suicidal_data)}")

    ngrams_twitter_suicidal_data_top_20 = Counter(ngrams_twitter_suicidal_data).most_common(20)
    print(f"First 20 {N}-grams from Twitter Suicidal dataset: {ngrams_twitter_suicidal_data_top_20}")
        
    # Transform to desired format
    formatted_ngrams = [
        {"content": " ".join(ngram), "N-gram": N, "frequency": freq}
        for ngram, freq in ngrams_twitter_suicidal_data_top_20
    ]
    datasets["twitter_suicidal_ngram.json"].extend(formatted_ngrams)

# Export to JSON files
for filename, data in datasets.items():
    with open(os.path.join(output_dir, filename), 'a') as f:
        json.dump(data, f, indent=4)

2-gram from Depression JSON dataset:

Number of 2-gram from Depression JSON dataset: 7
First 20 2-grams from Depression JSON dataset: [(('Yea', 'typically'), 1), (('typically', 'crying'), 1), (('crying', 'sign'), 1), (('sign', 'uncontrolled'), 1), (('uncontrolled', 'depression'), 1), (('depression', 'struggled'), 1), (('struggled', "life'"), 1)]
2-gram from Reddit SuicideWatch dataset:

Number of 2-gram from Reddit SuicideWatch dataset: 106
First 20 2-grams from Reddit SuicideWatch dataset: [(('I', 'left'), 2), (('Help?', 'I'), 1), (('I', 'reached'), 1), (('reached', 'point'), 1), (('point', 'today.'), 1), (('today.', 'I'), 1), (('I', 'struggling'), 1), (('struggling', 'lot'), 1), (('lot', 'financially'), 1), (('financially', 'trying'), 1), (('trying', 'support'), 1), (('support', 'kids.'), 1), (('kids.', 'I'), 1), (('I', 'lose'), 1), (('lose', 'house.'), 1), (('house.', 'I'), 1), (('I', 'work'), 1), (('work', 'full'), 1), (('full', 'I'), 1), (('I', "it's"), 1)]
2-gram from Social Medi

### N-gram Analysis based on Prediction Results

In [41]:
RESULTS_PATH = os.path.abspath(
    os.path.join(
        "NLP Training", 
        'Results'
        )
)

In [42]:
results = {}
for jsonFile in os.listdir(RESULTS_PATH):
    if jsonFile.endswith('.json') or jsonFile.endswith('.jsonl'):
        with open(os.path.join(RESULTS_PATH, jsonFile), 'r') as f:
            data = [json.loads(line) for line in f]
            print(f"Number of records in {jsonFile}: {len(data)}")
            for N in range(2, 6):
                ngram_results: list[tuple] = get_n_gram(
                    n=N,
                    corpus=data[0]['raw_text'].split(),
                    stopwords=stopwords
                )
                print(f"Number of {N}-gram from {jsonFile}: {len(ngram_results)}")
                ngram_results_top_20 = Counter(ngram_results).most_common(20)
                print(f"First 20 {N}-grams from {jsonFile}: {ngram_results_top_20}")
                print("\n")
                # For exporting
                if jsonFile not in results:
                    results[jsonFile] = []
                results[jsonFile].extend([
                    {"content": " ".join(ngram), "N-gram": N, "frequency": freq, "predicted_label": data[0]['predicted_label']}
                    for ngram, freq in ngram_results_top_20
                ])
# Export to JSON files
with open(os.path.join(output_dir, 'pred_results.json'), 'a') as f:
    json.dump(results, f, indent=4)

Number of records in Result_deep_learning_customized_standalone.jsonl: 18679
Number of 2-gram from Result_deep_learning_customized_standalone.jsonl: 7
First 20 2-grams from Result_deep_learning_customized_standalone.jsonl: [(('Yea', 'typically'), 1), (('typically', 'crying'), 1), (('crying', 'sign'), 1), (('sign', 'uncontrolled'), 1), (('uncontrolled', 'depression'), 1), (('depression', 'struggled'), 1), (('struggled', "life'"), 1)]


Number of 3-gram from Result_deep_learning_customized_standalone.jsonl: 6
First 20 3-grams from Result_deep_learning_customized_standalone.jsonl: [(('Yea', 'typically', 'crying'), 1), (('typically', 'crying', 'sign'), 1), (('crying', 'sign', 'uncontrolled'), 1), (('sign', 'uncontrolled', 'depression'), 1), (('uncontrolled', 'depression', 'struggled'), 1), (('depression', 'struggled', "life'"), 1)]


Number of 4-gram from Result_deep_learning_customized_standalone.jsonl: 5
First 20 4-grams from Result_deep_learning_customized_standalone.jsonl: [(('Yea', 't

## Sentiment Analysis with TextBlob

Analyze the Sentiments (i.e., Polarity and Subjectivity) of the top 20 n-grams in each dataset and updates the files.

In [43]:
sia = SentimentIntensityAnalyzer()
def analyze_sentiment(record):
    text = str(record['content'])
    scores = sia.polarity_scores(text)
    # Map VADER scores to a similar structure
    record['polarity_VADER'] = scores['compound']  # Ranges from -1 to 1
    record['positive'] = scores['pos']
    record['negative'] = scores['neg']
    record['neutral'] = scores['neu']
    return record

In [44]:
ngram_analysis_sets:dict = {}

for ngram_result_file in os.listdir(os.path.join(output_dir)):
    with open(os.path.join(output_dir, ngram_result_file), 'r') as f:
        data = json.load(f)
        ngram_analysis_sets[ngram_result_file] = data
        if (isinstance(data, dict)):
            for key, value in data.items():
                for record in value:
                    # TextBlob
                    wiki = TextBlob(str(record['content']))
                    record['polarity_TextBlob'] = wiki.sentiment.polarity
                    record['subjectivity'] = wiki.sentiment.subjectivity
                    # VADER
                    wiki = analyze_sentiment(record)
                    ngram_analysis_sets[ngram_result_file][key].append(wiki)
        if (isinstance(data, list)):
            for idx, record in enumerate(data):
                # TextBlob
                wiki = TextBlob(str(record['content']))
                record['polarity_TextBlob'] = wiki.sentiment.polarity
                record['subjectivity'] = wiki.sentiment.subjectivity
                # VADER
                wiki = analyze_sentiment(record)
                ngram_analysis_sets[ngram_result_file][idx] = wiki
        print(f"✅ {ngram_result_file.split('.json')[0].split('_ngram')[0].upper()} dataset: Polarity and Subjectivity are computed.")

    # Export updates to JSON files
    with open(os.path.join(output_dir, ngram_result_file), 'w') as f:
        json.dump(ngram_analysis_sets[ngram_result_file], f, indent=4)
        print(f"✅ {ngram_result_file} updated with polarity and subjectivity")


✅ DEPRESSION dataset: Polarity and Subjectivity are computed.
✅ depression_ngram.json updated with polarity and subjectivity


KeyboardInterrupt: 

**Polarity** from TextBlob:
* A float within the range [-1.0, 1.0]. A score of -1 means the words are super negative, like “disgusting” or “awful.” A score of 1 means the words are super positive, like “excellent” or “best.”

**Subjectivity** from TextBlob:
* A float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Source: <url>https://textblob.readthedocs.io/en/dev/quickstart.html</url>

**Polarity** from VADER:
* A float within the range [-1.0, 1.0]. A score of -1 means the words are super negative. A score of 1 means the words are super positive, like “excellent” or “best.”

**Positivity** from VADER:
* A float within the range [0.0, 1.0] indicating the proportion of Positive sentiments.

**Negativity** from VADER:
* A float within the range [0.0, 1.0] indicating the proportion of Negative sentiments.

**Neutral Score** from VADER:
* A float within the range [0.0, 1.0] indicating the proportion of Neutral sentiments.

Source: <url>https://www.nltk.org/api/nltk.html</url>