# COGS 118B - Final Project

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Audrey Liang
- Geena Limfat
- Nate Mead
- Neha Sharma
- Daphne Wu

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents 
- the solution/what you did
- major results you came up with (mention how results are measured) 

__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

# Background

The task of identifying tweets that are indicative of suicidal intent through machine learning techniques intersects with several research areas, notably the fields of sentiment analysis, natural language processing (NLP), and mental health information. The use of machine learning for sentiment analysis has been a subject of extensive research, aiming to understand, interpret, and classify emotions expressed in text. This work lays the groundwork for more specialized applications, including our case of detecting suicidal intent in social media posts.

Sentiment analysis itself is a well-established field within NLP, focusing on determining the attitude, emotions, and opinions expressed in text data. The approach has been applied across various domains, from customer feedback analysis to political sentiment assessment. Early works in this particular area utilized simple lexicon-based methods and gradually evolved to incorporate sophisticated machine learning algorithms, like support vector machines and Naive Bayes. These advancements have significantly improved the accuracy and depth of sentiment analysis.

When it comes to detecting distress or suicidal ideation in text, researchers have already explored several methods, including keyword detection, machine learning models, and more recently, deep learning approaches. One of the primary difficulties in this particular area is the nuanced and context-dependent nature of language used to express distress or suicidal thoughts, which requires models to not only understand the semantic content of the text, but also its emotional undertones present. Studies have shown that machine learning algorithms can be trained to detect undertones of psychological distress with reasonable accuracy by analyzing patterns in text data derived from social media platforms. In the realm of mental health information, leveraging social media data for mental health surveillance and intervention has become an increasingly researched topic. The ubiquity of social media usage provides a vast dataset of user-generated content that can offer insight into the mental health status of individuals in real-time. This has led to the development of predictive models that aim to identify signs of mental health issues, such as depression and anxiety, as well as suicidal ideation, by analyzing social media posts(4).

However, there are still ethical considerations to be had. Privacy and consent, along with the technical challenges of ensuring accuracy and reducing false positives, are significant hurdles that research in this area continues to face. As models become more sophisticated and datasets grow larger, there must be ongoing work to improve the sensitivity and specificity of detection algorithms aimed at sensitive groups of the population. Additionally, interdisciplinary collaboration between computer scientists, psychologists, and other mental health professionals is crucial to address the complex ethical issues surrounding the surveillance of individuals for signs of suicidal intent without their content. It is a very sensitive topic, and people deserve some level of privacy, regardless of what an algorithm might predict. 



# Problem Statement

The problem we are addressing involves identifying tweets that are indicative of suicidal intent using machine learning techniques. This task can be more broadly categorized under sentiment analysis, but with a specific focus on detecting signals of distress or suicidal thoughts. Our overall goal is to leverage natural language processing and unsupervised machine learning algorithms to uncover patterns in the data that correlate with the ground truth labels from our dataset. We want to develop a machine learning model that can identify tweets with suicidal intent based on their text context. 

Our dataset consists of tweets, each labeled as either indicative of suicidal intent (1.0) or not (0.0). The tweets will have undergone basic data cleaning. Our feature set will be generated primarily through the Bag of Words approach and sentiment analysis via VADER. Bag of Words will transform the tweet text into a vector of word counts, capturing the frequency of each word within the body of text. The VADER tool will be used to analyze the sentiment of each tweet, producing a compound score that reflects the overall sentiment—positive or negative. We will apply Principal Component Analysis to reduce the dimensionality of the feature space, aiming to retain the most informative aspects while reducing computational complexity. We will implement unsupervised machine learning algorithms like KMeans and Gaussian Mixture Models to cluster the tweets based on the reduced feature set. The goal is to see if clusters formed align with the ground truth labels of suicidal intent.

This process makes the problem quantifiable, as it involves quantifying the presence of words and sentiment in tweets, which are then expressed numerically. This problem is measurable, as the alignment of clustering results with ground truth labels can be measured through metrics such as elbow method for cluster cohesion and separation, and F1-score to evaluate the accuracy of the classification model. Our problem setup is replicable, as it relies on structured approaches like Bag of Words, VADER, Principal Component Analysis, and clustering algorithms that can be systemically applied to the tweet dataset or similar datasets with known labels in the future. Our approach seeks not just to detect suicidal intent in tweets based on analysis of their textual content, but also to explore and validate the effectiveness of combining natural language processing with unsupervised machine learning for the particular critical application. 


# Data

We obtained the tweet data from the Twitter API using the Tweepy Python library. The dataset consists of approximately 10,000 tweets related to various topics, including those indicating suicide risk. Before analysis, we performed data cleaning to remove duplicates, irrelevant tweets, and retweets.

### Dataset Source: Kaggle Twitter Dataset
- Dataset Size: The dataset comprises approximately 10,000 tweets.
- Observation Description: Each observation represents a single tweet retrieved from the Twitter API. An observation consists of several variables, including the tweet text, user ID, timestamp, and metadata such as retweet count and favorite count.
- Critical Variables:
    - Tweet Text: Contains the actual content of the tweet.
    - Suicide Classification: Whether or not it is a suicidal post. 

### Data Cleaning Process:
Removing Empty Rows: We dropped any empty rows present in the dataset.
Punctuation Removal: We removed punctuation from the tweet text to ensure consistency in text processing.

The detailed code for data cleaning can be found in this ADD LINK BROOOOOO!!!!! notebook, while the summary of the cleaning process is provided here for readability.


# Proposed Solution

The approach we are using to dentify tweets that indicate suicide risk involves a multiple step process to leverage sentiment analysis techniques combined with clustering algorithms. After preprocessing the tweets using TF-IDF (Term Frequency-Inverse Document Frequency) to extract features, we apply the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool to assign sentiment scores to each tweet. Then, we reduce the dimensionality of the feature space using Principal Component Analysis (PCA) to capture the most significant information while minimizing computational complexity. After the preprocessing steps, we will use two clustering algorithms, K-means and DBSCAN, to partition the tweets into distinct groups based on their sentiment and content similarities. K-means partitions the data into K clusters, where K is determined through experimentation or domain knowledge. DBSCAN, on the other hand, identifies dense regions of tweets in the feature space, allowing for the detection of outliers and noise. To incorporate sentiment analysis with a bag-of-words approach using VADER, we assign sentiment scores to individual words in each tweet based on their presence in lists of positive and negative words. These sentiment scores are then aggregated to derive an overall sentiment score for each tweet.

### Implementation:

- Preprocessing: We'll use Python's scikit-learn library to perform TF-IDF transformation and PCA for dimensionality reduction.
- Sentiment Analysis: VADER sentiment analysis will be implemented using the nltk library in Python.
- Clustering Algorithms: K-means and DBSCAN will be implemented using the scikit-learn library.
- Bag-of-Words Approach: We'll use custom scripts to assign sentiment scores to words based on predefined lists of positive and negative words.

### Evaluation:
To evaluate the effectiveness of our solution, we'll use labeled datasets to assess the accuracy of identifying tweets indicating suicide risk. We'll compare the clustering results obtained from K-means and DBSCAN against the ground truth labels to measure clustering performance. Additionally, we'll assess the correlation between sentiment scores assigned to tweets and their corresponding ground truth labels.

### Benchmark Model:
As a benchmark model, we'll compare our solution's performance against a baseline model that employs basic text processing techniques without sentiment analysis. This baseline model may involve simple clustering algorithms applied directly to the TF-IDF transformed data without incorporating sentiment scores.

With this approach, our goal is to to provide a robust and interpretable solution for identifying tweets indicating suicide risk, using te power of sentiment analysis and clustering techniques.

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).


scores closer to 1 indicates better model performace, 1 being perfect presicion and recall for F1 or signifying the data calustering are in perfect agreement with the true labels:


- **F1 Score**: 0.9311377245508982 - measure of model accuracy considering presicion and recall of the model's predictions. Particularly usefil in binary classifications (which we are doing). K-means produced the highest F1 score compared to GMM and DBSCAN. 
- **K-Means Rand**: 0.6050382484832498 - computes similarity between two clusters. K-means produced the second highest score of 0.605 this indicates moderate agreement between the k-means clusters and the truth labels. 
- **K-Means Adjusted Rand**: 0.1793769407924069 - the rand index but corrected-for-chance. As it is relatively low, there may be some but not a strong agreement between clustering labels and truth labels, and some agreement may be due to chance.


# Results:
## Analysis of Tweet Data for Identifying Suicidal Tendencies

The primary goal of our research is to detect and uncover patterns within tweet data that could suggest a risk of suicide. We do this using unsupervised machine learning techniques. We utilize the textual content of tweets, transform them into numerical data through Bag of Words (BoW) and sentiment analysis using VADER. Following this, we perform dimensionality reduction, clustering, and model evaluation techniques in order to analyze our data effectively.

### Data Preprocessing and Feature Extraction
We began by loading our dataset, which includes tweet texts and a binary indicator of suicidal tendency (whether or not the post might be suicidal). However, our analysis focused solely on the text and we dropped the column of data that would indicate suicidal tendency. After excluding missing data (dropping empty rows), we processed 1000 features using BoW and appended sentiment scores from VADER to our feature set.

### Dimensionality Reduction with PCA
Principal Component Analysis (PCA) was used to reduce the dimensionality of our feature set, simplifying the high-dimensional BoW data into two principal components for more ease of visualization and clustering.

### Determining Optimal Clusters with Elbow Method
We used the Elbow Method to identify the optimal number of clusters for K-Means clustering. Our analysis found a distinct "elbow" at \( k=2 \), which suggested that two clusters was the best choice for clustering our data (see Figure below).

<img src="pics/kmeansElbow.png" alt="Elbow Method" width="400" height="300"/>

We used the Elbow Method to determine the most optimal number of clusters, as depicted in the graph, which plots the Within-Cluster Sum of Squares (WCSS) against the number of clusters. WCSS serves as a measure of the variance within each cluster; a lower WCSS signifies that data points are more tightly grouped around the cluster centroids. The graph reveals that the optimal number of clusters is two, which is evidenced by the plateau in the rate of decrease in WCSS, suggesting that increasing the number of clusters beyond this point yields minimal benefit. This finding is consistent with our dataset, which requires categorizing tweets into two distinct categories: suicidal and non-suicidal.

### Clustering with K-Means
We applied K-Means with the chosen number of clusters which revealed two distinct groupings in our PCA-reduced feature space, as shown in the figure below. The centroids of the clusters are marked with a red x that indicates the central points around which the tweets are grouped.

<img src="pics/kmeansCluster.png" alt="K-Means Clustering with PCA" width="400" height="300"/>

**Analysis/explaination of graph**: A scatter plot displays tweets clustered via PCA, with two dimensions representing PCA features. Each tweet, color-coded by cluster, is categorized into potentially suicidal or non-suicidal sentiment. Red crosses mark centroids. 

#### Evaluation metrics for K-means clustering: 
- **F1 Score**: 0.9311377245508982 - K-means produced the highest F1 score compared to GMM and DBSCAN. 
- **K-Means Rand**: 0.6050382484832498 - K-means produced the second highest score of 0.605 this indicates moderate agreement between the k-means clusters and the truth labels. 
- **K-Means Adjusted Rand**: 0.1793769407924069 - K-means produces a relatively low score. There may be some but not a strong agreement between clustering labels and truth labels, and some agreement may be due to chance.
 
The main objective is to closely reflect the labeled classification of tweets into "suicidal" or "not suicidal". In this case, the high F1 Score suggests success and indicates that the clustering effectively discerns sentiment, however, the low adjusted rand index cautions an overstated success. 

### Model Selection with Gaussian Mixture Model (GMM)
To verify the clustering results and explore alternative models, we fitted a Gaussian Mixture Model (GMM). By using the Akaike Information Criterion (AIC), we identified the best fit model complexity. The AIC suggested a preference for a model with fewer components. This aligned with our K-Means results, as depicted in the figure below. The resulting clusters from the GMM, as visualized, and it also suggested a meaningful segmentation of the data.

<img src="pics/AICGMM.png" alt="AIC for Model Selection" width="400" height="300"/>

**Analysis/explaination of graph**: To determine the optimal number of components for GMM clustering, we plotted the Akaike Information Criterion (AIC) against GMM components where a lower AIC signifies a better fit with considerations for complexity penalties. There is a noticeable decline from 1 to 3 components, followed by a plateau, indicating minimal improvement beyond three. Consequently, the ideal number of components is two, suggesting that the data is best represented by a mixture of two distributions. This segmentation effectively divides tweets into positive and negative sentiment categories, corresponding to non-suicidal and potentially suicidal sentiments. Visualizations confirm the presence of two distinct groups within the dataset.

<img src="pics/GMMCluster.png" alt="Gaussian Mixture Model Clustering" width="400" height="300"/>

**Analysis/explaination of graph**: The scatter plot displays GMM clustering results on PCA-reduced tweet data. Each point represents a tweet, positioned by its values on the first two principal components. Clustering reveals two sentiment groups, possibly indicating suicidal ideation. 

#### Evaluation metrics for GMM: 
- **GMM F1 Score**: 0.9304973037747154 - produced the second highest score behind k-means, sugessting that the clusters produced by GMM align with the true binary labels in the tweet dataset. 
- **GMM Rand**: 0.6040120083908003 - produced the lowest score of 0.604. Still, it indicates moderate agreement between the GMM clusters and the ground truth labels.
- **GMM Adjusted Rand**: 0.177437612438209 - produced a relatively low score. There may be some but not a strong agreement between clustering labels and ground truth labels, though some agreement may still be due to chance.

As mentioned, the main objective is to closely reflect the labeled classification of tweets into "suicidal" or "not suicidal". Similar to the K-means clustering, GMM clustering produces a high F1 Score suggesting success, however, the low adjusted rand index indicated that there may be an overstated success. 

### Alternative Clustering with DBSCAN
We also took a look at using Density-Based Spatial Clustering of Applications with Noise (DBSCAN) as an alternative method to identify clusters based on density. However, the results we got, shown in the figure below, did not yield distinct clusters when compared to K-Means or GMM.

<img src="pics/DBSCANCluster.png" alt="DBSCAN Clustering with PCA" width="400" height="300"/>

**Analysis/explaination of graph**: This scatter plot shows DBSCAN clustering results on PCA-reduced tweet data. DBSCAN identifies clusters based on density, without presetting cluster numbers. Each point represents a tweet, colored by cluster. Noise points are labeled with -1. Algorithm parameters were optimized for silhouette score. Clusters may indicate varying levels of concern for suicidal ideation. 

#### Evaluation metrics for DBSCAN: 
- **DBSCAN F1 Score**: 0.908590645062672 - produced a high score, but the lowest of the three models used. Still, it suggests that the clusters produced by DBSCAN align with the true binary lables in the tweet dataset. 
- **DBSCAN Rand**: 0.632388740249463 - produced the highest score between the models used of 0.604. It indicates more but still moderate agreement between the DBSCAN clusters and the truth labels.
- **DBSCAN Adjusted Rand**: 0.2432863426143979 - produced the highest score of the three models. Compared to the other models, there more of an agreement between clustering labels and ground truth labels, some agreement may still be due to chance.

While the quantitative metrics suggest that DBSCAN's performance is on par with, if not slightly better than, K-Means and GMM, the visual comparison of the clusters shows that DBSCAN created clusters that are less cohesive and it labeled many points as outliers. This is not as desierable to the kind of analysis we want to do for our tweet data. Therefore, GMM and K-means is more preferable over DBSCAN.


# Discussion

### Interpreting the Result
Our analysis, through using Principal Component Analysis (PCA), K-Means clustering, and Gaussian Mixture Models (GMM), guided by the Elbow Method and Akaike Information Criterion (AIC) to better ensure best, revals two primary clusters within the tweet dataset. These clusters are likely associated to tweets with and without indications of suicidal ideation, based on sentiment valences derived from the data, which can be seen in the `sentiment_labels` text files. 

**Main Point**: 
The combination of PCA with K-Means clustering effectively segregated the tweets into two distinct categories. This is further supported by a high F1 Score, indicating that our model is proficient in distinguishing between tweets that may or may not be related to suicide ideation.

**Secondary Points**:
1. Validation through Elbow Method and AIC: The Elbow Method clearly indicated an optimal number of clusters, further corroborated by the AIC during the application of GMM. This dual-validation approach ensures that our clustering choices are empirically sound and methodologically robust.
   
2. Algorithm Selection: The comparative analysis reveals that DBSCAN was less effective than K-Means and GMM for this particular dataset. This underscores the importance of selecting appropriate clustering algorithms that align with the data's characteristics and the analysis objectives. In comparison to the Gaussian Mixture Model and K-Means, we were unable to get proper clusters. 

3. Evaluation Metrics Insight: The F1 Score validates the effectiveness of our clustering approach, while the Rand Index and Adjusted Rand Index provide a nuanced view of the clustering quality. These metrics confirm the validity of our model selections but also highlight the complexities within the tweet data that warrant further investigation. While the F1 Scores may have been very high, we also saw considerably lower Adjusted Rand Indices, which indicate that some part of these models could be in part due to chance clustering. 

4. Algorithm Exploration: Our exploratory approach, which included K-Means, GMM, and DBSCAN, demonstrates the value of assessing multiple clustering methods. The performance alignment between GMM and K-Means reinforces our findings and supports a comprehensive examination of the data. This is also something we are expecting, as we have the ground truth labels, indicating that the data was only split into two clusters: Potential Suicide Posts and Not Suicide Posts. 

### Limitations

While the results we got are promising, there are still limitations to the approach we used. The range of the data and the clustering methods we used are heavily reliant on the chosen hyperparameters. The exploration of a broader hyperparameter space may enhance model performance. Additionally, the dimensionality reduction through PCA, although may be beneficial for visualization, it could result in the loss of important information that may improve clustering accuracy. More data could also strengthen the model's robustness and allow for a more thorough understanding of the nuanced language patterns associated with suicide risks in text.

### Ethics & Privacy

It’s critical when examining large, sensitive groups of users on any platform to avoid singling out any one person because of their online activity. Suicide as a topic is extremely sensitive and personal, and many post to Twitter related to suicidal ideations and depression as a way to vent, rather than with the intention of following through. It’s important to keep the activities and identities of individuals private, and focus on overall trends within populations rather than a specific person’s expressions of personal issues and feelings. Twitter profiles outline and illustrate individual activity that can link to personal emails, and run concurrently with posts irrelevant to our study. We have tried to anonymize profiles as best we can to minimize  identifying information that could out specific users for the content they produce and discuss. Additionally, we would like to avoid unintended consequences such as the misuse of these results or potential stigmatization of people based on this model output. 

### Conclusion

Our analysis indicated that machine learning can aid in providing valuable insights into identifying patterns associated with suicidal tendencies in social media texts. Our study's main takeaway is the potential of machine learning in identifying tweets with suicidal content, with the F1 Score substantiating the effectiveness of our clustering approach. The work fits within the broader context of mental health monitoring through social media, a growing field that leverages data analytics to offer timely interventions. Future work should look into richer feature sets, alternative dimensionality reduction techniques, and the ethical deployment of such models. Enhanced model tuning and validation, alongside collaboration with mental health professionals, could see these techniques contributing significantly to preventive healthcare measures.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
