# COGS 118B - Project Proposal

# Names

- Sukhman Virk
- Athira Rajiv
- Neil Bajaj
- Yash Sharma
- Lucas Fasting

# Abstract 
In this project, we aim to advance customer satisfaction within the US airline industry using customer feedback in the form of Tweets. The data used is made up of ~14600 observations across 15 variables and includes information about the airline mentioned, the tweet, the sentiment of the tweet, and other relevant factors. We plan on converting the customer feedback into a numerical form through the utilization of NLP techniques such as TF-IDF. We can then use K-means or GMM clustering to group the customer feedback data into clusters, each of which represents a different sentiment. We will then further cluster these separated sentiments to analyze different areas to work on(negative feedback)  or to keep the same(positive feedback) for the airlines. Our goal is to gain insight into sentiments regarding various features of airlines to improve overall customer satisfaction. We measure the performance of the algorithm by using methods like the silhouette score and PCA visualization to see how clean our clusters look.

# Background
Social media platforms, particularly Twitter, have become a popular source for customer feedback in recent years. Analyzing customer sentiment can be a powerful tool for measuring customer satisfaction and identifying areas of improvement in industries such as the airline industry. Because Twitter is a real-time platform and is used by millions of people, it has become a valuable source for analyzing customer sentiment. NLP techniques can be used to preprocess and analyze Twitter text data to gather valuable information about customers' opinions about different products and services<a name="note1"></a>[<sup>[1]</sup>](#note1). 

Previous studies in this field have concentrated on a number of important factors. A significant amount of research has been done on the preprocessing and analysis of Twitter text data using natural language processing (NLP) techniques like TF-IDF, word input, and emotional vocabulary<a name="note2"></a>[<sup>[2]</sup>](#note2) numerical format that is appropriate for machine learning algorithms.

Moreover, existing studies have focused on classifying tweets into various sentiments and have found themes or patterns within those sentiments through the use of clustering algorithms. For this, K-Means clustering, hierarchical clustering, and Gaussian mixture model (GMM) clustering are frequently employed. A particular study focused on a multitude of clustering approaches with regard to sentiment analysis, finding that the use of clustering algorithms quickly and effectively separated tweets based on their sentiment scores<a name="note3"></a>[<sup>[3]</sup>](#note3). 

As proven in other studies, sentiment analysis in tweets can help us decipher people’s opinions, sentiments, attitudes, and emotions from merely written text. In a study conducted by Barreto, Moura, Carvalho, Paes and Plastino, they tend to address the sentiment classification of tweets, a task made challenging by the informal style of language, use of slang, and the presence of misspellings and grammatical inconsistencies<a name="note4"></a>[<sup>[4]</sup>](#note4). By evaluating these words with not only traditional vector space models like Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), but also with advanced methods like Word2Vec, FastText, and GloVe, this study offers a more comprehensive understanding of the effectiveness of different word representation models in the context of Twitter sentiment analysis.

# Problem Statement

The problem we are trying to solve is to improve customer satisfaction of customers using US Airlines. We decided to use clustering algorithms to form clusters of customer feedback data pulled from Twitter which are directed at various airlines operating in America. We then aim to identify clusters of different sentiments, which we can use to provide insight as to which airlines have the most satisfied customers and which areas each airline needs to work on to increase customer satisfaction. 

This problem is quantifiable because we can apply NLP to convert the text input into a numerical form. We can then input that numeric form into ML algorithms and extract data from there. We can use NLP techniques like TF-IDF and then use the data we get from there in our clustering algorithms such as K-Means or GMM clustering. The problem is measurable because we can measure how well our clustering has worked using metrics such as the silhouette score, which measures the cohesion and separation of the clusters. Further, the performance of the clustering algorithm can also be assessed visually using PCA by observing how well it groups customer feedback into distinct sentiments and themes in those respective sentiments. The problem is also replicable since customer feedback is a field that is always increasing in size and sentiment analysis can be performed on the newer data to monitor changes in customer sentiment and satisfaction over time. When the newer data is collected, the new data can be fed into the clustering model which can readjust itself to identify shifts in customer sentiment, ensuring airlines can be up to date on the changing customer expectations. 

# Data

https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment/data

This dataset has 15 variables and 14,600 observations. 

The observations involve the tweet in question(tweet id), the text in the tweet, the sentiment towards the airline from the tweet, the confidence of the sentiment being correct, the negative feedback in the response (if present), the confidence of the response being negative, the name of the airlines and such other factors.

Some critical variables involve the text in the tweet, the airline company, the retweet count, the time zone, and the time the tweet was created. The time is left in because there might be a relation between the sentiment of the tweet and the time the tweet was posted.

We will need to drop some irrelevant columns such as the name of the tweet author, tweet coordinates, tweet location, airline sentiment gold, negative reason gold, and we should further remove rows with negative reason confidence equal to 0. We will also need to replace all non-alphanumeric signs, punctuation signs, and duplicated white spaces with a single white space.

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Why might your solution work? Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Since we will be using clustering as our main algorithm to find solutions for our proposed question, the Silhouette score will be the main evaluation metric for our project. The Silhouette coefficient is a metric that measures how well each data point fits into its cluster. It provides information about both the cohesion and the separation of the data point. The value of the silhouette coefﬁcient is between  -1 and 1. A score of 1 denotes the best, meaning that the data point i is very compact within the cluster to which it belongs and is far away from the other clusters where it does not belong. This evaluation metric will help us determine how good our clustering is and if we are ready to move forward with the next steps of our project [topic modeling] to find out the main theme, positive points, and negative points for each cluster and data point. At times, a high silhouette score of 1 is not ideal because it can be caused by overfitting the data, hence instead we would be aiming for a silhouette score of around 0.8. 

# Ethics & Privacy

⁤Several ethical and privacy issues are raised by the proposed project, which aims to summarize customer satisfaction for US Airlines by analyzing customer feedback from Twitter. ⁤⁤These ethical and privacy issues circle data biases, the responsible and reasonable use of social media data, and the protection of both individual and corporate privacy. ⁤

⁤To protect people's privacy, we will make sure that any customer feedback data we acquire from the Twitter dataset is anonymized. ⁤⁤This means that any names, Twitter handles, or other information that could be used to link a specific reviewer to a specific person will be removed or obscured. The dataset already anonymizes the user's identities but we will make sure to double-check it while performing our EDA. ⁤⁤⁤⁤This action is essential to upholding data protection laws and safeguarding individuals' and companies' privacy. ⁤

We might encounter a data bias where most of the data is based on a singular airline, to handle this we will either make sure each airline is equally represented, to do this we could shrink the dataset (which is not ideal), or use SMote to create synthetic data, which makes sure the new data is equal and does not involve any biases. 
Furthermore, the analysis may be impacted by potential biases introduced by the use of machine learning algorithms like clustering and Natural Language Processing (NLP). ⁤⁤Biases might originate from the way the data is interpreted, how the algorithms themselves are applied, or the method of gathering the data itself. ⁤⁤Sentiment analysis, for example, may unintentionally favor statements or attitudes that are more prevalent in particular groups of people, which could distort the findings. ⁤⁤In order to avoid proper nouns skewing Sentiment Analysis data, we will make sure that all proper nouns are replaced with whitespaces or anonymous words. ⁤⁤Additionally, to reduce these risks, we will thoroughly assess and preprocess the data to detect and minimize bias and make sure that our analysis is as impartial and inclusive as feasible. ⁤

We also recognize the value of being open and honest about the process, findings, and outcomes. The actions taken to address privacy and ethical issues, including the identification and resolving of biases, will be recorded and made available. We also recognize that although clustering algorithms can offer insightful information on customer sentiment, they can not fully represent the complexities of human emotions and experiences. ⁤

# Team Expectations 

* Team Expectation 1: The team is expected to communicate through a group chat or Discord, holding meetings twice weekly on Tuesdays and Fridays. If a team member cannot attend a meeting, they must notify the group, who will then update them on the meeting's outcomes and gather their feedback to keep everyone informed.

* Team Expectation 2: Should any internal disputes arise, we'll convene all team members to discuss and resolve the issue through open communication, with the group collectively deciding on the necessary steps to address the problem.

* Team Expectation 3: During our meetings, we'll collaborate on the project and distribute any remaining tasks evenly among team members. Anyone wanting to propose changes to the project must share their ideas with the rest of the team.

* Team Expectation 4: We expect all of the work to be completed to a high standard, with each team member taking responsibility for their contributions. Work submitted by any member will be reviewed by everyone else in the team before adding it to the jupyter notebook.

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/12  |  3 PM |  N/A  | Determine best form of communication for our team; Discuss and decide on final project topic ideas; discuss potential datasets. | 
| 2/18  |  5 PM |  Background Research for our decided project topic. | Finalize Project dataset, project question and proposed solution. | 
| 2/20  | 2 PM  | Edit, finalize, and complete proposal; | Go over our project proposal to make sure everything is in order, Submit proposal. |
| 2/25  | 4 PM  | Import & Wrangle Data ,do some EDA | Complete EDA, Discuss clustering ideas |
| 3/4  | 2:30 PM  | Work on Clustering, Programming for project and start uncovering initial solutions. | Discuss finding and solutions and plan the last few steps of the project. |
| 3/11  | 1 PM  | Complete analysis; Draft results/conclusion/discussion | Edit, finalize, check  and complete the project; List out all Solutions and conclusions. |
| 3/19  | Before 11:59 PM  | N/A | Turn in Final Project  |

# Footnotes
<a name="note1"></a>1.[^](#note1):  Gohil, S., Vuik, S., & Darzi, A. (2018). Sentiment Analysis of Health Care Tweets: Review of the Methods Used. JMIR Public Health. https://pubmed.ncbi.nlm.nih.gov/29685871/

<a name="note2"></a>2.[^](#note2):  Hasan, M. R., Maliha, M., & Arifuzzaman, M. (2019). Sentiment Analysis with NLP on Twitter Data. 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering. https://ieeexplore.ieee.org/abstract/document/9036670

<a name="note3"></a>3.[^](#note3):  Ahuja, S., & Dubey, G. (2017). Clustering and sentiment analysis on Twitter data. 2017 2nd International Conference on Telecommunication and Networks (TEL-NET). https://ieeexplore.ieee.org/abstract/document/8343568

<a name="note4"></a>4.[^](#note4): Barreto, S. et al. (2023). Sentiment analysis in tweets: an assessment study from classical to modern word representation models. Data Mining and Knowledge Discovery, 37, 318–380. https://link.springer.com/article/10.1007/s10618-022-00853-0<br>