# Business Understanding

Our company wants wants to be able to factor in sentiments about our products that are expressed on social media platforms, namely X (formerly Twitter), where millions of users share positive, negative, or neutral sentiments about products they use in an organic way. While consumer or user reviews left through official platforms such as Yelp or Amazon are important to consider, the pool of users is rather limited as not everyone will go the extra mile to provide a review through those "official" channels, and taking more organic feedback expressed through platforms like X into consideration can provide us an extra layer of insight as to how the public is interacting with and perceiving our products.

The way we will go about this, is to create a model that is trained and tested on a dataset containing tweets where users expressed various emotions or reactions to other products, as the essential issue is how well our model is able to parse through and categorize expressed sentiments.

While many tweets will fall easily under either Positive or Negative categories, we also need to be able to handle more neutral sentiments expressed. In other words, our model will need to address the following questions:

1. How well can we differentiate between Positive and Negative sentiments?
2. How successful is our model at engaging Neutral sentiments?

For our purposes, both Positive and Negative sentiments would be equally important in different ways, as we need to rely on Positive sentiments in order to ascertain which aspects of our products are well-received by customers and why that is so, while Negative sentiments can call our attention to aspects of our products that turn customers away from our products. 

By relying on both Negative and Positive sentiments, we can better understand which changes need to be made, and which aspects of our products should be developed further. In other words, **neither False Positives nor False Negatives are inherently more important for us**.

# Data Understanding

As this problem is about analyzing and categorizing sentiments expressed through text, we will need to build a model capable of Natural Language Processing, or NLP for short. This model needs to be adept at processing and parsing through text, and categorizing the text as 'Positive', 'Negative', or Neutral'.

We will engage in NLP to build first a binary classifier that is capable of differentiating between 'Positive' and 'Negative' sentiments, and then we will expand this model into a multi-class classifier by computing the 'Neutral' sentiments. 

### Dataset
For these purposes, we will work with a [dataset we retrieved from data.world](https://data.world/crowdflower/brands-and-product-emotions) that contains more than 9,000 tweets expressing Positive, Negative, or Neutral sentiments towards Apple or Google products.

The dataset contains information across three columns:
1. `tweets_text`, which contains the text of the collected tweets themselves.
2. `emotion_in_tweet_is_directed_at`, which indicates which product the tweet is speaking to. This column contains a number of values, however they all fall under either Apple of Google products (with the exception of a number of rows that contain 'No data').
3. `is_there_an_emotion_directed_at_a_brand_or_product`, which categorizes the collected tweets according to either 'Positive emotion', 'Negative emotion', or 'No emotion toward brand or product'. There is a fourth category, 'I can't tell', which we need to investigate more before deciding whether or not to remove these entries.

### Features and Labels
The `tweets_text` column will serve as our features, or X, while `is_there_an_emotion_directed_at_a_brand_or_product` will serve as our labels, or y. 

To be more specific, we will use the `tweets_text` column to generate **TF-IDF (Term Frequency-Inverse Document Frequency)** scores, which assigns numeric values for key terms by weighing their frequency within a certain text against their frequency across different texts. This will help our model in gaining signals from significant words and reduce noise from frequent, insignificant words. These will be our features, at least in the initial baseline model.

The second column, `emotion_in_tweet_is_directed_at`, is not relevant for our purposes, as our task is to build a model that can categorize the sentiments expressed, and not to determine *who* the sentiments are addressing.

### Class Imbalance
Our labels have significant class imbalance, with Negative sentiments only comprising 6% of the data compared to 33% for Positive and 59% for Neutral. This will cause issues for us in both training the model and evaluating its performance according to success metrics.

To compensate for this imbalance we will deploy two strategies:
1. **Data Augmentation**, which essentially gives our model more examples of the minority class to learn from by synthetically creating entries for the minority class through techniques such as 'back-translation' or 'synonym replacement.'
2. **Class Weighting**, which is a strategy that can be used with certain models that gives higher importance to minority classes. 

### Success Metrics
As described above, we need to rely on both Positive and Negative sentiments equally as they both convey insights about our products in different ways and provide different signals as to which changes need to be made in the future.

For this reason, we will rely on the two following variations of the **F1-Score**, which balances both False Negatives and False Positives since neither is inherently more important for us:
1. **Macro F1-Score**, since this is a multi-class classification task and not binary.
2. **Weighted F1-Score**, which accounts for any remaining class imbalance after we deploy the imbalance compensation strategies listed above.

### Model Selection
We will start with a simple baseline model as an initial performance check before moving on to more complex models. Since we have a relatively large dataset and significant class imbalance, we will deploy **Logistic Regression**, which can help compensate for this imbalance using a weighted approach. This model can also help us determine relative imporance of features, which we can then use in tuning later complex models. 

Logistic Regression will require us to compute **TF-IDF (Term Frequency-Inverse Document Frequency)**, as described above. However, this will have limited applicability on future application of the model, as future tweets will undoubtedly contain new slang and terms that arise and will not be computed in this current model.

After establishing a baseline, we will then move on to testing out more complex models, namely **BERTweet (Bidirectional Encoder Representations from Transformers)**, a Deep Learning model which is well-suited for analyzing tweets in particular as it is trained on 850 million English tweets and can process special characters such as emojies, hashtags, etc., and is capable of determining contextual meaning from limited text as tweets had a limit of 140 characters back in 2013, which is when our dataset was compiled. 

## Data Preparation

In [1]:
import pandas as pd


# Exploratory Data Analysis

# Conclusion

## Limitations

## Recommendations

## Next Steps