<center>
    <h1>🍶 <b style="color: lightblue">"Churning"</b> the data 🍶</h1>
    <h3>An exploratory data analysis of yogurt brands <s>Twi...</s> <span style="color: cornflowerblue;">𝕏</span>.com</h3>
</center>

- <b><span style="color:#0074D9">Main Objective:</span></b> to give a detailed data analysis of data acquired from 𝕏 in order to equip our client with actionable, evidence-based insights.
  - Provide a clean dataset, separated by company type
  - Garner tweet <i>(or xeet or whatever they're called now)</i> sentiment using a pretrained model from huggingface🤗
  - Classify tweets into complaints that could be potentially detrimental to the brand.
  - Report on the broad themes discovered from our analysis.


### Let's start by importing our dependencies

In [None]:
import os
# import all dependencies
import sys
from pathlib import Path
import pandas as pd
from IPython.display import display

os.chdir('..')
if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())

from src.processing.preprocess import *
from src.resources.brands_data import Brand, brands
from src.resources.word_lists import *
from src.processing.sentiment_analysis import analyze

## `Step 0: Setup the environment`

1. Set up project using conda
2. Set up tiered, multi-environment approach utilizing either CUDA or MPS (Metal Performance Shaders) based on OS environment
    ```python
    # Check if MPS is available
    if torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Using MPS (Metal GPU) device.")
    elif torch.cuda.is_available():
        device = torch.device("cuda")
        print("Using CUDA device.")
    else:
        device = torch.device("cpu")
        print("Using CPU device.")
    ```
3. Chose dependencies
    * [Twitter-roBERTa-base](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) for Sentiment Analysis
    * [NLTK](https://www.nltk.org/) for tokenization and frequency analysis

## `Step 1: Clean the data`


### Data Preparation: 3 CSVs ➡ 1 DataFrame

* We started our analysis by combining the separate raw csv files into a single dataframe and removing any twitter links.

In [None]:
# preprocessing.preprocess_data()
csv_list = get_csv_files()

combined_data_frame = combine_csv_data(csv_list)

# remove twitter links - regex test here https://regex101.com/r/wZ0dAP/1
combined_data_frame["text"] = combined_data_frame["text"].str.replace(
    r"http[s]?://t\.[^\s]*|[^[$]]", "", regex=True
)

### Display datatable ###
print(f"Number of csv files ::: {len(csv_list)}")
print(f"Combined data frame length ::: {len(combined_data_frame)} rows")
pd.set_option("display.max_columns", 15)
pd.set_option("display.max_colwidth", 100)
display(combined_data_frame.iloc[10000:].head(3))
### for display purposes ###

### Data Orgnization: Creation of a `Brand` Class

<div style="display: flex; justify-content: space-between">
    <div style="margin: 2%">

```python
@dataclass
class Brand:
    """
    Brand object
    """
    twitter_handles: List[str]
    brand_name: str
    alternate_names: List[str] = field(default_factory=list)
    negative_keywords: List[str] = field(default_factory=list)
    is_nonspecific_name: bool = False
    has_food_related_name: bool = False
```

</div>
<div style="margin-left: 3%; margin-right: 3%;">

#### Example Brand:

```python
"Activia": Brand(
    twitter_handles=[
        "@activia",
        "@activiauk",
    ],
    brand_name="Activia",
    negative_keywords=[
        "activia benz",
        "mens-rights-activia",
    ],
),
```

</div>
</div>

We structured our brand data using a dataclass called `Brand` which helps us:

- <b><span style="color:#0074D9">Categorize Brands:</span></b> Using Twitter handles and brand names.
- <b><span style="color:#0074D9">Handle Ambiguities:</span></b> Some brand names might be similar to everyday words. Including alternative names can help us capture tweets from different regions and even help to catch common mispellings.
- <b><span style="color:#0074D9">Filter Out Noise:</span></b> Using negative keywords, we can eliminate irrelevant tweets, ensuring our insights are grounded in relevant data.
- <b><span style="color:#0074D9">Address Special Cases:</span></b> Some brands might have non-specific names (e.g. - a brand name closely associated with food)


### Data Cleansing:

<h4 style="color: green">We will show the journey of a single company's data: <code>"Greek Gods"</code> from raw data into processed data.</h4>

<div>We took several steps in cleansing the data:</div>

1. <b><span style="color:#0074D9">Created Company List:</span></b> Given the unique names under the `file` column, we were left with [the following companies.](../notes/companies_list.txt)
    * We <span style="color:crimson">removed</span> <i>Vanilla Bean</i> as it appeared to not be a valid company.  Keeping it also muddied the analysis process due to its association with food.
    <br><br>
2. <b><span style="color:#0074D9">Filtered Dataset:</span></b> We used several criteria to do this:
    * <u>[Word Association Lists](../src/resources/word_lists.py)</u> - given to brands based on their "uniqueness"
        * A brand like <span style="color: goldenrod">Chobani</span> can be categorized as more unique than <span style="color: goldenrod">Liberté`</span> given its French origin.
        * <span style="color: goldenrod">Greek Gods</span> is an example of a non-specific company name
    * <u>Removing Negative Keywords</u> - Limited keywords that are known exceptions
        * e.g. - A proper name like <span style="color: goldenrod">Activia Jones</span> against a normally sufficiently unique <span style="color: goldenrod">Activia</span> brand

In [None]:
# preprocess.prepare_data_for_filtering()
brand = "Greek Gods"

if brands[brand]:
    filtered_data_frame = filter_irrelevant_data(combined_data_frame, brand=brands[brand], relevancy_threshold=0)
    filtered_data_frame = remove_tweets_with_negative_keywords(filtered_data_frame, brands[brand])

    pd.set_option("display.max_columns", 10)
    pd.set_option("display.max_colwidth", 100)
    display(filtered_data_frame.head(3))

* We did not dedupe data 

In [None]:
analyze(
    data_frame=filtered_data_frame,
    company_name=brand.lower().replace(" ", "_"),
)

In [None]:
print(f"\n\nBefore filtering ::: {len(combined_data_frame)} items.")

## `Potential Improvements`
* Use NER (Named Entity Recognition) for getting yogurt and company related keywords
* Utilize NTLK for initial data filtering