<center>
    <h1>🍶 <b style="color: lightblue">"Churning"</b> the data 🍶</h1>
    <h3>An exploratory data analysis of yogurt brand <s>Twi...</s> <b style="font-size: 2rem; color: lightblue;">𝕏</b></h3>
</center>

- <b><span style="color:#0074D9">Main Objective:</span></b> to give a detailed data analysis of data acquired from 𝕏 in order to equip our client with actionable, evidence-based insights.
  - Provide a clean dataset, separated by company type
  - Garner tweet <i>(or xeet or whatever they're called now)</i> sentiment using a pretrained model from huggingface🤗
  - Classify tweets into complaints that could be potentially detrimental to the brand.
  - Report on the broad themes discovered from our analysis.


### Let's start by importing our dependencies

In [None]:
# import all dependencies
import os
from pathlib import Path
from IPython.display import display
import sys
import pandas as pd

# Ensure the root of your project is in sys.path
project_root = os.path.dirname(os.getcwd())
if project_root not in sys.path:
    sys.path.append(project_root)

from src.processing.preprocess import *
from src.resources.brands_data import Brand, brands

## `Step 1: Clean the data`


### Data Preparation: 3 CSVs ➡ 1 DataFrame

We started our analysis by combining the separate raw csv files into a single dataframe.


In [None]:
csv_list = get_csv_files()
print(f"Number of csv files ::: {len(csv_list)}")
combined_data_frame = combine_csv_data(csv_list)
# remove twitter links - regex test here https://regex101.com/r/wZ0dAP/1
combined_data_frame["text"] = combined_data_frame["text"].str.replace(
    r"http[s]?://t\.[^\s]*|[^[$]]", "", regex=True
)
print(f"Combined data frame length ::: {len(combined_data_frame)}")

pd.set_option("display.max_columns", 15)
pd.set_option("display.max_colwidth", 100)
display(combined_data_frame.iloc[10000:].head(3))

### Data Orgnization: Creation of a `Brand` Class

<div style="display: flex; justify-items: stretch;">
    <div>

```python
@dataclass
class Brand:
    """
    Brand object
    """
    twitter_handles: List[str]
    brand_name: str
    alternate_names: List[str] = field(default_factory=list)
    negative_keywords: List[str] = field(default_factory=list)
    is_nonspecific_name: bool = False
    has_food_related_name: bool = False
```

</div>
<div style="justify-self: start;">

#### Example Brand:

```python
"Activia": Brand(
    twitter_handles=[
        "@activia",
        "@activiauk",
    ],
    brand_name="Activia",
    negative_keywords=[
        "activia benz",
        "mens-rights-activia",
    ],
),
```

</div>
</div>

We structured our brand data using a dataclass called `Brand` which helps us:

- <b><span style="color:#0074D9">Categorize Brands:</span></b> Using Twitter handles and brand names.
- <b><span style="color:#0074D9">Handle Ambiguities:</span></b> Some brand names might be similar to everyday words. Including alternative names can help us capture tweets from different regions and even help to catch common mispellings.
- <b><span style="color:#0074D9">Filter Out Noise:</span></b> Using negative keywords, we can eliminate irrelevant tweets, ensuring our insights are grounded in relevant data.
- <b><span style="color:#0074D9">Address Special Cases:</span></b> Some brands might have non-specific names (e.g. - a brand name closely associated with food)


### Data Cleansing:

<h4 style="color: green">We will show the journey of a single company's data: <code>"Greek Gods"</code> from raw data into processed data.</h4>

<div>We took several steps in cleansing the data:</div>

1. <b><span style="color:#0074D9">Determined Company List:</span></b> Given the unique names under the `file` column, we were left with [the following companies.](../notes/companies_list.txt)
    * We <span style="color:crimson">removed</span> <i>Vanilla Bean</i> as it appeared to not be a valid company.  Keeping it also muddied the analysis process due to its association with food.
3. <b><span style="color:#0074D9">Filtered Dataset:</span></b> We used several criteria to do this:
    * <u>[Word Association Lists](../src/resources/word_lists.py)</u> - given to brands based on their "uniqueness"
        * A brand like `Chobani` can be categorized as more unique than `Liberté` given its French origin.
        * `Greek Gods` is an example of a non-specific company name

In [None]:
2. <b><span style="color: #0074D9">Removing negative keywords:</span></b> Removed common instances that need specific filtering

In [None]:
print(f"\n\nBefore filtering ::: {len(combined_data_frame)} items.")

remove_tweets_with_negative_keywords(combined_data_frame, brands["Fage"])

2. <b><span style="color:#0074D9">Handle Ambiguities:</span></b> Some brand names might be similar to everyday words. Including alternative names can help us capture tweets from different regions and even help to catch common mispellings.
3. <b><span style="color:#0074D9">Filter Out Noise:</span></b> Using negative keywords, we can eliminate irrelevant tweets, ensuring our insights are grounded in relevant data.
4. <b><span style="color:#0074D9">Address Special Cases:</span></b> Some brands might have non-specific names (e.g. - a brand name closely associated with food)
