<center>
    <h1>🍶 <b style="color: lightblue">"Churning"</b> the data 🍶</h1>
    <h3>An exploratory data analysis of yogurt brands <s>Twi...</s> <span style="color: cornflowerblue;">𝕏</span>.com</h3>
</center>

- <b><span style="color:#0074D9">Main Objective:</span></b> to give a detailed data analysis of data acquired from 𝕏 in order to equip our client with actionable, evidence-based insights.
  - Provide a clean dataset, separated by company type
  - Garner tweet <i>(or xeet or whatever they're called now)</i> sentiment using a pretrained model from huggingface🤗
  - Classify tweets into complaints that could be potentially detrimental to the brand.
  - Report on the broad themes discovered from our analysis.


### Let's start by importing our dependencies

In [1]:
import os
# import all dependencies
import sys
from pathlib import Path
import pandas as pd
from IPython.display import display

os.chdir('..')
if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())

from src.processing.preprocess import *
from src.resources.brands_data import Brand, brands
from src.resources.word_lists import *
from src.processing.sentiment_analysis import analyze
from src.processing.theme_analyzer import theme_analyzer_main

Using MPS (Metal GPU) device.


    PyTorch 2.0.1 with CUDA None (you have 2.1.0.dev20230803)
    Python  3.11.4 (you have 3.11.4)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
[nltk_data] Downloading package punkt to /Users/joshualawrence/anacond
[nltk_data]     a3/envs/ds_project/lib/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/joshualawrence/ana
[nltk_data]     conda3/envs/ds_project/lib/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to /Users/joshualawren
[nltk_data]     ce/anaconda3/envs/ds_project/lib/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


## `Step 0: Setup the environment`

1. Set up project using conda
2. Set up tiered, multi-environment approach utilizing either CUDA or MPS (Metal Performance Shaders) based on OS environment
    ```python
    # Check if MPS is available
    if torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Using MPS (Metal GPU) device.")
    elif torch.cuda.is_available():
        device = torch.device("cuda")
        print("Using CUDA device.")
    else:
        device = torch.device("cpu")
        print("Using CPU device.")
    ```
3. Chose dependencies
    * [Twitter-roBERTa-base](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) for Sentiment Analysis
    * [NLTK](https://www.nltk.org/) for tokenization and frequency analysis

## `Step 1: Clean the data`


### Data Preparation: 3 CSVs ➡ 1 DataFrame

* We started our analysis by combining the separate raw csv files into a single dataframe and removing any twitter links.

In [2]:
# preprocessing.preprocess_data()
csv_list = get_csv_files()

combined_data_frame = combine_csv_data(csv_list)

# remove twitter links - regex test here https://regex101.com/r/wZ0dAP/1
combined_data_frame["text"] = combined_data_frame["text"].str.replace(
    r"http[s]?://t\.[^\s]*|[^[$]]", "", regex=True
)

### Display datatable ###
print(f"Number of csv files ::: {len(csv_list)}")
print(f"Combined data frame length ::: {len(combined_data_frame)} rows")
pd.set_option("display.max_columns", 15)
pd.set_option("display.max_colwidth", 100)
display(combined_data_frame.iloc[10000:].head(3))
### for display purposes ###

Number of csv files ::: 3
Combined data frame length ::: 31567 rows


Unnamed: 0,Key,text,lang,created_at,created_day,timeonly,created_dateonly,...,user_geo_enabled,user_listed_count,user_location,user_statuses_count,user_time_zone,user_screen_name,file
10000,10001,"RT @ananavarro: Dear @chobani, thanks for the love. But I think you went a little over-board! 😜",en,Wed Jul 05 18:47:55,Wed,18:47:55,5/7/2017,...,True,399,"Sacramento, CA",59561,,clarisevail1,jsonfile_chobani.json
10001,10002,"[Batman ""WHERE ARE THEY?!"" voic CHOBANI",en,Wed Jul 05 18:29:14,Wed,18:29:14,5/7/2017,...,False,2,,1256,,milk_death,jsonfile_chobani.json
10002,10003,Omg Chobani Is Coming Back To Whole Foods Just In Time For The Amazon Acquis (pls RT↺❤️) ️️ #P,en,Wed Jul 05 18:07:19,Wed,18:07:19,5/7/2017,...,False,27,,242544,,KnowYourLeaker,jsonfile_chobani.json


### Data Orgnization: Creation of a `Brand` Class

<div style="display: flex; justify-content: space-between">
    <div style="margin: 2%">

```python
@dataclass
class Brand:
    """
    Brand object
    """
    twitter_handles: List[str]
    brand_name: str
    alternate_names: List[str] = field(default_factory=list)
    negative_keywords: List[str] = field(default_factory=list)
    is_nonspecific_name: bool = False
    has_food_related_name: bool = False
```

</div>
<div style="margin-left: 3%; margin-right: 3%;">

#### Example Brand:

```python
"Activia": Brand(
    twitter_handles=[
        "@activia",
        "@activiauk",
    ],
    brand_name="Activia",
    negative_keywords=[
        "activia benz",
        "mens-rights-activia",
    ],
),
```

</div>
</div>

We structured our brand data using a dataclass called `Brand` which helps us:

- <b><span style="color:#0074D9">Categorize Brands:</span></b> Using Twitter handles and brand names.
- <b><span style="color:#0074D9">Handle Ambiguities:</span></b> Some brand names might be similar to everyday words. Including alternative names can help us capture tweets from different regions and even help to catch common mispellings.
- <b><span style="color:#0074D9">Filter Out Noise:</span></b> Using negative keywords, we can eliminate irrelevant tweets, ensuring our insights are grounded in relevant data.
- <b><span style="color:#0074D9">Address Special Cases:</span></b> Some brands might have non-specific names (e.g. - a brand name closely associated with food)


### Data Cleansing:

<h4 style="color: green">We will show the journey of a single company's data: <code>"Greek Gods"</code> from raw data into processed data.</h4>

<div>We took several steps in cleansing the data:</div>

* We did not dedupe the text data as we wanted to track frequency of tweets.

1. <b><span style="color:#0074D9">Created Company List:</span></b> Given the unique names under the `file` column, we were left with [the following companies.](../notes/companies_list.txt)
    * We <span style="color:crimson">removed</span> <i>Vanilla Bean</i> as it appeared to not be a valid company.  Keeping it also muddied the analysis process due to its association with food.
    <br><br>
2. <b><span style="color:#0074D9">Filtered Dataset:</span></b> We used several criteria to do this:
    * <u>[Word Association Lists](../src/resources/word_lists.py)</u> - given to brands based on their "uniqueness"
        * A brand like <span style="color: goldenrod">Chobani</span> can be categorized as more unique than <span style="color: goldenrod">Liberté`</span> given its French origin.
        * <span style="color: goldenrod">Greek Gods</span> is an example of a non-specific company name
    * <u>Removing Negative Keywords</u> - Limited keywords that are known exceptions
        * e.g. - A proper name like <span style="color: goldenrod">Activia Jones</span> against a normally sufficiently unique <span style="color: goldenrod">Activia</span> brand

In [3]:
# preprocess.prepare_data_for_filtering()
brand = "Greek Gods"

if brands[brand]:
    filtered_data_frame = filter_irrelevant_data(combined_data_frame, brand=brands[brand], relevancy_threshold=0)
    filtered_data_frame = remove_tweets_with_negative_keywords(filtered_data_frame, brands[brand])

    pd.set_option("display.max_columns", 10)
    pd.set_option("display.max_colwidth", 100)
    display(filtered_data_frame.head(3))

Number of relevant Greek Gods tweets found: 10


Unnamed: 0,Key,text,lang,created_at,created_day,...,user_location,user_statuses_count,user_time_zone,user_screen_name,file
163,164,"Get hip wth fermentation. ""Oui"" goes up against the Greek gods and goddesses of yogurt.",en,Mon Jun 26 20:04:26,Mon,...,"NYC, Chicago, DC, Montreal",465,Central Time (US & Canada),nasermu,jsonfile_today_greek_gods_ny.json
401,402,"Greek Gods Yogurt mission is ""to deliver authentic Greek-style products that embody those flavor...",en,Thu Jun 22 23:20:07,Thu,...,"Stanton, CA",1221,Pacific Time (US & Canada),SocialSampling,jsonfile_today_greek_gods_ny.json
419,420,"Greek Gods Yogurt $1.79 at @Safeway after deal, save 55%!\n\n#Safeway #Coupon #SafewayDeals…",en,Thu Jun 22 17:19:04,Thu,...,Colorado,3471,Mountain Time (US & Canada),SuperSafeway,jsonfile_today_greek_gods_ny.json


## `Step 2: Perform Sentiment Analysis`

In [4]:
brand_name_snake_case = brand.lower().replace(" ", "_")
analyze(
    data_frame=filtered_data_frame,
    company_name=brand_name_snake_case,
)
pd.read_csv(f"data/processed/companies/{brand_name_snake_case}/{brand_name_snake_case}_relevant_tweets.csv")

Analyzing sentiments: 100%|██████████| 1/1 [00:00<00:00,  1.15it/s]


Unnamed: 0,Key,text,lang,created_at,created_day,...,user_screen_name,file,sentiment_score,sentiment,company_name
0,164,"Get hip wth fermentation. ""Oui"" goes up against the Greek gods and goddesses of yogurt.",en,Mon Jun 26 20:04:26,Mon,...,nasermu,jsonfile_today_greek_gods_ny.json,0.691087,neutral,greek_gods
1,402,"Greek Gods Yogurt mission is ""to deliver authentic Greek-style products that embody those flavor...",en,Thu Jun 22 23:20:07,Thu,...,SocialSampling,jsonfile_today_greek_gods_ny.json,0.510627,positive,greek_gods
2,420,"Greek Gods Yogurt $1.79 at @Safeway after deal, save 55%!\n\n#Safeway #Coupon #SafewayDeals…",en,Thu Jun 22 17:19:04,Thu,...,SuperSafeway,jsonfile_today_greek_gods_ny.json,0.827519,positive,greek_gods
3,2786,"Get hip wth fermentation. ""Oui"" goes up against the Greek gods and goddesses of yogurt.",en,Mon Jun 26 20:04:26,Mon,...,nasermu,jsonfile_today_yoplait_ny.json,0.691087,neutral,greek_gods
4,4090,Do you know how people binge eat ice cream when they're emotional. That's how I am but with hone...,en,Mon Jul 03 14:57:22,Mon,...,desteniemarie,jsonfile_greek gods.json,0.650827,positive,greek_gods
5,4160,@LiberteUSA I still purchase Liberte but now it's not the only brand I purchase. Greek Gods is w...,en,Sun Jul 02 05:59:11,Sun,...,nastywmnlabmngr,jsonfile_greek gods.json,0.785935,positive,greek_gods
6,4611,@LiberteUSA I still purchase Liberte but now it's not the only brand I purchase. Greek Gods is w...,en,Sun Jul 02 05:59:11,Sun,...,nastywmnlabmngr,jsonfile_liberte.json,0.785935,positive,greek_gods
7,10791,I'm such a fiend for my s'mores chobani and my strawberry honey Greek gods yogurt. Like yes. All...,en,Tue Jul 25 23:03:43,Tue,...,AyeStoney,jsonfile_chobani.json,0.954481,positive,greek_gods
8,13001,"RT @nasermu: Get hip wth fermentation. ""Oui"" goes up against the Greek gods and goddesses of yo...",en,Tue Jul 25 09:25:42,Tue,...,MonikaMckay8,jsonfile_yoplait.json,0.79818,neutral,greek_gods
9,17032,"(please read previous tweet) @browncowyogurt @FAGEUSA @Stonyfield, @LiberteUSA @YoCrunch @Powerf...",en,Tue Aug 08 20:35:21,Tue,...,sweetbob,jsonfile_Stonyfield.json,0.817065,neutral,greek_gods


## `Step 3: Get the top words and bigrams for each company`

In [None]:
theme_analyzer_main(filtered_data_frame)

## `Step 4: Plot data related to top words and bigrams`

In [None]:
# not completed yet

## `Step 5: Present findings`

* Companies `Wallaby`, `Maple Hill`, `Smari` did not have sufficient data to draw any conclusions
* Company`Greek Gods` had limited data, but there were themes
    * Based on the limited data there was very positive reaction to the brand with some even switching from other brands like Chobani

## Other brands
### Activia
    * A large number tweets from bigrams and top words focused on spokesperson Jamie Lee Curtis

```csv
bigram,frequency
"('jamie', 'lee')",20
"('lee', 'curtis')",18
"('👌', '👌')",10
```

### Chobani
    * A large number tweets from bigrams and top words focused on a feature appearance of CEO Hamdi Ulukaya on the cover of Fast Company magazine
    * The title was How Chobani's Hamdi Ulukaya Is Winning America's Culture War
```csv
bigram,frequency
"('guy', 'got')",271
"('got', 'war')",271
"('war', 'lost')",271
```

### Liberte
    * Tweets were focused around a recall at the time of Liberte products
```csv
"('recall', 'liberte')",280
"('expands', 'recall')",278
"('canadian', 'food')",256
"('food', 'inspection')",254
```

## `Potential Improvements`
* Use NER (Named Entity Recognition) for getting yogurt and company related keywords
* Utilize NTLK for initial data filtering
* Obtain a more complete dataset of yogurt brands and twitter handles
* Obtain a list of hashtags related to yogurt brands
* Multi-lingual analysis
    * Could remove false captures like "activia" as a brand