#**BUSI/COMP488-001 Data Science in the Business World.**
###**Final Project**: Gauging Brand Perception: How Luxury Brands Can Stay One Step Ahead.
###**Team Name:** Team D.
###**Team Members:** Carley Wiley, Eldar Utiushev, Bek Tukhtasinov, Mira Mohan, Aryonna Rice, and Tammy Duong.

###**Exposition**
**1. Focused Stakeholder:** Luxury Fashion Brands. Our model specifically analyzes the real-time public sentiment towards the following luxury clothing brands: The Row, Courrèges, Khaite, Chrome Hearts, Alaïa, Bottega Veneta, Versace, Balenciaga, Gucci, Chanel, Louis Vuitton, Saint Laurent, Christian Dior, Cartier, Celine, Burberry, Rick Owens, Givenchy, Hermès, Fendi, Prada, Valentino, and Armani.

**2. Our Question:** How can we leverage web scraping and social media analytics to identify real-time shifts in public sentiment toward designer fashion brands, enabling proactive reputation management and strategic marketing decisions?

**3. Data We Used to Answer our Question:**
We scraped textual data, using Apify's Tik Tok scrapers, such as hashtags and comments that mention different luxury brands so that we could then use our synthetic expert to perform a sentiment analysis on each brands in terms of the six pillars of brand perception that we defined at the start of our Data Science Pipeline process.

**4. Approach and Methods**:
- Data Cleaning and Preprocessing:
  
  - **Initial Setup and Data Loading:** Data is initially loaded from social media platforms and news sources, specifically focusing on comments that provide insights into brand perception. This data is processed using Python libraries such as Pandas for data manipulation. Necessary libraries for language detection and natural language processing, such as NLTK and LangDetect, are installed. Resources like stop words, tokenizers, and lemmatizers from NLTK are also downloaded to assist in text processing.

  - **Text Preprocessing:** Text data undergoes several cleaning steps:
    - URLs are removed to ensure the text reflects only content related to brand sentiment
    - Special characters and punctuation are stripped to simplify the text and focus on meaningful words.
    - Text is converted to lowercase to standardize the data and facilitate comparison and analysis.
    - Non-English comments are filtered out to maintain consistency in language for sentiment analysis.
    - Stop words are removed, and text is tokenized to focus on significant words that contribute to sentiment analysis.

- Sentiment Analysis Model Training:

  - **Data Preparation:** The cleaned data is split into training, validation, and testing sets using sklearn's train_test_split function. This step ensures the model is tested on unseen data, validating its predictive power.
  - **Model Training:** A sentiment analysis model is trained using the OpenAI API, which utilizes large-scale language models to predict sentiment based on text input. The model is fine-tuned to classify sentiment into specific categories relevant to brand perception, such as product quality, customer service, and sustainability.
  - **Error Handling and Model Deployment:** The integration with OpenAI's API includes handling potential errors like rate limits and ensuring the model can continuously classify new input by implementing retries and backoff strategies. Predictions from the model are used to assign labels to each text entry, indicating the sentiment towards various aspects of brand perception.

- Implementation and Evaluation:

  - **Continuous Learning:** The model is designed to update its learning as new data becomes available, ensuring that the sentiment analysis remains relevant over time and reflects current consumer opinions. Regular evaluations are conducted to compare predicted sentiments against actual brand outcomes, allowing adjustments to the model as necessary.
  - **Compliance and Ethical Considerations:** The project adheres to privacy laws and data usage policies, ensuring that data collection and analysis respect user privacy and data integrity. Steps are taken to validate data sources and ensure the accuracy and reliability of the information used for training and predictions.

**5. Actionable Implications to Luxury Fashion Designers (i.e., how it serves their purpose):**
*5.1. Implications.*

- We provide brands with detailed insights on public sentiment across multiple dimensions of brand perception, enabling them to tailor their marketing strategies, product development, and customer engagement to better align with current public sentiments.
- Our model offers predictive insights based on sentiment trends, aiding brands in anticipating potential shifts in public perception and adjusting their strategies proactively.

*5.2. Risks and Mitigations.*

As with any new development plan, these decisions come with certain risks:
1. **Data Accuracy and Reliability**: We ensure that the data collected is from credible sources and employ robust data validation techniques to maintain accuracy.
2. **Privacy and Compliance**: We adhere to data privacy laws and social platform policies during data collection and analysis, ensuring ethical data usage.
3. **Rapid Response Requirement**: We prepare brands for quick response strategies as real-time data might indicate the need for swift action to manage emerging brand perception issues. 

## 1. Data Cleaning and Processing

1. Load Raw Data

In [4]:
pip install langdetect
pip install scikit-learn

SyntaxError: invalid syntax (3863399607.py, line 1)

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from langdetect import detect

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/eldarutiushev/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/eldarutiushev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/eldarutiushev/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
# loading data
df = pd.read_csv('tiktok_comments.csv')

# we only need a text column in tiktok_comments dataset since it's the only
# thing in this dataset that can help us gauge the brand perception. 
df = df[['text']]

df.head()

Unnamed: 0,text
0,Maybe the simpsons are real and we are the car...
1,I think the designers do this on purpose
2,not me waiting for the model to do the back fl...
3,this was a collab they did with balenciaga yal...
4,the videos were created after 😂


2. Data Cleaning

In [7]:
import pandas as pd
import re
from langdetect import detect
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

def clean_text(text):
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

def detect_language(text):
    try:
        return detect(text)
    except:
        return 'unknown'

# Preprocess and clean data
df['cleaned_text'] = df['text'].apply(clean_text)
df['filtered_text'] = df['cleaned_text'].apply(remove_stopwords)
df['language'] = df['filtered_text'].apply(detect_language)
df = df[df['language'] == 'en']  # Keep only English language comments

# Handle missing values
df.dropna(subset=['filtered_text'], inplace=True)

df.head()

In [52]:
df = df[['filtered_text']]

# Optionally, if you want to rename the column to something more general like 'text', you can do:
df.rename(columns={'filtered_text': 'text'}, inplace=True)

# Now 'df' contains only the column with the cleaned and filtered text
df.head()

KeyError: "None of [Index(['filtered_text'], dtype='object')] are in the [columns]"

In [53]:
df.to_csv('filtered_data.csv', index=False)

3. Splitting into Train, Test and Validation 


In [55]:
from sklearn.model_selection import train_test_split

# First, separate out a random 500 rows for validation
validation_df = df.sample(n=500, random_state=42)
remaining_df = df.drop(validation_df.index)  # Drop the validation rows from the original dataset

# Now, split the remaining data into training and testing sets
# We need to adjust the proportions since 500 rows are already taken by the validation set
remaining_train_percentage = 0.7 / (0.85)  # Adjusting for 70% of the original now being a different percentage of the remaining
train_df, test_df = train_test_split(remaining_df, test_size=(1 - remaining_train_percentage), random_state=42)

# Print the sizes of each dataset to verify
print(f"Training set size: {len(train_df)}")
print(f"Validation set size: {len(validation_df)}")  # Should be exactly 500
print(f"Testing set size: {len(test_df)}")

Training set size: 6772
Validation set size: 500
Testing set size: 1452


In [None]:
### Spliting data into training, validation and testing sets
filtered_df = pd.read_csv('filtered_data.csv')

# First, separate out a random 500 rows for validation
validation_df = filtered_df.iloc[-500::,:]

# Now, split the remaining data into training and testing sets
train_test_df = filtered_df.iloc[0:-500,:]


In [4]:
filtered_df

Unnamed: 0,text
0,maybe simpsons real cartoon
1,think designers purpose
2,waiting model back flip like ones simpsons
3,collab balenciaga yall looks thing simpson put...
4,videos created
...,...
8719,wheres hood scarf
8720,cus dubai thats nissan altima
8721,park toyota dubai switch mercedes
8722,south africa everything would gone including p...


In [5]:
validation_df

Unnamed: 0,text
8224,opium founding father
8225,yall trippin fit clean
8226,might destroy lonely
8227,alr show us women
8228,bad think jeans ripped pull bad
...,...
8719,wheres hood scarf
8720,cus dubai thats nissan altima
8721,park toyota dubai switch mercedes
8722,south africa everything would gone including p...


In [6]:
train_test_df

Unnamed: 0,text
0,maybe simpsons real cartoon
1,think designers purpose
2,waiting model back flip like ones simpsons
3,collab balenciaga yall looks thing simpson put...
4,videos created
...,...
8219,true philippines fashion shorts tsinelas sando...
8220,philippines fashion thats underrated
8221,average fashion art music major
8222,wheres bag omg


In [7]:
# Save the datasets to CSV files:
train_test_df.to_csv('train_test_data.csv', index=False)
validation_df.to_csv('validation_data.csv', index=False)

In [52]:
train_test_df.to_json('train_test_data.json', orient='records')


In [63]:
# Tocken count:
import math

print("Average text length in characters: " + str(round(train_test_df['text'].str.len().mean(), 2)))

input_max = train_test_df['text'].str.len().max()

categories = "'product quality', 'reputation & heritage', 'customer service', 'social impact', 'ethical practices', 'sustainability'"

output_max = len(categories)

print("Max text length: " + str(input_max))
print("Max input tokens required for one text: " + str(input_max/4))
print("Max output tokens required for one text: " + str(math.ceil(output_max/4)))


Average text length in characters: 34.04
Max text length: 143
Max input tokens required for one text: 35.75
Max output tokens required for one text: 30


# 2. OpenAI API Labeling

In [64]:
# Install the OpenAI library if not already updated
!pip install --upgrade openai

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m

In [65]:
# Import the OpenAI library
import os
import openai
from openai import OpenAIError
from openai import OpenAI

import pandas as pd
import time
import json

client = OpenAI(
    api_key=os.getenv("sk-proj-nzxRWSLF2BlxuIZDtD50T3BlbkFJEbLOkIlSA9KlwIJJuCQz"))
# bek_api = sk-proj-qo0MkKTZpaPCagmSa1FPT3BlbkFJaEcfmnmXr6KzcAXthX3R

In [66]:
# Data set for training and testing the model
train_test_df = pd.read_csv('train_test_data.csv')
len(train_test_df)
train_test_df.head()


Unnamed: 0,text
0,maybe simpsons real cartoon
1,think designers purpose
2,waiting model back flip like ones simpsons
3,collab balenciaga yall looks thing simpson put...
4,videos created


In [67]:
# Building a OpenAI function:


api_function = {
   "type": "function",
   "function": {
        "name": "predict_label",
        "description": "Identify the brand perception category/categories based on the text",
        "parameters": {
            "type": "object",
            "properties": {
                "prediction": {
                    "type": "array",
                    "items": {
                        "type": "string",
                        "enum": [
                            'product quality', 
                            'reputation & heritage', 
                            'customer service', 
                            'social impact', 
                            'ethical practices',
                            'sustainability'
                        ]
                    },
                    "description": "Brand Labels for the social media comments."
                }
            },
            "required": [
                "prediction"
            ]
        }
    }
}

In [71]:
# import backoff 
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff
import backoff


@backoff.on_exception(backoff.expo, openai.RateLimitError)
# @retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def classifier(user_input):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful label assistant. Label the element of brand perception using one of these labels ('product quality', 'reputation & heritage', 'customer service', 'social impact', 'ethical practices', and 'sustainability') you can choose more than one."},
            {"role": "user", "content": user_input}],
        functions=[api_function],
        function_call={"name": "predict_label"},
        max_tokens=500,
        temperature=0,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
        )
    return json.loads(response.choices[0].message.function_call.arguments)["prediction"]


In [72]:
# Prepare the data for training:
train_test_df['predicted_labels'] = None

In [73]:
# Classification process on 100 rows:

for row in range(100):
    # Use the classifier function to predict the labels for each row in the dataset
    try:
        user_input = train_test_df.iloc[row]['text']
        train_test_df.at[row, 'predicted_labels'] = classifier(user_input)
        print(f"Row {row} processed successfully, labels: {train_test_df.iloc[row]['predicted_labels']}")
    except OpenAIError as e:
        # Printing all OpenAI API errors
        print(f"Error: {e}")


KeyboardInterrupt: 