# Final Project - Gauging Brand Perception: How Luxury Brands Can Stay One Step Ahead.

- Please, read the README.md before starting to run the project
- Team members: Carley Wiley, Eldar Utiushev, Bek Tukhtasinov, Mira Mohan, Tammy Duong, and Aryonna Rice.

## 1. Importing and downloading the necessary libraries

Before running the analysis, it's necessary to install specific Python libraries that our scripts depend on. Below are the commands used to install these libraries:

- **langdetect**: This library is used to detect the language of a given text automatically. It supports detection of several languages and is very useful for preprocessing steps in NLP tasks where language-specific processing might be necessary.

- **scikit-learn**: A powerful and versatile machine learning library in Python. It includes a wide range of tools for modeling and transforming data, including classification, regression, clustering, and dimensionality reduction. We use this library for various machine learning tasks throughout our analysis.

- **openai**: Upgrades the openai library to the latest version. This library provides API access to OpenAI's GPT models, allowing for easy integration of AI-powered natural language processing into applications.

In [3]:
!pip install langdetect
!pip install scikit-learn
!pip install --upgrade openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


This Python code snippet involves importing various libraries and setting up the environment for a Natural Language Processing (NLP) task that involves text processing, language detection, machine learning, and deep learning functionalities. Below is an explanation of each component and its purpose:

### Library Imports:
- **Pandas**: Used for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series.
- **re**: Provides regular expression matching operations similar to those found in Perl, used for string searching and manipulation.
- **nltk (Natural Language Toolkit)**: A suite of libraries and programs for symbolic and statistical natural language processing. It includes facilities for tokenizing, part-of-speech tagging, stemming, and more.
- **langdetect**: A library for detecting the language of text.
- **sklearn (Scikit-learn)**: A machine learning library for Python. It features various classification, regression, and clustering algorithms.
- **openai**: A library to interface with OpenAI's GPT models and other AI tools provided by OpenAI.
- **numpy**: Fundamental package for scientific computing with Python, providing support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- **torch (PyTorch)**: An open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing.
- **transformers**: Provides thousands of pre-trained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages.
- **datasets**: A library for easily accessing and sharing datasets for machine learning tasks.

### Resource Downloads:
- **nltk resources**: Downloading necessary datasets for tokenization (`punkt`), stopwords (`stopwords`), and lemmatization (`wordnet`) which are essential for text preprocessing.

### Environment Setup:
- The code ensures all necessary libraries and packages are imported for tasks ranging from data manipulation, model training, and evaluation to working with state-of-the-art NLP models. This setup is crucial for handling complex workflows in data science projects that involve text analysis and machine learning.

### Use Cases:
- **Train/Test Split**: The `train_test_split` function from `sklearn` is used to divide the dataset into training and testing sets, which is a common practice in machine learning to evaluate model performance.
- **Model Training and Evaluation**: Using libraries like `torch` and `transformers` for training deep learning models and evaluating them using metrics like precision, recall, and F1-score.
- **Krippendorff's alpha**: A statistical measure of the agreement level among multiple raters for qualitative (categorical) items, indicating the reliability of the raters.

This code is typically used in a setup phase of a project where machine learning or deep learning models are trained for tasks such as sentiment analysis, language detection, or any other form of textual data analysis.

In [4]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from langdetect import detect
from sklearn.model_selection import train_test_split
import os
import openai
from openai import OpenAIError
from openai import OpenAI
import pandas as pd
import time
import json
import numpy as np
import random
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
import krippendorff
import torch
from transformers import TrainingArguments, Trainer, EvalPrediction, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification, IntervalStrategy
from datasets import Dataset, DatasetDict
from datetime import datetime

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/ulugsali/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ulugsali/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ulugsali/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 2. Loading the scraped data from TikTok related to fashion brands described in the video submission and presentation.

In [15]:
# loading data
df = pd.read_csv('tiktok_comments.csv')

# we only need a text column in tiktok_comments dataset since it's the only
# thing in this dataset that can help us gauge the brand perception. 
df = df[['text']]

df.head()

Unnamed: 0,text
0,Maybe the simpsons are real and we are the car...
1,I think the designers do this on purpose
2,not me waiting for the model to do the back fl...
3,this was a collab they did with balenciaga yal...
4,the videos were created after 😂


# 3. Preparing the loaded text data for NLP

Text Preprocessing Workflow

This section of the code is dedicated to preparing text data for natural language processing (NLP) by performing several preprocessing steps. These steps are essential for cleaning and standardizing the text data to improve the performance of NLP models. Below is a detailed explanation of each function and its role in the preprocessing pipeline:

Functions Defined:

- **`clean_text`**:
  - **Purpose**: Cleans the text by removing URLs, special characters, punctuation, and converting all text to lowercase.
  - **Implementation**:
    - URLs are removed using a regular expression that identifies strings that start with `http` or `www`.
    - Non-alphanumeric characters and punctuation are removed with a regex that matches any character that is not a word character or whitespace.
    - Converts all characters in the text to lowercase to standardize the case.

- **`remove_stopwords`**:
  - **Purpose**: Filters out stopwords from the text, which are commonly used words (such as "the", "a", "an", "in") that may not be useful for some NLP tasks.
  - **Implementation**:
    - Utilizes the `stopwords` list from NLTK library, which is a well-curated list of stopwords for the English language.
    - Tokenizes the text into individual words and filters out any words that are in the stopwords list.
    - Rejoins the filtered words into a single string.

- **`detect_language`**:
  - **Purpose**: Detects the language of the text using the `langdetect` library, which can recognize multiple languages based on the textual input.
  - **Implementation**:
    - Attempts to detect the language and returns the language code (e.g., "en" for English).
    - If detection fails (possibly due to insufficient text), it returns 'unknown'.

### Data Cleaning Process:

- **Applying Functions**:
  - The `clean_text` function is applied to the raw text data in the DataFrame to perform initial cleaning.
  - The `remove_stopwords` function is then applied to the cleaned text to filter out stopwords.
  - The `detect_language` function is applied last to determine the language of the filtered text.

- **Filtering Non-English Comments**:
  - Only rows where the detected language is English ('en') are retained. This is crucial for tasks that are specifically designed for English language data.

- **Handling Missing Values**:
  - Drops any rows where the 'filtered_text' column is empty after preprocessing, ensuring that the dataset does not contain any null or empty values that could interfere with further analysis.

### Usage:

This preprocessing pipeline is typically used in the initial stages of a text analysis project to ensure that the text data is clean and uniform, making it more amenable to analysis and modeling. It lays the foundation for reliable and accurate insights from subsequent NLP tasks.

In [1]:
def clean_text(text):
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

def detect_language(text):
    try:
        return detect(text)
    except:
        return 'unknown'

# Preprocess and clean data
df['cleaned_text'] = df['text'].apply(clean_text)
df['filtered_text'] = df['cleaned_text'].apply(remove_stopwords)
df['language'] = df['filtered_text'].apply(detect_language)
df = df[df['language'] == 'en']  # Keep only English language comments

# Handle missing values
df.dropna(subset=['filtered_text'], inplace=True)

df.head()

NameError: name 'df' is not defined

In [None]:
df = df[['filtered_text']]

# Optionally, if you want to rename the column to something more general like 'text', you can do:
df.rename(columns={'filtered_text': 'text'}, inplace=True)

# Now 'df' contains only the column with the cleaned and filtered text
df.head()

KeyError: "None of [Index(['filtered_text'], dtype='object')] are in the [columns]"

In [None]:
df.to_csv('filtered_data.csv', index=False)

3. Splitting into Train, Test and Validation 


In [None]:
# First, separate out a random 500 rows for validation
validation_df = df.sample(n=500, random_state=42)
remaining_df = df.drop(validation_df.index)  # Drop the validation rows from the original dataset

# Now, split the remaining data into training and testing sets
# We need to adjust the proportions since 500 rows are already taken by the validation set
remaining_train_percentage = 0.7 / (0.85)  # Adjusting for 70% of the original now being a different percentage of the remaining
train_df, test_df = train_test_split(remaining_df, test_size=(1 - remaining_train_percentage), random_state=42)

# Print the sizes of each dataset to verify
print(f"Training set size: {len(train_df)}")
print(f"Validation set size: {len(validation_df)}")  # Should be exactly 500
print(f"Testing set size: {len(test_df)}")

Training set size: 6772
Validation set size: 500
Testing set size: 1452


In [None]:
### Spliting data into training, validation and testing sets
filtered_df = pd.read_csv('filtered_data.csv')

# First, separate out a random 500 rows for validation
validation_df = filtered_df.iloc[-500::,:]

# Now, split the remaining data into training and testing sets
train_test_df = filtered_df.iloc[0:-500,:]


In [None]:
filtered_df

Unnamed: 0,text
0,maybe simpsons real cartoon
1,think designers purpose
2,waiting model back flip like ones simpsons
3,collab balenciaga yall looks thing simpson put...
4,videos created
...,...
8719,wheres hood scarf
8720,cus dubai thats nissan altima
8721,park toyota dubai switch mercedes
8722,south africa everything would gone including p...


In [None]:
validation_df

Unnamed: 0,text
8224,opium founding father
8225,yall trippin fit clean
8226,might destroy lonely
8227,alr show us women
8228,bad think jeans ripped pull bad
...,...
8719,wheres hood scarf
8720,cus dubai thats nissan altima
8721,park toyota dubai switch mercedes
8722,south africa everything would gone including p...


In [None]:
train_test_df

Unnamed: 0,text
0,maybe simpsons real cartoon
1,think designers purpose
2,waiting model back flip like ones simpsons
3,collab balenciaga yall looks thing simpson put...
4,videos created
...,...
8219,true philippines fashion shorts tsinelas sando...
8220,philippines fashion thats underrated
8221,average fashion art music major
8222,wheres bag omg


In [None]:
# Save the datasets to CSV files:
train_test_df.to_csv('train_test_data.csv', index=False)
validation_df.to_csv('validation_data.csv', index=False)

In [None]:
train_test_df.to_json('train_test_data.json', orient='records')

In [16]:
train_test = pd.read_csv('train_test_data.csv')
len(train_test)
train_test.head()

Unnamed: 0,text
0,maybe simpsons real cartoon
1,think designers purpose
2,waiting model back flip like ones simpsons
3,collab balenciaga yall looks thing simpson put...
4,videos created


# 4. Coverting to an array of comments

In [22]:
# Function to convert each sentence to the desired format
def convert_to_array(text):
    return f'{text},'

# Apply the function to the DataFrame
comments = train_test['text'].apply(convert_to_array)
comments.describe()

count       8224
unique      8014
top       first,
freq          17
Name: text, dtype: object

# 5. OpenAI API Labeling

In [32]:
# 1. Imports and API Configuration
client = OpenAI(api_key = "enter_your_api")
# We removed our API key for security reasons!

# 2. Define a function to query OpenAI's models via the API
def ask_gpt(System_Prompt, User_Query, tokens, temp=1.0, top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0, model="gpt-4-turbo"):
    """Function that Queries OpenAI API"""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": System_Prompt},
            {"role": "user", "content": User_Query}],
        max_tokens=tokens,
        temperature=temp,
        top_p=top_p,
        frequency_penalty=frequency_penalty,
        presence_penalty=presence_penalty
    )
    return response

In [114]:
# 3. Define your Prompt:

## System: Role, task, and format
System_Prompt = """
You are an AI expert trained to analyze social media comments for insights into fashion brand perceptions and emotional sentiments. Your task is to interpret each comment, even when the references to brands or sentiments are indirect, and classify the comment into applicable categories based on the information provided.

Brand perception categories:
1. Product Quality: Indicate if the comment suggests any perception about the quality of fashion products.
2. Reputation & Heritage: Determine if the comment reflects on the brand's history or reputation in the fashion industry.
3. Customer Service: Assess if there are any mentions or implications regarding customer service experiences with fashion brands.
4. Social Impact: Consider if the comment discusses the brand's role or actions in social issues.
5. Ethical Practices: Evaluate any mentions of ethical practices by the fashion brand.
6. Sustainability: Look for any discussions related to environmental sustainability of the brand.

Emotional sentiments to identify based on the presence of specific keywords:
- 'love', 'great', 'amazing', 'impressed', 'thrilled', 'excited'
- 'horrible', 'bad', 'disappointed', 'terrible', 'worse', 'hate'

Analyze the text, assign it to relevant brand perception categories, and determine the emotional sentiment based on the keywords or overall tone of the comment. If the comment does not contain explicit keywords, use your judgment to infer the sentiment from the context. Provide your findings in a structured format, listing both the identified brand perception categories and the emotional sentiment.
If the comment even slightly vaguely implies something categorize it into categories i provided. If it's a reference or a very short phrase, still categoprize it. Don't leave any comment unlabeled in terms of brand perceotion and emotional sentiment.

For each comment, determine which brand perception categories and emotional sentiments apply. Provide your analysis in a structured format, with two columns: one for 'brand_labels' and one for 'emotion_labels'. List all applicable categories and sentiments in their respective columns, formatted as arrays. For example, a comment that is classified under both 'product quality' and 'reputation & heritage' for brand perception, and 'love' and 'admiration' for emotional sentiment, should be formatted as follows:

Brand_labels: ["product quality", "reputation & heritage"]
Emotion_labels: ["love", "admiration"]

Please ensure to apply this structured approach to all comments, making judgment calls as necessary based on even slight or indirect implications within the comments.
"""

User_Query = comments[0]
display(comments)

0                            maybe simpsons real cartoon,
1                                think designers purpose,
2             waiting model back flip like ones simpsons,
3       collab balenciaga yall looks thing simpson put...
4                                         videos created,
                              ...                        
8219    true philippines fashion shorts tsinelas sando...
8220                philippines fashion thats underrated,
8221                     average fashion art music major,
8222                                      wheres bag omg,
8223                              blud think willy wonka,
Name: text, Length: 8224, dtype: object

In [115]:
# 3. Use API to connect with AI model and get model response
response = ask_gpt(System_Prompt, User_Query, tokens=1000, temp=0, model="gpt-3.5-turbo")

# 4. Show response
display(response)

# 5. Get Response Content
answer = response.choices[0].message.content
print(answer)

ChatCompletion(id='chatcmpl-9KvM6med0bfBDNGBAYL9T8gXLmRqb', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Brand_labels: ["Reputation & Heritage"]\nEmotion_labels: ["Neutral"]', role='assistant', function_call=None, tool_calls=None))], created=1714774062, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=540, total_tokens=556))

Brand_labels: ["Reputation & Heritage"]
Emotion_labels: ["Neutral"]


In [141]:
# 6. Define an OpenAI function
myfunction = {
  "type": "function",
  "function": {
    "name": "predict_label",
    "description": "Identify the brand perception category/categories and emotional sentiments based on the text",
    "parameters": {
      "type": "object",
      "properties": {
        "brand_labels": {
          "type": "array",
          "items": {
            "type": "string",
            "enum": [
              "product quality", 
              "reputation & heritage", 
              "customer service", 
              "social impact", 
              "ethical practices",
              "sustainability"
            ]
          },
          "description": "Brand perception labels for the social media comments."
        },
        "emotion_labels": {
          "type": "array",
          "items": {
            "type": "string",
            "enum": [
              "love", 
              "admiration", 
              "disgust", 
              "disapproval", 
              "excitement", 
              "optimism", 
              "disappointment", 
              "approval", 
              "pride"
            ]
          },
          "description": "Emotional sentiment labels for the social media comments."
        }
      },
      "required": [
        "brand_labels", 
        "emotion_labels"
      ]
    }
  }
}

In [162]:
# 7. Define a Python Function to query an OpenAI Model
def classify_gpt(text, model="gpt-3.5-turbo", tokens=200, temp=0.0, top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0):
    """Function that uses OpenAI's API to classify text based on brand perception and emotional sentiment"""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": System_Prompt},  # System prompt that explains the task to the model
                {"role": "user", "content": text}  # User input that needs to be classified
            ],
            max_tokens=tokens,
            temperature=temp,
            top_p=top_p,
            frequency_penalty=frequency_penalty,
            presence_penalty=presence_penalty
        )
        # Correct way to access the response content and tokens
        content = response.choices[0].message.content  # Directly accessing the 'content' attribute of the 'message' object
        total_tokens = response.usage.total_tokens  # Directly accessing 'total_tokens' attribute of the 'usage' object
        return content, total_tokens
    except Exception as e:
        return f"Error: {str(e)}", 0

In [165]:
start_time = time.time()

# 8. Get all texts classified
results = []
tokens_used = 0
review_count = 0

# Query for each text the OpenAI model
for text in comments:

  # Error handling becomes important when you work with APIs. We imported OpenAIError to show errors and allow us to handle them
  try:
    response, tokens = classify_gpt(text, top_p=0.1, model="gpt-4-turbo")
    results.append((text, response, tokens))
    tokens_used += tokens
    review_count += 1
    if review_count % 50 == 0 and review_count != 0:
      print(f"Processed {review_count} comments") #Check to see the progress every 50 reviews
  except OpenAIError as e:
    # Handle all OpenAI API errors
    print(f"Error: {e}")

# Calculate and print the elapsed time
elapsed_time = time.time() - start_time
if elapsed_time > 60:
    print(f"Total time: {elapsed_time / 60:.2f} minutes")
else:
    print(f"Total time: {elapsed_time:.2f} seconds")

Processed 50 comments
Processed 100 comments
Processed 150 comments
Processed 200 comments
Processed 250 comments
Processed 300 comments
Processed 350 comments
Processed 400 comments
Processed 450 comments
Processed 500 comments
Processed 550 comments
Processed 600 comments
Processed 650 comments
Processed 700 comments
Processed 750 comments
Processed 800 comments
Processed 850 comments
Processed 900 comments
Processed 950 comments
Processed 1000 comments
Processed 1050 comments
Processed 1100 comments
Processed 1150 comments
Processed 1200 comments
Processed 1250 comments
Processed 1300 comments
Processed 1350 comments
Processed 1400 comments
Processed 1450 comments
Processed 1500 comments
Processed 1550 comments
Processed 1600 comments
Processed 1650 comments
Processed 1700 comments
Processed 1750 comments
Processed 1800 comments
Processed 1850 comments
Processed 1900 comments
Processed 1950 comments
Processed 2000 comments
Processed 2050 comments
Processed 2100 comments
Processed 21

In [169]:
print(results[0])

# 9. Create a dataframe from the results
df = pd.DataFrame(results, columns=['Text', 'labels','Tokens_used'])
df.head()
print(df['labels'])

('maybe simpsons real cartoon,', 'Brand_labels: ["product quality", "reputation & heritage", "customer service", "social impact", "ethical practices", "sustainability"]\nEmotion_labels: ["love", "great", "amazing", "impressed", "thrilled", "excited", "horrible", "bad", "disappointed", "terrible", "worse", "hate"]', 613)
0       Brand_labels: ["product quality", "reputation ...
1       Brand_labels: ["reputation & heritage"]\nEmoti...
2       Brand_labels: ["reputation & heritage"]\nEmoti...
3       Brand_labels: ["product quality"]\nEmotion_lab...
4       Brand_labels: ["reputation & heritage"]\nEmoti...
                              ...                        
8219    Brand_labels: ["product quality"]\nEmotion_lab...
8220    Brand_labels: ["reputation & heritage"]\nEmoti...
8221    Brand_labels: ["reputation & heritage"]\nEmoti...
8222    Brand_labels: ["customer service"]\nEmotion_la...
8223    Brand_labels: ["reputation & heritage"]\nEmoti...
Name: labels, Length: 8224, dtype: objec

In [170]:
# 10. saving labeled dataset
df.to_csv('labeled_data.csv', index=False)

# 6. Cleaning the labeled dataset

In [171]:
def parse_labels(label_string):
    brand_labels = []
    emotion_labels = []
    
    # Extract brand labels
    brand_start = label_string.find("Brand_labels: [") + len("Brand_labels: [")
    brand_end = label_string.find("]", brand_start)
    brand_labels_str = label_string[brand_start:brand_end]
    if brand_labels_str:
        brand_labels = [label.strip().strip('"') for label in brand_labels_str.split(",")]
    
    # Extract emotion labels
    emotion_start = label_string.find("Emotion_labels: [") + len("Emotion_labels: [")
    emotion_end = label_string.find("]", emotion_start)
    emotion_labels_str = label_string[emotion_start:emotion_end]
    if emotion_labels_str:
        emotion_labels = [label.strip().strip('"') for label in emotion_labels_str.split(",")]
    
    return brand_labels, emotion_labels

# Apply the parsing function to the labels column
df[['brand_label', 'emotion_label']] = df['labels'].apply(lambda x: pd.Series(parse_labels(x)))

In [173]:
df = df.drop(['labels', 'Tokens_used'], axis=1)

In [176]:
df.head()

Unnamed: 0,Text,brand_label,emotion_label
count,8224,8224,8224
unique,8014,34,111
top,"first,",[reputation & heritage],[]
freq,17,3291,2455


Removing comments with empty emption label lists because they likely don't represent any kind of brand perception

In [180]:
# Filter out rows with empty emotion labels
df = df[df['emotion_label'].apply(lambda x: len(x) > 0)]

# Save the filtered data
df.head()

Unnamed: 0,Text,brand_label,emotion_label
count,5769,5769,5769
unique,5641,32,110
top,"first,",[reputation & heritage],[love]
freq,17,2056,2414


In [179]:
# saving cleaned labeled data
df.to_csv('labeled_data_cleaned.csv', index=False)