# UBS Challenge Submission

Group Name: "" (empty string)

Group Participants: Maxim Huber, Elias Mbarek, Michal Mikuta, Noah Stäuble

The culmination of your submission should be a technical report that outlines your analytical journey, highlighting the methodologies employed, any obstacles encountered along the way, and the strategies adopted to overcome these challenges.

## Brief Challenge Description:

We were tasked with ...

## Report Overview

1. Data Understanding
   1. Data Cleaning and Preprocessing
   2. Assumptions on Data
   3. Feature Engineering and Data Augmentation
2. Modelling Approach
   1. Model Selection
   2. Recommendation for Model Enhancement
3. Actionable Insights !! (not sure if we have all of them)
   1. Brand-choice -> data-driven decision
   2. Industry-Level
   3. Market-Level

In [9]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

path = os.path.join(os.path.join(os.getcwd(), os.pardir),"data")
print(path)
file = os.path.join(path, "skylab_instagram_datathon_dataset.csv")
data = pd.read_csv(file, delimiter=";")

/Users/maximhuber/Developer/datathon/docs/../data


## 1. Data Understanding

Provide a detailed account of the initial steps taken to prepare the data for analysis.

This should include a description of how data quality issues, such as missing values or outliers, were addressed.

1. Exploratory Data Analysis
   - Examination and understanding of the dataset's structure and content
   - Performing exploratory data analysis to understand:
     - data patterns, 
     - outliers, and
     - relationships between variables

2. Data Cleaning
   - Data preprocessing, include, but not limited to:
     - handling missing values, 
     - data conversions, and 
     - normalization

### Data Cleaning and Preprocessing

A general issue data scientists face is data cleaning. This is why we spent efforts on understanding our dataset and cleaning it.

The biggest issues we faced:

1. NaNs
2. Unordered data (time-series, group by company...)
3. Duplicate rows
4. Mappings between compset_groups, compsets, business entities, and so on.
5. Company Acquisitions
6. Data normalization -> maybe explain a bit (only for k-means)
7. throw out "All brands"

#### Grouping and sorting data

Before doing anything, we made sure we had order in our dataset.

We made sure to group the data by `business_entity_doing_business_as_name`. Then we made sure to order the values for each business item by increasing `period_end_date` (first also transforming it to a pandas date format, I think).

In [None]:
# not sure if code is necessary for this, it's really straightforward

#### Adressing NaNs

We figured that ... % of data is NaN values. Due to the "big" amount of missing data, we decided that imputing the data makes more sense than dropping incomplete rows.

Because we worked with time-series data, we thought that it is suitable to linearly interpolate missing numerical data between adjacent values. This aligns with the goal of finding outliers - linearly interpolating continues trends and should not cause , which might be observed when imputing with the mean over a bigger time-span.

In [None]:
# .... Copy in Michal's code for interpolating

#### Nearly-Duplicate Rows

Next, we realized that for some rows, we had multiple entries for ....

We therefore mapped individual rows which only different in the `compset` column to have list in the `compset` column. This step is important because if we compute means over whole industries or the entire market, the computation would be biased towards companies which are present in multiple competitive sets. As a side effect, we were also able to reduce the size of our dataset.

In [None]:
#### Code where maxim transformed `compset` into lists

### Exploratory Data Analysis

In a following step, we really tried to understand how our data is structured and what it reveals about the real world. For this, we 

- first needed to generate mappings between compset_group, compset, business_entity name etc...

After, we could
- compare industries to the whole market
- compare compsets within compset_groups
- compare companies within compset

#### Industries and Sub-Industries

First, we made sure we understood what industries we are working with. We iterated over the csv file and found the following (shortened) hierarchical structure:

In [7]:
# Code showing some hierarchical structure for industries. Here, we could maybe make some nice tree plots or so that "haut sie vom hocker"

import json

# Map from compset_group -> compset
map_path = '../out/compset_group_to_compset_map.json'
with open(map_path, 'r') as f:
    map_data = json.load(f)

counter = 0  # Limit the printing
for key, value in map_data.items():
    print(key)
    print(value)
    print()  # Add an empty line after each key's values
    counter += 1  # Increment the counter
    if counter == 3:  # Check if we've printed 3 key-value pairs
        break  # If yes, break out of the loop


Luxury & Premium & Mainstream
['US Softlines Analyst Interest List', 'Luxury & Premium & Mainstream', 'Footwear', 'Premium Brands', 'Soft Luxury', 'Hard Luxury', 'Mid-Range Watch & Jewelry', 'Mainstream Brands', 'Global Luxury Analysts Interest List']

Restaurants
['Fast Casual', 'Restaurants', 'Casual Dining', 'QSR', 'Coffee']

Beverages
['Energy drinks', 'Alcohol', 'Beverages', 'Soda', 'Sports drinks']



### Getting a feel for the market

Next, we made sure we understood broad trends between industries. For visualization purposes, we restrict ourselves to 
five industries. They are ... . They exhibit these ... special features, which make them interesting for plotting.

TODO: add plots that show e.g. follower count means over industries, as box-plots and time-series

What to plot:
- which brands have the highest follower counts & growth (time-series plot)
- which industries (nice box plot)
- maybe aggregate across regions (maybe some bubble plot idk)

### ii. Assumptions

Clearly articulate any assumptions that were made during the data preparation phase.

Here, we need to think a little more. Short and concise, but it's also important to have good stuff here.

Assumptions made so far:
- assuming that the data is representative of all the major brands or that the followers count directly impacts brand popularity.
- Post data is not only advertised stuff but also general posts on the company profiles
Impact: more accurate representation of customer sentiment towards the brand because people tend to like and comment on content that is marked as sponsored.
- COVID ??
- no company acquisitions (which holds for most of the data, but not for all (e.g. LVMH)) -> make some plots confirming this (I think Dior was acquired by LVMH, and maybe we see something in the time-series plot after the buy date...). the least we could do would be to identify all holding changes. i think we can do this based on the maps I created or create some more 
- for the k-means model, we work (!!) with 2023 data


### iii. Feature Engineering and Data Augmentation

Describe any techniques employed to enhance the dataset, whether through the creation of new features or augmentation of the existing data.

To be able to work with more relevant data, we decided to transform our raw data into widely used marketing measures, which better represent the engagement of the  customer with companies social media presence. To this end we employ the following metrics:

'followers_delta', 'followers_second_delta': first, second difference of followers
'followers_spike': 1 if the followers_delta differs from its median by one std, 0 else
'average_engagement_per_post': engagements (likes + comments) / post (be it picture or video)
'per_post_aquisition': followers_delta / engagement, represents sensitivity of follower growth with respect to engagement , large values may indicate an effective engagement strategy
'engagement_rate': engagements/followers
'relative_growth': followers_delta / followers
'growth_per_post': relative_growth / post, represents sensitivity of relativized growth with respect to number of posts, large values may indicate an efficient number of posts used to generate growth
'growth_per_engagement': relative_growth / engagement
'virality_index': engagement * relative_growth, we chose to relativize engagement with respect to growth 


This is an exhaustive list of industry standards, which we could adapt to our type of data. In later analysis of outliers, we will see which features contribute most to extraordinary behavior of a company.
  



## 2. Model Approach

### i. Model Selection

Explain on the choice(s) of statistical or machine learning model(s) utilized in your analysis. Provide a compelling justification for each model selected, emphasizing how they align with the objectives of the challenge.

How could we go about solving this challenge?

In the end, we came up with three models.

1. Baseline: Best ranked companies according to engagement rate
2. k-means: ?
3. ?

In the following, we detail why our model choices make sense in the broader setting.

#### 1. Baseline Model

How it works:
- for each month, we ranked companies according to their mean engagement rates
- in the end, we averaged the company rankings and took the top and bottom three as outliers

Reasons:
- need something to start with
- measure consistency and social media engagement (relative to followers), which is one of the most relevant metrics used in social media marketing.
- gives insight into best-performing and worst-performing companies, which can serve as indicators for investment opportunities and risk management, which we will explore more **in a later section**

Results:
Looking at the top ranked company each montch yields 28 companies which occupied this spot. This indicates that the space for online engagement is heavily contested and user engagement is not very steady. Nonetheless, engagement is distributed very unequally - there is a small subset of companies, which "figured out" social media, who compete for the top spots of user engagement. To analize what makes a company over- or underperform on social media metrics, we further examine the top and bottom 3 companies of all time. These are calculated by averaging the ranks of all companies over all the respective months, where they have been present on social media. This yields the following table, where the rank rated by engagement rate is their average rank out of 704 companies. Here we display the top 10 companies
|                 |    rank |   n_months |
|:----------------|--------:|-----------:|
| Dr. Martens     | 614.388 |         67 |
| Inov-8          | 615.493 |         67 |
| Crocs           | 616.493 |         67 |
| JD Sports       | 622.299 |         67 |
| ShopGoodwill    | 623.5   |          2 |
| Funko           | 643.429 |         63 |
| SHEIN           | 649.701 |         67 |
| Pop Mart        | 679.182 |         11 |
| Fashion Nova    | 683.944 |         36 |
| Finding Unicorn | 691.182 |         11 |

And analogously the bottom 10
|           |    rank |   n_months |
|:----------|--------:|-----------:|
| Boca      |  5.06   |         50 |
| Helix     | 20.5309 |         81 |
| Powerade  | 24.8587 |         92 |
| Bobstore  | 29.9841 |         63 |
| Leesa     | 36.9529 |         85 |
| Purina    | 37.5821 |         67 |
| City Chic | 38.254  |         63 |
| INDIO     | 40.4688 |         64 |
| Natura    | 42.9674 |         92 |
| Ragu      | 43.6071 |         84 |

What's immediately noteworthy is that companies with the lowest engagement have nonetheless been on the platform for some time, while there are very new companies that can be among the top. This highlights the fast-pace nature of the social media environment for companies.

"Finding Unicorn" seems to be taking the best advantage of this. It's a Toy manufacturer producing cute figurine art toys. They self-categorize as a trend-art company. Their products are very visual in nature, making for very engaging social media posts. Additionally, the company is rather small and seems to put a lot of effort into it's friendly image, which may encourage users to engage more with their content rather than just consuming it.

"Pop Mart" is a very similar business. However, in contrast to "Finding Unicorn" they seem to put less value in appearing like a small local company harboring creativity and artistic expression. This indicates that such an image, while useful, is not necessary to succeed on social media. Such user behavior is perhaps also explained by the demographic - young people might interact on social media more readily - however, our dataset includes no information which would allow us to study this further.

"Funko" is another business in the same branch. However, it has been around for a lot longer and is the main producer of such figurines.

Thus we can see that having a visually appealing product is a big predictor of social media success, independent of several other variables.

Another branch that is well represented in the top social media performers are fashion brands. Additionally to the previously exmained factor of visuality, we can extract additional important traits by examining the social media presence of these companies. Many of these companies make occasional posts which seem unrelated to their product, which seems to differentiate them from other companies with less engagement.

Among the low performers, the company "Powerade" stuck out to us, since it is relatively well known and belongs to the coca-cola company. 
|              | rank  | n_months |
|--------------|-------|----------|
| SPRITE       | 99.8  | 64       |
| Costa Coffee | 406.4 | 67       |
| Schweppes    | 238.5 | 17       |
| Coca-Cola    | 113.9 | 92       |
| Powerade     | 24.9  | 92       |

Even though coca-cola is an extremely popular brand, only the account of Costa Coffe is somewhat successful on instagram. This is most likely due to their more familiar approach to social media. Their other accounts are mostly focused on design rather than relatability, which seems to hinder engagement. Nonetheless, the Coca-Cola share price does not reflect a slowed growth in the age of social media. Thus there is room for a company to thrive without overperforming on social media, however this is most likely due to Coca-Colas previous popularity.



#### 2. K-Means Model
How it works:
We have chosen to consider our engineered features on the year of 2023 since we also want to consider our data on a smaller timeframe, without the effects of COVID.
Since our baseline model is based on monthly considerations, we decided to consider another methodology as we only consider at most 12 months.
Number of clusters has been chosen via Silhouette Scoring. We decided to label a brand an outlier if the distance to the in-class median exceeds by a factor k * in-class standard deviation.
Choosing k = 2 has showed itself to reliably yield an outlier rate of about 8%, changing k +-20% results in 12%, 6% respectively. k = 2 was the smallest such choice for which we got a good improvement without being too restrictive.    As we wanted to better understand which of our engineered features were a good indicator of an outlier in this context we decided to train a surrogate decision tree on our predictions.

Reasons (standalone):
- Easily applicable and explainable
- Unsupervised
- detects outliers, while taking into account the variety of different companies

Why we chose this model over other models:
- Other unsupervised analysis methods are more technical

results:

The above method yields the following outliers 'Antarctica', 'Bang', 'Boca', 'Bulgari', 'Calvin Klein', 'Chaumet','Dior', 'Fiever', 'Finding Unicorn', 'Flywheel Sports', 'Foot Action', 'Hylete', 'KFC', 'Kenzo Beauty', 'Lilys Kitchen', 'Louis Vuitton','Meow Mix', 'Montejo', 'Pure Farmland', 'Richard Mille', 'Rolife', 'SK-II', 'Sleep Number', 'Sofina', 'Starwars', 'Superdown', 'Tecate','Temu', 'The Meatless Farm', 'The Very Good Butchers', 'Tiffany & Co.','Tobi', 'Trask', 'Uniqlo', 'Wolaco'. Analysing these by the baseline engagement rate model yields a variety of results, indicating that we extracted a more complex set of outliers by this method. To interpret these, we employ a surrogate model, which will express the decision boundaries between outliers and regular points.

##### 2.1 Surrogate Model

In "surrogate_final.png" can see a plot of our decision tree. Blue hues indicate outliers. 


In [8]:
from IPython.display import Image
test = Image("surrogate_final.png")

'[title]' is not recognized as an internal or external command,
operable program or batch file.


![title](surrogate_final.png)

Having done so we have found that large values of growth_per_post and per_post_aquisition are a good indicator of an outlier in about 96.31% of outlier predictions based on knn. Thus large values of these featurs are the most expressive from our engineered columns when it comes to detecting outliers in the knn sense defiined above (after normalization).


##### 2.2 Shapley Values



#### 3. ??


### ii. Recommendation for Model Enhancement

Conclude with a thoughtful reflection on potential avenues for further improving your model. Propose specific modifications or additional analyses that could refine your predictions and insights.

More features would be possible if we had:

- customer lifetime value (CLV) in social media terms (ask gpt)
- customer acquisition cost (CAC) in social media terms (ask gpt)
- ratio CLV / CAC
- some more metrics, like reach ....

From these features we would have the chance to have a broader feature set, which would enhance model accuracy and understanding of company performance.

## 3. Actionable Insights

In the process of finding outliers within the given dataset, we came to an understanding of the trends that govern social media performance of the given firms.

We need insights into:

1. identify outlier firms that are performing exceptionally well on social media
  
    -> this could indicate strong brand engagement and customer loyalty
    -> might translate into future profitability
    -> investment opportunities

    find why they are well-performing?
    
    possible causes:
    - compare to industry?
    - long & strong brand history
    - strong brand ambassadors
    - good ethics of company (planting trees etc)
    - funny social media accounts whatever

2. identify outlier firms on lower end

    -> signals issues with brand perception, customer engagement, or emerging crises
    -> poses risk to company's long-term stability 
    -> requires proactive (risk) management.

    why are they not well-performing?
    - industry?
    - scandals?
    - stock development?

3. identify market trends, industry trends

    - this is actually more of the data exploration part, but we need to synthesize the findings into the insights report because other groups might not do it, so it gives us an edge
  
    -> can indicate broader market trends / shifts in consumer behaviour
    -> can help more informed broadcasting & market analysis
    -> more of a tool to use as input to more complex market models
    -> can probably work well together with recent LLMs that consume market news/sentiment because social media is a good representation of what people think

4. identify actual weird outlier firms -> can generate solid investment advice, probably

    -> idea with k-means clustering: have them clustered according to performance features
    -> then find "bad stock performance" companies in "good social media performance" companies and vice versa

5. general UBS use of our analysis: advise their customers why their competitors are outliers, which can help the clients refine their competitive strategies. Identifying key factors leading to higher engagement can inform better marketing strategies. & also: Product and Service Development: Insights derived from social media performance can influence decisions about product innovations or adjustments, focusing on areas that resonate with or displease the market.