# UBS Challenge Submission

Group Name: "" (empty string)

Group Participants: ...

The culmination of your submission should be a technical report that outlines your analytical journey, highlighting the methodologies employed, any obstacles encountered along the way, and the strategies adopted to
overcome these challenges.

## Brief Challenge Description:

We were tasked with ...

## Report Overview

1. Data Understanding
   1. Data Cleaning and Preprocessing
   2. Assumptions on Data
   3. Feature Engineering and Data Augmentation
2. Modelling Approach
   1. Model Selection
   2. Recommendation for Model Enhancement
3. Actionable Insights
   1. Company-Level
   2. Industry-Level
   3. Market-Level

In [9]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

path = os.path.join(os.path.join(os.getcwd(), os.pardir),"data")
print(path)
file = os.path.join(path, "skylab_instagram_datathon_dataset.csv")
data = pd.read_csv(file, delimiter=";")

/Users/maximhuber/Developer/datathon/docs/../data


## 1. Data Understanding

Provide a detailed account of the initial steps taken to prepare the data for analysis.

This should include a description of how data quality issues, such as missing values or outliers, were addressed.

1. Exploratory Data Analysis
   - Examination and understanding of the dataset's structure and content
   - Performing exploratory data analysis to understand:
     - data patterns, 
     - outliers, and
     - relationships between variables

2. Data Cleaning
   - Data preprocessing, include, but not limited to:
     - handling missing values, 
     - data conversions, and 
     - normalization

### Data Cleaning and Preprocessing

A general issue data scientists face is data cleaning. This is why we spent efforts on understanding our dataset and cleaning it.

The biggest issues we faced:

1. Unordered data (time-series, group by company...)
2. NaNs
3. Duplicate rows for compset
4. Mappings between compset_groups, compsets, business_entities and so on.
5. company acquisitions
6. data normalization -> maybe explain a bit (only for k-means)

#### Grouping and sorting data

Before doing anything, we made sure we had order in our dataset.

We made sure to group the data by `business_entity_doing_business_as_name`. Then we made sure to order the values for each business item by increasing `period_end_date` (first also transforming it to a pandas date format, I think).

In [None]:
# not sure if code is necessary for this, it's really straightforward

#### Adressing NaNs

We figured that ... % of data is NaN values. Due to the "big" amount of missing data, we decided that imputing the data makes more sense than dropping incomplete rows.

Because we worked with time-series data, we thought that it is suitable to linearly interpolate missing numerical data between adjacent values. This aligns with the goal of finding outliers - linearly interpolating continues trends and should not cause , which might be observed when imputing with the mean over a bigger time-span.

In [None]:
# .... Copy in Michal's code for interpolating

#### Nearly-Duplicate Rows

Next, we realized that for some rows, we had multiple entries for ....

We therefore mapped individual rows which only different in the `compset` column to have list in the `compset` column. This step is important because if we compute means over whole industries or the entire market, the computation would be biased towards companies which are present in multiple competitive sets. As a side effect, we were also able to reduce the size of our dataset.

In [None]:
#### Code where maxim transformed `compset` into lists

### Exploratory Data Analysis

In a following step, we really tried to understand how our data is structured and what it reveals about the real world. For this, we 

- first needed to generate mappings between compset_group, compset, business_entity name etc...

After, we could
- compare industries to the whole market
- compare compsets within compset_groups
- compare companies within compset

#### Industries and Sub-Industries

First, we made sure we understood what industries we are working with. We iterated over the csv file and found the following (shortened) hierarchical structure:

In [7]:
# Code showing some hierarchical structure for industries. Here, we could maybe make some nice tree plots or so that "haut sie vom hocker"

import json

# Map from compset_group -> compset
map_path = '../out/compset_group_to_compset_map.json'
with open(map_path, 'r') as f:
    map_data = json.load(f)

counter = 0  # Limit the printing
for key, value in map_data.items():
    print(key)
    print(value)
    print()  # Add an empty line after each key's values
    counter += 1  # Increment the counter
    if counter == 3:  # Check if we've printed 3 key-value pairs
        break  # If yes, break out of the loop


Luxury & Premium & Mainstream
['US Softlines Analyst Interest List', 'Luxury & Premium & Mainstream', 'Footwear', 'Premium Brands', 'Soft Luxury', 'Hard Luxury', 'Mid-Range Watch & Jewelry', 'Mainstream Brands', 'Global Luxury Analysts Interest List']

Restaurants
['Fast Casual', 'Restaurants', 'Casual Dining', 'QSR', 'Coffee']

Beverages
['Energy drinks', 'Alcohol', 'Beverages', 'Soda', 'Sports drinks']



### Getting a feel for the market

Next, we made sure we understood broad trends between industries. For visualization purposes, we restrict ourselves to 
five industries. They are ... . They exhibit these ... special features, which make them interesting for plotting.

TODO: add plots that show e.g. follower count means over industries, as box-plots and time-series

### ii. Assumptions

Clearly articulate any assumptions that were made during the data preparation phase.

Here, we need to think a little more. Short and concise, but it's also important to have good stuff here.

Assumptions made so far:
- Post data is not only advertised stuff but also general posts on the company profiles
Impact: more accurate representation of customer sentiment towards the brand because people tend to like and comment on content that is marked as sponsored.
- COVID ??
- no company acquisitions (which holds for most of the data, but not for all (e.g. LVMH)) -> make some plots confirming this (I think Dior was acquired by LVMH, and maybe we see something in the time-series plot after the buy date...). the least we could do would be to identify all holding changes. i think we can do this based on the maps I created or create some more 
- for the k-means model, we work (!!) with 2023 data

### iii. Feature Engineering and Data Augmentation

Describe any techniques employed to enhance the dataset, whether through the creation of new features or augmentation of the existing data.

Goal: create relevant features for identifying deviations

TODO: fett reinschreiben

## 2. Model Approach

### i. Model Selection

Explain on the choice(s) of statistical or machine learning model(s) utilized in your analysis. Provide a compelling justification for each model selected, emphasizing how they align with the objectives of the challenge.

How could we go about solving this challenge?

In the end, we came up with three models.

1. Baseline: Best ranked companies according to engagement rate
2. k-means: ?
3. ?

In the following, we detail why our model choices make sense in the broader setting.

#### 1. Baseline Model

How it works:
- for each month, we ranked companies according to their mean engagement rates
- in the end, we averaged the company rankings and took the top and bottom three as outliers

Reasons:
- need something to start with
- measure consistency and social media engagement (relative to followers), which is one of the most relevant metrics used in social media marketing.
- gives insight into best-performing and worst-performing companies, which can serve as indicators for investment opportunities and risk management, which we will explore more **in a later section**

#### 2. K-Means Model

How it works:
...


Reasons (standalone):
- ...

Why we chose this model over other models:
- ...
- ...
- ...

#### 3. ??


### ii. Recommendation for Model Enhancement

Conclude with a thoughtful reflection on potential avenues for further improving your model. Propose specific modifications or additional analyses that could refine your predictions and insights.

More features would be possible if we had:

- customer lifetime value (CLV) in social media terms (ask gpt)
- customer acquisition cost (CAC) in social media terms (ask gpt)
- ratio CLV / CAC
- some more metrics, like reach ....

From these features we would have the chance to have a broader feature set, which would enhance model accuracy and understanding of company performance.

## 3. Actionable Insights

In the process of finding outliers within the given dataset, we came to an understanding of the trends that govern social media performance of the given firms.

We need insights into:

1. identify outlier firms that are performing exceptionally well on social media
  
    -> this could indicate strong brand engagement and customer loyalty
    -> might translate into future profitability
    -> investment opportunities

    find why they are well-performing?
    
    possible causes:
    - compare to industry?
    - long & strong brand history
    - strong brand ambassadors
    - good ethics of company (planting trees etc)
    - funny social media accounts whatever

2. identify outlier firms on lower end

    -> signals issues with brand perception, customer engagement, or emerging crises
    -> poses risk to company's long-term stability 
    -> requires proactive (risk) management.

    why are they not well-performing?
    - industry?
    - scandals?
    - stock development?

3. identify market trends, industry trends

    - this is actually more of the data exploration part, but we need to synthesize the findings into the insights report because other groups might not do it, so it gives us an edge
  
    -> can indicate broader market trends / shifts in consumer behaviour
    -> can help more informed broadcasting & market analysis
    -> more of a tool to use as input to more complex market models
    -> can probably work well together with recent LLMs that consume market news/sentiment because social media is a good representation of what people think

4. identify actual weird outlier firms -> can generate solid investment advice, probably

    -> idea with k-means clustering: have them clustered according to performance features
    -> then find "bad stock performance" companies in "good social media performance" companies and vice versa

5. general UBS use of our analysis: advise their customers why their competitors are outliers, which can help the clients refine their competitive strategies. Identifying key factors leading to higher engagement can inform better marketing strategies. & also: Product and Service Development: Insights derived from social media performance can influence decisions about product innovations or adjustments, focusing on areas that resonate with or displease the market.