In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data Understanding

## 1. Exploratory Data Analysis

- Examination and understanding of the dataset's structure and content.
- Performing exploratory data analysis to understand data patterns, outliers, and relationships between variables.

In [2]:
from utilities import clean_data

In [3]:
df, df_brands, df_allbrands, brands, compsets, compset_groups, groups_bycompset = clean_data()

print(df.shape)
df.head(n=5)

(298040, 7)


Unnamed: 0,period_end_date,business_entity_doing_business_as_name,followers,pictures,videos,comments,likes
0,2017-05-06,24S,,,,,
1,2017-05-13,24S,,6.0,3.0,57.0,1765.0
2,2017-05-20,24S,,6.0,3.0,57.0,1765.0
3,2017-05-27,24S,,6.0,3.0,57.0,1765.0
4,2017-06-03,24S,,24.0,3.0,109.0,3922.0


In [4]:
entries_per_business = df['business_entity_doing_business_as_name'].value_counts()

print(entries_per_business)

Loewe                  455
Michael Kors           455
Muji                   455
Mountain Dew           455
Vacheron Constantin    455
                      ... 
Sculptra               129
Temu                    55
Finding Unicorn         46
Pop Mart                46
ShopGoodwill            26
Name: business_entity_doing_business_as_name, Length: 705, dtype: int64


## 2. Data Cleaning

### Handling Missing Values

In [5]:
from utilities import missing_df, missing_values

In order to get a feeling of the distribution and magnitude of the missing values, we created a new datafram *missing_df* containing the total number of entries per business and the total number of missing values per business per category.

In [6]:
missing_df = missing_df(df)

print(missing_df.shape)
missing_df.head()

(705, 7)


Unnamed: 0,Business,Total Entries:,Missing followers:,Missing pictures:,Missing videos:,Missing comments:,Missing likes:
0,24S,333,20,1,1,1,1
1,3.1 Phillip Lim,455,18,0,0,0,0
2,3CE,455,131,0,0,0,0
3,A. Lange & Soehne,403,8,3,2,2,2
4,ANIMALE,403,131,0,2,0,0


We reorganized the DataFrame *df* by date and business, revealing numerous initial missing-value-series across many businesses. By removing these initial series, we reduced the number of rows containing at least one missing value from 65,868 to 4,378.

By identifying the lengths and locations of all remaining missing-value-series in the data, we manage to further decrease the number of rows containing at least one missing value to 4,145. This represents a $ 94 \% $ decrease in rows containing at least one missing value.

In [7]:
cleaned_df = missing_values(df)

Number of rows with at least one NaN before cleaning: 65868
Number of rows with at least one NaN after dropping series of Nan's at beginning of businesses: 4378

 Remaining number of rows with Nan that are not at beginning or end:
Number of rows with 4 NaNs: 3214
Number of rows with 3 NaNs: 0
Number of rows with 2 NaNs: 7
Number of rows with 1 NaNs: 924
Total remaining rows with at least one Nan:  4145

 Total number of rows after cleaning: 236317


Based on the following facts:
- Most of the remaining missing values were are all series of lengths greater than 20, spread randomly through the data.
- There were only 4,145 of the total of 236,317 rows left that contained at least one missing value.

We made the following **assumption**:
- We assumed that the remaining missing values would not affect the performance of our model significantly.

Based on that we decided to leave the remaining missing values for now, and check after implementing the model wether our assumption was correct.

 

### Normalization

In [8]:
#from utilities import normalization

In [9]:
#include normalization function here

# Modeling Approach

## 1. Feature Engineering

- Feature engineering to create relevant features for identifying deviations.

In [10]:
from utilities import derivatives_data

In [11]:
#include the calculation of first and second derivatives here

## 2. Development of Model

- Development of a model to identify significant deviations from observed trends.

## 3. Evaluation of Model

- Evaluation of the model's performance and its ability to identify deviations.

# Evaluation of Results

## 1. Analysis of deviations

- Analysis of deviations detected by your model, understanding potential causes.

## 2. Interpretation of the results

- Interpretation of the results, providing strategic insights based on deviations.

## 3. Use of alternative approaches

- Use of alternative approaches (potential enhancement of dataset with external data).