# Anomaly detection 
### Using isolating forest technique

The isolating forest technique was used after importing the IsolationForest available. This library can be used to detect the presence of outliers in the dataset and detect anomalies that can be removed to produce the cleaned version of the dataset. Then, the correlations for the response variable likes with other variables can be obtained under each brand category to find out the ones with top correlations.

---

### Creating a cleaned dataset after removing the anomalies detected using isolating forest method

In [14]:
## Note the anomalies are detected using the isolating forest method keeping in mind the variables followers, comments and likes
##follow these steps: scale the data, fit the isolation forest onto the model, store and then display the results obtained for each brand category the cleaned data after anomaly detection

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

fashiondata = 'fashiondata.csv'
df = pd.read_csv(fashiondata)
print(df.columns)
categories = ['High street', 'Small couture', 'Mega couture', 'Designer']
cleaned_data = {}
outlier_data = {}

# first the split into brand categories is done

for category in categories:
    category_data = df[df['BrandCategory'] == category].copy()
    if 'Comments' in df.columns:
        numerical_data = category_data[['Likes', 'Followers', 'Comments']]
    else:
        numerical_data = category_data[['Likes', 'Followers']]

    numerical_data = numerical_data.dropna()
    
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(numerical_data)
    isolation_forest = IsolationForest(n_estimators=100, contamination='auto', random_state=42)
    preds = isolation_forest.fit_predict(scaled_data)
    category_data.loc[:, 'Anomaly'] = preds
    anomalies = category_data[category_data['Anomaly'] == -1]
    normal_data = category_data[category_data['Anomaly'] == 1]
    cleaned_data[category] = normal_data.drop(columns=['Anomaly'])
    outlier_data[category] = anomalies.drop(columns=['Anomaly'])
for category, data in cleaned_data.items():
    print(f"Cleaned data for {category}:")
    print(data.describe())

for category, data in outlier_data.items():
    print(f"Outliers in {category}:")
    print(data.describe())

Index(['UserId', 'Followings', 'Followers', 'MediaCount', 'BrandName',
       'BrandCategory', 'Hashtags', 'Caption', 'ImgURL', 'Likes', 'Comments ',
       'CreationTime', 'Link', 'Selfie', 'BodySnap', 'Marketing',
       'ProductOnly', 'NonFashion', 'Face', 'Logo', 'BrandLogo', 'Smile',
       'Outdoor', 'NumberOfPeople', 'NumberOfFashionProduct', 'Anger',
       'Contempt', 'Disgust', 'Fear', 'Happiness', 'Neutral', 'Sadness',
       'Surprise'],
      dtype='object')
Cleaned data for High street:
        Followings    Followers    MediaCount        Likes    Comments   \
count  5173.000000   5173.00000   5173.000000  5173.000000  5173.000000   
mean   1618.624203   2377.23874   3335.073652    33.605258     2.091823   
std    2286.256995   3588.53693   7957.672306    40.790293     5.164565   
min       0.000000      0.00000      0.000000     0.000000     0.000000   
25%     209.000000    339.00000    235.000000     5.000000     0.000000   
50%     529.000000    984.00000    628.00000

### Correlations for likes vs all other variables under each brand category in the cleaned version of the dataset

In [11]:
import numpy as np

##follow the same steps here again
##then print the correlations between likes and all the other variables in the cleaned data after anomaly detection and removal for each of the brand category

for category in categories:
    category_data = df[df['BrandCategory'] == category].copy()
    numerical_columns = category_data.select_dtypes(include=[np.number]).columns.tolist()
    if 'Likes' not in numerical_columns:
        print(f"not found for {category}")
        continue

    numerical_data = category_data[numerical_columns]
    numerical_data = numerical_data.dropna()
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(numerical_data)
    isolation_forest = IsolationForest(n_estimators=100, contamination='auto', random_state=42)
    preds = isolation_forest.fit_predict(scaled_data)
    category_data.loc[:, 'Anomaly'] = preds
    normal_data = category_data[category_data['Anomaly'] == 1].drop(columns=['Anomaly'])
    cleaned_data[category] = normal_data
    normal_data_numeric = normal_data.select_dtypes(include=[np.number])
    if 'Likes' in normal_data_numeric.columns:
        correlations[category] = normal_data_numeric.corr()['Likes'].drop('Likes')  # Calculate correlation

for category, corr_data in correlations.items():
    print('')
    print(f"Correlations in {category}:")
    print('')
    print(corr_data)


Correlations in High street:

Followings               -0.094736
Followers                 0.448979
MediaCount               -0.077368
Comments                  0.432710
CreationTime             -0.059124
Selfie                   -0.011907
BodySnap                  0.079994
Marketing                -0.017298
ProductOnly              -0.062694
NonFashion               -0.015148
Face                      0.033731
Logo                     -0.120049
BrandLogo                -0.035380
Smile                     0.002259
Outdoor                   0.136631
NumberOfPeople            0.037889
NumberOfFashionProduct    0.052558
Anger                     0.003882
Contempt                 -0.006837
Disgust                  -0.008509
Fear                     -0.004998
Happiness                 0.002164
Neutral                   0.027914
Sadness                  -0.004237
Surprise                 -0.003179
Name: Likes, dtype: float64

Correlations in Small couture:

Followings               -0.12363

### Printing the top four correlating variables with likes under each brand category

In [13]:
top_four_correlations = {}

##this is done to ease the process of analysing to find out which variables are most suitable to be the predictors for likes under each brand category

for category in categories:
    category_data = df[df['BrandCategory'] == category].copy()
    numerical_columns = category_data.select_dtypes(include=[np.number]).columns.tolist()
    if 'Likes' not in numerical_columns:
        print(f"'Likes' not found in numerical columns for {category}. Skipping this category.")
        continue

    numerical_data = category_data[numerical_columns]
    numerical_data = numerical_data.dropna()
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(numerical_data)
    isolation_forest = IsolationForest(n_estimators=100, contamination='auto', random_state=42)
    preds = isolation_forest.fit_predict(scaled_data)
    category_data.loc[:, 'Anomaly'] = preds
    normal_data = category_data[category_data['Anomaly'] == 1].drop(columns=['Anomaly'])
    cleaned_data[category] = normal_data
    normal_data_numeric = normal_data.select_dtypes(include=[np.number])
    if 'Likes' in normal_data_numeric.columns:
        correlation_series = normal_data_numeric.corr()['Likes'].drop('Likes')  # Calculate correlation
        correlations[category] = correlation_series
        top_four = correlation_series.abs().sort_values(ascending=False).head(4)
        top_four_correlations[category] = top_four

for category, top_corr in top_four_correlations.items():
    print(f"Top four correlations for {category}:")
    print(top_corr)

Top four correlations for High street:
Followers    0.448979
Comments     0.432710
Outdoor      0.136631
Logo         0.120049
Name: Likes, dtype: float64
Top four correlations for Small couture:
Followers     0.693111
Comments      0.349417
Logo          0.146613
MediaCount    0.124994
Name: Likes, dtype: float64
Top four correlations for Mega couture:
Comments       0.473899
Followers      0.406086
ProductOnly    0.140632
Followings     0.124736
Name: Likes, dtype: float64
Top four correlations for Designer:
Followers    0.555847
Comments     0.376767
BodySnap     0.132217
Logo         0.103182
Name: Likes, dtype: float64


The correlations obtained above made us come up with the final choice of keeping our raw dataset for data analysis to get insights on marketing strategies.

So far, the top four correlations for likes against the other predictor variables have been obtained in terms of:
> 1. Raw dataset
> 2. Cleaned version (using IQR)
> 3. Cleaned version (using anomaly detection)

These numberings will be used to come up with our analysis and conclusion that follows as stated below.

On a majority, the trend observed in terms of correlation of followers with likes under each brand category was quite interesting. For instance in small couture brand category, from 1, the correlation was 0.946084, it was 0.301372 from 2, and 0.693111. Furthermore, under mega couture brand category, the values for correlation for followers vs likes were 0.678127 from 1 and 0.406086 from 3. In fact, followers was not even listed among the top four correlations in 2 where outliers were removed using the IQR method. A similar trend was observed for highstreet brand category as well. These observations helped us realise an important detail in analysing our chosen dataset.

The nature of the fashion dataset is such that many of the brand marketing strategies that are created or used are also dependent on specific groups of target audience. For the general dataset under consideration it can be difficult to determine which exactly causes an outlier in the dataset. In fact, the outliers detected using the IQR method in the previous files might have been the driving forces behind the high correlation between followers and likes. However, removing these using IQR might have resulted in the correlation to not even appear in the top four for followers vs likes in highstreet and megacouture.

## Conclusion:  Decision to use raw data for further analysis

Overall, the consistency in top factors was maintained in most of the brand categories such as followers and comments with respect to likes.

Particularly, anomaly detection can be useful in datasets where understanding the 'typical' user engagement is important. In a study pertaining to this, the removal of anomalies can be helpful. However, in our study, we hope to understand the full range of behaviours or trends that are observed (including extremes). Therefore, analyzing the raw dataset can give us better insights as compared to its cleaned counterpart without outliers (either removed using IQR method or anomaly detection).

Thereby, we will proceed analysing our raw dataset for the rest of our evaluation to capture the data's characteristics and how it can impact the strategic decisions made by the profiles making the post on social media platforms like instagram.