# CS 421 PROJECT
---

Group: Empirical Risk Minimisers  
Members:
- Lai Wan Xuan Joanne (joanne.lai.2021)
- Ryan Miguel Moralde Sia (ryansia.2022)
- Dhruv Benegal (benegalda.2022)
- Benedict Lee Zi Le (benedictlee.2022)

### 1. Background & Objective

In this project, you will be working with data extracted from famous recommender systems type datasets: you are provided with a large set of interactions between users (persons) and items (movies). Whenever a user "interacts" with an item, it watches the movie and gives a "rating". There are 5 possible ratings expressed as a "number of stars": 1,2,3,4, or 5. 

In this exercise, we will **not** be performing the recommendation task per se. Instead, you will try to identify the amount of noise/corruption which was injected in each user. Indeed, for each of the users you have been given, an anomaly/noise generation procedure was applied to corrupt the sample. The noise generation procedure depends on two variables: the noise level $p\in [0,1]$ and the noise type $X\in\{0,1,2\}$.  Each user has been randomly assigned a noise level $p$ and anomaly/noise type $X$, and subsequently been corrupted with the associated noise generation procedure. 

You have two tasks: first, you must predict the noise level $p$ associated to each test user. This is a **supervised regression task**. Second, you must try to identify the noise generation type for each user. This is a classification task with three classes, with the possibility of including more classes later depending on class performance. This task will be semi-supervised: only a very small number of labels is provided. You will therefore need to combine supervised and unsupervised approaches for this component. 

### 2. Data

You are provided with three frames: the first one ("X") contains the interactions provided to you, and the second one ("yy") contains the continuous for the users. The third data frame "yy_cat" contains the anomaly/noise type for 15 users. The idea is to use these users to disambiguate the category types, but the task will mostly be unsupervised. 

As you can see, the three columns in "X" correspond to the user ID, the item ID and the rating (encoded into numerical form). Thus, each row of "X" contains a single interaction. For instance, if the row "$142, 152, 5$" is present, this means that the user with ID $142$ has given the movie $152$ a positive rating of $5$.

The dataframe "yy" has two columns. In the first column we have the user IDs, whilst the second column contains the continuous label. A label of $0.01$ indicates a very low anomaly level, whilst a label of $0.99$ indicates a very high amount of noise/corruption. 

### 3. Evaluation

Your task is to be able to regress the noise level $p$ for each new user, and predict the anomaly type $X$. The first (regression) task will be easier due to the larger amount of supervision, and will form the main basis of the evaluation. The second task will be more importance to showcase each team's creativity and differentiate between top performers. 

THE **EVALUATION METRICs** are:  

1. The Mean Absolute Error (MAE) for the regression task. 
2. The accuracy for the classiciation task. 

Every few weeks, we will evaluate the performance of each team (on a *test set with unseen labels* that I will provide) in terms of both metrics

The difficulty implied by **the generation procedure of the anomalies MAY CHANGE as the project evolves: depending on how well the teams are doing, I may generate easier or harder anomaly classes, which would change the number of labels in the classification task**. However, the regression task will still be the same (with a different distribution).

### 4. Deliverables

Together with this file, you are provided with a first batch of examples "`first_batch_regression_labelled.npz`" which are labelled in terms of noise level. You are also provided with the test samples to rank by the next round (without labels) in the file "`second_batch_regression_unlabelled.npz`".

The **first round** will take place after recess (week 9): you must hand in your scores for the second batch before the **Wednesday at NOON (15th of October)**. We will then look at the results together on the Friday.  

We will check everyone's performance in this way every week (once on  week 10, once on week 11 and once on week 12). 

---

To summarise, the project deliverables are as follows:

- Before every checkpoint's deadline, you need to submit **a `.csv` file** containing a dataframe of size $\text{number of test batch users} \times 3$.
    - The first column should be the user IDs of the test batch.
    - The second column should contain the estimated noise level $p$ for each sample.
    - The final column should contain the estimated class (it should be a natural number in \{0,1,2\}).
- The order of rows should correspond to the user IDs. For example, if the test batch contains users 1100-2200, scores for user 1100 should be the first row (row 0), scores for user 1101 should be the second row (row 1), and so on.
- On Week 12-13 (schedule to be decided), you need to present your work in class. The presentation duration is **10 minutes** with 5 minutes of QA.
- On Week 12, you need to submit your **Jupyter Notebook** (with comments in Markdown) and the **slides** for your presentation. 
- On week 13 you need to submit your **final report**. The final report should be 2-3 pages long (consisting of problem statement, literature review, and motivation of algorithm design) with unlimited references/appendix.

Whilst performance (expressed in terms of MAE and accuracy) at **each of the check points** (weeks 9 to 12 inclusive) is an **important component** of your **final grade**, the **final report** and the detail of the various methods you will have tried will **also** be very **important**. Ideally, to get perfect marks (A+), you should try at least **two supervised methods** and **two unsupervised methods**, as well as be ranked the **best team** in terms of performance. 


In addition, I will be especially interested in your **reasoning**. Especially high marks will be awarded to any team that is able to **qualitatively describe** the difference between the two anomaly types. You are also encouraged to compute statistics related to each class and describe what is different about them. 

## Imports

In [83]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import kurtosis
from sklearn.linear_model import (
    LinearRegression,
    LogisticRegression
    )
from sklearn.metrics import (
    mean_absolute_error,
    accuracy_score
)
from sklearn.model_selection import (
    cross_val_score,
    train_test_split,
    GridSearchCV,
    StratifiedKFold
)
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler

%matplotlib inline

## Data Loading and Cleaning

### Data Loading

In [84]:
data  = np.load("data/Week1/first_batch_regression_labelled.npz")
X     = data["X"]
y     = data["yy"]
y_cat = data["yy_cat"]

# Load dataframes
X     = pd.DataFrame(X, columns=["user", "item", "rating"])
y     = pd.DataFrame(y, columns=["user", "label"])
y_cat = pd.DataFrame(y_cat, columns=["user", "label", "anomtype"])

# Parse to correct types
y     = y.astype({"user": int, "label": float})
y_cat = y_cat.astype({"user": int, "label": float, "anomtype": int})

In [85]:
XX    = np.load("data/Week1/second_batch_regression_unlabelled.npz")['X']
XX    = pd.DataFrame(XX, columns=["user", "item", "rating"])

In [86]:
# contains interactions provided
# has 288205 rows

X

Unnamed: 0,user,item,rating
0,0,94,2
1,0,90,1
2,0,97,2
3,0,100,4
4,0,101,2
...,...,...,...
288200,899,515,3
288201,899,522,1
288202,899,526,4
288203,899,592,2


In [87]:
# contains the noise level p
# has 900 rows corresponding to users

y

print("---To check if number of users in X corresponds to number of rows in y---")
print(f"Number of unique users in X: {X['user'].nunique()}")
print(f"Number of rows in y: {len(y)}")

---To check if number of users in X corresponds to number of rows in y---
Number of unique users in X: 900
Number of rows in y: 900


In [88]:
# contains the anomaly/noise type, which is in {0, 1, 2}
# only has 15 rows

y_cat

Unnamed: 0,user,label,anomtype
0,561,0.383316,1
1,202,0.925028,2
2,205,0.38086,2
3,424,0.255181,1
4,284,0.055162,2
5,667,0.558745,0
6,730,0.311928,1
7,469,0.233492,2
8,199,0.165112,1
9,699,0.261752,2


In [89]:
# contains test data that we predict anomaly and noise on

XX

Unnamed: 0,user,item,rating
0,900,0,2
1,900,388,2
2,900,389,3
3,900,390,0
4,900,401,5
...,...,...,...
282441,1799,319,4
282442,1799,318,5
282443,1799,316,3
282444,1799,814,4


### Feature Engineering
For further explanation, see file `EDA.ipynb`.

In [90]:
# Define the engineer_features function to add additional features

def engineer_features(df_X, df_y=None):
    df_X_no_dupes = df_X.drop_duplicates(subset=["user", "item"], keep="last")
    df_ratings = df_X_no_dupes.pivot(index='user', columns='item', values='rating').fillna(-1)
    all_items = range(0, 1000)
    df_ratings = df_ratings.reindex(columns=all_items, fill_value=-1)

    # Basic user features
    df_user_features = df_X.groupby("user").agg(
        mean_rating=("rating", "mean"),
        median_rating=("rating", "median"),
        std_rating=("rating", "std"),
        count_dislike=("rating", lambda x: ((x == 1) | (x == 2)).sum()),
        count_neutral=("rating", lambda x: (x == 3).sum()),
        count_like=("rating", lambda x: ((x == 4) | (x == 5)).sum()),
        total_interactions=("rating", "count"),
    )

    # Ratio features
    df_user_features["like_ratio"] = (
        df_user_features["count_like"] / df_user_features["total_interactions"]
    )
    df_user_features["dislike_ratio"] = (
        df_user_features["count_dislike"] / df_user_features["total_interactions"]
    )
    df_user_features["neutral_ratio"] = (
        df_user_features["count_neutral"] / df_user_features["total_interactions"]
    )

    # Distribution features
    df_user_features["rating_kurtosis"] = df_X.groupby("user")["rating"].apply(
        lambda x: kurtosis(x)
    )

    # Fill NaN values in std_rating and rating_kurtosis
    df_user_features["std_rating"] = df_user_features["std_rating"].fillna(0)
    df_user_features["rating_kurtosis"] = df_user_features["rating_kurtosis"].fillna(0)

    final_df = pd.merge(df_ratings.reset_index(), df_user_features, on='user')
    
    if df_y is not None:
        df_merged = pd.merge(final_df.reset_index(), df_y, on="user", how="inner")
        return df_merged.drop(columns=["index"]).set_index("user")
    else:
        return final_df.set_index("user")

## Evaluation Functions

In [91]:
# to print MAE for regression task
def evaluate_linear_predictions(y_test, y_pred):
    mae = mean_absolute_error(y_test, y_pred)
    print(f"Test MAE: {mae:.4f}")

# to print accuracy for classification task
def evaluate_classification_accuracy(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Test Accuracy: {accuracy:.4f}")
    return accuracy

## Supervised Learning method 1: Linear Regression

### Feature Engineering

In [92]:
# Create the initial dataframe for linear regression, using X and y
final_df = engineer_features(X, y)

display(final_df.head())

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,std_rating,count_dislike,count_neutral,count_like,total_interactions,like_ratio,dislike_ratio,neutral_ratio,rating_kurtosis,label
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4.0,-1.0,-1.0,-1.0,4.0,-1.0,-1.0,-1.0,-1.0,4.0,...,1.14532,108,45,49,202,0.242574,0.534653,0.222772,-1.32912,0.962817
1,-1.0,-1.0,2.0,3.0,-1.0,2.0,-1.0,-1.0,3.0,-1.0,...,0.838,138,154,43,335,0.128358,0.41194,0.459701,0.48319,0.031248
2,4.0,-1.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0.57607,1,54,207,262,0.790076,0.003817,0.206107,0.281598,0.068668
3,4.0,2.0,4.0,-1.0,1.0,3.0,-1.0,3.0,5.0,4.0,...,1.081526,33,53,213,302,0.705298,0.109272,0.175497,1.077512,0.349012
4,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,-1.0,-1.0,...,0.633128,7,6,329,342,0.961988,0.020468,0.017544,2.194315,0.917704


In [93]:
# Prepare and split dataset into train and val

# Step 1: Extract features and labels
X_lr = final_df.drop(columns=["label"]).values # Features
y_lr = final_df["label"].values # Labels

# Step 2: Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X_lr, y_lr, test_size = 0.2, random_state=42
)

print("Shapes (regression):")
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

Shapes (regression):
(720, 1011) (720,) (180, 1011) (180,)


In [94]:
# Standardise features (for regression)

scaler_lr = StandardScaler().fit(X_train)
X_train_std = scaler_lr.transform(X_train)
X_val_std = scaler_lr.transform(X_val)

### Linear Regression

In [95]:
# We now perform linear regression to predict label

# Train model
lr = LinearRegression()
lr.fit(X_train_std, y_train)

y_pred= lr.predict(X_val_std)
evaluate_linear_predictions(y_val, y_pred)

Test MAE: 0.3284


## Supervised learning method 2: Logistic Regression

### Feature Engineering

We do some more steps in addition to supervised learning, as we have to include column "anomtype" in our training

In [96]:
# Combine dataframe with anomtype

final_df_log = engineer_features(X, y_cat)

# we convert all column names to strings so it does not throw an error later
final_df_log.columns = final_df_log.columns.astype(str)
display(final_df_log)
final_df_log["anomtype"].value_counts()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,count_dislike,count_neutral,count_like,total_interactions,like_ratio,dislike_ratio,neutral_ratio,rating_kurtosis,label,anomtype
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
26,2.0,-1.0,-1.0,4.0,3.0,-1.0,4.0,4.0,4.0,-1.0,...,80,60,98,244,0.401639,0.327869,0.245902,-0.604327,0.558222,0
199,5.0,-1.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,5.0,-1.0,...,28,83,95,211,0.450237,0.132701,0.393365,0.710936,0.165112,1
202,5.0,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,1,1,212,214,0.990654,0.004673,0.004673,0.128747,0.925028,2
205,2.0,3.0,2.0,1.0,-1.0,-1.0,-1.0,4.0,4.0,-1.0,...,49,97,220,367,0.599455,0.133515,0.264305,0.032162,0.38086,2
231,3.0,-1.0,4.0,3.0,-1.0,1.0,1.0,-1.0,2.0,1.0,...,182,94,82,358,0.22905,0.50838,0.26257,-1.347485,0.951103,0
284,4.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,-1.0,...,1,23,199,223,0.892377,0.004484,0.103139,0.53499,0.055162,2
424,4.0,1.0,-1.0,-1.0,-1.0,3.0,-1.0,-1.0,4.0,4.0,...,34,73,93,200,0.465,0.17,0.365,-0.210772,0.255181,1
459,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,4.0,-1.0,...,91,57,82,230,0.356522,0.395652,0.247826,-1.203525,0.7393,0
469,4.0,-1.0,3.0,3.0,-1.0,3.0,-1.0,-1.0,3.0,4.0,...,34,111,124,269,0.460967,0.126394,0.412639,0.029708,0.233492,2
561,3.0,-1.0,-1.0,-1.0,4.0,-1.0,-1.0,-1.0,-1.0,5.0,...,39,57,109,205,0.531707,0.190244,0.278049,-0.376356,0.383316,1


anomtype
0    5
1    5
2    5
Name: count, dtype: int64

In [97]:
# Separate the dataframes into input X and label y
# ben: to determine if we should keep the "label" column

X_log = final_df_log.drop(columns=["label", "anomtype"])
y_log = final_df_log["anomtype"]

display(X)
display(y)

Unnamed: 0,user,item,rating
0,0,94,2
1,0,90,1
2,0,97,2
3,0,100,4
4,0,101,2
...,...,...,...
288200,899,515,3
288201,899,522,1
288202,899,526,4
288203,899,592,2


Unnamed: 0,user,label
0,0,0.962817
1,1,0.031248
2,2,0.068668
3,3,0.349012
4,4,0.917704
...,...,...
895,895,0.962911
896,896,0.606888
897,897,0.334323
898,898,0.726156


In [98]:
# Scale features in X

scaler_log = StandardScaler().fit(X_log)
X_log_std = scaler_log.transform(X_log)

Because of the limited sample size, we choose to only train and validate our model as best we can, before predicting on the actual dataset

Ben: can refer to Logistic Regression documentation here  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [99]:
# Do logistic regression here

log_reg = LogisticRegression(
    penalty="l2", # ridge regression
    C=0.01,
    solver="liblinear", # efficient for small-medium datasets
    max_iter=1000
)

# wrap log_reg in OneVsRest classifier so the liblinear solver works for our multiclass classifier
log_classifier = OneVsRestClassifier(log_reg)

# Using StratifiedKFold, for n_splits=5, we train on 4 folds and validate on the remaining fold
# then compute accuracy based on that fold
cv = StratifiedKFold(n_splits=5)

# Code for GridSearchCV: to find the best value of C
param_grid = {'estimator__C': [0.01, 0.1, 1, 10]}
grid = GridSearchCV(
    estimator=log_classifier,
    param_grid=param_grid,
    cv=cv,
    scoring='accuracy'
)

grid.fit(X_log_std, y_log)

print("Best C:", grid.best_params_)
print("Best CV Accuracy: %.4f" % grid.best_score_)

cv_results = pd.DataFrame(grid.cv_results_)
display(cv_results[['param_estimator__C', 'mean_test_score', 'std_test_score']])

# Scores will be an array of 5 numbers: one accuracy per fold
# scores = cross_val_score(log_classifier, X_std, y, cv=cv, scoring="accuracy")

# display(scores)
# print("CV Accuracy: %.4f ± %.4f" % (scores.mean(), scores.std()))

Best C: {'estimator__C': 0.1}
Best CV Accuracy: 0.6000


Unnamed: 0,param_estimator__C,mean_test_score,std_test_score
0,0.01,0.533333,0.339935
1,0.1,0.6,0.326599
2,1.0,0.6,0.326599
3,10.0,0.6,0.326599


In [100]:
# To get final logistic regression model, we choose the best value of C

log_reg = LogisticRegression(
    penalty="l2",
    C=grid.best_params_['estimator__C'],
    solver="liblinear",
    max_iter=1000
)

log_classifier = OneVsRestClassifier(log_reg)

final_model = log_classifier.fit(X_log_std, y_log)
final_model

0,1,2
,estimator,LogisticRegre...r='liblinear')
,n_jobs,
,verbose,0

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,0.1
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'liblinear'
,max_iter,1000


-- This is with GridSearchCV --  
CV accuracy is still low (only 0.6), probably due to too small a number of samples  



-- This was without GridSearchCV--  
[liblinear] CV accuracy with or without "label" column: 0.5333 ± 0.2667  
[lbfgs] CV accuracy with or without "label" column: 0.4000 ± 0.3266  
By observation we can see that liblinear is better with smaller samples, but then again the sample size is too small to make any meaningful conclusion

## Making predictions on second batch

### Data cleaning

In [101]:
XX

Unnamed: 0,user,item,rating
0,900,0,2
1,900,388,2
2,900,389,3
3,900,390,0
4,900,401,5
...,...,...,...
282441,1799,319,4
282442,1799,318,5
282443,1799,316,3
282444,1799,814,4


In [102]:
# Check for duplicate rows (if a user rated an item more than once)

duplicates = XX[XX.duplicated(subset=["user", "item"], keep=False)]
print(duplicates.sort_values(by=["user", "item"]))

        user  item  rating
488      901    73       4
606      901    73       4
499      901   157       5
570      901   157       4
484      901   172       4
...      ...   ...     ...
282425  1799    26       4
282267  1799   866       4
282372  1799   866       1
282223  1799   930       5
282398  1799   930       5

[41033 rows x 3 columns]


In [103]:
# We found that there are quite a few duplicates (i.e. a user rated an item more than once)
# We assume that a user's final rating is the final decision, and we keep that

XX_no_dupes = XX.drop_duplicates(subset=["user", "item"], keep="last")

print(XX_no_dupes.shape)

(258465, 3)


### Feature Engineering

In [104]:
# # Create the initial dataframe
# # Step 1: Pivot the dataframe, so that the cell in (i,j) is user i's rating of the movie j
# XX_df = XX_no_dupes.pivot(index="user", columns="item", values="rating")

# # Step 2: Fill missing values with -1 (to show that it stands for no rating, instead of 0 = hated it)
# XX_df = XX_df.fillna(-1)

# # Ensure all items appear as columns (in case there is a movie within range(0, 1000) not inserted)
# all_items = range(0, 1000)
# XX_df = XX_df.reindex(columns=all_items, fill_value=-1)

XX_df = engineer_features(XX)
XX_df.columns = XX_df.columns.astype(str)

display(XX_df.head())
print(XX_df.shape)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,median_rating,std_rating,count_dislike,count_neutral,count_like,total_interactions,like_ratio,dislike_ratio,neutral_ratio,rating_kurtosis
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
900,2.0,-1.0,-1.0,3.0,-1.0,2.0,-1.0,-1.0,4.0,-1.0,...,3.0,0.982688,166,176,66,414,0.15942,0.400966,0.425121,-0.149735
901,5.0,4.0,4.0,-1.0,-1.0,4.0,-1.0,-1.0,-1.0,-1.0,...,4.0,0.589878,2,31,258,291,0.886598,0.006873,0.106529,0.632708
902,4.0,-1.0,3.0,-1.0,-1.0,2.0,-1.0,-1.0,4.0,2.0,...,2.0,1.148131,118,45,61,224,0.272321,0.526786,0.200893,-1.419371
903,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,5.0,...,4.0,1.087456,33,62,179,275,0.650909,0.12,0.225455,-0.268205
904,2.0,-1.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,0.0,...,3.0,1.02409,200,186,74,472,0.15678,0.423729,0.394068,-0.579728


(900, 1011)


### Supervised learning prediction

In [105]:
# Generate predictions using the trained linear regression model

XX_df_lr = scaler_lr.transform(XX_df)

yy_label_pred = lr.predict(XX_df_lr)
print(yy_label_pred.shape)

(900,)




### Unsupervised learning prediction

In [106]:
# Generate predictions using the trained logistic regression model

XX_df_log = scaler_log.transform(XX_df)

yy_label_anomtype = log_classifier.predict(XX_df_log)
print(yy_label_anomtype.shape)

(900,)


### Saving the result

In [107]:
# combine dataframe
result_df = XX_df.reset_index()
result_df = result_df[["user"]]
result_df["label"] = yy_label_pred
result_df["anomtype"] = yy_label_anomtype
display(result_df.head(10))

# normalise label column as some predictions are <0 or >1
result_df["label"] = result_df["label"].clip(lower=0, upper=1)

# save as csv
result_df.to_csv('second_batch_output.csv')
print("Result successfully saved")

Unnamed: 0,user,label,anomtype
0,900,0.627882,0
1,901,0.553428,0
2,902,0.679889,0
3,903,-0.021223,0
4,904,0.663203,0
5,905,0.565031,0
6,906,0.933601,2
7,907,0.283465,2
8,908,0.449125,2
9,909,-0.319913,2


Result successfully saved
