# Isolation Forest for Anomaly Detection

In this notebook, we apply the **Isolation Forest (IF)** algorithm to detect spam reviews.  

Isolation Forest is an unsupervised anomaly detection technique that works by **randomly partitioning data**.  
- Anomalies are isolated faster because they differ significantly from the majority.  
- The algorithm is particularly effective on high-dimensional datasets, making it suitable for text-based features.

We will train the IF model on our processed dataset, extract anomaly scores, and label potential spam reviews.


## Step 1: Importing Libraries

We first load the essential Python libraries for:  
- **Data handling:** pandas, numpy  
- **Preprocessing:** scikit-learn scalers  
- **Modeling:** IsolationForest from scikit-learn  
- **Exporting results:** saving outputs into CSV and Parquet formats  

These libraries form the backbone of our anomaly detection workflow.


In [1]:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import MinMaxScaler
from textblob import TextBlob
import re
import sys

## Step 2: Loading Processed Dataset

Next, we import the **cleaned and pre-processed dataset**.  
This dataset contains the features engineered from the raw Google review data, such as:  
- Review text embeddings / metadata features  
- Reviewer and business-level statistics  

These features will serve as input to the Isolation Forest model for anomaly detection.


In [2]:
df = pd.read_parquet('final_dataset.parquet')

# Define the features to be used in the model
features = [
    'rating',
    'text_len',
    'rating_deviation',
    'sentiment_polarity',
    'sentiment_subjectivity',
    'excessive_exclaim',
    'avg_rating',
    'log_num_reviews',
    'price_encoded',
    'year',
    'month',
    'weekday',
    'hour',
    'cat_American restaurant',
    'cat_Coffee shop',
    'cat_Department store',
    'cat_Fast food restaurant',
    'cat_Grocery store',
    'cat_Hotel',
    'cat_Mexican restaurant',
    'cat_Other',
    'cat_Pizza restaurant',
    'cat_Restaurant',
    'cat_Shopping mall'
]

# Check if all required features exist in the DataFrame
missing_features = [f for f in features if f not in df.columns]
if missing_features:
    print(f"Error: Missing required features in the dataset: {missing_features}")
    print("Please ensure your 'final-dataset.csv' contains these columns.")
    features = [f for f in features if f in df.columns]  # Proceed with available
    if not features:
        print("No valid features remaining. Exiting.")
        sys.exit(1)
    else:
        print(f"Proceeding with available features: {features}")

# Use a temporary DataFrame for scaling to avoid modifying the original
X = df[features].copy()

# Handle potential NaNs in features (fill with mean)
for col in features:
    if X[col].isnull().any():
        X[col] = X[col].fillna(X[col].mean())
        print(f"Filled NaN values in '{col}' with its mean.")

## Step 3: Data Preprocessing

Before training, we scale the features to ensure consistency.  
- **Why scale?** Many ML algorithms, including IF, rely on distance-based measures.  
  - Unscaled data may cause features with larger numerical ranges to dominate.  
- We use `MinMaxScaler` to transform all features into the **[0,1] range**.  

This ensures that every feature contributes equally to anomaly detection.


In [3]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

## Step 4: Building the Isolation Forest Model

We now initialize and train the Isolation Forest model:  
- **n_estimators:** number of trees in the forest  
- **contamination:** proportion of expected anomalies (controls thresholding)  
- **random_state:** ensures reproducibility  

The model assigns each review:  
- An **anomaly score** (how “isolated” it is)  
- An **anomaly label** (1 = normal, -1 = anomaly)

These outputs help us flag suspicious or spam reviews.



In [None]:
# Initialize the Isolation Forest model
# 'contamination' is set to be 5% because we assume that 5% of the dataset is an anomaly
model = IsolationForest(contamination=0.05, random_state=42)

# Fit the model and get the predictions.
# A prediction of -1 indicates an outlier, and 1 indicates an inlier.
df['is_outlier'] = model.fit_predict(X_scaled)

# Get the anomaly score. The lower the score, the more anomalous the point.
df['anomaly_score'] = model.decision_function(X_scaled)

In [6]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26622 entries, 0 to 26621
Data columns (total 38 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   user_id                   26622 non-null  object 
 1   name_review_user          26622 non-null  object 
 2   time                      26622 non-null  int64  
 3   rating                    26622 non-null  int64  
 4   text                      26622 non-null  object 
 5   gmap_id                   26622 non-null  object 
 6   latitude                  26622 non-null  float64
 7   longitude                 26622 non-null  float64
 8   category                  26616 non-null  object 
 9   avg_rating                26622 non-null  float64
 10  num_of_reviews            26622 non-null  int64  
 11  price                     14747 non-null  object 
 12  state                     15483 non-null  object 
 13  category_main             26622 non-null  object 
 14  cat_Am

## Step 5: Saving Results

Finally, we export the model outputs (scores and anomaly flags) into multiple formats:  
- **CSV:** widely used, easy for inspection and sharing  
- **Parquet:** optimized for speed and storage efficiency, ideal for large datasets  

This ensures the results can be easily re-used in downstream tasks, such as **fusion with Autoencoder scores**.


In [5]:
df.to_csv('IF_final_dataset_with_scores.csv', index=False)
df.to_parquet('IF_final_dataset_with_scores.parquet', index=False)