# Computational Final Project

## Setup and Imports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
#blabla

## Loading and Preprocessing the Data

In [36]:
data = pd.read_csv("../data/fake reviews dataset.csv")

data.head()

Unnamed: 0,category,rating,label,text_
0,Home_and_Kitchen_5,5,CG,"Love this! Well made, sturdy, and very comfor..."
1,Home_and_Kitchen_5,5,CG,"love it, a great upgrade from the original. I..."
2,Home_and_Kitchen_5,5,CG,This pillow saved my back. I love the look and...
3,Home_and_Kitchen_5,1,CG,"Missing information on how to use it, but it i..."
4,Home_and_Kitchen_5,5,CG,Very nice set. Good quality. We have had the s...


In [37]:
data.dtypes

category    object
rating       int64
label       object
text_       object
dtype: object

## Feature Engineering

#### Text-based Features
1 - Text Length: Calculate the length of each review.
2 - Sentiment Score: You could use TextBlob to get the sentiment score.

In [38]:
from textblob import TextBlob

# Feature: Text length
data['text_length'] = data['text_'].apply(len)

# Feature: Sentiment score
data['sentiment_score'] = data['text_'].apply(lambda x: TextBlob(x).sentiment.polarity)


##### Summary of text_length and sentiment_score in Anomaly Detection

High Variability in Genuine Reviews: Original (OG) reviews are likely to vary in both text_length and sentiment_score, as they reflect diverse user experiences.

Patterns in Fake Reviews: Fake (CG) reviews might follow patterns, such as being consistently short, overly positive, or similarly structured, which these features can capture.

Detecting Anomalies: Using both text_length and sentiment_score as features helps anomaly detection models (like Isolation Forest) spot reviews that deviate from the natural distribution of genuine reviews.

These features are therefore essential as they capture both structural (text length) and emotional (sentiment) aspects of the reviews, allowing for a nuanced approach to detecting computer-generated reviews.

#### Rating-based Features
1 - Rating Deviation: Calculate the deviation of each rating from the average rating for that category

In [39]:
# Calculate the mean rating per category
category_avg_rating = data.groupby('category')['rating'].transform('mean')

# Feature: Rating deviation
data['rating_deviation'] = data['rating'] - category_avg_rating


#### Category Encoding
Encode the category column to use in the model.

In [40]:
# One-hot encoding for category
data = pd.get_dummies(data, columns=['category'], drop_first=True)


#### Final Feature Set

In [41]:
# Prepare the feature set for anomaly detection
features = data[['text_length', 'sentiment_score', 'rating', 'rating_deviation'] + [col for col in data.columns if col.startswith('category_')]]


## Standardize the Data
Standardize the feature data for improved model performance.

In [42]:
# Standardize the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)


## Isolation Forest
The Isolation Forest is straightforward to implement and works well with tabular data.

#### 1. Initialize and Train the Isolation Forest

In [43]:
# Initialize the Isolation Forest model
isolation_forest = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)

# Fit the model to the feature data
isolation_forest.fit(features_scaled)


#### 2. Anomaly Prediction

In [None]:
# Predict anomalies
anomaly_labels_if = isolation_forest.predict(features_scaled)
data['anomaly_label_if'] = anomaly_labels_if  # Store predictions in the dataset

# Calculate anomaly scores (the lower the score, the more anomalous)
anomaly_scores_if = isolation_forest.decision_function(features_scaled)
data['anomaly_score_if'] = anomaly_scores_if

#### 3. Evaluate and Interpret Isolation Forest Results

In [45]:
# Display a sample of predicted anomalies
anomalies_if = data[data['anomaly_label_if'] == -1]
print("Number of anomalies detected by Isolation Forest:", anomalies_if.shape[0])

# Filter for rows where label is 'CG' and count them
cg_count = data[data['label'] == 'CG'].shape[0]

print(f"Number of computer-generated (CG) reviews: {cg_count}")



Number of anomalies detected by Isolation Forest: 4044
Number of computer-generated (CG) reviews: 20216


From these results we can see that 4044 anomalies are detected by the Isolation Forest and it appears that the model flagged a substantial subset of reviews as suspicious based on their features. Given the dataset size, this number might represent a meaningful pattern in the data. Isolaton Forest detects anomalies without label knowledge, so there may or may not overlap with actual CG reviews. 

When we look at the CG-labelled reviews, we can see that there are 20216 datapoints of actual CG-labelled reviews. This represents a significant portion of the dataset, and it confimrs that many of the CG reviews might not have been detected as anomalies by the Isolation Forest.

## Next Steps for Analysis and Model Improvement

#### Overlap Analysis
Compare the Isolation Forest’s detected anomalies with the CG-labeled reviews to understand the overlap. This will help you see how well the Isolation Forest performed in detecting CG reviews:

In [46]:
# Check overlap between Isolation Forest anomalies and CG reviews
isolation_forest_anomalies = data[data['anomaly_label_if'] == -1]
cg_reviews = data[data['label'] == 'CG']

# Count reviews that are both anomalies and labeled as CG
overlap_count = isolation_forest_anomalies[isolation_forest_anomalies['label'] == 'CG'].shape[0]
print(f"Number of reviews flagged as anomalies and labeled as CG: {overlap_count}")


Number of reviews flagged as anomalies and labeled as CG: 1674


Interpretation: If the overlap is high, the Isolation Forest might be doing well at detecting CG reviews based on anomalies. If the overlap is low, it could mean that the features used by the Isolation Forest aren't fully capturing the characteristics of CG reviews.

#### Adjusting Isolation Forest Parameters:

Experiment with different values for contamination (proportion of anomalies) and n_estimators (number of trees) to see if you can improve the overlap with CG labels.

In [47]:
# Example: Adjust contamination to increase/decrease anomaly detection sensitivity
isolation_forest = IsolationForest(n_estimators=150, contamination=0.15, random_state=42)
isolation_forest.fit(features_scaled)
data['anomaly_label_if'] = isolation_forest.predict(features_scaled)


#### Exploring Feature Importance:

Since Isolation Forests work by splitting data across feature values, analyzing which features most contributed to the anomalies could provide insight. You could use feature importance methods to determine which features contribute most to the anomalies.

#### Combining Isolation Forest with PCA or Clustering:

If the overlap between Isolation Forest anomalies and CG reviews isn’t satisfactory, consider combining this method with PCA-based anomaly detection or clustering to capture more patterns indicative of CG reviews.

In [None]:
# Assuming the required libraries are already imported
# Additional imports if needed
from textblob import TextBlob
from nltk.corpus import stopwords
import nltk

# Ensure NLTK stopwords are downloaded
nltk.download('stopwords')

# Initialize stopwords
stop_words = set(stopwords.words('english'))

## Step 1: Adding New Features

# Feature: Subjectivity score
data['subjectivity_score'] = data['text_'].apply(lambda x: TextBlob(x).sentiment.subjectivity)

# Feature: Unique word count
data['unique_word_count'] = data['text_'].apply(lambda x: len(set(x.split())))

# Feature: Stopword ratio
data['stopword_ratio'] = data['text_'].apply(lambda x: sum(1 for word in x.split() if word in stop_words) / len(x.split()) if len(x.split()) > 0 else 0)

# The new features `subjectivity_score`, `unique_word_count`, and `stopword_ratio` are now added to the DataFrame.

## Step 2: Updating the Feature Set with New Features

# Prepare the feature set for anomaly detection including the new features
features = data[['text_length', 'sentiment_score', 'subjectivity_score', 'unique_word_count', 'stopword_ratio',
                 'rating', 'rating_deviation'] + [col for col in data.columns if col.startswith('category_')]]

## Step 3: Standardize the Data

from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

## Step 4: Running Isolation Forest

from sklearn.ensemble import IsolationForest

# Initialize and train the Isolation Forest model
isolation_forest = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)
isolation_forest.fit(features_scaled)

# Predict anomalies
anomaly_labels_if = isolation_forest.predict(features_scaled)
data['anomaly_label_if'] = anomaly_labels_if  # Store predictions in the dataset

# Calculate anomaly scores (the lower the score, the more anomalous)
anomaly_scores_if = isolation_forest.decision_function(features_scaled)
data['anomaly_score_if'] = anomaly_scores_if


# Display a sample of predicted anomalies
anomalies_if = data[data['anomaly_label_if'] == -1]
print("Number of anomalies detected by Isolation Forest:", anomalies_if.shape[0])

# Filter for rows where label is 'CG' and count them
cg_count = data[data['label'] == 'CG'].shape[0]

print(f"Number of computer-generated (CG) reviews: {cg_count}")

print(f"Anomaly Score (The lower the score, the more anomalous): {anomaly_scores_if}")



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of anomalies detected by Isolation Forest: 8087
Number of computer-generated (CG) reviews: 20216
Anomaly Score (The lower the score, the more anomalous): [ 0.00497232  0.04852512  0.02963544 ... -0.10163876 -0.10544148
 -0.04114876]


In [54]:
# Check overlap between Isolation Forest anomalies and CG reviews
isolation_forest_anomalies = data[data['anomaly_label_if'] == -1]
cg_reviews = data[data['label'] == 'CG']

# Count reviews that are both anomalies and labeled as CG
overlap_count = isolation_forest_anomalies[isolation_forest_anomalies['label'] == 'CG'].shape[0]
print(f"Number of reviews flagged as anomalies and labeled as CG: {overlap_count}")

Number of reviews flagged as anomalies and labeled as CG: 3493


In [50]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [100, 150, 200], 'contamination': [0.05, 0.1, 0.15]}
grid_search = GridSearchCV(IsolationForest(random_state=42), param_grid, scoring='accuracy', cv=5)
grid_search.fit(features_scaled)
print("Best Parameters:", grid_search.best_params_)


Traceback (most recent call last):
  File "c:\Users\Lenovo\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 811, in _score
    scores = scorer(estimator, X_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Lenovo\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 811, in _score
    scores = scorer(estimator, X_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Lenovo\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 811, in _score
    scores = scorer(estimator, X_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'

Traceback (most recent call last):
  File "c:\Users\Lenovo\anaconda3\L

Best Parameters: {'contamination': 0.05, 'n_estimators': 100}
