# Hello! 👋  
Welcome to **Compile & Conquer's** workspace.

Our project utilizes a combination of libraries and APIs to tackle the problem of spam review detection.  
Below is a brief overview of our pipeline:

---

## 🔑 Training vs Inference

👉 **Training Pipeline (done by us):**
- Standardize raw datasets into workable CSV format
- Preprocess and clean the data
- Use both OpenAI and VADER for pseudolabelling
- Train CatBoost with 10-fold cross-validation
- Save models + metadata

👉 **Inference Pipeline (for judges / new data):**
- Standardize and preprocess new input
- Add VADER sentiment for extra accuracy
- Load pre-trained CatBoost models
- Predict spam vs ham  
- Save results to output folder

In [None]:
import sys
import os
import json
import jsonlines
import random
sys.path.append(os.path.join(os.getcwd(), "src"))
from src.parse_file import *
from src.ucsd_json_standardization import *
from src.standardization import *
from src.helpers import *
from src.preprocess import *
from src.Vader_function import *
from src.inference import *


ModuleNotFoundError: No module named 'ucsd_json_standardization'

We need to set some directories so that our program can access all the files.

In [None]:
PROJECT_ROOT = os.path.abspath("..")
INPUT_FOLDER = os.path.join(PROJECT_ROOT, "input")
DATA_FOLDER = os.path.join(PROJECT_ROOT, "data")
OUTPUT_FOLDER = os.path.join(PROJECT_ROOT, "outputs")

In [None]:
input_file = "test.json" # Feel free to change! Though have to have the file of interest in the input folder of the project directory.
output_file = standardize_file(input_file)
output_path = os.path.join(DATA_FOLDER, output_file)

# Preprocessing

You’re welcome to review the standardization step, but we’ll now move on to preprocessing.

Our project supports two different approaches to standardization, depending on the type of file we receive:

UCSD Google Review Dataset: Since this dataset is already well-structured, we use a streamlined, faster pipeline.

Other datasets: To ensure scalability and robustness, we also integrate OpenAI to assist with standardization. Open AI is very versatile for all kinds of files, albeit slower runtime... 🥺

This dual approach ensures our pipeline works efficiently with clean data while remaining flexible enough to process real-world datasets that may not be as polished.

In [None]:
base_name = os.path.splitext(output_file)[0]  # standardized base name
csv_path = os.path.join(DATA_FOLDER, base_name + ".csv")
json_to_csv_from_data(output_path, csv_path)
_, preprocessed_csv = preprocess_file(base_name)
_, final_csv = VADER_Sentiment_Score(base_name)

# Training
Our dataset is now fully standardized and ready to be plugged into the model!
In the preprocessing stage, we remove potentially problematic rows such as duplicate entries and non sensible values.

We also enrich the dataset with VADER sentiment scores, adding new attributes that can help the model capture more nuance in the reviews. This should improve both the robustness and overall accuracy of our predictions.

In [None]:
# Final Step!
_, results_csv = run_inference(base_name)

# Display

In [None]:
result_name = base_name + "_results.csv"
result_path = os.path.join(OUTPUT_FOLDER, result_name)

df = pd.read_csv(result_path)
display(df.head())