# Hello! 👋  
Welcome to **Compile & Conquer's** workspace.

Our project utilizes a combination of libraries and APIs to tackle the problem of spam review detection.  
Below is a brief overview of our pipeline:

---

## 🔑 Training vs Inference

👉 **Training Pipeline (done by us):**
- Standardize raw datasets into workable CSV format
- Preprocess and clean the data
- Use both OpenAI and VADER for pseudolabelling
- Train CatBoost with 10-fold cross-validation
- Save models + metadata

👉 **Inference Pipeline (for judges / new data):**
- Standardize and preprocess new input
- Add VADER sentiment for extra accuracy
- Load pre-trained CatBoost models
- Predict spam vs ham  
- Save results to output folder

In [35]:
import sys
import os
import json
import jsonlines
import random
from parse_file import *
from ucsd_json_standardization import *
from standardization import *
from helpers import *
from preprocess import *
from vader_function import *
from inference import *

We need to set some directories so that our program can access all the files.

In [36]:
PROJECT_ROOT = os.path.abspath("..")
INPUT_FOLDER = os.path.join(PROJECT_ROOT, "input")
DATA_FOLDER = os.path.join(PROJECT_ROOT, "data")
OUTPUT_FOLDER = os.path.join(PROJECT_ROOT, "output")

In [46]:
input_file = "test.json" # Feel free to change! Though have to have the file of interest in the input folder of the project directory.
output_file = standardize_file(input_file)
output_path = os.path.join(DATA_FOLDER, output_file)

File is successfully standardized!


Feel free to take a look at the standardization, but we will continue now in the preprocessing stage.
Our project has two approaches towards standardization, depending on the file we receive.
Since the json from the UCSD Google Review dataset is so well formatted, we have a faster way of dealing with it.
However, because we want scalibility and understand not every dataset is unfortunately treated the same, we also impletment OpenAI in helping us standardize, ALbeit trading some runtime...

In [49]:
base_name = os.path.splitext(output_file)[0]  # standardized base name
csv_path = os.path.join(DATA_FOLDER, base_name + ".csv")
json_to_csv_from_data(output_path, csv_path)
_, preprocessed_csv = preprocess_file(base_name)
_, final_csv = VADER_Sentiment_Score(base_name)

✅ Converted 10 rows into C:\Users\hanya\Desktop\TikTok TechJam\TikTok-TechJam-2025\data\test_standardized.csv
✅ Preprocessing complete. Saved to C:\Users\hanya\Desktop\TikTok TechJam\TikTok-TechJam-2025\data\test_standardized_preprocessed.csv
Final columns: ['business_name', 'user_name', 'time', 'user_id', 'text', 'rating', 'sentiment_category', 'gmap_id']
✅ VADER sentiment scoring complete. Saved to C:\Users\hanya\Desktop\TikTok TechJam\TikTok-TechJam-2025\data\test_standardized_final.csv


Now our file is ready to be plugged into the model!!
In this previous preprocessing stage, we get rid of potentially problematic rows, such as duplicate entries, or non-sensible values.
We then add vader sentiment to increase the number of attributes used for training, which is hopefully going to increase the applicability and accuracy of our model.

In [52]:
# Final Step!
_, results_csv = run_inference(base_name)

FileNotFoundError: [Errno 2] No such file or directory: 'model\\features.pkl'