# Model Background

The Support Vector Machine (SVM) classifier works by finding an optimal hyperplane that maximally separates the two classes (fake vs. real reviews) in feature space. Three kernel functions were evaluated: linear, polynomial (degree 3), and radial basis function (RBF). The linear kernel fits a flat decision boundary, which is effective when the classes are linearly separable. The polynomial kernel maps the input features into a higher-dimensional space through polynomial combinations, allowing the model to capture non-linear relationships. The RBF kernel maps features into an infinite-dimensional space using a Gaussian similarity measure, providing the most flexible decision boundary of the three.

Feature selection is performed first using SelectKBest with the ANOVA F-test (f_classif), which ranks each feature by its F-statistic measuring the ratio of between-class variance to within-class variance. The number of features to retain (k) is searched over the values 3 through 10. Features with high F-scores exhibit statistically significant differences in their means between the fake and real review classes, making them more discriminative for classification. Features that do not pass this test contribute noise rather than signal, which is particularly problematic on small datasets where the model can easily memorize irrelevant patterns.

After feature selection, grid search is performed with 5-fold stratified cross-validation on the combined training and validation subsets to find the best hyperparameter combination. The hyperparameter grid includes the regularization parameter C (controlling the trade-off between maximizing the margin and minimizing classification errors), the kernel type, and kernel-specific parameters such as gamma (which controls the influence radius of each support vector in the RBF and polynomial kernels) and the polynomial degree. The best model is selected according to F1 score to ensure balanced performance across both classes. Once the best model is identified, it is saved using the joblib library, and evaluation metrics are reported for both training and test sets.

# Evaluation Results

The SVM classifier was trained on a stratified-split dataset of 60:20:20, resulting in equal samples of real and fake reviews across all subsets. Two models were trained: one using all 10 input features and one using the best feature subset identified by SelectKBest.

The model trained on all 10 features selected the polynomial kernel with C=100 and gamma=1 as its best configuration. This model achieved a training accuracy of 0.7837 and a training F1 score of 0.7783, but only a test accuracy of 0.5769 and a test F1 score of 0.5417. The severe drop from training to test performance (over 20 percentage points in accuracy) indicates significant overfitting. This occurs because the degree-3 polynomial kernel implicitly maps the 10 input features into a much higher-dimensional space through all polynomial combinations up to degree 3, creating far more effective dimensions than the 208 training samples can support. Combined with the high C=100 regularization parameter, which heavily penalizes misclassifications and forces the model to fit the training data very tightly, the model memorizes training-specific noise rather than learning generalizable patterns. The polynomial kernel is particularly prone to this on small datasets because it creates complex, highly curved decision boundaries that can perfectly wrap around training points without capturing the true class structure.

With SelectKBest feature selection, the best model retained k=4 features: NAJ (number of adjectives, F-score=27.91, p=3.2e-7), NST (number of stopwords, F-score=17.59, p=4.1e-5), NOW (number of words, F-score=15.73, p=1.0e-4), and NVB (number of verbs, F-score=8.91, p=3.2e-3). These are the only four features with statistically significant differences (p < 0.01) between fake and real reviews. The remaining six features, NTP (F=2.36, p=0.13), TPR (F=1.81, p=0.18), AWL (F=1.33, p=0.25), ASL (F=0.93, p=0.34), NPV (F=0.11, p=0.74), and CDV (F=0.005, p=0.95), do not reach statistical significance and were discarded. This is consistent with the linguistic intuition that fake reviews differ from real ones primarily in content quantity and descriptive language rather than in superficial stylistic measures like average word length or sentence length.

The feature-selected model chose the RBF kernel with C=0.1 and gamma=1. It achieved a training accuracy of 0.6731 and a training F1 score of 0.6458, with a test accuracy of 0.7115 and a test F1 score of 0.6939. The fact that test performance exceeds training performance is notable and can be attributed to two factors: (1) the low C=0.1 value creates a wide-margin classifier that prioritizes generalization over training accuracy, deliberately allowing some training misclassifications to avoid overfitting, and (2) the small test set size (52 samples) introduces sampling variance where the test distribution may slightly favor the learned boundary. The switch from polynomial to RBF kernel, combined with the dramatic reduction in C from 100 to 0.1, demonstrates that the model benefits from both a simpler, more localized decision boundary and stronger regularization.

Despite these improvements, the feature-selected model still falls short of the benchmark targets (accuracy 79.09%, F1 76.99%). The gap of approximately 8 percentage points in accuracy and 7.5 in F1 can be attributed to the fundamental limitation of working with only 260 total samples (52 in the test set), where even the best feature selection and tuning cannot fully compensate for the limited training signal. The training accuracy of 67.31% suggests the model is slightly underfitting, indicating that the four selected features, while statistically significant, may not capture all the complexity needed to fully distinguish fake from real reviews.

Comparing across all three models in this project, the SVM with feature selection (accuracy 0.7115, F1 0.6939) slightly outperforms both the gradient-boosted tree classifier (accuracy 0.6923, F1 0.6882) and the random forest classifier (accuracy 0.6923, F1 0.6522). The SVM achieves this with the fewest features (4 vs. 5 for GB and 7 for RF), suggesting it is the most efficient at extracting signal from the most discriminative features. All three models converge around the 69-71% accuracy range, which likely reflects an inherent ceiling for these 10 features on this dataset size rather than a deficiency of any single model architecture.