# Title: Machine Learning for Deception Detection

## 1. Author
**Student Name: Haocheng Zhang**

**Student ID: 221166194**

## 2. Problem Formulation
In this project, we aim to solve the machine learning problem of deception detection using the MLEnd Deception Dataset. The goal is to classify stories into two categories: "True Story" and "Fabricated Story." Deception detection is an interesting problem due to its applications in psychology, security, and automated systems. Detecting deception accurately can help build trust in systems that require interaction with humans, such as conversational agents and forensic tools. It is also challenging because the signals of deception in speech or text can be subtle and influenced by a variety of factors, including individual traits and environmental conditions.

## 3. Methodology
I use neural networks, random forests, grid-optimized, and decision trees to implement binary decision-making, and compare their effectiveness and similarities and differences.

**Our methodology includes the following steps:**

**Training Task**: The training task involves creating a machine learning model to classify audio files as either "True Story" or "Fabricated Story." We use the provided MLEnd dataset for training and validation.

**Validation Task**: The model is validated using a separate test set to evaluate its generalization ability.

Performance Metrics:
Model performance will be evaluated using:

Accuracy: The ratio of correctly predicted samples to total samples.

Confusion Matrix: To understand false positives and false negatives.

F1-Score, Precision, and Recall: To ensure a balanced performance across classes.

**Additional Tasks**: Feature extraction from audio data is critical for success. We extract Mel Frequency Cepstral Coefficients (MFCCs), chroma features, zero-crossing rate, and other audio features to represent the data effectively.

## 4. Implemented ML Prediction Pipelines

**Input**: Raw audio files from the dataset.

**Stages**:

Transformation Stage: Feature extraction from audio files.

Model Stage: Building and training machine learning models (e.g., Neural Networks, Decision Trees, Random Forests，grid-optimized Decision Trees).

Evaluation Stage: Using metrics to analyze model performance.

Output: Classification of each audio file as either "True Story" or "Fabricated Story."

### 4.1 Transformation Stage
**Input**: Raw audio files (.wav format).

**Output**: Feature vectors representing each audio file.

**Description**:
Features include MFCCs, chroma features, mel spectrogram, spectral centroid, and zero-crossing rate.
These features capture temporal and spectral properties of audio that may correlate with deception.
Chosen because they are standard in audio signal processing and provide a compact representation of the data.

### 4.2 Model Stage

**Models Explored**:

Neural Networks: For their ability to learn non-linear patterns in data.

Decision Trees: As a baseline due to their interpretability.

Random Forests: To improve charateristic.

Grid-optimized: To reduce overfitting and improve stability.

**Reason for Choice**:

Grid-optimized Decision Trees are suited for high-dimensional data, while ensemble methods like Random Forests and Neural networks are robust and perform not well on small datasets, and Decision Trees is too simple.

### 4.3 Ensemble Stage

In the ensemble stage of our model pipeline, we integrated the strengths of the Grid-Optimized Decision Tree with the Standard Decision Tree to create a more robust classifier. This approach was chosen based on the principle that combining models often results in better performance than any single model alone due to their ability to compensate for each other's weaknesses.

#### Reasons for Choosing This Ensemble:

This ensemble approach forms a crucial part of our strategy to enhance model performance while maintaining the interpretability of the decision trees, making it a valuable asset in the robust analytical toolkit required for deception detection.

## 5. Dataset
**Description**:
The dataset consists of audio files and a CSV file with metadata:

**Audio Files**: Speech data categorized as "True Story" or "Fabricated Story."
Metadata CSV: Contains attributes such as filename and Story_type.

### 5.1 Dataset Preparation
**Training and Validation Split:**

Training Set: 70% of the data.

Validation Set: 30% of the data.

**Independence and IID Assumptions:**
Ensured through random sampling without replacement.
Audio features are standardized to maintain uniform distributions.

**Limitations:**
Dataset size might be small, leading to overfitting.
Audio quality variability might introduce noise into feature extraction.

### 5.2 Dataset Visualization
Feature distributions for MFCCs, chroma, and other extracted features were analyzed to check for class imbalance and separability.

## 6. Experiments and Results

### 6.1 Extract data and features from the sample set

In [2]:
import os
import pandas as pd
import numpy as np
import librosa
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import classification_report, accuracy_score
from imblearn.over_sampling import SMOTE
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# Path definitions
csv_path = "Deception/CBU0521DD_stories_attributes.csv"
audio_dir = "Deception/CBU0521DD_stories"

# Read the CSV file
data = pd.read_csv(csv_path)

print("Data preview:")
print(data.head())

# Function to extract audio features
def extract_features(file_path):
    try:
        y, sr = librosa.load(file_path, sr=None)

        # Extract MFCC features
        mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
        chroma = librosa.feature.chroma_stft(y=y, sr=sr)
        mel = librosa.feature.melspectrogram(y=y, sr=sr)

        mfcc_mean = np.mean(mfcc, axis=1)
        chroma_mean = np.mean(chroma, axis=1)
        mel_mean = np.mean(mel, axis=1)

        # Extract zero crossing rate
        zero_crossing_rate = librosa.feature.zero_crossing_rate(y=y)
        zcr_mean = np.mean(zero_crossing_rate)

        # Combine additional features
        spectral_centroid = np.mean(librosa.feature.spectral_centroid(y=y, sr=sr))

        print(f"process audio file {file_path}")
        
        return np.hstack((mfcc_mean, chroma_mean, mel_mean, zcr_mean, spectral_centroid))
    except Exception as e:
        print(f"Unable to process audio file {file_path}: {e}")
        return np.zeros(13 + 12 + 128 + 1 + 1)  # Return a zero vector with the appropriate dimension if processing fails

# Extract features for all audio files
features = []
labels = []

for index, row in data.iterrows():
    audio_file = os.path.join(audio_dir, row['filename'])
    if os.path.exists(audio_file):
        feature = extract_features(audio_file)
        features.append(feature)
        labels.append(1 if row['Story_type'] == 'True Story' else 0)

# Convert to NumPy arrays
features = np.array(features)
labels = np.array(labels)

# Apply SMOTE for oversampling
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(features, labels)

# Normalize the data
scaler = StandardScaler()
X_resampled = scaler.fit_transform(X_resampled)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

# Convert labels to one-hot encoding
y_train = to_categorical(y_train, num_classes=2)
y_test = to_categorical(y_test, num_classes=2)

Data preview:
    filename Language  Story_type
0  00001.wav  Chinese  True Story
1  00002.wav  Chinese  True Story
2  00003.wav  Chinese  True Story
3  00004.wav  Chinese  True Story
4  00005.wav  Chinese  True Story
process audio file Deception/CBU0521DD_stories\00001.wav
process audio file Deception/CBU0521DD_stories\00002.wav
process audio file Deception/CBU0521DD_stories\00003.wav
process audio file Deception/CBU0521DD_stories\00004.wav
process audio file Deception/CBU0521DD_stories\00005.wav
process audio file Deception/CBU0521DD_stories\00006.wav
process audio file Deception/CBU0521DD_stories\00007.wav
process audio file Deception/CBU0521DD_stories\00008.wav
process audio file Deception/CBU0521DD_stories\00009.wav
process audio file Deception/CBU0521DD_stories\00010.wav
process audio file Deception/CBU0521DD_stories\00011.wav
process audio file Deception/CBU0521DD_stories\00012.wav
process audio file Deception/CBU0521DD_stories\00013.wav
process audio file Deception/CBU0521DD_st

### 6.2 Decision Tree model

In [82]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

# Build the Decision Tree model
decision_tree = DecisionTreeClassifier(random_state=42)

# Train the Decision Tree model
print("Training the Decision Tree model...")
decision_tree.fit(X_train, np.argmax(y_train, axis=1))  # Use non-one-hot-encoded labels for training

# Evaluate the Decision Tree model
print("Evaluating the Decision Tree model...")
y_pred = decision_tree.predict(X_test)

# Output the classification report
print("Classification Report:")
print(classification_report(np.argmax(y_test, axis=1), y_pred))  # Use non-one-hot-encoded labels for evaluation
print(f"Accuracy: {accuracy_score(np.argmax(y_test, axis=1), y_pred)}")

Training the Decision Tree model...
Evaluating the Decision Tree model...
Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.47      0.56        15
           1       0.60      0.80      0.69        15

    accuracy                           0.63        30
   macro avg       0.65      0.63      0.62        30
weighted avg       0.65      0.63      0.62        30

Accuracy: 0.6333333333333333


Description: The decision tree was used as a baseline model for its simplicity and interpretability.

Results:Accuracy: 0.63(not good enough)

Observations: The decision tree performed moderately well but tended to overfit the training data due to its tendency to create complex decision boundaries.

### 6.3 Random Forest model

In [81]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Build the Random Forest model
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)  # Using 100 estimators (trees)

# Train the Random Forest model
print("Training the Random Forest model...")
random_forest.fit(X_train, np.argmax(y_train, axis=1))  # Use non-one-hot-encoded labels for training

# Evaluate the Random Forest model
print("Evaluating the Random Forest model...")
y_pred = random_forest.predict(X_test)

# Output the classification report
print("Classification Report:")
print(classification_report(np.argmax(y_test, axis=1), y_pred))  # Use non-one-hot-encoded labels for evaluation
print(f"Accuracy: {accuracy_score(np.argmax(y_test, axis=1), y_pred)}")

Training the Random Forest model...
Evaluating the Random Forest model...
Classification Report:
              precision    recall  f1-score   support

           0       0.55      0.73      0.63        15
           1       0.60      0.40      0.48        15

    accuracy                           0.57        30
   macro avg       0.57      0.57      0.55        30
weighted avg       0.57      0.57      0.55        30

Accuracy: 0.5666666666666667


Description: Random forest was employed as an ensemble method to improve stability and reduce overfitting by averaging multiple decision trees.

Results:Accuracy: 0.57(not good enough)

Observations: Surprisingly, the random forest underperformed compared to the decision tree. This could be due to suboptimal hyperparameter settings or the small dataset size, which made it challenging for an ensemble method to generalize.

### 6.4 CNN model

In [80]:
# Build the neural network model
model = Sequential()
model.add(Dense(256, activation='relu', input_shape=(X_train.shape[1],)))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Define early stopping and learning rate scheduler
early_stopping = EarlyStopping(monitor='val_loss', patience=16)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=8)  # Set proper learning rate adjustment parameters

# Train the model
print("Training the model...")
history = model.fit(X_train, y_train, epochs=100, batch_size=16, 
                    validation_split=0.2, verbose=1, 
                    callbacks=[early_stopping, reduce_lr])

# Evaluate the model
print("Evaluating the model...")
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_test_classes = np.argmax(y_test, axis=1)

# Output the classification report
print("Classification Report:")
print(classification_report(y_test_classes, y_pred_classes))
print(f"Accuracy: {accuracy_score(y_test_classes, y_pred_classes)}")

Training the model...
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Evaluating the model...
Classification Report:
              precision    recall  f1-score   support

           0  

Description: A CNN was implemented to take advantage of its ability to extract spatial and hierarchical patterns in the feature data.

Results:Accuracy: **0.70**

Observations: The CNN outperformed both the decision tree and random forest, demonstrating its strength in capturing complex patterns in the audio features. **However, the model required significant computational resources and careful tuning to prevent overfitting. due to the small sample size, the neural network also experienced slow gradient descent during training, resulting in low accuracy of the final results, sometimes even below 50%. This result is the best set of parameters I have thrown out after multiple runs.**

### 6.5 grid-optimized Decision Tree

In [22]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define the model
dt = DecisionTreeClassifier(random_state=42)

# Expand the parameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],  # Splitting criteria
    'max_depth': [None, 10, 20, 30, 40, 50],  # None means no limit
    'min_samples_split': [2, 5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 4, 6, 8],
    'max_features': ['auto', 'sqrt', 'log2', None]  # Number of features to consider when looking for the best split
}

# Setup the grid search
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train, np.argmax(y_train, axis=1))

# Display the best parameters
print("Best Decision Tree parameters:", grid_search.best_params_)

# Use the best estimator to make predictions
best_dt = grid_search.best_estimator_
y_pred = best_dt.predict(X_test)
y_test_classes = np.argmax(y_test, axis=1)  # Convert if y_test is one-hot encoded

# Output the classification report
print("Classification Report:")
print(classification_report(y_test_classes, y_pred))
print(f"Accuracy: {accuracy_score(y_test_classes, y_pred)}")

Fitting 5 folds for each of 1200 candidates, totalling 6000 fits
Best Decision Tree parameters: {'criterion': 'entropy', 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 2}
Classification Report:
              precision    recall  f1-score   support

           0       0.69      0.60      0.64        15
           1       0.65      0.73      0.69        15

    accuracy                           0.67        30
   macro avg       0.67      0.67      0.67        30
weighted avg       0.67      0.67      0.67        30

Accuracy: 0.6666666666666666


Description: A Decision Tree classifier was employed, utilizing a comprehensive grid search to optimize its configuration. The decision tree was chosen for its interpretive ease, allowing us to clearly understand the decision-making process. Grid search was used to methodically explore a range of hyperparameters, including tree depth, criteria for splits, and leaf constraints, aiming to capture the best balance between model complexity and prediction accuracy.

Results:Accuracy: **0.67**

Observations: The decision tree optimized by grid decision shows significant performance, although it does not exceed the upper limit of CNN, it has extremely high stability and will not encounter overfitting problems. This model benefits from the systematic adjustment of its parameters, which to some extent enhances its ability to handle the complexity of audio feature patterns. This makes it particularly useful for applications where understanding the basis of model decisions is crucial.

### 6.6 Results Analysis
The CNN achieved the highest accuracy (0.70), followed by the grid-optimized Decision Tree (0.67), Decision Tree (0.63) and then the Random Forest.

Considering the serious overfitting phenomenon of CNN neural network, obtaining this set of parameters with an accuracy of 0.7 is a coincidence, the confusion matrices revealed that the grid-optimized Decision Tree was better at minimizing false negatives, demonstrating its robustness in capturing the hierarchical patterns in the data. The Decision Tree, showed significant performance improvements with extremely high stability and no overfitting issues. It balanced complexity and accuracy well, but still struggled with some false positives and false negatives compared to the CNN. The Random Forest underperformed unexpectedly (accuracy lower than Decision Tree), suggesting that it might require further hyperparameter tuning or more sophisticated data preprocessing techniques.

Overall, the results indicate that more complex models such as CNNs and random forests may not be suitable for this small sample task as they can learn nonlinear relationships and complex patterns in useless data. The decision tree optimized by grid provides valuable insights due to its transparency, and its overall performance is superior to ordinary decision trees, which helps to understand the falsehood and truth of the story well. I have observed that as I extract more audio features, the accuracy of recognition also increases, possibly due to the decision tree's ability to understand the styles formed by each feature in the audio file well

### 7. Conclusions

This study investigated four machine learning models (CNN, Grid-Optimized Decision Tree, Decision Tree, and Random Forest) for deception detection using audio features. The findings highlight the nuanced capabilities of each model:

#### Key Findings:
**Performance Comparison**:

- The **CNN** achieved the best accuracy (**0.70**), underscoring its proficiency in capturing complex patterns in the audio data. **However, its performance is somewhat tempered by a tendency to overfit due to the small sample size, potentially leading to inconsistent results.**
  
- The **Grid-Optimized Decision Tree** showed significant improvement over the standard Decision Tree (**0.67**). It provided high stability without overfitting, benefiting from systematic parameter tuning which enhanced its ability to handle complex audio features.

- The **Standard Decision Tree** recorded moderate performance (**0.63**), reflecting its simplicity and limited capacity to manage subtleties within the audio data, with a noted propensity for overfitting.

- The **Random Forest** underperformed (**0.57**), likely due to inadequate data volume and suboptimal hyperparameter settings, indicating a need for more sophisticated tuning and possibly more extensive data preprocessing.

**Feature Extraction Importance**:
Feature extraction, especially the use of MFCC, chromaticity, and Mel spectrogram, has played a crucial role in all models. Decision trees are particularly adept at utilizing these features under optimal conditions while providing transparent utilization, making them valuable in applications where interpretability is crucial.

**Data Challenges**:
The limited dataset size constrained the effectiveness of more complex models like the Random Forest and sometimes the CNN, leading to potential overfitting. Class imbalance and the nuanced distinctions between "True Story" and "Fabricated Story" further complicated the modeling.

#### Improvement:

- **Advanced Models**: Explore more sophisticated architectures, including Vggish and PANN, or using hybrid models, to better capture both spatial and temporal dynamics in the audio data.

- **Hyperparameter Optimization**: Employ more rigorous techniques such as grid search or Bayesian optimization to refine the settings for Random Forest and possibly extend this to other models to ensure optimal performance.

- **Ensemble Methods**: Consider using ensemble strategies to amalgamate the strengths of various models. Techniques like stacking or boosting might yield better performance by mitigating the individual weaknesses of single models.

#### Final Conclusion:

Although CNN has shown its greatest potential by achieving the highest accuracy, overfitting may occur. In the current problem, **the grid optimized decision tree model is the optimal solution**, providing a balance between accuracy and interpretability without the overfitting problem that occurs in more complex models, successfully distinguishing between true and false stories. These insights emphasize the importance of model selection based on specific requirements and constraints of the task at hand. Looking ahead to the future, adopting professional speech models can solve the limitations caused by data constraints and improve the reliability and accuracy of deception detection systems.

## 8. References
Scikit-learn Documentation: https://scikit-learn.org/

TensorFlow Documentation: https://www.tensorflow.org/

Chen Wanzhi, Hou Yue A temporal multimodal sentiment analysis model that integrates multi-level attention and sentiment scale vectors Liaoning University of Engineering and Technology, School of Software, School of Electronic and Information Engineering, 2023

Sun Zhi, Crown CNN-GRU speech emotion recognition algorithm based on self supervised contrastive learning China Telecom Shenzhen Branch, Department of Applied Mathematics, Hong Kong Polytechnic University, 2023

https://cloud.tencent.com/developer/article/2443967

https://blog.csdn.net/laojinlaojinlaojin/article/details/138249407

https://blog.csdn.net/universsky2015/article/details/137304461