Import Required Libraries and Load Data

In [8]:
import pandas as pd
import os
import librosa
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Define file paths
csv_file = r"E:\code\truth_teller\Deception-main\Deception-main\CBU0521DD_stories_attributes.csv"
audio_folder = r"E:\code\truth_teller\Deception-main\Deception-main\CBU0521DD_stories"

# Load the CSV file
df = pd.read_csv(csv_file, sep=",")
print("First five rows of the dataset:")
display(df.head())

print(f"Total number of records: {len(df)}")

First five rows of the dataset:


Unnamed: 0,filename,Language,Story_type
0,00001.wav,Chinese,True Story
1,00002.wav,Chinese,True Story
2,00003.wav,Chinese,True Story
3,00004.wav,Chinese,True Story
4,00005.wav,Chinese,True Story


Total number of records: 100


Preprocess Audio Files and Extract Features

In [54]:
# Function to extract MFCC features
def extract_features(audio_path, n_mfcc=13):
    try:
        y, sr = librosa.load(audio_path, duration=180)  # Limit duration to 3 minutes (180 seconds)
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
        mfccs_mean = np.mean(mfccs, axis=1)  # Compute mean of each MFCC
        return mfccs_mean
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None
# Extract MFCC features from all audio files
features = []
labels = []

for index, row in df.iterrows():
    file_path = os.path.join(audio_folder, row['filename'])
    label = row['Story_type']
    mfcc_features = extract_features(file_path)

    if mfcc_features is not None:
        features.append(mfcc_features)
        labels.append(label)

print("Feature extraction complete!")

Feature extraction complete!


Data Preparation: Encode Labels and Scale Features

In [55]:
# Convert features and labels to numpy arrays
X = np.array(features)
y = np.array(labels)

# Encode labels to numeric values
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}, Testing samples: {len(X_test)}")


Training samples: 80, Testing samples: 20


Train a Random Forest Classifier

In [56]:
# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy:.2f}")

# Display confusion matrix and classification report
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Accuracy on the test set: 0.55

Confusion Matrix:
[[6 3]
 [6 5]]

Classification Report:
                 precision    recall  f1-score   support

Deceptive Story       0.50      0.67      0.57         9
     True Story       0.62      0.45      0.53        11

       accuracy                           0.55        20
      macro avg       0.56      0.56      0.55        20
   weighted avg       0.57      0.55      0.55        20



## Results and Summary

In this section, we summarize the results of the machine learning model for predicting whether a narrated story is true or not based on audio recordings.

### 1. Model Performance

- **Accuracy on the test set**: The model achieved an accuracy of **0.55** on the test set, which indicates that the model is able to correctly predict whether a story is true or not approximately 55% of the time.

### 2. Confusion Matrix

The confusion matrix below shows the true positive, true negative, false positive, and false negative values:




This matrix indicates that the model has some difficulty distinguishing between **True Stories** and **Deceptive Stories**. There are a number of misclassifications, especially when identifying **True Stories** as **Deceptive** (6 cases) and vice versa (3 cases).

### 3. Classification Report

The classification report provides additional details on the precision, recall, and F1-score for each category:

|                   | Precision | Recall | F1-score | Support |
|-------------------|-----------|--------|----------|---------|
| **Deceptive Story** | 0.50      | 0.67   | 0.57     | 9       |
| **True Story**      | 0.62      | 0.45   | 0.53     | 11      |
| **Accuracy**        |           |        | 0.55     | 20      |
| **Macro avg**       | 0.56      | 0.56   | 0.55     | 20      |
| **Weighted avg**    | 0.57      | 0.55   | 0.55     | 20      |

#### Key Insights:
- **Precision for Deceptive Story**: The model has a precision of **0.50** for the **Deceptive Story** category, meaning that when it classifies a story as deceptive, it's correct 50% of the time.
- **Recall for Deceptive Story**: The recall for **Deceptive Story** is **0.67**, indicating that the model successfully identifies 67% of all deceptive stories.
- **Precision for True Story**: The model has a precision of **0.62** for **True Story**, suggesting that it classifies true stories as true 62% of the time.
- **Recall for True Story**: The recall for **True Story** is **0.45**, which means the model correctly identifies 45% of all true stories.

The **F1-score** is a balance between precision and recall, and the model has relatively balanced F1-scores for both categories, but still shows room for improvement, especially in recall for **True Stories**.

### 4. Challenges and Limitations

- **Imbalanced Data**: The dataset is relatively small (only 20 samples in the test set), and the imbalanced distribution of true and deceptive stories might have contributed to the model's difficulties in achieving high recall for the **True Story** category.

- **Feature Extraction**: Feature extraction from audio can be noisy and challenging, especially with the variability in story narration styles, voice quality, and background noise in the audio files. The current features may not fully capture the complexities of the audio data.

- **Model Choice**: While a **Random Forest** model was a reasonable first choice, further model tuning and trying other algorithms such as **SVM** or **Deep Learning** could potentially improve the results.

### 5. Conclusion

The model achieved an accuracy of **0.55**, which suggests some predictive power but also reveals significant room for improvement. Future work could include:
- Trying different machine learning algorithms or models,
- Collecting more data to balance the dataset and enhance generalization,
- Experimenting with more advanced feature extraction techniques like **deep learning-based audio features**.

Despite the challenges, this project serves as a good starting point for further refinement and exploration in the area of audio-based story classification.


# Change the Model to Support Vector Machine (SVM)
In this section, we will replace the Random Forest classifier with a Support Vector Machine (SVM) to see if it improves the model's accuracy. SVM is a powerful classifier that works well in high-dimensional spaces, which could be useful for the audio classification task.


In [57]:
# Import necessary libraries for SVM
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

# Load the data
# Assuming that 'features' is a DataFrame containing the extracted features and 'labels' contains the target labels
X = features  # features of the audio
y = df['Story_type']  # target labels, assuming the 'Story_type' column contains 'True Story' or 'Deceptive Story'

# Split the data into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Support Vector Machine (SVM) model
svm_model = SVC(kernel='linear')  # Using a linear kernel for simplicity
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Output the results
print(f"Accuracy on the test set: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Accuracy on the test set: 0.30
Confusion Matrix:
[[3 6]
 [8 3]]
Classification Report:
                 precision    recall  f1-score   support

Deceptive Story       0.27      0.33      0.30         9
     True Story       0.33      0.27      0.30        11

       accuracy                           0.30        20
      macro avg       0.30      0.30      0.30        20
   weighted avg       0.31      0.30      0.30        20



### Results and Summary

In this section, we evaluate the performance of the Support Vector Machine (SVM) model on the test dataset. The results from the model's evaluation are as follows:

#### Accuracy:

- **Accuracy on the test set**: 0.30

This indicates that the model correctly predicted the story type (True or Deceptive) 30% of the time on the test dataset. This is relatively low and suggests that the model is struggling to classify the audio features accurately.

#### Confusion Matrix:

[[3 6] [8 3]]

The confusion matrix reveals the following:

- **Deceptive Story**: 3 correct predictions and 6 incorrect predictions (false negatives).
- **True Story**: 3 correct predictions and 8 incorrect predictions (false positives).

This suggests a large number of misclassifications between the two categories, with the model having difficulty distinguishing between the "Deceptive" and "True" stories.

#### Classification Report:

The classification report provides more detailed performance metrics such as precision, recall, and F1-score for each class (Deceptive Story and True Story):


- **Deceptive Story**: Precision is 0.27, recall is 0.33, and F1-score is 0.30. These results indicate that the model is not very good at correctly identifying deceptive stories.
- **True Story**: Precision is 0.33, recall is 0.27, and F1-score is 0.30. Similar to the deceptive story class, the model struggles to correctly identify true stories.

#### Key Observations:

- The model shows poor performance across both classes with precision, recall, and F1-score all being around 0.30.
- **Accuracy** is low, and the model’s ability to distinguish between "True" and "Deceptive" stories is not effective.
- There are **misclassifications** across both categories, with the model incorrectly classifying both types of stories at a high rate.

### Conclusion:

The SVM model, with the current feature set and preprocessing, does not perform well for the task of classifying stories as "True" or "Deceptive." The accuracy of 30% indicates that the model struggles to make correct predictions.

#### Potential Improvements:
To improve the model's performance, consider the following strategies:

1. **Feature Engineering**: Extract additional or more relevant features from the audio data. Features such as speech patterns, tone, and pitch could be more informative.
2. **Hyperparameter Tuning**: Perform hyperparameter optimization using grid search or random search to improve the model's performance, particularly by adjusting the `C` parameter and kernel type for SVM.
3. **Feature Scaling**: Ensure that all features are scaled appropriately. Using **StandardScaler** or **MinMaxScaler** could help improve the model’s performance, especially for SVMs.
4. **Model Selection**: Consider trying other models, such as **Random Forest**, **Logistic Regression**, or **Neural Networks**, to see if they can handle the audio data more effectively.

By exploring these adjustments, we can attempt to improve classification performance and create a more reliable model for predicting whether a narrated story is true or deceptive.


# Convolutional Neural Network (CNN) for audio classification

## PyTorch Implementation of CNN
In this step, we will use PyTorch to build and train a Convolutional Neural Network (CNN). The features extracted from the audio files (MFCCs) will serve as the input to the CNN. The goal is to improve upon previous performance metrics and better classify the stories.



In [71]:
import pandas as pd
import os
import librosa
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import torch
from torch.utils.data import DataLoader, TensorDataset, random_split

# Define file paths
csv_file = r"E:\code\truth_teller\Deception-main\Deception-main\CBU0521DD_stories_attributes.csv"
audio_folder = r"E:\code\truth_teller\Deception-main\Deception-main\CBU0521DD_stories"

# Load the CSV file
df = pd.read_csv(csv_file, sep=",")

# Clean column names (if needed)
df.columns = df.columns.str.strip()  # Strip spaces in column names

# Function to extract MFCC features
def extract_features(audio_path, n_mfcc=13):
    try:
        y, sr = librosa.load(audio_path, duration=180)  # Limit duration to 3 minutes (180 seconds)
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
        mfccs_mean = np.mean(mfccs, axis=1)  # Compute mean of each MFCC
        return mfccs_mean
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None

# Extract MFCC features from all audio files
features = []
labels = []

for index, row in df.iterrows():
    file_path = os.path.join(audio_folder, row['filename'])
    label = row['Story_type']

    # Extract features from the audio file
    mfcc_features = extract_features(file_path)

    if mfcc_features is not None:
        features.append(mfcc_features)
        labels.append(label)

print("Feature extraction complete!")

# Convert features and labels to numpy arrays
X = np.array(features)
y = np.array(labels)

# Encode labels to numeric values (Story_type)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert data to PyTorch tensors
X_tensor = torch.tensor(X_scaled, dtype=torch.float32)
y_tensor = torch.tensor(y_encoded, dtype=torch.long)

# Create a PyTorch dataset
dataset = TensorDataset(X_tensor, y_tensor)

# Split dataset into training and testing sets
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# DataLoaders for batching
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

print("Data prepared successfully!")





Feature extraction complete!
Data prepared successfully!


## Define the CNN Model
Here we define the Convolutional Neural Network architecture. Since the MFCC features are 1D arrays, we reshape them for compatibility with CNN layers.

In [79]:
class AudioCNN(nn.Module):
    def __init__(self, input_size):
        super(AudioCNN, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=1)  # Conv Layer 1
        self.pool = nn.MaxPool1d(kernel_size=2, stride=2)  # MaxPooling
        self.conv2 = nn.Conv1d(16, 32, kernel_size=3, stride=1, padding=1)  # Conv Layer 2
        self.fc1 = nn.Linear(32 * (input_size // 4), 128)  # Fully Connected Layer 1
        self.fc2 = nn.Linear(128, 2)  # Output Layer (2 classes: True Story, Deceptive Story)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # Conv1 -> ReLU -> Pool
        x = self.pool(F.relu(self.conv2(x)))  # Conv2 -> ReLU -> Pool
        x = x.view(-1, 32 * (x.shape[2]))  # Flatten for FC Layer
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Instantiate the model
input_size = X_tensor.shape[1]  # Input feature size
model = AudioCNN(input_size=input_size).to(device)
print(model)


AudioCNN(
  (conv1): Conv1d(1, 16, kernel_size=(3,), stride=(1,), padding=(1,))
  (pool): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv1d(16, 32, kernel_size=(3,), stride=(1,), padding=(1,))
  (fc1): Linear(in_features=96, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=2, bias=True)
)


## Train the CNN Model
We train the CNN using the Cross-Entropy Loss for classification and the Adam optimizer. The training loop iterates over the epochs, and loss/accuracy is tracked.


In [80]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training the model
epochs = 20
for epoch in range(epochs):
    model.train()
    running_loss = 0.0

    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        inputs = inputs.unsqueeze(1)  # Add channel dimension for CNN (batch, channels, features)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    print(f"Epoch [{epoch+1}/{epochs}], Loss: {running_loss / len(train_loader):.4f}")
print("Training complete!")


Epoch [1/20], Loss: 0.7125
Epoch [2/20], Loss: 0.6872
Epoch [3/20], Loss: 0.6813
Epoch [4/20], Loss: 0.6730
Epoch [5/20], Loss: 0.6651
Epoch [6/20], Loss: 0.6619
Epoch [7/20], Loss: 0.6499
Epoch [8/20], Loss: 0.6389
Epoch [9/20], Loss: 0.6273
Epoch [10/20], Loss: 0.6182
Epoch [11/20], Loss: 0.6016
Epoch [12/20], Loss: 0.5832
Epoch [13/20], Loss: 0.5639
Epoch [14/20], Loss: 0.5372
Epoch [15/20], Loss: 0.5169
Epoch [16/20], Loss: 0.5014
Epoch [17/20], Loss: 0.4920
Epoch [18/20], Loss: 0.4695
Epoch [19/20], Loss: 0.4266
Epoch [20/20], Loss: 0.4124
Training complete!


## Evaluate the Model
Here we evaluate the model's performance on the test set using metrics like accuracy, confusion matrix, and classification report.

In [81]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

model.eval()
all_preds = []
all_targets = []

with torch.no_grad():
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        inputs = inputs.unsqueeze(1)  # Add channel dimension
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)
        all_preds.extend(preds.cpu().numpy())
        all_targets.extend(targets.cpu().numpy())

# Compute evaluation metrics
accuracy = accuracy_score(all_targets, all_preds)
conf_matrix = confusion_matrix(all_targets, all_preds)
class_report = classification_report(all_targets, all_preds, target_names=["Deceptive Story", "True Story"])

print(f"Accuracy on the test set: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)


Accuracy on the test set: 0.50
Confusion Matrix:
[[3 4]
 [6 7]]

Classification Report:
                 precision    recall  f1-score   support

Deceptive Story       0.33      0.43      0.38         7
     True Story       0.64      0.54      0.58        13

       accuracy                           0.50        20
      macro avg       0.48      0.48      0.48        20
   weighted avg       0.53      0.50      0.51        20



# Evaluation Summary

## Accuracy on the Test Set: 0.50
The model achieved an accuracy of 50%, indicating that it correctly predicted half of the test samples. This is better than random guessing, but the performance is still quite low.

## Confusion Matrix:
[[3 4]
[6 7]]
- **True Positives (TP)**: 7 (True Story correctly predicted as True Story)
- **False Positives (FP)**: 4 (Deceptive Story incorrectly predicted as True Story)
- **False Negatives (FN)**: 6 (True Story incorrectly predicted as Deceptive Story)
- **True Negatives (TN)**: 3 (Deceptive Story correctly predicted as Deceptive Story)

The confusion matrix shows that while the model has a higher number of correct predictions for the "True Story" class, it still misclassifies a significant number of instances, particularly for the "Deceptive Story" class.
- **Precision**:
    - For "Deceptive Story", the precision is 0.33, meaning only 33% of the predictions for "Deceptive Story" are correct.
    - For "True Story", the precision is higher at 0.64, indicating a better proportion of correct predictions for this class.

- **Recall**:
    - For "Deceptive Story", the recall is 0.43, meaning the model correctly identified 43% of the actual "Deceptive Story" instances.
    - For "True Story", the recall is 0.54, meaning the model identified 54% of the actual "True Story" instances.

- **F1-Score**:
    - For "Deceptive Story", the F1-score is 0.38, reflecting a poor balance between precision and recall for this class.
    - For "True Story", the F1-score is 0.58, indicating a better balance for the "True Story" class.

- **Support**: The number of true instances in the test set is 7 for "Deceptive Story" and 13 for "True Story".

## Macro Average:
- The macro average precision, recall, and F1-score are all around 0.48, which is relatively low and indicates that the model is not performing well across both classes equally.

## Weighted Average:
- The weighted average precision, recall, and F1-score are higher, reflecting the model’s better performance on the "True Story" class due to its larger proportion in the dataset.

# Insights and Next Steps
- **Deceptive Story Misclassification**: The model is struggling with predicting "Deceptive Story" accurately, as shown by its low precision and recall for that class.
- **Better Performance on True Story**: The model performs better with "True Story", but there is still room for improvement.

# Suggestions for Improvement:
1. **Feature Engineering**:
   - Consider using additional audio features (e.g., Chroma, Zero Crossing Rate) to improve the model's understanding of the audio data.

2. **Data Augmentation**:
   - Apply augmentation techniques like pitch shifting, time stretching, or adding background noise to the training data to make the model more robust.

3. **Model Tuning**:
   - Experiment with different architectures or more complex models, such as **LSTM** or **GRU**, which are well-suited for sequential data like audio.

4. **Hyperparameter Optimization**:
   - Fine-tune hyperparameters like learning rate, batch size, or optimizer choice to improve model performance.

5. **Balanced Dataset**:
   - The dataset may be imbalanced, and techniques like **oversampling** or **class weighting** could help improve performance, particularly for the underrepresented class.

# Conclusion
The current model shows moderate performance with an accuracy of 50%. While it performs better on "True Story", it still struggles with "Deceptive Story". Further work on data preprocessing, model complexity, and data augmentation could help improve the overall performance.

