<a href="https://colab.research.google.com/github/mariampinel/Deception-Detector/blob/main/ECS7020P_miniproject_2425.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ECS7020P mini-project submission


## What is the problem?

This year's mini-project considers the problem of predicting whether a narrated story is true or not. Specifically, you will build a machine learning model that takes as an input an audio recording of **30 seconds** of duration and predicts whether the story being narrated is **true or not**.


## Which dataset will I use?

A total of 100 samples consisting of a complete audio recording, a *Language* attribute and a *Story Type* attribute have been made available for you to build your machine learning model. The audio recordings can be downloaded from:

https://github.com/MLEndDatasets/Deception/tree/main/MLEndDD_stories_small

A CSV file recording the *Language* attribute and *Story Type* of each audio file can be downloaded from:

https://github.com/MLEndDatasets/Deception/blob/main/MLEndDD_story_attributes_small.csv




## What will I submit?

Your submission will consist of **one single Jupyter notebook** that should include:

*   **Text cells**, describing in your own words, rigorously and concisely your approach, each implemented step and the results that you obtain,
*   **Code cells**, implementing each step,
*   **Output cells**, i.e. the output from each code cell,

Your notebook **should have the structure** outlined below. Please make sure that you **run all the cells** and that the **output cells are saved** before submission.

Please save your notebook as:

* ECS7020P_miniproject_2425.ipynb


## How will my submission be evaluated?

This submission is worth 16 marks. We will value:

*   Conciseness in your writing.
*   Correctness in your methodology.
*   Correctness in your analysis and conclusions.
*   Completeness.
*   Originality and efforts to try something new.

**The final performance of your solutions will not influence your grade**. We will grade your understanding. If you have an good understanding, you will be using the right methodology, selecting the right approaches, assessing correctly the quality of your solutions, sometimes acknowledging that despite your attempts your solutions are not good enough, and critically reflecting on your work to suggest what you could have done differently.

Note that **the problem that we are intending to solve is very difficult**. Do not despair if you do not get good results, **difficulty is precisely what makes it interesting** and **worth trying**.

## Show the world what you can do

Why don't you use **GitHub** to manage your project? GitHub can be used as a presentation card that showcases what you have done and gives evidence of your data science skills, knowledge and experience. **Potential employers are always looking for this kind of evidence**.





-------------------------------------- PLEASE USE THE STRUCTURE BELOW THIS LINE --------------------------------------------

# [Your title goes here]

# 1 Author

**Student Name**:  Maria Martinez
**Student ID**:  240990523



# 2 Problem formulation

Describe the machine learning problem that you want to solve and explain what's interesting about it.

- This year's mini-project considers the problem of predicting whether a narrated story is true or not. Specifically, you will build a machine learning model that takes as an input an audio recording of **30 seconds** of duration and predicts whether the story being narrated is **true or not**.



- Factors
  - Lie is prepared, pre-thought
  - Take into account silent periods: 0.1s or longer
  - Identify: vocalizations, inspirations and expirations, tongue noises, laughs and giggles, and hiccoughs.
  - Possible measures: rate of speech (the relationship between the number of
syllables in the statement and its overall duration) and the rate of articula-
tion (based on the length of the speech excluding filled and unfilled
pauses)
  - fundamental frequency (F0) was extracted using the spectral clipping method and zero crossing analysis
       Calculate mean, range and SD
  - Verbal measures: 1) the number of arguments (1 =hair; 2 = age); (2) the
total number of words; (3) the eloquency index (given by the ratio between
number of words and number of arguments); and (4) the disfluency index
(given by the sum of interrupted and repeated words).

1.  31 male university students were asked to raise doubts in an expert in law about a picture. The subjects were required to describe the picture in three experimental conditions: telling the truth (T) and lying to a speaker when acquiescent (L1) and when suspicious (L2). The utterances were then subjected to a digitized acoustic analysis in order to measure nonverbal vocal variables

**Evidence**
  - Supported Predictors: the number of pauses {F = 4.67; p < .012) and the number of syllable. F0.
  - Suggestion: Look at extreme values- Outliers might me tells for lies
      - Calculate number of outliers of each sample
      
  - On the contrary, mean pause duration (considering filled and un-
filled pauses separately), mean phrases and speech duration, and mean
rate of articulation and language speed showed no significant overall ef-
fect.
- Calculate number of pauses. Why:
    - The complexity of the cognitive task lies in the discrepancy between private knowledge and public statement: the liar (L) knows the truth (which he/she does not tell), but publicly tells a lie (which he/she does not believe, but has to make the H think that he/she does believe it). The cognitive conflict between the knowledge of truth and the encoded message denying (or concealing) it, as well as the need "to disguise" one's true beliefs, imply a saturation of the L's mental capacities making the
task of deception all the more complex. This complexity, in fact, came to
light in the research through the results of the experiment in the L1 condi-
tion especially. These showed a rise in the number of shorter and more
recurrent pauses which,
2. There are 3 types of liars: : good liars, tense liars (more numerous in L1), and overcontrolled liars (more numerous in L2)

# 3 Methodology

Describe your methodology. Specifically, describe your training task and validation task, and how model performance is defined (i.e. accuracy, confusion matrix, etc). Any other tasks that might help you build your model should also be described here.

# 4 Implemented ML prediction pipelines

Describe the ML prediction pipelines that you will explore. Clearly identify their input and output, stages and format of the intermediate data structures moving from one stage to the next. It's up to you to decide which stages to include in your pipeline. After providing an overview, describe in more detail each one of the stages that you have included in their corresponding subsections (i.e. 4.1 Transformation stage, 4.2 Model stage, 4.3 Ensemble stage).

## 4.1 Transformation stage

Describe any transformations, such as feature extraction. Identify input and output. Explain why you have chosen this transformation stage.

## 4.2 Model stage

Describe the ML model(s) that you will build. Explain why you have chosen them.

## 4.3 Ensemble stage

Describe any ensemble approach you might have included. Explain why you have chosen them.

# 5 Dataset

Describe the datasets that you will create to build and evaluate your models. Your datasets need to be based on our MLEnd Deception Dataset. After describing the datasets, build them here. You can explore and visualise the datasets here as well.

If you are building separate training and validation datasets, do it here. Explain clearly how you are building such datasets, how you are ensuring that they serve their purpose (i.e. they are independent and consist of IID samples) and any limitations you might think of. It is always important to identify any limitations as early as possible. The scope and validity of your conclusions will depend on your ability to understand the limitations of your approach.

If you are exploring different datasets, create different subsections for each dataset and give them a name (e.g. 5.1 Dataset A, 5.2 Dataset B, 5.3 Dataset 5.3) .



In [None]:
# Import necessary packages
from google.colab import drive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os, sys, re, pickle, glob
import urllib.request
import zipfile

import IPython.display as ipd
from tqdm import tqdm
import librosa



Mounted at /content/drive


In [5]:
#Install library - make sure you have version 1.0.0.4
%%capture
!pip install mlend==1.0.0.4


In [6]:
#Import library and functions
import mlend
from mlend import download_deception_small, deception_small_load



In [29]:
#Download small data
datadir = download_deception_small(save_to='MLEnd', subset={}, verbose=1, overwrite=False)


Downloading 100 stories (audio files) from https://github.com/MLEndDatasets/Deception
100%|[92m▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓[0m|100\100|00100.wav
Done!


'MLEnd/deception'

In [31]:
#Read file paths
TrainSet, TestSet, MAPs = deception_small_load(datadir_main=datadir, train_test_split=0.2, verbose=1, encode_labels=True)

Total 100 found in MLEnd/deception/MLEndDD_stories_small/


In [41]:
# Visualize data: Extracted as a dictionary
## Maps contains meaning of encoded labels
# Length of training set
len(TrainSet['X_paths'])
# Length of Test set: 80 for 80% of dataset
TrainSet



{'X_paths': ['MLEnd/deception/MLEndDD_stories_small/00001.wav',
  'MLEnd/deception/MLEndDD_stories_small/00007.wav',
  'MLEnd/deception/MLEndDD_stories_small/00014.wav',
  'MLEnd/deception/MLEndDD_stories_small/00019.wav',
  'MLEnd/deception/MLEndDD_stories_small/00023.wav',
  'MLEnd/deception/MLEndDD_stories_small/00027.wav',
  'MLEnd/deception/MLEndDD_stories_small/00041.wav',
  'MLEnd/deception/MLEndDD_stories_small/00048.wav',
  'MLEnd/deception/MLEndDD_stories_small/00051.wav',
  'MLEnd/deception/MLEndDD_stories_small/00054.wav',
  'MLEnd/deception/MLEndDD_stories_small/00057.wav',
  'MLEnd/deception/MLEndDD_stories_small/00060.wav',
  'MLEnd/deception/MLEndDD_stories_small/00061.wav',
  'MLEnd/deception/MLEndDD_stories_small/00066.wav',
  'MLEnd/deception/MLEndDD_stories_small/00072.wav',
  'MLEnd/deception/MLEndDD_stories_small/00081.wav',
  'MLEnd/deception/MLEndDD_stories_small/00082.wav',
  'MLEnd/deception/MLEndDD_stories_small/00084.wav',
  'MLEnd/deception/MLEndDD_stories_

In [28]:
#To read the documentation on the given functions run:
help(download_deception_small)
help(deception_small_load)

Help on function download_deception_small in module mlend.downloader:

download_deception_small(save_to='../MLEnd', subset={}, verbose=1, overwrite=False, pbar_style='colab')
    Download Deception Dataset - small size.
    
    
    Parameters
    ----------
    save_to: str, default ('../MLEnd')
       - local path where you want to store the data
         relative to `../MLEnd/deception/`).
    
    subset: dict, default={}
    -  subset of data to download. {'attribute_name':list_of_values to select}
        subset = {'Language':['English','Hindi']}, will download stories of English and Hindi langauge only 
        subset = {'Language':['English'], 'Story_type':['true_story']} will download true stories narrated in English only
        subset = {} will download entire small dataset (100 stories)
    
    subset of data
    
    verbose: bool, (default=1)
       - verbosity level, show progress
    
    overwrite: bool, default=False
       - to avoid downloaing files, if already ex

# 6 Experiments and results

Carry out your experiments here. Analyse and explain your results. Unexplained results are worthless.

# 7 Conclusions

Your conclusions, suggestions for improvements, etc should go here.

# 8 References

Acknowledge others here (books, papers, repositories, libraries, tools)

# FINAL CHECK- CHECK EVERYTHING DONE. VAMOOOOOOOOOOOOS YOU GOT IT!!- final check with you know who- vamos
- Text cells contain: description in  own words, **rigorously** and **concisely**
    1.  your approach
    2. each implemented step
    3. the results that you obtain