# Data Preparation

## Understanding the data

The following acoustic properties of each voice are measured and included within the CSV:

* `meanfreq`: mean frequency (in kHz)
* `sd`: standard deviation of frequency
* `median`: median frequency (in kHz)
* `Q25`: first quantile (in kHz)
* `Q75`: third quantile (in kHz)
* `IQR`: inter-quantile range (in kHz)
* `skew`: skewness
* `kurt`: kurtosis
* `sp.ent`: spectral entropy
* `sfm`: spectral flatness
* `mode`: mode frequency
* `centroid`: frequency centroid (see `specprop`)
* `peakf`: peak frequency (frequency with the highest energy)
* `meanfun`: average of fundamental frequency measured across acoustic signal
* `minfun`: minimum fundamental frequency measured across acoustic signal
* `maxfun`: maximum fundamental frequency measured across acoustic signal
* `meandom`: average of dominant frequency measured across acoustic signal
* `mindom`: minimum of dominant frequency measured across acoustic signal
* `maxdom`: maximum of dominant frequency measured across acoustic signal
* `dfrange`: range of dominant frequency measured across acoustic signal
* `modindx`: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
label: male or female

### Skewness and Kurtosis

For skewness $S$ and kurtosis $K$, we have the formula:
$$
\begin{aligned}
S &= \frac{1}{n - 1}\cdot\frac{\sum_{i=1}^n{\left(x - \overline{x}\right)^3}}{\sigma^3}\\
K &= \frac{1}{n - 1}\cdot\frac{\sum_{i=1}^n{\left(x - \overline{x}\right)^4}}{\sigma^4}
\end{aligned}
$$
Where $n$ is the number of observations, $\overline{x}$ is the mean, and $\sigma$ is the standard deviation.

- $S < 0$ when the spectrum is skewed to left,
- $S = 0$ when the spectrum is symetric,
- $S > 0$ when the spectrum is skewed to right.

Spectrum asymmetry increases with $|S|$.

- $K < 3$ when the spectrum is platikurtic, i.e. it has fewer items at the center and at the tails than the normal curve but has more items in the shoulders,
- $K = 3$ when the spectrum shows a normal shape,
- $K > 3$ when the spectrum is leptokurtic, i.e. it has more items near the center and at the tails, with fewer items in the shoulders relative to normal distribution with the same mean and variance.

### Flatness

The spectral flatness is a measure of the spectral shape of a signal. It is defined as the ratio of the geometric mean to the arithmetic mean of the signal's power spectrum:
$$
F = \frac{\sqrt[n]{\prod_{i=1}^n{p_i}}}{\frac{1}{n}\sum_{i=1}^n{p_i}}
$$
where $n$ is the number of frequency bins, $p_i$ is the power in the $i$-th frequency bin, and $F$ is the spectral flatness.

The greater $F$ is, the more flat the spectrum is.

## Reference

* [Data Set from Kaggle](https://www.kaggle.com/datasets/primaryobjects/voicegender)
* [Skewness and Kurtosis](https://cran.r-project.org/web/packages/seewave/seewave.pdf)