# Group Project: Identifying Gender from Voice Features

### Project Overview:
This project has the intention of analyzing human voice samples in order to create multiple predictive models that can accurately identify the speakers as male or female.


### The Dataset
We will use the dataset provided by kaggle to train our models. The data was sourced from Kaggle and includes 3,168 voice samples that have been preprocessed in R's seewave and tuneR software packages, with an analyzed frequency range of 0hz-280hz (human vocal range). We will be using 75% of the original data for training and the rest for testing.

Each sample consists of 20 unique acoustic parameters per sample, with a lebel either male or female:
    1. meanfreq: mean frequency (in kHz)
    2. sd: standard deviation of frequency
    3. median: median frequency (in kHz)
    4. Q25: first quantile (in kHz)
    5. Q75: third quantile (in kHz)
    6. IQR: interquantile range (in kHz)
    7. skew: skewness (see note in specprop description)
    8. kurt: kurtosis (see note in specprop description)
    9. sp.ent: spectral entropy
    10. sfm: spectral flatness
    11. mode: mode frequency
    12. centroid: frequency centroid (see specprop)
    13. meanfun: average of fundamental frequency measured across acoustic signal
    14. minfun: minimum fundamental frequency measured across acoustic signal
    15. maxfun: maximum fundamental frequency measured across acoustic signal
    16. meandom: average of dominant frequency measured across acoustic signal
    17. mindom: minimum of dominant frequency measured across acoustic signal
    18. maxdom: maximum of dominant frequency measured across acoustic signal
    19. dfrange: range of dominant frequency measured across acoustic signal
    20. modindx: modulation index. 
    

    


The models we chose are as following:
    1. Neural Network
    2. KNN
    3. Gaussian Naive Bayes
    4. Logistic Regression
    5. Random Forest
    6. SVM


### This notebook will focus on understanding and visualizing the data before we proceed to generate various machine learning algorithms.

In [9]:
import pandas as pd  
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

In [10]:
path = '/Users/Shirley/Desktop/Gender_Recognition_by_Voice/voice.csv'
voice_data = pd.read_csv(path)

FileNotFoundError: File b'/Users/Shirle/Desktop/Gender_Recognition_by_Voice/voice.csv' does not exist

In [None]:
voice_data["label"].value_counts()

In [None]:
voice_data.info()

In [None]:
voice_data.head()

In [None]:
# Part 2: Data Visualization
for col in voice_data.columns[:-1]:
    sns.FacetGrid(voice_data, hue="label", size=3).map(sns.kdeplot, col).add_legend()
    plt.show()

### Our observation: 
At first glance, most significant features are IQR and meanfun. As one would expect, the fundamental frequencies exhibited by male voices are much lower than those exhibited by females and interquantile range exhibited by male voices are much higher than those exhibited by females. 

In [None]:
sns.FacetGrid(voice_data, hue="label", size=7).map(plt.scatter, "IQR", "meanfun").add_legend()
plt.show()

###### As we could see, there are some samples that belong to male but graph tells us that it is female.

### Is it enough of this two features to make predictions? 
With this question, we will train our models with 2376 train samples and 792 test samples 

1) with all features 

2) with 2 most significant features (IQR and meanfun)

3) with low dimensional approximations to the data (PCA)
