
Link: https://physionet.org/content/voiced/1.0.0/

https://wfdb.readthedocs.io/en/latest/wfdb.html#wfdb.rdsamp

https://www.nature.com/articles/s41598-023-34461-9#Bib1


What is Dysphonia? Dysphonia is an alteration of the sound structure of the voice, due to structural or funcational changes of one or more organce involved in voive production. 

## VOICED Database Capstone Research
The VOICED (Voice Disorder Database) is a comprehensive collection of voice samples designed to facilitate the study of voice disorders. The database is publicly available on PhysioNet and contains a significant number of voice samples from both pathological and healthy subjects.

### Database Composition
The VOICED database is a rich resource that includes a total of 208 voice samples. It encompasses recordings from 150 individuals with voice pathologies and 58 from healthy voices. This diverse collection allows for a broad analysis of voice characteristics across different health statuses.

### Demographic and Clinical Information
In addition to voice samples, the database provides detailed demographic and clinical information. This includes the gender, age, and pathology of the participants, as well as lifestyle factors such as smoking, alcohol, and coffee consumption. Occupational status is also documented. Such comprehensive data can help researchers identify patterns and correlations between these variables and voice disorders.

### Participant Criteria
The study that contributed to the VOICED database had specific inclusion criteria. Participants were required to be adults, aged between 18 and 70 years, and capable of following the study protocol. This age range ensures that the database covers a wide spectrum of adult voices.

### Recording Methodology
Voice samples in the VOICED database were captured using Vox4Health, an m-health system. This system is designed to record voice signals in real-time using the microphone of a mobile device, ensuring that the recordings are easily accessible and can be made in various settings.

### Technical Specifications
The technical quality of the voice recordings is maintained at a high standard. All samples in the database were recorded with a sampling rate of 8000 Hz and a resolution of 32-bit. These specifications ensure that the recordings are of sufficient quality for detailed analysis.

### Data Format
The VOICED database provides the voice recordings in two formats: the WFDB format and text format. The WFDB format is particularly useful for researchers familiar with PhysioNet's tools, as it allows for easy manipulation and analysis of the data.

### Recording Environment
To ensure consistency and quality of the voice samples, all recordings were made in a controlled environment. The room was kept quiet, with less than 30 dB of background noise, and had a humidity level between 30-40% . Such conditions help to minimize external factors that could affect the voice recordings.

## Conclusion:
The VOICED database is a valuable resource for researchers interested in voice disorders. It provides a wealth of high-quality voice samples along with extensive demographic and clinical information, all of which are essential for a thorough analysis of voice health and pathology.

In [106]:
##!pip install wget

In [2]:
import wget

In [4]:
url= 'https://physionet.org/physiobank/database/voiced/'
fileDownloadLocation = './Database/'
filename = 'RECORDS'
print('Download records file.')
wget.download(url+filename, fileDownloadLocation+filename)

Download records file.
100% [............................................................] 1871 / 1871

'./Database/RECORDS'

In [5]:
# Open record file    
fileHandler = open (fileDownloadLocation+filename, "r")
# Get list of all lines in file
listOfLines = fileHandler.readlines()
# Close file 
fileHandler.close()

In [6]:
print('First file:',listOfLines[0])
print('Last file:',listOfLines[len(listOfLines)-1])

First file: voice001

Last file: voice208


In [7]:
print('Beginning file download.')
for line in listOfLines:
    filename = line.strip()
    print(filename)
    wget.download(url+filename+'.txt', fileDownloadLocation+filename+'.txt')
    wget.download(url+filename+'.dat', fileDownloadLocation+filename+'.dat')
    wget.download(url+filename+'-info.txt', fileDownloadLocation+filename+'-info.txt')
    wget.download(url+filename+'.hea', fileDownloadLocation+filename+'.hea')
print('Done.')

Beginning file download.
voice001
100% [..............................................................] 175 / 175voice002
100% [..............................................................] 158 / 158voice003
100% [..............................................................] 176 / 176voice004
100% [..............................................................] 176 / 176voice005
100% [..............................................................] 172 / 172voice006
100% [..............................................................] 174 / 174voice007
100% [..............................................................] 174 / 174voice008
100% [..............................................................] 170 / 170voice009
100% [..............................................................] 175 / 175voice010
100% [..............................................................] 172 / 172voice011
100% [..............................................................] 171 / 171voice01

100% [..............................................................] 172 / 172voice094
100% [..............................................................] 174 / 174voice095
100% [..............................................................] 159 / 159voice096
100% [..............................................................] 158 / 158voice097
100% [..............................................................] 158 / 158voice098
100% [..............................................................] 160 / 160voice099
100% [..............................................................] 159 / 159voice100
100% [..............................................................] 158 / 158voice101
100% [..............................................................] 158 / 158voice102
100% [..............................................................] 157 / 157voice103
100% [..............................................................] 160 / 160voice104
100% [..........................

100% [..............................................................] 175 / 175voice187
100% [..............................................................] 175 / 175voice188
100% [..............................................................] 173 / 173voice189
100% [..............................................................] 171 / 171voice190
100% [..............................................................] 175 / 175voice191
100% [..............................................................] 175 / 175voice192
100% [..............................................................] 175 / 175voice193
100% [..............................................................] 173 / 173voice194
100% [..............................................................] 170 / 170voice195
100% [..............................................................] 159 / 159voice196
100% [..............................................................] 159 / 159voice197
100% [..........................

In [14]:
import numpy as np
import pandas as pd

In [17]:
df = pd.DataFrame({'filename': listOfLines})

In [21]:
print(df.head(5))

     filename
0  voice001\n
1  voice002\n
2  voice003\n
3  voice004\n
4  voice005\n


In [26]:
first_filename = df['filename'].iloc[0].strip()
first_filename

'voice001'

In [28]:
import pandas as pd
import re
import 

Opening File

In [29]:
# Read record file
fileDownloadLocation = './Database/'
filename = 'RECORDS'


# Open record file    
fileHandler = open (fileDownloadLocation+filename, "r")
# Get list of all lines in file
listOfRecords = fileHandler.readlines()
# Close file 
fileHandler.close()

Creating Features

In [30]:
# Create header of features
filename = listOfRecords[0].strip()
# Open record file    
fileHandler = open (fileDownloadLocation+filename+'-info.txt', "r")
# Get list of all lines in file
listOfData = fileHandler.readlines()
# Close file 
fileHandler.close()

In [31]:
headerList = []
for i in range(len(listOfData)):
    string = re.split('\t|,|\n',listOfData[i])
    headerList.append(string[0])
headerList.insert(len(headerList)+1,"")   
# remove : in the string
newheaderList = []
for element in headerList:
    newheaderList.append(element.replace(':', ''))
newheaderList

['ID',
 '',
 'Age',
 'Gender',
 'Diagnosis',
 'Occupation status',
 '',
 '',
 'Voice Handicap Index (VHI) Score',
 'Reflux Symptom Index (RSI) Score',
 '',
 '',
 'Smoker',
 'Number of cigarettes smoked per day',
 '',
 'Alcohol consumption',
 'Number of glasses containing alcoholic beverage drinked in a day',
 "Amount of water's litres drink every day",
 '',
 'Eating habits',
 'Carbonated beverages',
 'Amount of glasses drinked in a day',
 'Tomatoes',
 'Coffee',
 'Number of cups of coffee drinked in a day',
 'Chocolate',
 'Gramme of chocolate eaten in  a day',
 'Soft cheese',
 'Gramme of soft cheese eaten in a day',
 'Citrus fruits',
 'Number of citrus fruits eaten in a day',
 '']

Building Dataframe

In [32]:

data = []
for line in listOfRecords:
    record = line.strip()
    #print(record)
    filename = record+'-info.txt'
    # Open record file    
    fileHandler = open (fileDownloadLocation+filename, "r")
    # Get list of all lines in file
    dataLine= fileHandler.readlines()
    # Close file 
    fileHandler.close()
    
    dataString = []
    for i in range(len(dataLine)):
        string = re.split('\t|,|\n',dataLine[i])
        dataString.append(string[1])
    data.append(dataString)

In [34]:
# Drop all columns with name ''
df = pd.DataFrame(data)
df.columns = newheaderList
df = df.drop('', axis=1)
df.head()

Unnamed: 0,ID,Age,Gender,Diagnosis,Occupation status,Voice Handicap Index (VHI) Score,Reflux Symptom Index (RSI) Score,Smoker,Number of cigarettes smoked per day,Alcohol consumption,...,Amount of glasses drinked in a day,Tomatoes,Coffee,Number of cups of coffee drinked in a day,Chocolate,Gramme of chocolate eaten in a day,Soft cheese,Gramme of soft cheese eaten in a day,Citrus fruits,Number of citrus fruits eaten in a day
0,voice001,32,m,hyperkinetic dysphonia,Researcher,15,5,no,NU,casual drinker,...,NU,sometimes,almost always,4,almost never,NU,sometimes,NU,sometimes,NU
1,voice002,55,m,healthy,Employee,17,12,casual smoker,2,habitual drinker,...,3,sometimes,sometimes,3,sometimes,NU,almost always,50 gr,almost always,2
2,voice003,34,m,hyperkinetic dysphonia (nodule),Researcher,42,26,no,NU,casual drinker,...,1,sometimes,almost always,NU,sometimes,20 gr,almost always,200 gr,almost never,NU
3,voice004,28,f,hypokinetic dysphonia,Researcher,20,9,casual smoker,NU,casual drinker,...,NU,sometimes,always,3,sometimes,NU,almost always,NU,sometimes,NU
4,voice005,54,f,hypokinetic dysphonia,Researcher,39,23,no,NU,casual drinker,...,NU,sometimes,never,NU,sometimes,150 gr,sometimes,200 gr,almost always,1


In [35]:
df.shape

(208, 24)

In [36]:
list(df.columns.values)

['ID',
 'Age',
 'Gender',
 'Diagnosis',
 'Occupation status',
 'Voice Handicap Index (VHI) Score',
 'Reflux Symptom Index (RSI) Score',
 'Smoker',
 'Number of cigarettes smoked per day',
 'Alcohol consumption',
 'Number of glasses containing alcoholic beverage drinked in a day',
 "Amount of water's litres drink every day",
 'Eating habits',
 'Carbonated beverages',
 'Amount of glasses drinked in a day',
 'Tomatoes',
 'Coffee',
 'Number of cups of coffee drinked in a day',
 'Chocolate',
 'Gramme of chocolate eaten in  a day',
 'Soft cheese',
 'Gramme of soft cheese eaten in a day',
 'Citrus fruits',
 'Number of citrus fruits eaten in a day']

In [38]:
##!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.0/250.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2


In [45]:
from openpyxl import Workbook
##import xlsxwriter

In [46]:
# Write to excel file
excelFile = './Datasets/dataset_InfoTxtFile.xlsx'
writer = pd.ExcelWriter(excelFile)
df.to_excel(writer,'Sheet1')
writer.close()

  df.to_excel(writer,'Sheet1')


Machine Learning

In [61]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import mean_squared_error

In [49]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

Missing Values:
 ID                                                                  0
Age                                                                 0
Gender                                                              0
Diagnosis                                                           0
Occupation status                                                   0
Voice Handicap Index (VHI) Score                                    0
Reflux Symptom Index (RSI) Score                                    0
Smoker                                                              0
Number of cigarettes smoked per day                                 0
Alcohol consumption                                                 0
Number of glasses containing alcoholic beverage drinked in a day    0
Amount of water's litres drink every day                            0
Eating habits                                                       0
Carbonated beverages                                                0
Amo

In [50]:
# Data Preprocessing
# Drop rows with missing values (if needed)
df.dropna(inplace=True)

In [57]:
# Encode categorical variables
label_encoder = LabelEncoder()
categorical_columns = ['Gender', 'Diagnosis', 'Occupation status', 'Smoker', 'Alcohol consumption', 'Eating habits']
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column])


In [58]:
# Split data into features and target variable
X = df.drop(columns=['ID', 'Voice Handicap Index (VHI) Score'])  # Features
y = df['Voice Handicap Index (VHI) Score']  # Target variable


In [59]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [63]:
# Machine Learning Model
# Train a RandomForestRegressor
#rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
#rf_regressor.fit(X_train, y_train)

In [86]:
#!pip install wfdb

In [65]:
import wfdb

In [85]:
##record = wfdb.rdrecord('DataBase/voice002')

In [94]:
# Assuming filename is the name of the WFDB file you want to load
filename = 'voice001'

In [115]:
# Load the WFDB record
record = wfdb.rdrecord(filename, pn_dir='voiced')

In [112]:
# Print record information
print(record.__dict__)

{'record_name': 'voice001', 'n_sig': 1, 'fs': 8000, 'counter_freq': None, 'base_counter': None, 'sig_len': 38080, 'base_time': None, 'base_date': None, 'comments': ['<age>: 32  <sex>: M <diagnoses>: hyperkinetic dysphonia <medications>: none'], 'sig_name': ['voice'], 'p_signal': array([[ 0.        ],
       [ 0.        ],
       [ 0.        ],
       ...,
       [ 0.00354004],
       [-0.03735352],
       [-0.02871704]]), 'd_signal': None, 'e_p_signal': None, 'e_d_signal': None, 'file_name': ['voice001.dat'], 'fmt': ['32'], 'samps_per_frame': [1], 'skew': [None], 'byte_offset': [None], 'adc_gain': [4079702243.3775], 'baseline': [-260023747], 'units': ['NU'], 'adc_res': [0], 'adc_zero': [0], 'init_value': [-260023747], 'checksum': [14973], 'block_size': [0]}


In [113]:
# Access signal data
signals = record.p_signal
print(signals)

[[ 0.        ]
 [ 0.        ]
 [ 0.        ]
 ...
 [ 0.00354004]
 [-0.03735352]
 [-0.02871704]]


In [98]:
# Access signal labels
signal_labels = record.sig_name
print(signal_labels)

['voice']


In [123]:
##wfdb.rdsamp(filename, sampfrom=0, sampto=None, channels=None, pn_dir=None, channel_names=None, warn_empty=False, return_res=64)