<a href="https://colab.research.google.com/github/nikorose87/TechChallengeSamay/blob/main/%5BNP%5D_Technical_assesment__Senior_ML_engineer_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://uploads-ssl.webflow.com/632f50e291252dcd1cf0c08b/63dc0565438858025d69245a_Samay-logo-gradient.png" width="30%">

# Comprehensive Technical Test:
## Lung Sound Analysis for Respiratory Health

# Part 2

---

The primary goal of this technical test is to design, develop, and optimize a machine learning solution for detecting and classifying *respiratory diseases*. This will be achieved using a dataset of lung sounds recorded with an electronic stethoscope. The comprehensive IPython Notebook delivered should demonstrate the *candidate's proficiency* in analyzing sensor data for health applications, specifically in the context of pulmonary diseases.

## Dataset description

The evolution of stethoscope technology has facilitated the high-quality recording of lung sounds from both healthy individuals and those with various pulmonary conditions. This dataset encompasses audio recordings from patients with seven different ailments, including asthma, heart failure, pneumonia, bronchitis, pleural effusion, lung fibrosis, and COPD, alongside normal breathing sounds. Recordings were taken from multiple positions on the chest, as determined by a specialist physician. Each sound was recorded thrice using different frequency filters to highlight specific bodily sounds. This valuable dataset supports the development of automated tools for diagnosing pulmonary diseases through lung sound analysis and can be extended to heart sound studies.

<img src="https://ars.els-cdn.com/content/image/1-s2.0-S2352340921001979-gr1.jpg" width="50%">

The dataset comprises audio recordings of lung sounds from 112 subjects, captured using an electronic stethoscope. It includes data from 35 healthy individuals and 77 subjects with various respiratory diseases [1, 2, 3].
- **Content:** The audio recordings have been filtered through Bell, Diaphragm, and Extended modes to ensure clarity and precision in sound quality.
- **Annotations:** Each audio file is annotated with comprehensive details including the type of lung sound, the disease diagnosis, recording location on the subject's chest, as well as the age and gender of the subjects. This information is crucial for the analysis and classification tasks.

In [None]:
# Data Manipulation and Analysis
import pandas as pd
import numpy as np

# File System and Path Handling
import os
import pathlib
import warnings
warnings.filterwarnings('ignore')

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Deep Learning (TensorFlow/Keras)
import tensorflow as tf
from tensorflow import keras

# Signal Processing
import scipy
import scipy.io
import scipy.io.wavfile
from scipy import signal
from scipy.fft import fftshift
from scipy.io import wavfile
import scipy.io.wavfile

# Image Manipulation
import matplotlib.image as mpimg

# Miscellaneous
import ntpath
import random
from IPython import display
import time

# Audio Processing (Librosa)
import librosa.display
import soundfile as sf

# Importing required libraries for handling HTTP requests and zip files
import requests  # Library for making HTTP requests
import zipfile   # Library for handling zip files
import io        # Library for handling binary data streams

---

# 1. Loading the dataset

In [None]:
# Obtain the current working directory
current_path = os.getcwd()

# URL of the ZIP file to be downloaded
url = "https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/jwyy9np4gv-3.zip"

# Download the ZIP file from the specified URL
response = requests.get(url)

# Check if the HTTP response status code is 200 (OK)
if response.status_code == 200:
    # Retrieve the content of the ZIP file
    zip_content = response.content

    # Extract the content of the ZIP file into a specific folder
    with zipfile.ZipFile(io.BytesIO(zip_content), 'r') as zip_ref:
        zip_ref.extractall("inputdata")  # Change this to the desired directory path
    print("File downloaded and extracted successfully.")
else:
    print("Error downloading the file.")

File downloaded and extracted successfully.


In [None]:
def extract_nested_zip(zip_file_path, extraction_path):
    """
    Extract a zip file, including any nested zip files.
    Delete the zip file after extraction.
    """
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        # Extract all the contents into the directory
        zip_ref.extractall(extraction_path)

        # Iterate through each file in the extracted files
        for file in zip_ref.namelist():
            # Check if the file is a zip file
            if file.endswith('.zip'):
                # Construct full path to the nested zip file
                nested_zip_path = os.path.join(extraction_path, file)

                # Recursively extract the nested zip file
                extract_nested_zip(nested_zip_path, extraction_path)

                # Optionally, remove the nested zip file after extraction
                os.remove(nested_zip_path)

In [None]:
dataset_path = current_path + "/inputdata/"  # Change this to the extraction path of the main ZIP file

# Look for ZIP files in the extraction directory and extract any nested ZIP files
for file in os.listdir(dataset_path):
    if file.endswith('.zip'):
        zip_file_path = os.path.join(dataset_path, file)
        extract_nested_zip(zip_file_path, dataset_path)

# Define the path to the dataset directory
data_dir = pathlib.Path(dataset_path)

# List the files in the directory
commands = np.array(os.listdir(data_dir))

# Filter and remove 'README.md' from the list of files
commands = commands[commands != 'README.md']

# Split each element in the list by commas, and then by underscores
a = [line.split(',') for line in commands]
b = [x[0].split('_') for x in a]

# Extract the label only if the element has at least two parts
label = [c[1] for c in b if len(c) > 1]

# Convert labels to lowercase
label = [x.lower() for x in label]

# Iterate through the labels and perform label consolidation
for i in range(336):
    if label[i] == 'asthma and lung fibrosis':
        label[i] = 'asthma'
    elif label[i] == 'heart failure + copd' or label[i] == 'heart failure + lung fibrosis ':
        label[i] = 'heart failure'
    else:
        label[i] = label[i]

def return_unique_labels(labels):
    # Removing duplicates from the list while maintaining the order
    unique_labels = []
    for label in labels:
        if label not in unique_labels:
            unique_labels.append(label)
    return unique_labels

labels = return_unique_labels(label)

# Initialize an empty array to store the full paths of WAV files
wav_files = []

# Iterate through the files in the directory
for file in os.listdir(dataset_path):
    # Check if the file is a WAV file
    if file.endswith(".wav"):
        # Add the full path of the file to the array
        wav_files.append(os.path.join(dataset_path, file))

# Initialize an array to store the data of each WAV file
wav_data = []

# Iterate through the WAV file paths in the wav_files array
for filepath in wav_files:
    # Read the WAV file using scipy.io.wavfile
    sample_rate, data = scipy.io.wavfile.read(filepath)

    # Add the data, sample rate, file path, and data type to the wav_data array
    wav_data.append({
        "file_path": filepath,
        "sample_rate": sample_rate,
        "data": data,
        "data_type": data.dtype
    })


# Ensure the number of labels matches the number of WAV files
assert len(label) == len(wav_data)

# Extracting only the 'data' from each WAV file
data_values = [item['data'] for item in wav_data]

# Creating a DataFrame with 'data' and 'label'
sound_df = pd.DataFrame({
    'data': data_values,  # Column 'data' containing waveform data arrays
    'label': label        # Column 'label' containing labels
})

---

# Machine Learning Algorithm Design and Validation


In [None]:
# Determine the maximum length of audio data in the dataset
max_length =

# Pad the audio data
padded_data =

In [None]:
print(padded_data.shape)

(336, 120000)


---

- Add your code here