
<div style="text-align:center">

<div style="background-image: url('background.jpg'); background-size: cover; padding: 50px; color: white;">

# Acoustic Loggers 

© Sand Technologies

<img src="sand.JPEG" alt="Example Image" width="100" height="100">

### Team 1

</div>
</div>


### **Acoustic Loggers for Leak Detection**


Water distribution networks play a crucial role in ensuring clean and safe drinking water is delivered to consumers. However, leaks in the pipes of these networks lead to significant loss of water, posing challenges to water utilities. Not only are these leaks wasteful, but they also result in large fines for water utilities. 

To mitigate wastage, acoustic loggers have been attached to water pipes to record the sound profile in each pipe at night. These recordings can then be used to determine whether there is a leak present. 

The goal of this project is to produce a model that can classify each of these recordings as either 'leak' or 'no leak', aiding in the early detection and prevention of water loss in distribution networks. 

<img src="water.JPG" alt="Example Image" width="1700" height="400">

### **Problem Statement:**




Water leaks in pipes, though seemingly minor, pose a significant threat to communities worldwide.

On an environmental level, water leaks contribute to water scarcity by wasting a precious resource. Clean, treated water is a finite resource, and leaks deplete supplies needed for human consumption, sanitation, and agriculture.

Economically, leaky pipes lead to increased water bills for residents and lost revenue for water utilities. The constant need to repair and replace aging infrastructure strains budgets and diverts funds from other essential services.

Socially, water leaks can exacerbate existing inequalities. Communities with limited access to clean water are disproportionately affected by leaks, further jeopardizing their health and well-being. Additionally, leaks can lead to structural damage to buildings and roads, impacting infrastructure and safety.

Water utilities want to reduce the loss of water, minimise infrastructure damages due to leaks, improve operational efficiencies of their pipe networks, and enhance public health and safety from contamination of drinking water. 

Water utilities have received significant fines from wasted water and damages on properties, they also received some complaints regarding dirty water that comes out of taps and have a problem of efficiency in locating leaking pipes and getting them fixed.

Our data science and engineering team is pioneering a data-driven approach to water network leak detection and localization.  This initiative leverages acoustic loggers and cutting-edge machine learning algorithms to achieve prompt and accurate leak identification. The system we are developing offers a robust solution for minimizing water loss within distribution networks.





<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Libraries</a>

<a href=#two>2. Loading the data</a>

<a href=#four>4. Exploratory Data Analysis</a>

<a href=#five>5. Data Cleaning & Preprocessing

<a href=#six>6. Modelling</a>

<a href=#seven>7. Model Evaluation</a>

<a href=#ten>10. Conclusion</a>

## <div style="text-align: center;"><u/> **Let's Get Started!!!**.</u></div>

 <a id="two"></a>
## **1. Importing Libraries**
<a href=#cont>Back to Table of Contents</a>

---


In [5]:
!pip install requests
!pip install librosa
import requests
import zipfile
import os
import pandas as pd
import os
import io
import matplotlib.pyplot as plt
import numpy as np
import librosa
import librosa.display
import IPython.display as ipd
import boto3
from io import BytesIO
import datetime
from IPython.display import Audio
import librosa

!pip install librosa
import concurrent.futures




[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


 <a id="three"></a>
## **2. Downloading Data**
<a href=#cont>Back to Table of Contents</a>

---


#### **3.1 About the dataset**

**<u/>Dataset Description</u>**

**Data Overview**.

The dataset comprises unstructured audio files spanning from 2017 to 2022, accompanied by corresponding metadata in an Excel spreadsheet. These audio files capture the sound of water flow within pipes, distinguishing between those with leaks and those without. Themas Waters has graciously provided us access to this data via an API..

**Source**.

The Excel spreadsheet contains a comprehensive set of columns or features, serving as metadata for each audio file as follows:


- **datetime:** This column represents the date and time when the audio recording was captured.

- **siteid:** The site ID is a unique identifier assigned to each location where the recording took place. It helps in tracking the geographical location associated with the audio data.

- **recording_id:** The recording ID is a unique identifier assigned to each audio recording. It distinguishes one recording from another and aids in organizing and referencing the audio files.

- **file_name:** The file name column contains the name of the audio file. It helps in identifying and accessing the corresponding audio recording file.

- **postcodedistrict:** This column contains the postal code district associated with the location where the recording was made. It provides additional geographical context to the data.

- **dmacode:** The DMA (Distribution Management Area) code is a unique identifier used in water management systems. It helps in categorizing and managing water distribution networks.

- **leak found:** This column indicates whether a leak was detected in the corresponding audio recording. It serves as a binary flag, where "leak found" signifies the presence of a leak, and "no leak found" indicates the absence of a leak.

- **noise:** The noise column represents the characteristics of the recorded sound, providing insights into the acoustic properties of the audio data.

- **spread:** The spread column refer to the spread or distribution of sound frequencies within the audio recording. It could provide information about the variability or range of sound frequencies captured in the recording.

#### **3.2 Downloading files from the API**

- API Credentials

In [None]:
clientID = 'c70b57fc939d4c4eb3b32bc256fe451f'
clientSecret = '515600b3BB9547A580760B29007c6E73'

- modify this url as desired to access the different end points. e.g. 

In [None]:
# Replace DischargeCurrentStatus at the end of the resource URL
api_root = 'https://prod-tw-opendata-app.uk-e1.cloudhub.io'
api_resource = '/data/AcousticLogger/v1/SoundFiles'
url = api_root + api_resource
params = 'data filters' # Parameter

- Reqesting Data from the URL

In [None]:
r = requests.get(url, headers={'client_id':clientID, 'client_secret': clientSecret}, params=params)
print("Requesting from " + r.url)

- Checking request status to validate the request.

In [None]:
if r.status_code == 200:
    response = r.json()
    df = pd.json_normalize(response, 'items')
else:
    raise Exception("Request failed with status code {0}, and error message: {1}".format(r.status_code, r.json()))

- We have retrieved the data Lets take what we want from this data.

In [None]:
print(df.tail())
a = df.loc[0, 'FileURL']
a
response = requests.get(a)

save_path = r'C:\Users\Percy\OneDrive\Desktop\Acoustic\0404'

if response.status_code == 200:
    with io.BytesIO(response.content) as zip_data:
        
        with zipfile.ZipFile(zip_data, 'r') as zip_ref:
            zip_ref.extractall(os.path.abspath(r'C:\Users\Percy\OneDrive\Desktop\Acoustic\0404'))  # Specify the destination folder
                
    print("Zip file downloaded successfully.")
else:
    print(f"Failed to download zip file. Status code: {response.status_code}")

<a id="four"></a>
## **4. Connecting Audio with Metadata**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    

#### **4.1 Retrieving metadata.**

4.1.1 <u/>Lets take the metadeta from the s3 bucket into the notebook</u>

In [None]:
s3 = boto3.client('s3')
bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
object_key = 'Metadata/labelled_acoustic_logger_leaks.xlsx'

response = s3.get_object(Bucket=bucket_name, Key=object_key)
excel_data = response['Body'].read()

metadata_df = pd.read_excel(BytesIO(excel_data))
print('Metadata Loaded Successfully into the notebook')

4.1.2 <u/>Lets Observe the first 5 raws of the metadata</u>

In [None]:
metadata_df.head()

In [None]:
metadata_df.loc[0,'file']

#### **4.2 Retrieving Audio.**

4.2.1 <u/>Adding Feature in metadata table that will help us retrieve the audios</u>

In [None]:
metadata_df['datetime'] = pd.to_datetime(metadata_df['datetime'])

In [None]:
print(metadata_df['datetime'].dtypes)

In [None]:
metadata_df['year'] = metadata_df['datetime'].dt.year
metadata_df['month'] = metadata_df['datetime'].dt.month.apply(lambda x: '{:02d}'.format(x))
metadata_df['day'] = metadata_df['datetime'].dt.day.apply(lambda x: '{:02d}'.format(x))

4.2.2 <u/>Retrieving list of Audio URL and audio Keys</u>

In [None]:
url = []
key_ = []
for i in range(len(metadata_df)):
    year = metadata_df.loc[i, 'year']
    month = metadata_df.loc[i, 'month']
    day = metadata_df.loc[i, 'day']
    sideid = metadata_df.loc[i, 'siteid']
    rec_id =metadata_df.loc[i, 'recording_id']
    file = metadata_df.loc[i,'file']
    value = file[62:-4]
    
    
    aws_access_key_id = 'AKIATNJHRXAPQBHVQARV'
    aws_secret_access_key = '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'
    region_name = 'eu-west-1'

    # Initialize S3 client
    s3 = boto3.client('s3', 
                     aws_access_key_id=aws_access_key_id, 
                     aws_secret_access_key=aws_secret_access_key, 
                     region_name=region_name)

    # S3 bucket and object details
    bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
    key = f'Unstructured audio files/{year}/{month}/{day}/recordings_{sideid}_{rec_id}_{year}{month}{day}_{value}.wav'  # Path to the audio file in the S3 bucket

    key_url = f'Unstructured+audio+files/{year}/{month}/{day}/recordings_{sideid}_{rec_id}_{year}{month}{day}_{value}.wav'
    # Generate a URL to the audio file
    link = f"https://{bucket_name}.s3.{region_name}.amazonaws.com/{key_url}"
    link = [link]
    url = url + link
    key = [key]
    key_ = key_ + key

In [None]:
key

4.2.3 <u/>Connecting Metadata with corresponding Audio links</u>

In [None]:
metadata_df['Audio_Links'] = url
metadata_df['Audio_key'] = key_

4.2.4 <u/>Observing the first 5 raws of the connected metadata with its audio data</u>

In [None]:
metadata_df.head()

4.2.4 <u/>Downloading The connected Data into my Device</u>

In [None]:
metadata_df.to_excel("Connected_data.xlsx", index=False)

<a id="four"></a>
## **6. Exploratory Data Analysis**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

#### **4.1 Audio Sound.**

6.1.1 <u/>Sound With leak</u>

In [None]:
Leaking = metadata_df[metadata_df['leak_found'] == 'Yes']
Leaking.tail()


In [None]:
Leaking.loc[38329, 'Audio_key']

In [None]:
# S3 bucket name and audio file path
bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
audio_file_key_l= Leaking.loc[38326, 'Audio_key']  # Path to the audio file in the S3 bucket

# Download the audio file from S3
response = s3.get_object(Bucket=bucket_name, Key=audio_file_key)
audio_data_Leak = response['Body'].read()

# Display the audio file using the IPython Audio widget
Audio(audio_data_Leak)

6.1.2 <u/>Sound With No leak</u>

In [None]:
Not_Leaking = metadata_df[metadata_df['leak_found'] == 'No']
Not_Leaking.head(2)

In [None]:
# S3 bucket name and audio file path
bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
audio_file_key_n = Not_Leaking.loc[4, 'Audio_key']  # Path to the audio file in the S3 bucket

# Download the audio file from S3
response = s3.get_object(Bucket=bucket_name, Key=audio_file_key)
audio_data_No_L = response['Body'].read()

# Display the audio file using the IPython Audio widget
Audio(audio_data_No_L)

#### **4.2 Ex.**


#### **4.3 Amplitude Envelope.**

6.1.1 <u/>Amplitude with time signal</u>

In [None]:
# Initialize the S3 client
s3 = boto3.client('s3')

# Define your S3 bucket name and audio file key
bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'

# Create a file-like object to store the audio data
audio_data = io.BytesIO()

# Stream the audio file from S3 into the file-like object
s3.download_fileobj(bucket_name, audio_file_key_l, audio_data)

# Reset the file-like object's position to the beginning
audio_data.seek(0)

# Load the audio data using Librosa
y, sr = librosa.load(audio_data)

In [None]:
# Initialize the S3 client
s3 = boto3.client('s3')

# Define your S3 bucket name and audio file key
bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'

# Create a file-like object to store the audio data
audio_data = io.BytesIO()

# Stream the audio file from S3 into the file-like object
s3.download_fileobj(bucket_name, audio_file_key_n, audio_data)

# Reset the file-like object's position to the beginning
audio_data.seek(0)

# Load the audio data using Librosa
z, srn = librosa.load(audio_data)

In [None]:
# Extracting Mel-frequency cepstral coefficients (MFCCs)
mfccs = librosa.feature.mfcc(y=y, sr=sr)

# Computing the spectral centroid
spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)

# Computing the chromagram
chromagram = librosa.feature.chroma_stft(y=y, sr=sr)

In [None]:
# Extracting Mel-frequency cepstral coefficients (MFCCs)
mfccs = librosa.feature.mfcc(y=z, sr=srn)

# Computing the spectral centroid
spectral_centroid = librosa.feature.spectral_centroid(y=z, sr=srn)

# Computing the chromagram
chromagram = librosa.feature.chroma_stft(y=z, sr=srn)

In [None]:
import matplotlib.pyplot as plt

# Visualizing the MFCCs
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.tight_layout()

# Visualizing the chromagram
plt.figure(figsize=(10, 4))
librosa.display.specshow(chromagram, x_axis='time')
plt.colorbar()
plt.title('Chromagram')
plt.tight_layout()

plt.show()

In [None]:
plt.figure(figsize=(15, 17))

time = librosa.times_like(y)

plt.figure(figsize=(10, 4))
plt.plot(time, y)
plt.title('Leaking Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.tight_layout()

time = librosa.times_like(z)

plt.figure(figsize=(10, 4))
plt.plot(time, z)
plt.title('No Leak Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.tight_layout()

plt.show()  


#### **4.3 Fourier Transform.**

6.1.1 <u/>Deriving frequency</u>

In [None]:
import scipy as sp

In [None]:
# derive spectrum using FT
ft = sp.fft.fft(y)
magnitude = np.absolute(ft)
frequency = np.linspace(0, sr, len(magnitude)) 

ft = sp.fft.fft(z)
magnituden = np.absolute(ft)
frequencyn = np.linspace(0, srn, len(magnitude)) 

In [None]:
# plot spectrum
plt.figure(figsize=(18, 5))
plt.plot(frequency[:5000], magnitude[:5000]) # magnitude spectrum
plt.title("Leaking")
plt.xlabel("Frequency (Hz)")
plt.ylabel("Magnitude")

plt.figure(figsize=(18, 5))
plt.plot(frequencyn[:5000], magnituden[:5000]) # magnitude spectrum
plt.title("No Leak")
plt.xlabel("Frequency (Hz)")
plt.ylabel("Magnitude")
plt.show()

4.1.2 <u/>Missing values in Train dataset </u>

In [None]:
# zomm in to the waveform
samples = range(len(y))
t = librosa.samples_to_time(samples, sr=sr)

plt.figure(figsize=(18, 5))
plt.plot(t[10000:10400], y[10000:10400]) 
plt.title("Leaking")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()

samplesn = range(len(z))
t = librosa.samples_to_time(samples, sr=srn)

plt.figure(figsize=(18, 5))
plt.plot(t[10000:10400], z[10000:10400])
plt.title("No Leaking") 
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()

In [None]:
# compare signal and sinusoids
samples = range(len(y))
t = librosa.samples_to_time(samples, sr=sr)

f = 55
phase = 30

sin = 0.1 * np.sin(2*np.pi * (f * t - phase))

plt.figure(figsize=(18, 5))
plt.plot(t[10000:10400], y[10000:10400]) 
plt.plot(t[10000:10400], sin[10000:10400], color="r")

plt.fill_between(t[10000:10400], sin[10000:10400]*y[10000:10400], color="y")

plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()

In [None]:
# superimposing pure tones
f = 1
t = np.linspace(0, 10, 10000)

sin = np.sin(2*np.pi * (f * t))
sin2 = np.sin(2*np.pi * (2*f * t))
sin3 = np.sin(2*np.pi * (3*f * t))

sum_signal = sin + sin2 + sin3

plt.figure(figsize=(15, 10))

plt.subplot(4, 1, 1)
plt.plot(t, sum_signal, color="r")

plt.subplot(4, 1, 2)
plt.plot(t, sin)

plt.subplot(4, 1, 3)
plt.plot(t, sin2)

plt.subplot(4, 1, 4)
plt.plot(t, sin3)

plt.show()

**Comment:** Now we can see that they are no missing values within each column, from this dataframe lets see the length of this dataframe, and observe the timber of distinct userid and movieid.

#### **4.3 Spectrograms.**

6.1.1 <u/>Spectrograms Visualization</u>

In [None]:
def plot_spectrogram(Y, sr, hop_length, y_axis="linear"):
    plt.figure(figsize=(18, 5))
    librosa.display.specshow(Y, 
                             sr=sr, 
                             hop_length=hop_length, 
                             x_axis="time", 
                             y_axis=y_axis)
    plt.colorbar(format="%+2.f")

In [None]:
FRAME_SIZE = 2048
HOP_SIZE = 512
# Extracting short time fourier transform
leak = librosa.stft(y, n_fft=FRAME_SIZE, hop_length=HOP_SIZE)
nleak = librosa.stft(z , n_fft=FRAME_SIZE, hop_length=HOP_SIZE)

In [None]:
# Calculating the spectrogram

leak = np.abs(leak) ** 2
nleak = np.abs(nleak) ** 2

In [None]:
plot_spectrogram(leak, sr, HOP_SIZE)
plot_spectrogram(nleak, srn, HOP_SIZE)

4.1.3 <u/>log Aplitude spectrogram </u>

In [None]:
Y_log_scale = librosa.power_to_db(leak)
z_log_scale = librosa.power_to_db(nleak)
plot_spectrogram(Y_log_scale, sr, HOP_SIZE)
plot_spectrogram(z_log_scale, srn, HOP_SIZE)

4.1.4 <u/>Log Frequency Spectrogram </u>

In [None]:
plot_spectrogram(Y_log_scale, sr, HOP_SIZE, y_axis="log")
plot_spectrogram(z_log_scale, srn, HOP_SIZE, y_axis="log")

#### **4.3 Mel Spectrograms.**

6.1.1 <u/>Mel filter banks</u>

In [None]:
mel_spectrogram_leak = librosa.feature.melspectrogram(y, sr=sr)
mel_spectrogram_nleak = librosa.feature.melspectrogram(z, sr=srn, n_fft=2048, hop_length=512, n_mels=10)

<a id="five"></a>
## **5. Data Processing**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---


In [53]:
#Loading the data into the notebook
s3 = boto3.client('s3')
bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
object_key = 'Metadata_Audio_Connected/Connected_data.xlsx'

response = s3.get_object(Bucket=bucket_name, Key=object_key)
excel_data = response['Body'].read()

df = pd.read_excel(BytesIO(excel_data))
print('Connected Loaded Successfully into the notebook')

Connected Loaded Successfully into the notebook


In [54]:
df.head()

Unnamed: 0,datetime,siteid,recording_id,file,postcodedistrict,dmacode,leak_found,noise,spread,repaired_as,year,month,day,Audio_Links,Audio_key
0,2018-12-12 04:00:00,1668120,26577010,../recordings/2018/12/12/recordings_1668120_26...,NW10,ZSUHIL25,No,15.0,4.0,,2018,12,12,https://2307-01-acoustic-loggers-for-leak-dete...,Unstructured audio files/2018/12/12/recordings...
1,2018-12-12 04:00:00,1742872,26592074,../recordings/2018/12/12/recordings_1742872_26...,E15,ZWOODF113,No,22.0,9.0,,2018,12,12,https://2307-01-acoustic-loggers-for-leak-dete...,Unstructured audio files/2018/12/12/recordings...
2,2018-12-12 04:00:00,1616760,26593071,../recordings/2018/12/12/recordings_1616760_26...,HP12,ZWIDDN02,No,21.0,7.0,,2018,12,12,https://2307-01-acoustic-loggers-for-leak-dete...,Unstructured audio files/2018/12/12/recordings...
3,2018-12-12 04:00:00,1630929,26593758,../recordings/2018/12/12/recordings_1630929_26...,SL1,ZSTKWD30,No,14.0,5.0,,2018,12,12,https://2307-01-acoustic-loggers-for-leak-dete...,Unstructured audio files/2018/12/12/recordings...
4,2018-12-12 04:00:00,6896951,26596303,../recordings/2018/12/12/recordings_6896951_26...,SL7,ZMARLC01,No,10.0,3.0,,2018,12,12,https://2307-01-acoustic-loggers-for-leak-dete...,Unstructured audio files/2018/12/12/recordings...


In [None]:
# Constant Values
FRAME_SIZE = 2048
HOP_SIZE = 572

In [55]:
df_modeling = df[['recording_id', 'leak_found']]
df_modeling.head()

Unnamed: 0,recording_id,leak_found
0,26577010,No
1,26592074,No
2,26593071,No
3,26593758,No
4,26596303,No


#### **5.1 Time Domain Features.**
5.1.1 <u/>Amplitude Envelope </u>

In [56]:
def amplitude_envelope(signal, frame_size, hop_length):
    """Calculate the amplitude envelope of a signal with a given frame size nad hop length."""
    amplitude_envelope = []
    
    # calculate amplitude envelope for each frame
    for i in range(0, len(signal), frame_size): 
        amplitude_envelope_current_frame = max(signal[i:i+frame_size]) 
        amplitude_envelope.append(amplitude_envelope_current_frame)
    
    return amplitude_envelope  

In [8]:
# Initialize S3 client
s3 = boto3.client('s3')

def process_audio(i):
    try:
        bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
        audio_file_key = df.loc[i, 'Audio_key']

        # Download audio file from S3
        response = s3.get_object(Bucket=bucket_name, Key=audio_file_key)
        audio_data = response['Body'].read()

        # Display audio file
        Audio(audio_data)

        # Create file-like object
        audio_data_io = io.BytesIO()

        # Stream audio file from S3
        s3.download_fileobj(bucket_name, audio_file_key, audio_data_io)
        audio_data_io.seek(0)

        # Load audio data
        signal, sr = librosa.load(audio_data_io)

        # Compute amplitude envelope
        amplitude_e = amplitude_envelope(signal, FRAME_SIZE, HOP_SIZE)

        # Create a DataFrame with the extracted features
        df_features = pd.DataFrame(amplitude_e, columns=['Amplitude_Env'])

        return df_features

    except Exception as e:
        print(f"Error processing audio {i}: {e}")
        return None

if __name__ == "__main__":
    # Initialize S3 client
    s3 = boto3.client('s3')

    # Define constants
    FRAME_SIZE = 1024
    HOP_SIZE = 512

    # Initialize list to store DataFrames
    df_list = []

    # Process audio files in parallel
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(process_audio, i) for i in range(len(df))]
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            df_list.append(result)

    # Concatenate DataFrames into a single DataFrame
    df_features = pd.concat(df_list, ignore_index=True)

Error processing audio 86: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 87: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 89: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 88: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 90: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 91: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 92: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 93: An error occurred (NoSuchKey) when calling the GetObject operation: Th

5.1.1 <u/>Root Mean Square Energy </u>

In [57]:
def rmse(signal, frame_size, hop_length):
    rmse = []
    
    # calculate rmse for each frame
    for i in range(0, len(signal), hop_length): 
        rmse_current_frame = np.sqrt(sum(signal[i:i+frame_size]**2) / frame_size)
        rmse.append(rmse_current_frame)
    return rmse

In [None]:
# Define your amplitude_envelope function here if not already defined
#s3 = boto3.client('s3')
def process_audio(i):
    try:
        bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
        audio_file_key = df.loc[i, 'Audio_key']

        # Download audio file from S3
        response = s3.get_object(Bucket=bucket_name, Key=audio_file_key)
        audio_data = response['Body'].read()

        # Display audio file
        Audio(audio_data)

        # Initialize S3 client
        

        # Create file-like object
        audio_data_io = io.BytesIO()

        # Stream audio file from S3
        s3.download_fileobj(bucket_name, audio_file_key, audio_data_io)
        audio_data_io.seek(0)

        # Load audio data
        signal, sr = librosa.load(audio_data_io)

        # Compute amplitude envelope
        rms = rmse(signal, FRAME_SIZE, HOP_SIZE)

        return rms

    except Exception as e:
        print(f"Error processing audio {i}: {e}")
        return None

if __name__ == "__main__":
    # Initialize S3 client
    s3 = boto3.client('s3')

    # Define constants
    FRAME_SIZE = 1024
    HOP_SIZE = 512

    # Initialize list to store results
    RMS_Energy = []

    # Process audio files in parallel
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(process_audio, i) for i in range(len(df))]
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            RMS_Energy.append(result)

    print(RMS_Energy)

4.1.3 <u/>Zero Crossing Rate </u>

In [59]:

def zcr(signal, frame_size, hop_lenth):
    zcr_signal = librosa.feature.zero_crossing_rate(signal, frame_length=FRAME_SIZE, hop_length=hop_lenth)[0]
    
    return zcr_signal


#### **5.1 Frequency Domain Features.**
5.1.1 <u/>Band Energy Ratio </u>

In [None]:
def calculate_split_frequency_bin(split_frequency, sample_rate, num_frequency_bins):
    """Infer the frequency bin associated to a given split frequency."""
    
    frequency_range = sample_rate / 2
    frequency_delta_per_bin = frequency_range / num_frequency_bins
    split_frequency_bin = math.floor(split_frequency / frequency_delta_per_bin)
    return int(split_frequency_bin)

In [None]:
for i in range(len(metadata_df)):
    

5.1.2 <u/>Spectral Centroid </u>

5.1.1 <u/>Bandwidth </u>

**Comment:** From the above results we can see that they no duplicated values within our dataset.

In [45]:
df_modeling = df_modeling.head(10)

#### **6.2 Creating Features.**
6.2.1 <u/> Feature Creator </u>

In [60]:
# Initialize S3 client
s3 = boto3.client('s3')

def process_audio(i):
    try:
        bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
        audio_file_key = df.loc[i, 'Audio_key']

        # Download audio file from S3
        response = s3.get_object(Bucket=bucket_name, Key=audio_file_key)
        audio_data = response['Body'].read()

        # Display audio file
        Audio(audio_data)

        # Create file-like object
        audio_data_io = io.BytesIO()

        # Stream audio file from S3
        s3.download_fileobj(bucket_name, audio_file_key, audio_data_io)
        audio_data_io.seek(0)

        # Load audio data
        signal, sr = librosa.load(audio_data_io)

        # Compute amplitude envelope
        amplitude_e = amplitude_envelope(signal, FRAME_SIZE, HOP_SIZE)

        return amplitude_e

    except Exception as e:
        print(f"Error processing audio {i}: {e}")
        return None


def process_audio_r(i):
    try:
        bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
        audio_file_key = df.loc[i, 'Audio_key']

        # Download audio file from S3
        response = s3.get_object(Bucket=bucket_name, Key=audio_file_key)
        audio_data = response['Body'].read()

        # Display audio file
        Audio(audio_data)

        # Create file-like object
        audio_data_io = io.BytesIO()

        # Stream audio file from S3
        s3.download_fileobj(bucket_name, audio_file_key, audio_data_io)
        audio_data_io.seek(0)

        # Load audio data
        signal, sr = librosa.load(audio_data_io)

        # Compute amplitude envelope
        rms_energy = rmse(signal, FRAME_SIZE, HOP_SIZE)
    
        return rms_energy

    except Exception as e:
        print(f"Error processing audio {i}: {e}")
        return None

def process_audio_z(i):
    try:
        bucket_name = '2307-01-acoustic-loggers-for-leak-detection-a'
        audio_file_key = df.loc[i, 'Audio_key']

        # Download audio file from S3
        response = s3.get_object(Bucket=bucket_name, Key=audio_file_key)
        audio_data = response['Body'].read()

        # Display audio file
        Audio(audio_data)

        # Create file-like object
        audio_data_io = io.BytesIO()

        # Stream audio file from S3
        s3.download_fileobj(bucket_name, audio_file_key, audio_data_io)
        audio_data_io.seek(0)

        # Load audio data
        signal, sr = librosa.load(audio_data_io)

        # Compute amplitude envelope
        zcr_0 = zcr(signal, FRAME_SIZE, HOP_SIZE)

        # Create a DataFrame with the extracted features

        return zcr_0

    except Exception as e:
        print(f"Error processing audio {i}: {e}")
        return None


if __name__ == "__main__":
    # Initialize S3 client
    s3 = boto3.client('s3')

    # Define constants
    FRAME_SIZE = 1024
    HOP_SIZE = 512

    # Initialize lists to store DataFrames
    df_list = []
    df_list_r = []
    df_list_z = []

    # Process audio files in parallel
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(process_audio, i) for i in range(len(df))]
        futures_r = [executor.submit(process_audio_r, i) for i in range(len(df))]
        futures_e = [executor.submit(process_audio_z, i) for i in range(len(df))]

        # Collect results from futures
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            df_list.append(result)
        
        for future in concurrent.futures.as_completed(futures_r):
            result = future.result()
            df_list_r.append(result)
        
        for future in concurrent.futures.as_completed(futures_e):
            result = future.result()
            df_list_e.append(result)

    # Concatenate DataFrames into a single DataFrame
    df_modeling['Amplitude_Envelope'] = df_list
    df_modeling['RMS_Energy'] = df_list_r 
    df_modeling['Zero_Crossing'] = df_list_e

Error processing audio 86: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 87: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 88: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 89: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 90: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 91: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 92: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Error processing audio 93: An error occurred (NoSuchKey) when calling the GetObject operation: Th

**Comment:** This function will remove character '|' when applied to a dataset with a name movies and have a column called genres. 

In [38]:
df_modeling.head()

Unnamed: 0,recording_id,leak_found
0,26577010,No
1,26592074,No
2,26593071,No
3,26593758,No
4,26596303,No
...,...,...
38325,129718949,Yes
38326,74345861,Yes
38327,76414165,Yes
38328,143471783,Yes


<a id="six"></a>
## **6. Modelling**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---


In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assuming X_train and y_train are your feature matrix and corresponding labels
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize SVM classifier
svm_classifier = SVC(kernel='linear', C=1.0)

# Train the SVM classifier
svm_classifier.fit(X_train_scaled, y_train)

# Predict on the testing set
y_pred = svm_classifier.predict(X_test_scaled)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("SVM Accuracy:", accuracy)

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assuming X_train and y_train are your feature matrix and corresponding labels
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize neural network model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model on the testing set
_, accuracy = model.evaluate(X_test_scaled, y_test)
print("Neural Network Accuracy:", accuracy)

## **Authors**

| Name | Surname | Position |
| :----------- | :------------: | ------------: |
| Percy  | Mmutle       | None       |
|  Lesego  | 88888888      | Project Maneger       |
| Aphiwe | 888888      | None   |
| Tonia | 88888 | None|
|Ntsako| 888888 | None |
| Tumi | 888888 | None |
| Victoria | 888888 | Team Lead |
| Ndivho | 888888 | None |