
<div style="text-align:center">

<div style="background-image: url('background.jpg'); background-size: cover; padding: 50px; color: white;">

# Acoustic Loggers 

© Sand Technologies

<img src="sand.JPEG" alt="Example Image" width="100" height="100">

### Team 1

</div>
</div>


### **Acoustic Loggers for Leak Detection**


Water distribution networks play a crucial role in ensuring clean and safe drinking water is delivered to consumers. However, leaks in the pipes of these networks lead to significant loss of water, posing challenges to water utilities. Not only are these leaks wasteful, but they also result in large fines for water utilities. 

To mitigate wastage, acoustic loggers have been attached to water pipes to record the sound profile in each pipe at night. These recordings can then be used to determine whether there is a leak present. 

The goal of this project is to produce a model that can classify each of these recordings as either 'leak' or 'no leak', aiding in the early detection and prevention of water loss in distribution networks. 

<img src="water.JPG" alt="Example Image" width="1700" height="400">

### **Problem Statement:**




<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Libraries</a>

<a href=#two>2. Loading the data</a>

<a href=#four>4. Exploratory Data Analysis</a>

<a href=#five>5. Data Cleaning & Preprocessing

<a href=#six>6. Modelling</a>

<a href=#seven>7. Model Evaluation</a>

<a href=#ten>10. Conclusion</a>

## <div style="text-align: center;"><u/> **Let's Get Started!!!**.</u></div>

 <a id="two"></a>
## **1. Importing Libraries**
<a href=#cont>Back to Table of Contents</a>

---


In [3]:
!pip install requests
import requests
import zipfile
import os
import pandas as pd
import os
import io




[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


 <a id="three"></a>
## **2. Downloading Data**
<a href=#cont>Back to Table of Contents</a>

---


#### **3.1 About the dataset**

**<u/>Dataset Description</u>**

**Data Overview**.

The dataset comprises unstructured audio files spanning from 2017 to 2022, accompanied by corresponding metadata in an Excel spreadsheet. These audio files capture the sound of water flow within pipes, distinguishing between those with leaks and those without. Themas Waters has graciously provided us access to this data via an API..

**Source**.

The Excel spreadsheet contains a comprehensive set of columns or features, serving as metadata for each audio file as follows:


- **datetime:** This column represents the date and time when the audio recording was captured.

- **siteid:** The site ID is a unique identifier assigned to each location where the recording took place. It helps in tracking the geographical location associated with the audio data.

- **recording_id:** The recording ID is a unique identifier assigned to each audio recording. It distinguishes one recording from another and aids in organizing and referencing the audio files.

- **file_name:** The file name column contains the name of the audio file. It helps in identifying and accessing the corresponding audio recording file.

- **postcodedistrict:** This column contains the postal code district associated with the location where the recording was made. It provides additional geographical context to the data.

- **dmacode:** The DMA (Distribution Management Area) code is a unique identifier used in water management systems. It helps in categorizing and managing water distribution networks.

- **leak found:** This column indicates whether a leak was detected in the corresponding audio recording. It serves as a binary flag, where "leak found" signifies the presence of a leak, and "no leak found" indicates the absence of a leak.

- **noise:** The noise column represents the characteristics of the recorded sound, providing insights into the acoustic properties of the audio data.

- **spread:** The spread column refer to the spread or distribution of sound frequencies within the audio recording. It could provide information about the variability or range of sound frequencies captured in the recording.

#### **3.2 Downloading files from the API**

- API Credentials

In [None]:
clientID = 'c70b57fc939d4c4eb3b32bc256fe451f'
clientSecret = '515600b3BB9547A580760B29007c6E73'

- modify this url as desired to access the different end points. e.g. 

In [None]:
# Replace DischargeCurrentStatus at the end of the resource URL
api_root = 'https://prod-tw-opendata-app.uk-e1.cloudhub.io'
api_resource = '/data/AcousticLogger/v1/SoundFiles'
url = api_root + api_resource
params = 'data filters' # Parameter

- Reqesting Data from the URL

In [None]:
r = requests.get(url, headers={'client_id':clientID, 'client_secret': clientSecret}, params=params)
print("Requesting from " + r.url)

- Checking request status to validate the request.

In [None]:
if r.status_code == 200:
    response = r.json()
    df = pd.json_normalize(response, 'items')
else:
    raise Exception("Request failed with status code {0}, and error message: {1}".format(r.status_code, r.json()))

- We have retrieved the data Lets take what we want from this data.

In [4]:
print(df.tail())
a = df.loc[0, 'FileURL']
a
response = requests.get(a)

save_path = r'C:\Users\Percy\OneDrive\Desktop\Acoustic\0404'

if response.status_code == 200:
    with io.BytesIO(response.content) as zip_data:
        
        with zipfile.ZipFile(zip_data, 'r') as zip_ref:
            zip_ref.extractall(os.path.abspath(r'C:\Users\Percy\OneDrive\Desktop\Acoustic\0404'))  # Specify the destination folder
                
    print("Zip file downloaded successfully.")
else:
    print(f"Failed to download zip file. Status code: {response.status_code}")

Requesting from https://prod-tw-opendata-app.uk-e1.cloudhub.io/data/AcousticLogger/v1/SoundFiles?data%20filters
            FileName                                            FileURL  \
0  acoustic_data.zip  https://saseuwdevdsplat.blob.core.windows.net/...   

                        Date  
0  2024-02-06T11:34:43+00:00  
Zip file downloaded successfully.


<a id="four"></a>
## **4. Loading Data**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    

#### **4.1 Train Dataset Exploratory Data Analysis.**

4.1.1 <u/>The first 3 rows of the Train dataset </u>

**Comment:** The train dataset has 4 features, let's see if they are no missing values withing each feature.

4.1.2 <u/>Missing values in Train dataset </u>

**Comment:** Now we can see that they are no missing values within each column, from this dataframe lets see the length of this dataframe, and observe the timber of distinct userid and movieid.

4.1.3 <u/>length of the dataframe and Number of distinct element on movieid and userid </u>

**Comment:** Now we can see that they are more users that movies, let's check the visual to see the frequency of user ratings for the movies. 

4.1.4 <u/>Frequeacy of rating categories </u>

**Comment:** From the above distribution we can see that the rating of 4 makes 26% of the total dataset, this tell us that people rate most of the movies with a value of 4. Now lest observe the statistical information of the given ratings.

4.1.5 <u/>Rating Statistical summary</u>

**Comment:** From the above we see that avarage values is 3.533, and the maximum rating is 5. Let's visualize this knowledge so that we can see the insights of the ratings feature.

4.1.6 <u/>Ratings Visuals</u>
- Box and Whiskers plot.

**Comment:** The skewness of the ratings is -0.7 which tells us that the distribution is left-skewed (tail to the left).

#### **4.2 Movie Dataset Exploratory Data Analysis.**

4.2.1 <u/>The first 3 rows of the Movies dataset </u>

**Comment:** From the movie dataset we can see that the un explored features are text data, lets see if they are missing values.

4.2.2 <u/>Nan values in movies dataset</u>

** Comment:** from the above results it seems like they are no missing values for the movies dataset. now lets see the frequency of movie genres.

4.2.3 <u/>Frequency of genres in movies dataset.</u>

**Comment:** we can see that drama, comedy, documentary, are the top 3 genres. Now lets visualize the frequent genre in a word cloud.

4.2.4 <u/>Frequent Genre.</u>
- Word Cloud visualization.

**Comment:** From the word cloud we can see that drama, Scifi, Romance and comedy are the most frequent genres of our dataset.Lets look at the distribution.

4.2.5 <u/>Genres Distributions</u>

**Comment:** The above distribution confirms that most of the movies in the dataset are drama and romance movies. 

#### **4.3 Tag Dataset Exploratory Data Analysis.**

4.3.1 <u/>The first 3 rows of the Tag dataset </u>

**Comment:** We can see that the dataframe has two new features which is the tag and the timestamp, but for this project we will be interested in the tag which is what was provided by the user.

4.3.2 <u/> Lets check for missing values</u>

**Comment:** We can see that they are no missing values in this dataset. Lets see the cloud of how people classify the genres on the movies.

4.3.3 <u/>Tag genres of people</u>

**Comment:** From what viewers have tagged the movies they have watched, the wordcloud shows that they classify most movies as Comedy and Drama which is what the movies dataset shows. 

#### **4.4 Genome_scores Dataset Exploratory Data Analysis.**

4.4.1 <u/>The first 3 rows of the Genome_scores dataset </u>

**Comment:** From the genome_scores dataset we cans see that the new feature is the relevance of the column. lets check missing values and perform somestatistical analysis on the relevence column.

4.4.2 <u/>Nan values in the geneme_scores dataset<u/>

**Comment:** It can be seen that they are no missing values within the genome score dataset.

4.4.3 <u/>Statistical Sescription<u/>

**Comment:** From the above we can see the mean, maximum and the minimum relevance, lets put this values on a plot to see them in a visual form.

4.4.4 <u/>Statistical summary Plot<u/> 
- Box and whisker plot.

**Comment:** The skewness of the relevence is 2.6 which tells us that the distribution is right-skewed (tail to the right).

#### **4.5 Rating Dataset Exploratory Data Analysis.**

4.5.1 <u/>The first 3 rows of the rating dataset </u>

**Comment:** Since we have already done some analysis on the userid, movieid and ratings features lets only check for missing values.

5.5.2 <u/>Nan valeus<u/>

**Comment:** As we can see they are no missing values that are present in the ratings data frame and all features has been explored, now we jump to the next dataframe.

#### **4.6 imdb Dataset Exploratory Data Analysis.**

4.6.1 <u/>The first 3 rows of the imbd dataset </u>

**Comment:** In this dataset they are 5 new features which we havent explored. lets explore them and see their insights. 

4.6.2 <u/> Statistical Summary<u/>

<a id="five"></a>
## **5. Data Cleaning & Preprocessing**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---


#### **5.1 Duplicates within the dataset.**
- We only explore the important dataset which is train, test, tags, imdb and movies.

**Comment:** From the above results we can see that they no duplicated values within our dataset.

#### **6.2 Processing the data to remove characters.**
- A function to remove specific character and replace is my an empty space. 

**Comment:** This function will remove character '|' when applied to a dataset with a name movies and have a column called genres. 

<a id="six"></a>
## **6. Modelling**
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---


## **Authors**

| Name | Surname | Position |
| :----------- | :------------: | ------------: |
| Percy  | Mmutle       | None       |
|  Lesego  | 88888888      | Project Maneger       |
| Aphiwe | 888888      | None   |
| Tonia | 88888 | None|
|Ntsako| 888888 | None |
| Tumi | 888888 | None |
| Victoria | 888888 | Team Lead |
| Ndivho | 888888 | None |