# **DATA CLEANING**

## Objectives

* Ensure data quality and usability

## Inputs

* outputs/datasets/collection/data-melborne_f.csv

## Outputs

*  Cleaned data ready for use




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Nod-to-the-COD/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/Nod-to-the-COD'

# Data Cleaning

I have already dropped the index column and the year, month, and day columns, as I have creatded a timestamp column to replace these 3 columns. This helps with time-based functionality as I am interested in time series data.

Load data sheet

In [6]:
%pip install --upgrade numpy pandas



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
df = pd.read_csv(f"outputs/datasets/collection/Data-Melbourns_F_fixed.csv")
df.head()

Unnamed: 0,Average Outflow,Average Inflow,Energy Consumption,Ammonia,Biological Oxygen Demand,Chemical Oxygen Demand,Total Nitrogen,Average Temperature,Maximum temperature,Minimum temperature,Atmospheric pressure,Average humidity,Total rainfall,Average visibility,Average wind speed,Maximum wind speed,date
0,2.941,2.589,175856.0,27.0,365.0,730.0,60.378,19.3,25.1,12.6,0.0,56.0,1.52,10.0,26.9,53.5,2014-01-01
1,2.936,2.961,181624.0,25.0,370.0,740.0,60.026,17.1,23.6,12.3,0.0,63.0,0.0,10.0,14.4,27.8,2014-01-02
2,2.928,3.225,202016.0,42.0,418.0,836.0,64.522,16.8,27.2,8.8,0.0,47.0,0.25,10.0,31.9,61.1,2014-01-05
3,2.928,3.354,207547.0,36.0,430.0,850.0,63.0,14.6,19.9,11.1,0.0,49.0,0.0,10.0,27.0,38.9,2014-01-06
4,2.917,3.794,202824.0,46.0,508.0,1016.0,65.59,13.4,19.1,8.0,0.0,65.0,0.0,10.0,20.6,35.2,2014-01-07


# Data Exploration

Install ydata-profiling

In [5]:
%pip install ydata-profiling

Collecting numpy<2.2,>=1.16.0 (from ydata-profiling)
  Using cached numpy-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Using cached numpy-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.0 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.6
    Uninstalling numpy-2.2.6:
      Successfully uninstalled numpy-2.2.6
Successfully installed numpy-2.1.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
%pip install --upgrade numpy ydata-profiling


Collecting numpy
  Using cached numpy-2.2.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 17/17 [00:00<00:00, 146.15it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

---

In [18]:
df['date'] = pd.to_datetime(df['date'])


In [19]:
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year


---

In [21]:
corr_spearman = df.corr(method='spearman')['Chemical Oxygen Demand'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

Total Nitrogen              0.766683
Biological Oxygen Demand    0.456210
Ammonia                     0.296885
year                        0.276204
date                        0.268945
Average humidity           -0.165520
Total rainfall             -0.159646
Maximum temperature         0.114882
Average Temperature         0.102299
Minimum temperature         0.069284
Name: Chemical Oxygen Demand, dtype: float64

In [22]:
corr_pearson = df.corr(method='pearson')['Chemical Oxygen Demand'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

Total Nitrogen              0.681225
Biological Oxygen Demand    0.524189
Ammonia                     0.284552
year                        0.231927
date                        0.227455
Average humidity           -0.152655
Maximum temperature         0.090018
Average Temperature         0.083946
Total rainfall             -0.055335
Minimum temperature         0.052716
Name: Chemical Oxygen Demand, dtype: float64

In [23]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

{'Ammonia', 'Biological Oxygen Demand', 'Total Nitrogen', 'date', 'year'}

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
