<a href="https://colab.research.google.com/github/meiladrahmani556/concrete-strength-ml-dissertation/blob/main/Notebook/03_Data_Cleaning_and_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 03: Data Cleaning and Preprocessing

## Objective
The purpose of this notebook is to clean and preprocess the concrete compressive strength dataset
in preparation for exploratory data analysis and machine learning modelling.

This includes:
- Loading the dataset
- Checking for missing values
- Inspecting data types
- Detecting duplicates
- Renaming columns for consistency
- Basic statistical validation

In [5]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 1000)

In [2]:
from google.colab import files

uploaded = files.upload()

Saving concrete_data.csv to concrete_data.csv


In [7]:
# Load dataset
df = pd.read_csv("concrete_data.csv")

print("Dataset loaded successfully")
print("Shape:", df.shape)

df.head()

Dataset loaded successfully
Shape: (1030, 9)


Unnamed: 0,cement,blast_furnace_slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,concrete_compressive_strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [8]:
# Dataset overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   cement                         1030 non-null   float64
 1   blast_furnace_slag             1030 non-null   float64
 2   fly_ash                        1030 non-null   float64
 3   water                          1030 non-null   float64
 4   superplasticizer               1030 non-null   float64
 5   coarse_aggregate               1030 non-null   float64
 6   fine_aggregate                 1030 non-null   float64
 7   age                            1030 non-null   int64  
 8   concrete_compressive_strength  1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.6 KB


In [9]:
# Check for missing values
missing_values = df.isnull().sum()

print("Missing values per column:")
missing_values

Missing values per column:


Unnamed: 0,0
cement,0
blast_furnace_slag,0
fly_ash,0
water,0
superplasticizer,0
coarse_aggregate,0
fine_aggregate,0
age,0
concrete_compressive_strength,0


In [14]:
# Check for duplicate rows
duplicates = df.duplicated().sum()

print(f"Number of duplicate rows: {duplicates}")
df = df.drop_duplicates()

Number of duplicate rows: 0


In [12]:
# Rename columns for consistency
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
    .str.replace("(", "")
    .str.replace(")", "")
)

df.head()

Unnamed: 0,cement,blast_furnace_slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,concrete_compressive_strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [13]:
# Basic statistical summary
df.describe()
assert (df.select_dtypes(include="number") >= 0).all().all(), "Negative values detected"

## Summary

In this notebook, the concrete compressive strength dataset was successfully cleaned and prepared
for further analysis. The dataset contains no missing values and only minimal duplication.

Column names were standardised for clarity, and the cleaned dataset was saved for use in
subsequent notebooks, ensuring reproducibility and consistency across the project.