# 02 – Data Cleaning & Quality Assurance  
### Smart City Energy Dataset (Asian Grid Standards)

This notebook continues from the data import step and performs a **full cleaning pipeline** to prepare the Smart City dataset for analytics and ML workflows.

The objectives of this notebook:  
- Assess data quality  
- Clean invalid country names  
- Handle missing values  
- Remove duplicate entries  
- Detect voltage/current/consumption outliers  
- Remove outlier rows  
- Produce a fully clean dataset: `df_final`  


In [None]:
print("\n=== DATA QUALITY CHECK ===")
df.info()

print("\nMissing values:")
missing = df.isnull().sum()
print(missing[missing > 0])

missing_percent = (missing/len(df))*100
print("\nMissing %:")
print(missing_percent[missing_percent > 0])

empty_cols = missing_percent[missing_percent == 100].index.tolist()
print("\nEmpty columns:", empty_cols)


## 1. Data Quality Assessment  
This step performs an initial scan of the dataset including:

- Data structure (`df.info()`)
- Missing values count and percentages
- Detection of completely empty columns  


In [None]:
non_countries = ['SolarPark','Substation','Site','Industrial','LoadHub','Center']
print("\nRows before:", len(df))
df_cleaned = df[~df['Country'].isin(non_countries)].copy()
print("Rows after:", len(df_cleaned))
print("\nRemaining countries:")
print(df_cleaned['Country'].value_counts())


## 2. Cleaning the Country Column  
The dataset contains facility names mistakenly stored in the `Country` column.  
This step removes non-country labels:

- SolarPark  
- Substation  
- Site  
- Industrial  
- LoadHub  
- Center  

The goal is to keep only valid Asian countries.  


In [None]:
numeric_cols = df_cleaned.select_dtypes(include='number').columns
for col in numeric_cols:
    if df_cleaned[col].isnull().sum() > 0:
        med = df_cleaned[col].median()
        df_cleaned[col].fillna(med, inplace=True)
        print(f"Filled missing in {col} with median {med:.2f}")


## 3. Missing Value Treatment (Median Imputation)  
Missing values in numeric columns are filled using the **median**, which:

- Is robust against outliers  
- Preserves distribution shape  
- Avoids skew caused by extreme values  


In [None]:
dups = df_cleaned.duplicated().sum()
print("Duplicates:", dups)
df_cleaned = df_cleaned.drop_duplicates()
print("New shape:", df_cleaned.shape)


## 4. Duplicate Row Removal  
Duplicate rows distort statistical analysis and time-series patterns.  
This step:

- Detects duplicate rows  
- Removes them  
- Confirms new dataset size  


In [None]:
ASIA_VOLT = {'min':90,'max':250}

def detect_voltage_outliers_asia(data, col='Voltage (V)'):
    low, high = ASIA_VOLT['min'], ASIA_VOLT['max']
    out = data[(data[col] < low) | (data[col] > high)]
    print(f"Voltage outliers: {len(out)}")
    return out

def detect_outliers_iqr(data, col):
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    low = Q1 - 1.5 * IQR
    high = Q3 + 1.5 * IQR
    out = data[(data[col] < low) | (data[col] > high)]
    print(f"{col} outliers: {len(out)}")
    return out, low, high

voltage_out = detect_voltage_outliers_asia(df_cleaned)
current_out, _, _ = detect_outliers_iqr(df_cleaned, 'Current (A)')
power_out, _, _ = detect_outliers_iqr(df_cleaned, 'Power Consumption')

all_out_idx = set(voltage_out.index) | set(current_out.index) | set(power_out.index)
print("Total unique outlier rows:", len(all_out_idx))


## 5. Outlier Detection  
Outliers are detected for:  

### Voltage (V) — Asian Grid Standard  
Valid range: **90V – 250V**

### Current (A) — IQR Method  
### Power Consumption — IQR Method  

We identify rows with extreme or unrealistic values.  


In [None]:
df_final = df_cleaned.drop(index=list(all_out_idx))
print("Final shape after removing outliers:", df_final.shape)


### Cleaning complete. Dataset stored in `df_final`. Ready for EDA.