# Electricity tariffs analysis and  compliance. 
---
## 02_Nootebook: Dataset repackaging.
---   

When the five original CSV files were loaded into dataframes: tariffs_df1 - tariffs_df5, it was found that tariffs_df1 and tariffs_df2 are subsets of each other, about clients. The tariffs_df3 and tariffs_df4 are also subsets of each other, about invoices. Therefore, in this notebook, `02_dataset_repakaging.ipynb`, we combine the pairs into respective single dataframes. Then, merge the two dataframes based on 'client_id' column.

## Dataset repackaging workflow.

---

### Tec1: Load and inspect cleaned dataframes.
- Read cleaned files into memory and verify structure.
---
### Tec2: Package data into a unified dataset.
#### Tec2.1: Combine client-related files.
- Merge `df1` and `df2` into `df_combined_clients`.
---
#### Tec2.2: Combine invoice-related files.
- Merge `df3` and `df4` into `df_combined_invoices`.
---
#### Tec2.3: Merge clients and invoices.
- Join on `client_id` to form a unified dataset.
---
#### Tec2.4: Handle missing targets.
- Address null values in the `fraud_flag` column.
---
### Tec3: Summarize and export cleaned dataset.
- Confirm structure and save for downstream workflows.
---

## __Tec1__: Load and inspect cleaned dataframes.
---


### Load libraries.

In [1]:
import pandas as pd
import numpy as np

In [2]:
import os
# Define full path to fragmented data.
fragmented_data_path = r'C:\Users\Lenovo\OneDrive\Desktop\4IR_DataScience\DataScienceEnvironment\my_projects\electricity_tariffs_revenue_protection\electricity_tariffs_analysis_compliance\data\processed\fragmented_data'

# Load all tariff fragments.
tariff_dfs = []
for i in range(1, 6):
    file_path = os.path.join(fragmented_data_path, f"tariffs_df{i}.csv")
    df = pd.read_csv(file_path)
    tariff_dfs.append(df)
    
# Display a few rows from each fragment.
for i, df in enumerate(tariff_dfs, start=1):
    print(f"\ntariffs_df{i}.head():")                   # Display first 5 rows.
    display(df.head())          
    print(f"tariffs_df{i},shape:")                      # Display shape of the dataframe.
    print(df.shape)
    print(f'Columns in tariffs_df{i}:')                 # Display columns in the dataframe.
    print(df.columns.tolist())
                              

  df = pd.read_csv(file_path)



tariffs_df1.head():


Unnamed: 0,district,client_id,client_catg,region,creation_date
0,62,test_Client_0,11,307,28/05/2002
1,69,test_Client_1,11,103,06/08/2009
2,62,test_Client_10,11,310,07/04/2004
3,60,test_Client_100,11,101,08/10/1992
4,62,test_Client_1000,11,301,21/07/1977


tariffs_df1,shape:
(58069, 5)
Columns in tariffs_df1:
['district', 'client_id', 'client_catg', 'region', 'creation_date']

tariffs_df2.head():


Unnamed: 0,district,client_id,client_catg,region,creation_date,target
0,60,train_Client_0,11,101,31/12/1994,0.0
1,69,train_Client_1,11,107,29/05/2002,0.0
2,62,train_Client_10,11,301,13/03/1986,0.0
3,69,train_Client_100,11,105,11/07/1996,0.0
4,62,train_Client_1000,11,303,14/10/2014,0.0


tariffs_df2,shape:
(135493, 6)
Columns in tariffs_df2:
['district', 'client_id', 'client_catg', 'region', 'creation_date', 'target']

tariffs_df3.head():


Unnamed: 0,client_id,invoice_date,tariff_type,meter_number,meter_status,meter_code,reading_remark,meter_coefficient,consumption_level_1,consumption_level_2,consumption_level_3,consumption_level_4,old_reading,new_reading,number_months,meter_type
0,test_Client_0,2018-03-16,11,651208,0,203,8,1,755,0,0,0,19145,19900,8,ELEC
1,test_Client_0,2014-03-21,11,651208,0,203,8,1,1067,0,0,0,13725,14792,8,ELEC
2,test_Client_0,2014-07-17,11,651208,0,203,8,1,0,0,0,0,14792,14792,4,ELEC
3,test_Client_0,2015-07-13,11,651208,0,203,9,1,410,0,0,0,16122,16532,4,ELEC
4,test_Client_0,2016-07-19,11,651208,0,203,9,1,412,0,0,0,17471,17883,4,ELEC


tariffs_df3,shape:
(1939722, 16)
Columns in tariffs_df3:
['client_id', 'invoice_date', 'tariff_type', 'meter_number', 'meter_status', 'meter_code', 'reading_remark', 'meter_coefficient', 'consumption_level_1', 'consumption_level_2', 'consumption_level_3', 'consumption_level_4', 'old_reading', 'new_reading', 'number_months', 'meter_type']

tariffs_df4.head():


Unnamed: 0,client_id,invoice_date,tariff_type,meter_number,meter_status,meter_code,reading_remark,meter_coefficient,consumption_level_1,consumption_level_2,consumption_level_3,consumption_level_4,old_reading,new_reading,number_months,meter_type
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,0,0,14302,14384,4,ELEC
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,0,0,12294,13678,4,ELEC
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,0,0,14624,14747,4,ELEC
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,0,0,14747,14849,4,ELEC
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,0,0,15066,15638,12,ELEC


tariffs_df4,shape:
(4476738, 16)
Columns in tariffs_df4:
['client_id', 'invoice_date', 'tariff_type', 'meter_number', 'meter_status', 'meter_code', 'reading_remark', 'meter_coefficient', 'consumption_level_1', 'consumption_level_2', 'consumption_level_3', 'consumption_level_4', 'old_reading', 'new_reading', 'number_months', 'meter_type']

tariffs_df5.head():


Unnamed: 0,client_id,target
0,test_Client_0,0.957281
1,test_Client_1,0.996425
2,test_Client_10,0.612359
3,test_Client_100,0.776933
4,test_Client_1000,0.571046


tariffs_df5,shape:
(58069, 2)
Columns in tariffs_df5:
['client_id', 'target']


---
### Tariff fragment loading summary.

- `tariffs_df1.csv` loaded — shape: **(58,069 rows × 5 columns)**
- `tariffs_df2.csv` loaded — shape: **(135,493 rows × 6 columns)**
- `tariffs_df3.csv` loaded — shape: **(1,939,722 rows × 16 columns)**  
(tariffs_df3.csv raised a `DtypeWarning`: *Columns (4) have mixed types*. Consider specifying `dtype` or using `low_memory=False`).
- `tariffs_df4.csv` loaded — shape: **(4,476,738 rows × 16 columns)**
- `tariffs_df5.csv` loaded — shape: **(58,069 rows × 2 columns)**

> Total rows across all fragments: **6,668,091**

---



### Fix the DtypeWarning in `tariffs_df3`.

In [3]:
import pandas as pd
import os

# Define path.
fragmented_data_path = r'C:\Users\Lenovo\OneDrive\Desktop\4IR_DataScience\DataScienceEnvironment\my_projects\electricity_tariffs_revenue_protection\electricity_tariffs_analysis_compliance\data\processed\fragmented_data'
file_path = os.path.join(fragmented_data_path, "tariffs_df3.csv")

# First, inspect column names.
df_temp = pd.read_csv(file_path, nrows=5)
print("Column 4 name:", df_temp.columns[4])

# Explicitly set dtype for column 4 (index starts at 0).
df3 = pd.read_csv(file_path, dtype={'meter_status': str})


Column 4 name: meter_status


## __Tec2__: Package data into a unified dataset.
---
### __Tec2.1__: Combine client-related files.
---

In [4]:
import pandas as pd
import os

# Load client files.
df1 = pd.read_csv(os.path.join(fragmented_data_path, "tariffs_df1.csv"))
df2 = pd.read_csv(os.path.join(fragmented_data_path, "tariffs_df2.csv"))

# Drop 'target' from df2 before combining.
df2_no_target = df2.drop(columns=['target'])

# Combine client datasets.
df_combined_clients = pd.concat([df1, df2_no_target], ignore_index=True)

# Merge 'target' back using client_id.
target_map = df2[['client_id', 'target']]
df_combined_clients = df_combined_clients.merge(target_map, on='client_id', how='left')


### __Tec2.2__: Combine invoice-related files.
---

In [5]:
# Load invoice files.
df3 = pd.read_csv(os.path.join(fragmented_data_path, "tariffs_df3.csv"), low_memory=False)
df4 = pd.read_csv(os.path.join(fragmented_data_path, "tariffs_df4.csv"), low_memory=False)

# Combine invoice datasets.
df_combined_invoices = pd.concat([df3, df4], ignore_index=True)


### __Tec2.3__: Merge clients and invoices.
---


In [6]:
# Merge combined clients with combined invoices.
df_packaged = pd.merge(df_combined_invoices, df_combined_clients, on='client_id', how='left')

# Preview results.
print("Packaged dataset shape:", df_packaged.shape)
print("Target column value counts:\n", df_packaged['target'].value_counts(dropna=False))


Packaged dataset shape: (6416460, 21)
Target column value counts:
 target
0.0    4123629
NaN    1939722
1.0     353109
Name: count, dtype: int64


---

### Packaged dataset summary.

| Metric                     | Value      |
|---------------------------|------------|
| Total rows                | 6 416 460  |
| Clients with `target = 0` | 4 123 629  |
| Clients with `target = 1` | 353 109    |
| Clients with `NaN target` | 1 939 722  |

- The `NaN` values reflect clients from `df1` who were not labeled in `df2`.

---


## __Tec2.4__: Handle missing targets.
---

The merged dataset contains **1 939 722 rows** with `NaN` values in the `target` column. These entries originate from `df1` clients who were not labeled in `df2`, and thus lack supervised outcomes.

### Why these NaNs exist.
- The original five source files had **0 missing values**.
- The `NaN` targets were introduced during the **merge step**, where `df1` (features) was left-joined with `df2` (labels).
- These missing labels are not data quality issues — they reflect **genuine absence of labeling** in the source system.

### Justification for filtering.
To ensure modeling integrity and ethical clarity:
- **Supervised learning requires labeled outcomes**. Including unlabeled rows would introduce noise and bias.
- **Filtering preserves signal quality** and avoids misleading performance metrics.
- **The exclusion is reversible**—these rows are retained in the full dataset for future use in unsupervised tasks or operational reporting.
- **Segregating labeled vs. unlabeled clients** aligns with stakeholder expectations and supports transparent decision-making.

### Final decision.
Filter out rows with missing `target` values to prepare a clean, reliable training set for supervised modeling.

---

In [7]:
# Fill NaNs in target column with -1.
df_packaged['target'] = df_packaged['target'].fillna(-1)

# Filter only labeled clients (target in [0.0, 1.0]).
df_labeled = df_packaged[df_packaged['target'].isin([0.0, 1.0])].copy()

# Confirm shape.
print(df_labeled.shape)  

# Preview first few rows.
df_labeled.head()


(4476738, 21)


Unnamed: 0,client_id,invoice_date,tariff_type,meter_number,meter_status,meter_code,reading_remark,meter_coefficient,consumption_level_1,consumption_level_2,...,consumption_level_4,old_reading,new_reading,number_months,meter_type,district,client_catg,region,creation_date,target
1939722,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,...,0,14302,14384,4,ELEC,60,11,101,31/12/1994,0.0
1939723,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,...,0,12294,13678,4,ELEC,60,11,101,31/12/1994,0.0
1939724,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,...,0,14624,14747,4,ELEC,60,11,101,31/12/1994,0.0
1939725,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,...,0,14747,14849,4,ELEC,60,11,101,31/12/1994,0.0
1939726,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,...,0,15066,15638,12,ELEC,60,11,101,31/12/1994,0.0


### Cast all object columns to string.

In [8]:
# Cast all object columns to string.
for col in df_packaged.select_dtypes(include='object').columns:
    df_packaged[col] = df_packaged[col].astype(str)


## __Tec3__: Summarize and export cleaned dataset.
---


- The cleaned dataset now contains only labeled entries for supervised modeling.
- Class distribution remains imbalanced, with ~6% positive class.


In [9]:
# Save the packaged dataframe to a Parquet file.
import os

# Define path for packaged data.
packaged_path = r'C:\Users\Lenovo\OneDrive\Desktop\4IR_DataScience\DataScienceEnvironment\my_projects\electricity_tariffs_revenue_protection\electricity_tariffs_analysis_compliance\data\processed\packaged_data'

# Ensure directory exists.
os.makedirs(packaged_path, exist_ok=True)

# Save to Parquet.
df_packaged.to_parquet(os.path.join(packaged_path, "df_packaged.parquet"), index=False)

# Save to Pickle.
df_packaged.to_pickle(os.path.join(packaged_path, "df_packaged.pkl"))

# Save to CSV.
df_packaged.to_csv(os.path.join(packaged_path, "df_packaged.csv"), index=False) 

# Confirn there is no missing values in the packaged dataframe.
df_packaged.isnull().sum()
# Display missing values summary.
print("Missing values summary:\n", df_packaged.isnull().sum())


Missing values summary:
 client_id              0
invoice_date           0
tariff_type            0
meter_number           0
meter_status           0
meter_code             0
reading_remark         0
meter_coefficient      0
consumption_level_1    0
consumption_level_2    0
consumption_level_3    0
consumption_level_4    0
old_reading            0
new_reading            0
number_months          0
meter_type             0
district               0
client_catg            0
region                 0
creation_date          0
target                 0
dtype: int64


---

### Summary and next steps.

The data packaging process is complete. Key steps included:
- Ingesting and validating five source files with **0 missing values**.
- Merging feature and label datasets, introducing `NaN` targets where labels were absent.
- Filtering out rows with missing `target` values to prepare a clean supervised training set.
- Saving the final packaged dataset to: `data/processed/data_packaging/df_packeged.parquet`.

#### This packaged dataset serves as the foundation for downstream tasks:
- **Exploratory Data Analysis (EDA)**: Assess feature distributions, correlations, and class balance.
- **Feature Engineering**: Apply transformations, flagging, binning, and enrichment strategies.
- **Modeling**: Train and evaluate supervised models using the labeled subset.

The full dataset — including unlabeled rows — is retained for unsupervised analysis and operational reporting.

---

## Next steps: `exploratory_data_analysis.ipynb`

---
