## Data Assessment of Waterflow Historical Data

**Metadata Summary**  
- 📅 **Date of Retrieval:** JULY 1, 2025  
- 🌐 **Source of Data:** LGU San Jacinto Treasury Records
- 📄 **License/Permission:**  
- 🧑‍💼 **Prepared by:** MARK JUNE E. ALMOJUELA

This notebook is used to split the compiled records with more than one month in one file to create chunks of records for each month.

In [59]:
# Initialization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 

# Split MAR_APR2020 record to create MAR2020 and APR2020

In [60]:
# Initialize df as None at the start
df = None

# Define the file path
file_path = os.path.normpath("../../dataset/raw/2020/compiled/MAR_APR_2020.csv")

# Print the full path for verification
print(f"Attempting to load file from: {os.path.abspath(file_path)}")

try:
    if not os.path.exists(file_path):
        print("Error: File not found at the specified location.")
        dir_path = os.path.dirname(file_path)
        if not os.path.exists(dir_path):
            print(f"Error: Directory not found: {os.path.abspath(dir_path)}")
        else:
            print("Files in directory:")
            print(os.listdir(dir_path))
    else:
        # Try UTF-8 encoding first
        try:
            df = pd.read_csv(file_path)
            print("File loaded successfully with UTF-8 encoding!")
        except UnicodeDecodeError:
            print("Trying with 'latin1' encoding...")
            df = pd.read_csv(file_path, encoding='latin1')
            print("File loaded successfully with 'latin1' encoding!")
        
        # Display info if df was loaded
        if df is not None:
            print(f"\nNumber of rows: {len(df)}")
            print("\nFirst few rows:")
            print(df.head())
            print("\nColumns in the dataset:")
            print(df.columns.tolist())
            
except Exception as e:
    print(f"An error occurred: {e}")

# The df variable is now available for use in subsequent cells

Attempting to load file from: c:\Users\Mark June Almojuela\OneDrive - Bicol University\WaterFlow\AI\Model Training\dataset\raw\2020\compiled\MAR_APR_2020.csv
Trying with 'latin1' encoding...
File loaded successfully with 'latin1' encoding!

Number of rows: 1633

First few rows:
   Control Number      Consumer's Name       Address Water Meter Serial #  \
0        501549.0       Albaño, Lilane  Alicante St.                  NaN   
1        500750.0  Aljecera, Marcelino  Alicante St.                  NaN   
2        500990.0       Almiñana, Irus  Alicante St.                  NaN   
3        500505.0       Almiñe, Edison  Alicante St.             95022096   
4        501542.0       Almiñe, Filben  Alicante St.                  NaN   

  Previous Present  Cons.    Amount  
0      218     247   29.0    87.00   
1     3030    3051   21.0    63.00   
2      471     537   66.0   198.00   
3        2      63   61.0   183.00   
4     3271    3314   43.0   129.00   

Columns in the dataset:
['Con

In [61]:
# Count of null/NaN values in each column
null_counts = df.isnull().sum()
print("Count of null/NaN values per column:")
print(null_counts[null_counts > 0])  # Only show columns with null values

# Count of rows with any null/NaN values
rows_with_nulls = df[df.isnull().any(axis=1)]
print(f"\nNumber of rows with any null/NaN values: {len(rows_with_nulls)}")

Count of null/NaN values per column:
Control Number            1
Water Meter Serial #    698
Previous                201
Present                 403
Cons.                   548
Amount                  416
dtype: int64

Number of rows with any null/NaN values: 1026


Creating MAR2020 AND APR2020 records

In [62]:
# Logic test for MAR_APR2020.csv record split
for index, row in df.iterrows():
    try:
        control_number = row["Control Number"]
        consumer_name = row["Consumer's Name"]
        address = row["Address"]
        serial_number = row["Water Meter Serial #"]
        try:
            previous_reading = int(row["Previous"])
        except ValueError:
            previous_reading = 0
        
        try:
            present_reading = int(row["Present"])
        except ValueError:
            if previous_reading > 0:
                present_reading = previous_reading
            else:
                present_reading = 0
        
        current_reading = present_reading - ((present_reading - previous_reading) / 2)
        
        total_consumption = present_reading - previous_reading
        total_amount = total_consumption * 10

        print(control_number, consumer_name, address, serial_number, 
              previous_reading, current_reading, total_consumption, total_amount)
              
    except Exception as e:
        print(f"Error processing row {index}: {e}")

501549.0 Albaño, Lilane Alicante St. nan 218 232.5 29 290
500750.0 Aljecera, Marcelino Alicante St. nan 3030 3040.5 21 210
500990.0 Almiñana, Irus Alicante St. nan 471 504.0 66 660
500505.0 Almiñe, Edison Alicante St. 95022096 2 32.5 61 610
501542.0 Almiñe, Filben Alicante St. nan 3271 3292.5 43 430
500431.0 Almiñe, Franchie Alicante St. 121006093 0 0.0 0 0
500263.0 Almodal, Arna Alicante St. 9588526 5228 5240.5 25 250
501240.0 Almocera, Owen Alicante St. nan 67 102.5 71 710
500484.0 Almodal, Erlinda Alicante St. 028086-02 0 0.0 0 0
500739.0 Almodal, Jolly Alicante St. 017902-02 1795 1861.5 133 1330
500544.0 Almodal, Noe Alicante St. nan 2418 2418.0 0 0
500187.0 Almodiel, Arles Alicante St. 9074313 3210 3210.0 0 0
501447.0 Almodiel, Mary Grace Alicante St. nan 238 240.5 5 50
501453.0 Alcantara, Hilda Alicante St. nan 183 189.5 13 130
501317.0 Almoete, Ike Alicante St. nan 595 603.0 16 160
501280.0 Almojuela, Arlic Alicante St. nan 424 448.0 48 480
500248.0 Almojuela, Rogelio Alicante S

In [63]:
import csv

# Create the output directory if it doesn't exist
mar_output_dir = os.path.dirname("../../dataset/raw/2020/MAR2020.csv")
apr_output_dir = os.path.dirname("../../dataset/raw/2020/APR2020.csv")
os.makedirs(mar_output_dir, exist_ok=True)
os.makedirs(apr_output_dir, exist_ok=True)

with open("../../dataset/raw/2020/MAR2020.csv", "w", newline="", encoding='latin-1') as mar_file \
    , open("../../dataset/raw/2020/APR2020.csv", "w", newline="", encoding='latin-1') as apr_file:
    mar_csv_writer = csv.writer(mar_file)
    apr_csv_writer = csv.writer(apr_file)
    # Write header
    mar_csv_writer.writerow([
        "Control Number", "Consumer's Name", "Address", 
        "Water Meter Serial #", "Previous", "Present", 
        "Cons.", "Amount", "Connection Status"
    ])
    apr_csv_writer.writerow([
        "Control Number", "Consumer's Name", "Address", 
        "Water Meter Serial #", "Previous", "Present", 
        "Cons.", "Amount", "Connection Status"
    ])

    for index, row in df.iterrows():
        try:
            control_number = row["Control Number"]
            consumer_name = row["Consumer's Name"]
            address = row["Address"]
            serial_number = row["Water Meter Serial #"]
            connection_status = None
            
            # Handle Previous Reading
            try:
                mar_previous_reading = int(float(str(row["Previous"]).strip()))
                connection_status = "Connected"
            except (ValueError, TypeError):
                if row['Previous'] is not None:
                    prev_status = str(row['Previous']).strip().upper() if pd.notna(row['Previous']) else ""
                    if prev_status in ["DISC", "DISC."]:
                        connection_status = "Disconnected"
                    elif prev_status:   
                        connection_status = prev_status.capitalize()
                else:
                    mar_previous_reading = 0
                    connection_status = "Unknown"
            
            # Handle Present Reading
            try:
                mar_present_reading = int(float(str(row["Present"]).strip()))
                connection_status = "Connected" if connection_status is None else connection_status
            except (ValueError, TypeError):
                mar_present_reading = mar_previous_reading if mar_previous_reading is not None else 0
                connection_status = "Unknown" if connection_status is None else connection_status
            
            # Calculate values for March
            mar_current_reading = mar_previous_reading + round((mar_present_reading - mar_previous_reading) / 2)
            mar_total_consumption = mar_current_reading - mar_previous_reading
            mar_total_amount = mar_total_consumption * 10 

            new_record_mar = [
                control_number, consumer_name, address, serial_number,
                mar_previous_reading, round(mar_current_reading),
                mar_total_consumption, mar_total_amount, connection_status
            ]

            # Calculate values April
            apr_previous_reading = mar_current_reading
            
            # Handle Present Reading
            try:
                apr_current_reading = int(float(str(row["Present"]).strip()))
            except (ValueError, TypeError):
                apr_current_reading = apr_previous_reading if apr_previous_reading is not None else 0
            
            # Calculate values for April
            apr_total_consumption = apr_current_reading - apr_previous_reading
            apr_total_amount = apr_total_consumption * 10 

            new_record_apr = [
                control_number, consumer_name, address, serial_number,
                apr_previous_reading, round(apr_current_reading),
                apr_total_consumption, apr_total_amount, connection_status
            ]            
            # Print Record
            print(f"Processed MAR {index} rows: {new_record_mar}")
            print(f"Processed APR {index} rows: {new_record_apr}")
            
            # Write row
            mar_csv_writer.writerow(new_record_mar)
            apr_csv_writer.writerow(new_record_apr) 

            # Reset connection status
            connection_status = None
               
        except Exception as e:
            print(f"Error processing row {index}: {e}")
            continue

print("Processing complete!")

Processed MAR 0 rows: [501549.0, 'Albaño, Lilane', 'Alicante St.', nan, 218, 232, 14, 140, 'Connected']
Processed APR 0 rows: [501549.0, 'Albaño, Lilane', 'Alicante St.', nan, 232, 247, 15, 150, 'Connected']
Processed MAR 1 rows: [500750.0, 'Aljecera, Marcelino', 'Alicante St.', nan, 3030, 3040, 10, 100, 'Connected']
Processed APR 1 rows: [500750.0, 'Aljecera, Marcelino', 'Alicante St.', nan, 3040, 3051, 11, 110, 'Connected']
Processed MAR 2 rows: [500990.0, 'Almiñana, Irus', 'Alicante St.', nan, 471, 504, 33, 330, 'Connected']
Processed APR 2 rows: [500990.0, 'Almiñana, Irus', 'Alicante St.', nan, 504, 537, 33, 330, 'Connected']
Processed MAR 3 rows: [500505.0, 'Almiñe, Edison', 'Alicante St.', '95022096', 2, 32, 30, 300, 'Connected']
Processed APR 3 rows: [500505.0, 'Almiñe, Edison', 'Alicante St.', '95022096', 32, 63, 31, 310, 'Connected']
Processed MAR 4 rows: [501542.0, 'Almiñe, Filben', 'Alicante St.', nan, 3271, 3293, 22, 220, 'Connected']
Processed APR 4 rows: [501542.0, 'Almiñ

In [64]:
# Read the data with optimized dtypes
dtypes = {
    'Control Number': 'str',
    "Consumer's Name": 'str',
    'Address': 'str',
    'Water Meter Serial #': 'str',
    'Previous': 'float64',
    'Present': 'float64',
    'Current': 'float64',
    'Cons.': 'float64',
    'Amount': 'float64'
}

# Read the CSV
new_df = pd.read_csv("../../dataset/raw/2020/APR2020.csv", 
                    encoding='latin-1',
                    dtype=dtypes)

# Check for negative consumption
print("=== Negative Consumption Summary ===")
neg_consumption = new_df[new_df['Cons.'] < 0]
print(f"Total rows with negative consumption: {len(neg_consumption)}")
if not neg_consumption.empty:
    print("\nSample of rows with negative consumption:")
    print(neg_consumption[['Control Number', 'Previous', 'Present', 'Cons.']].head())

# Check for negative amount
print("\n=== Negative Amount Summary ===")
neg_amount = new_df[new_df['Amount'] < 0]
print(f"Total rows with negative amount: {len(neg_amount)}")
if not neg_amount.empty:
    print("\nSample of rows with negative amount:")
    print(neg_amount[['Control Number', 'Cons.', 'Amount']].head())

# Additional checks
print("\n=== Additional Data Quality Checks ===")
print(f"Total rows: {len(new_df)}")
print(f"Rows with zero consumption: {len(new_df[new_df['Cons.'] == 0])}")
print(f"Rows with missing values: {new_df.isnull().any(axis=1).sum()}")

=== Negative Consumption Summary ===
Total rows with negative consumption: 30

Sample of rows with negative consumption:
    Control Number  Previous  Present  Cons.
33        500330.0     177.0     32.0 -145.0
171       500741.0      58.0     14.0  -44.0
298       500957.0      81.0      5.0  -76.0
299       501002.0     109.0     61.0  -48.0
382       501654.0     210.0     47.0 -163.0

=== Negative Amount Summary ===
Total rows with negative amount: 30

Sample of rows with negative amount:
    Control Number  Cons.  Amount
33        500330.0 -145.0 -1450.0
171       500741.0  -44.0  -440.0
298       500957.0  -76.0  -760.0
299       501002.0  -48.0  -480.0
382       501654.0 -163.0 -1630.0

=== Additional Data Quality Checks ===
Total rows: 1633
Rows with zero consumption: 544
Rows with missing values: 698


# Create records for AUG2022 and SEP2022

In [65]:
# Review of consumption difference
import pandas as pd
from IPython.display import display

july_df = pd.read_csv("../../dataset/clean/training/semi_clean/2022/JUL_2022_semi_clean.csv", encoding='latin-1')
oct_df = pd.read_csv("../../dataset/clean/training/semi_clean/2022/OCT_2022_semi_clean.csv", encoding='latin-1')
nov_df = pd.read_csv("../../dataset/clean/training/semi_clean/2022/NOV_2022_semi_clean.csv", encoding='latin-1')
dec_df = pd.read_csv("../../dataset/clean/training/semi_clean/2022/DEC_2022_semi_clean.csv", encoding='latin-1')

july_mini_df = july_df[['Control Number', 'Account Name', 'Cleaned Previous Reading', 'Cleaned Present Reading']]
oct_mini_df = oct_df[['Control Number', 'Account Name', 'Cleaned Previous Reading', 'Cleaned Present Reading']]
nov_mini_df = nov_df[['Control Number', 'Account Name', 'Cleaned Previous Reading', 'Cleaned Present Reading']]
dec_mini_df = dec_df[['Control Number', 'Account Name', 'Cleaned Previous Reading', 'Cleaned Present Reading']]

merged_df = pd.merge(july_mini_df, oct_mini_df, on=['Control Number', 'Account Name'], suffixes=(' July', ' October'))
merged_df = pd.merge(merged_df, nov_mini_df, on=['Control Number', 'Account Name'], suffixes=(' October', ' November'))
merged_df = pd.merge(merged_df, dec_mini_df, on=['Control Number', 'Account Name'], suffixes=(' November', ' December'))

merged_df['Consumption Difference'] = pd.to_numeric(merged_df['Cleaned Present Reading October']) - pd.to_numeric(merged_df['Cleaned Present Reading July'])

negative_df = merged_df[merged_df['Consumption Difference'] < 0]
print("Negative Consumption Difference Count: ", len(negative_df))
display(negative_df[['Control Number', 'Account Name', 'Cleaned Previous Reading July','Cleaned Present Reading July', 'Cleaned Previous Reading October', 'Cleaned Present Reading October', 'Consumption Difference']].head())
display(negative_df)

result = merged_df.loc[merged_df['Consumption Difference'] < 0, 'Control Number'].unique()
invalid_record = [result.astype(int) for result in result]
print("Invalid Record Count: ", len(invalid_record))
print(invalid_record)

Negative Consumption Difference Count:  14


Unnamed: 0,Control Number,Account Name,Cleaned Previous Reading July,Cleaned Present Reading July,Cleaned Previous Reading October,Cleaned Present Reading October,Consumption Difference
842,500606,"Almoete, Oscar",2571.0,2571.0,2548.0,2555.0,-16.0
927,500375,"Mira, Noe",3514.0,3514.0,3512.0,3512.0,-2.0
976,500682,"Almosara, Celin",3009.0,3024.0,43.0,43.0,-2981.0
1012,501355,"Bocboc, Evelyn",1000.0,1030.0,6.0,6.0,-1024.0
1013,500881,"Bocboc, Lily",1456.0,1478.0,74.0,74.0,-1404.0


Unnamed: 0,Control Number,Account Name,Cleaned Previous Reading July,Cleaned Present Reading July,Cleaned Previous Reading October,Cleaned Present Reading October,Cleaned Previous Reading November,Cleaned Present Reading November,Cleaned Previous Reading December,Cleaned Present Reading December,Consumption Difference
842,500606,"Almoete, Oscar",2571.0,2571.0,2548.0,2555.0,2555.0,2601.0,2601.0,2605.0,-16.0
927,500375,"Mira, Noe",3514.0,3514.0,3512.0,3512.0,3512.0,3514.0,3514.0,3516.0,-2.0
976,500682,"Almosara, Celin",3009.0,3024.0,43.0,43.0,43.0,43.0,95.0,95.0,-2981.0
1012,501355,"Bocboc, Evelyn",1000.0,1030.0,6.0,6.0,6.0,18.0,18.0,23.0,-1024.0
1013,500881,"Bocboc, Lily",1456.0,1478.0,74.0,74.0,74.0,74.0,245.0,245.0,-1404.0
1028,501592,"Dejino, Evangeline",1818.0,1881.0,55.0,55.0,55.0,72.0,72.0,92.0,-1826.0
1075,501158,"Gupalao, Roger",609.0,622.0,8.0,8.0,8.0,40.0,40.0,47.0,-614.0
1101,501688,"Pinaranda, Maricel",347.0,350.0,31.0,31.0,31.0,59.0,59.0,83.0,-319.0
1127,500066,"Almodal, Glenda",4755.0,4755.0,19.0,22.0,22.0,25.0,25.0,27.0,-4733.0
1287,500960,"Almojuela, Nila",973.0,973.0,26.0,35.0,35.0,48.0,48.0,63.0,-938.0


Invalid Record Count:  14
[np.int64(500606), np.int64(500375), np.int64(500682), np.int64(501355), np.int64(500881), np.int64(501592), np.int64(501158), np.int64(501688), np.int64(500066), np.int64(500960), np.int64(501087), np.int64(500917), np.int64(500426), np.int64(501358)]


In [66]:
# Drop the records with negative consumption difference from July and October Records
filtered_jul_df = july_df[~july_df['Control Number'].isin(invalid_record)]
jul_df_count = len(july_df)
filtered_jul_df_count = len(filtered_jul_df)
filtered_oct_df = oct_df[~oct_df['Control Number'].isin(invalid_record)] 
oct_df_count = len(oct_df)
filtered_oct_df_count = len(filtered_oct_df)

columns = ['Control Number', 
            'Account Name', 
            'Service Address', 
            'Previous Reading', 
            'Present Reading', 
            'Cleaned Previous Reading', 
            'Cleaned Present Reading', 
            'Cleaned Consumption']

filtered_jul_df = filtered_jul_df[columns]
filtered_oct_df = filtered_oct_df[columns]

if jul_df_count - len(invalid_record) == len(filtered_jul_df):
    print("July records are valid")
if oct_df_count - len(invalid_record) == len(filtered_oct_df):
    print("October records are valid")

July records are valid
October records are valid


In [67]:
# Null Checks
print(f'Null July Present Readings Count: {filtered_jul_df['Cleaned Present Reading'].isnull().sum()}')
print(f'Null July Previous Readings Count: {filtered_jul_df['Cleaned Previous Reading'].isnull().sum()}')

print(f'Null Oct Present Readings Count: {filtered_oct_df['Cleaned Present Reading'].isnull().sum()}')
print(f'Null Oct Previous Readings Count: {filtered_oct_df['Cleaned Previous Reading'].isnull().sum()}')

# Coerce to numeric and create masks for invalid entries
jul_present_numeric = pd.to_numeric(filtered_jul_df['Cleaned Present Reading'], errors="coerce")
jul_previous_numeric = pd.to_numeric(filtered_jul_df['Cleaned Previous Reading'], errors="coerce")

oct_present_numeric = pd.to_numeric(filtered_oct_df['Cleaned Present Reading'], errors="coerce")
oct_previous_numeric = pd.to_numeric(filtered_oct_df['Cleaned Previous Reading'], errors="coerce")

# Boolean masks where coercion failed (i.e., non-numeric values)
jul_invalid_present_mask = jul_present_numeric.isna() & filtered_jul_df['Cleaned Present Reading'].notna()
jul_invalid_previous_mask = jul_previous_numeric.isna() & filtered_jul_df['Cleaned Previous Reading'].notna()

oct_invalid_present_mask = oct_present_numeric.isna() & filtered_oct_df['Cleaned Present Reading'].notna()
oct_invalid_previous_mask = oct_previous_numeric.isna() & filtered_oct_df['Cleaned Previous Reading'].notna()

# Extract invalid entries
jul_invalid_present_values = filtered_jul_df.loc[jul_invalid_present_mask, "Cleaned Present Reading"].unique()
jul_invalid_previous_values = filtered_jul_df.loc[jul_invalid_previous_mask, "Cleaned Previous Reading"].unique()

oct_invalid_present_values = filtered_oct_df.loc[oct_invalid_present_mask, "Cleaned Present Reading"].unique()
oct_invalid_previous_values = filtered_oct_df.loc[oct_invalid_previous_mask, "Cleaned Previous Reading"].unique()

# Report results
print('\n\nJULY\n')
print(f"Non-Numeric Present Readings Count: {jul_invalid_present_mask.sum()}")
print(f"Values: {jul_invalid_present_values.tolist()}")

print(f"Non-Numeric Previous Readings Count: {jul_invalid_previous_mask.sum()}")
print(f"Values: {jul_invalid_previous_values.tolist()}")

print('\n\nOCTOBER\n')
print(f"Non-Numeric Present Readings Count: {oct_invalid_present_mask.sum()}")
print(f"Values: {oct_invalid_present_values.tolist()}")

print(f"Non-Numeric Previous Readings Count: {oct_invalid_previous_mask.sum()}")
print(f"Values: {oct_invalid_previous_values.tolist()}")


Null July Present Readings Count: 529
Null July Previous Readings Count: 529
Null Oct Present Readings Count: 527
Null Oct Previous Readings Count: 527


JULY

Non-Numeric Present Readings Count: 0
Values: []
Non-Numeric Previous Readings Count: 0
Values: []


OCTOBER

Non-Numeric Present Readings Count: 0
Values: []
Non-Numeric Previous Readings Count: 0
Values: []


In [68]:
merged_df = filtered_jul_df.merge(filtered_oct_df, on=['Control Number', 'Account Name'], how='inner', suffixes=(' July', ' October'))
display(merged_df.head())
display(merged_df.tail())
display(merged_df.info())

Unnamed: 0,Control Number,Account Name,Service Address July,Previous Reading July,Present Reading July,Cleaned Previous Reading July,Cleaned Present Reading July,Cleaned Consumption July,Service Address October,Previous Reading October,Present Reading October,Cleaned Previous Reading October,Cleaned Present Reading October,Cleaned Consumption October
0,501549,"Albano, Lilane",Alicante St.,544.0,544.0,544.0,544.0,0,Alicante St.,585.0,610.0,585.0,610.0,25
1,500750,"Aljecera, Marcelino",Alicante St.,3274.0,3274.0,3274.0,3274.0,0,Alicante St.,3327.0,3359.0,3327.0,3359.0,32
2,500990,"Alminana, Irus",Alicante St.,1401.0,1401.0,1401.0,1401.0,0,Alicante St.,1581.0,1631.0,1581.0,1631.0,50
3,501704,"Alminana, Violeta",Alicante St.,147.0,147.0,147.0,147.0,0,Alicante St.,184.0,,184.0,184.0,0
4,500505,"Almine, Edison",Alicante St.,894.0,894.0,894.0,894.0,0,Alicante St.,983.0,1023.0,983.0,1023.0,40


Unnamed: 0,Control Number,Account Name,Service Address July,Previous Reading July,Present Reading July,Cleaned Previous Reading July,Cleaned Present Reading July,Cleaned Consumption July,Service Address October,Previous Reading October,Present Reading October,Cleaned Previous Reading October,Cleaned Present Reading October,Cleaned Consumption October
1886,500641,"Moya, Concepcion",Villamor St.,,,,,0,Villamor St.,,,,,0
1887,500021,"Nacino, Christopher",Villamor St.,1829.0,1829.0,1829.0,1829.0,0,Villamor St.,1832.0,1834.0,1832.0,1834.0,2
1888,500091,"Ragasa, Fe",Villamor St.,3369.0,3379.0,3369.0,3379.0,10,Villamor St.,3428.0,3453.0,3428.0,3453.0,25
1889,501109,"Almocera, Ricky","Puro, Calipat-an",,,,,0,"Puro, Calipat-an",,,,,0
1890,501381,"Gaurano, Clutario","Puro, Calipat-an",,,,,0,"Puro, Calipat-an",,,,,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1891 entries, 0 to 1890
Data columns (total 14 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Control Number                    1891 non-null   int64  
 1   Account Name                      1891 non-null   object 
 2   Service Address July              1891 non-null   object 
 3   Previous Reading July             1358 non-null   float64
 4   Present Reading July              1355 non-null   float64
 5   Cleaned Previous Reading July     1365 non-null   float64
 6   Cleaned Present Reading July      1365 non-null   float64
 7   Cleaned Consumption July          1891 non-null   int64  
 8   Service Address October           1891 non-null   object 
 9   Previous Reading October          1348 non-null   float64
 10  Present Reading October           1306 non-null   float64
 11  Cleaned Previous Reading October  1381 non-null   float64
 12  Cleane

None

In [69]:
import numpy as np

def process_consumption(df):
    df['Consumption Difference'] = df['Cleaned Present Reading October'] - df['Cleaned Present Reading July']
    df['Consumption Difference'] = df['Consumption Difference'].fillna(0)
    df['August Consumption'] = df['Consumption Difference'].clip(lower=0).apply(lambda x: np.ceil(x / 2))
    df['September Consumption'] = df['Consumption Difference'] - df['August Consumption']
    df['September Consumption'] = df['September Consumption'].clip(lower=0)

    df['Cleaned Present Reading August'] = df['Cleaned Present Reading July'] + df['August Consumption']
    df['Cleaned Present Reading September'] = df['Cleaned Present Reading August'] + df['September Consumption']
    return df

merged_df = process_consumption(merged_df)
merged_df.head()
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1891 entries, 0 to 1890
Data columns (total 19 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Control Number                     1891 non-null   int64  
 1   Account Name                       1891 non-null   object 
 2   Service Address July               1891 non-null   object 
 3   Previous Reading July              1358 non-null   float64
 4   Present Reading July               1355 non-null   float64
 5   Cleaned Previous Reading July      1365 non-null   float64
 6   Cleaned Present Reading July       1365 non-null   float64
 7   Cleaned Consumption July           1891 non-null   int64  
 8   Service Address October            1891 non-null   object 
 9   Previous Reading October           1348 non-null   float64
 10  Present Reading October            1306 non-null   float64
 11  Cleaned Previous Reading October   1381 non-null   float

In [70]:
for_review = []
for _, row in merged_df.iterrows():
    if pd.isna(row['Cleaned Present Reading July']) and pd.notna(row['Cleaned Present Reading October']):
        for_review.append(row)

for_review_df = pd.DataFrame(for_review)
display(for_review_df.head())
print(len(for_review_df))

Unnamed: 0,Control Number,Account Name,Service Address July,Previous Reading July,Present Reading July,Cleaned Previous Reading July,Cleaned Present Reading July,Cleaned Consumption July,Service Address October,Previous Reading October,Present Reading October,Cleaned Previous Reading October,Cleaned Present Reading October,Cleaned Consumption October,Consumption Difference,August Consumption,September Consumption,Cleaned Present Reading August,Cleaned Present Reading September
90,500623,"Grencio, Maribel",Alicante St.,,,,,0,Alicante St.,782.0,819.0,782.0,819.0,37,0.0,0.0,0.0,,
106,500717,"Letada, Gener",Alicante St.,,,,,0,Alicante St.,1455.0,,1455.0,1455.0,0,0.0,0.0,0.0,,
128,501878,"Ramiro, Geny",Alicante St.,,,,,0,Alicante St.,42.0,54.0,42.0,54.0,12,0.0,0.0,0.0,,
138,500487,"Sese, Robert",Alicante St.,,,,,0,Alicante St.,74.0,79.0,74.0,79.0,5,0.0,0.0,0.0,,
143,500381,UCCP,Alicante St.,,,,,0,Alicante St.,7388.0,7393.0,7388.0,7393.0,5,0.0,0.0,0.0,,


74


In [71]:
filtered_jul_df.head()

aug_columns = ['Control Number', 'Account Name', 'Service Address July', 'Cleaned Present Reading July', 'Cleaned Present Reading August', 'August Consumption']
sep_columns = ['Control Number', 'Account Name', 'Service Address July', 'Cleaned Present Reading August', 'Cleaned Present Reading September', 'September Consumption']

aug_df = merged_df[aug_columns]
aug_df = aug_df.rename(columns={
    'Control Number': 'Control Number', 
    'Account Name': 'Account Name', 
    'Service Address July': 'Service Address', 
    'Cleaned Present Reading July': 'Previous Reading', 
    'Cleaned Present Reading August': 'Present Reading', 
    'August Consumption': 'Consumption'})

sep_df = merged_df[sep_columns]
sep_df = sep_df.rename(columns={
    'Control Number': 'Control Number', 
    'Account Name': 'Account Name', 
    'Service Address July': 'Service Address', 
    'Cleaned Present Reading August': 'Previous Reading', 
    'Cleaned Present Reading September': 'Present Reading', 
    'September Consumption': 'Consumption'})

display(aug_df.head())
display(sep_df.head())

Unnamed: 0,Control Number,Account Name,Service Address,Previous Reading,Present Reading,Consumption
0,501549,"Albano, Lilane",Alicante St.,544.0,577.0,33.0
1,500750,"Aljecera, Marcelino",Alicante St.,3274.0,3317.0,43.0
2,500990,"Alminana, Irus",Alicante St.,1401.0,1516.0,115.0
3,501704,"Alminana, Violeta",Alicante St.,147.0,166.0,19.0
4,500505,"Almine, Edison",Alicante St.,894.0,959.0,65.0


Unnamed: 0,Control Number,Account Name,Service Address,Previous Reading,Present Reading,Consumption
0,501549,"Albano, Lilane",Alicante St.,577.0,610.0,33.0
1,500750,"Aljecera, Marcelino",Alicante St.,3317.0,3359.0,42.0
2,500990,"Alminana, Irus",Alicante St.,1516.0,1631.0,115.0
3,501704,"Alminana, Violeta",Alicante St.,166.0,184.0,18.0
4,500505,"Almine, Edison",Alicante St.,959.0,1023.0,64.0


In [72]:
from pathlib import Path

output_dir = Path("../../dataset/raw/2022/compiled")
aug_df.to_csv(output_dir / 'AUG_2022.csv', index=False)
sep_df.to_csv(output_dir / 'SEP_2022.csv', index=False)

print(f"August 2022 records: {aug_df.shape[0]}")
print(f"September 2022 records: {sep_df.shape[0]}")

August 2022 records: 1891
September 2022 records: 1891


# Create records for NOV2023 and DEC2023

In [73]:
# Review of consumption difference
import pandas as pd
from IPython.display import display

def enforce_schema(df):
    df['Control Number'] = df['Control Number'].astype(str)
    df['Account Name'] = df['Account Name'].astype(str)
    return df

def sanitize_control_numbers(df):
    df['Control Number'] = df['Control Number'].str.replace('.0', '')
    return df

sep_df = pd.read_csv("../../dataset/clean/training/semi_clean/2023/SEP_2023_semi_clean.csv", encoding='utf-8')
sep_df = enforce_schema(sep_df)
sep_df = sanitize_control_numbers(sep_df)

oct_df = pd.read_csv("../../dataset/clean/training/semi_clean/2023/OCT_2023_semi_clean.csv", encoding='utf-8')
oct_df = enforce_schema(oct_df)
oct_df = sanitize_control_numbers(oct_df)

jan_df = pd.read_csv("../../dataset/clean/training/semi_clean/2024/JAN_2024_semi_clean.csv", encoding='utf-8')
jan_df = enforce_schema(jan_df)
jan_df = sanitize_control_numbers(jan_df)

sep_df = sep_df[['Control Number', 'Account Name', 'Cleaned Previous Reading', 'Cleaned Present Reading']]
sep_df = sep_df.rename(columns={'Cleaned Previous Reading': 'Cleaned Previous Reading September', 'Cleaned Present Reading': 'Cleaned Present Reading September'})

oct_df = oct_df[['Control Number', 'Account Name','Service Address', 'Cleaned Previous Reading', 'Cleaned Present Reading']]
oct_df = oct_df.rename(columns={'Cleaned Previous Reading': 'Cleaned Previous Reading October', 'Cleaned Present Reading': 'Cleaned Present Reading October'})

jan_df = jan_df[['Control Number', 'Account Name', 'Service Address', 'Cleaned Previous Reading', 'Cleaned Present Reading']]
jan_df = jan_df.rename(columns={'Cleaned Previous Reading': 'Cleaned Previous Reading January', 'Cleaned Present Reading': 'Cleaned Present Reading January'})

merged_df = pd.merge(oct_df, jan_df, on=['Control Number', 'Account Name'], how='inner')

merged_df['Consumption Difference'] = pd.to_numeric(merged_df['Cleaned Present Reading January']) - pd.to_numeric(merged_df['Cleaned Present Reading October'])

negative_df = merged_df[merged_df['Consumption Difference'] < 0]
print("Negative Consumption Difference Count: ", len(negative_df))
display(negative_df[['Control Number', 'Account Name', 'Cleaned Previous Reading October','Cleaned Present Reading October', 'Cleaned Previous Reading January', 'Cleaned Present Reading January', 'Consumption Difference']].head())

result = merged_df.loc[merged_df['Consumption Difference'] < 0, 'Control Number'].unique()
invalid_record = [int(result) for result in result]
print("Invalid Record Count: ", len(invalid_record))
print(invalid_record)

Negative Consumption Difference Count:  25


Unnamed: 0,Control Number,Account Name,Cleaned Previous Reading October,Cleaned Present Reading October,Cleaned Previous Reading January,Cleaned Present Reading January,Consumption Difference
104,501291,"Herato, Jovelyn",1605.0,1605.0,10.0,20.0,-1585.0
135,501088,"Pamotillo, Eva",4339.0,4340.0,2457.0,2457.0,-1883.0
194,500215,"Almojuela, Junie",752.0,752.0,15.0,15.0,-737.0
197,500149,"Atabay, Elisa",6198.0,6198.0,41.0,68.0,-6130.0
230,500156,"Delariarte, Vibiana",3258.0,3258.0,36.0,57.0,-3201.0


Invalid Record Count:  25
[501291, 501088, 500215, 500149, 500156, 501629, 501427, 500147, 500297, 500222, 501223, 500721, 501987, 501413, 501618, 501525, 501711, 501253, 501326, 501678, 501591, 501186, 501215, 500092, 501257]


In [74]:
# Drop the records with negative consumption difference from July and October Records
filtered_merged_df = merged_df[~merged_df['Control Number'].astype(int).isin(invalid_record)]
print("Valid Record Count: ", filtered_merged_df.shape[0])

columns = ['Control Number', 
            'Account Name', 
            'Service Address_y', 
            'Cleaned Previous Reading October', 
            'Cleaned Present Reading October', 
            'Cleaned Previous Reading January', 
            'Cleaned Present Reading January', 
            'Consumption Difference']

filtered_merged_df = filtered_merged_df[columns]

if merged_df.shape[0] - len(invalid_record) == len(filtered_merged_df):
    print("Merged records are valid")
    display(filtered_merged_df.head())
else:
    print("Merged records are invalid")

Valid Record Count:  2013
Merged records are valid


Unnamed: 0,Control Number,Account Name,Service Address_y,Cleaned Previous Reading October,Cleaned Present Reading October,Cleaned Previous Reading January,Cleaned Present Reading January,Consumption Difference
0,501549,"Albano, Lilane",Alicante St.,767.0,780.0,790.0,797.0,17.0
1,501453,"Alcantara, Hilda",Alicante St.,374.0,377.0,382.0,386.0,9.0
2,500750,"Aljecera, Marcelino",Alicante St.,3498.0,3523.0,3558.0,3565.0,42.0
3,500990,"Alminana, Irus",Alicante St.,2228.0,2287.0,2367.0,2412.0,125.0
4,501704,"Alminana, Violeta",Alicante St.,301.0,314.0,352.0,358.0,44.0


In [75]:
import numpy as np

def process_consumption(df):
    df['Consumption Difference'] = df['Consumption Difference'].fillna(0)
    df['November Consumption'] = df['Consumption Difference'].clip(lower=0).apply(lambda x: np.ceil(x / 2))
    df['December Consumption'] = df['Consumption Difference'] - df['November Consumption']
    df['December Consumption'] = df['December Consumption'].clip(lower=0)

    df['Cleaned Present Reading November'] = df['Cleaned Present Reading October'] + df['November Consumption']
    df['Cleaned Present Reading December'] = df['Cleaned Present Reading November'] + df['December Consumption']
    return df

merged_df = process_consumption(merged_df)
merged_df.head()
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2038 entries, 0 to 2037
Data columns (total 13 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Control Number                    2038 non-null   object 
 1   Account Name                      2038 non-null   object 
 2   Service Address_x                 2038 non-null   object 
 3   Cleaned Previous Reading October  1577 non-null   float64
 4   Cleaned Present Reading October   1577 non-null   float64
 5   Service Address_y                 2038 non-null   object 
 6   Cleaned Previous Reading January  1564 non-null   float64
 7   Cleaned Present Reading January   1564 non-null   float64
 8   Consumption Difference            2038 non-null   float64
 9   November Consumption              2038 non-null   float64
 10  December Consumption              2038 non-null   float64
 11  Cleaned Present Reading November  1577 non-null   float64
 12  Cleane

In [76]:
import numpy as np

def calculate_values(df):
    # Create mask for valid rows
    valid_mask = df['Consumption Difference'].notna()

    # November consumption: ceil half of the total difference
    df.loc[valid_mask, 'Consumption November'] = np.ceil(df.loc[valid_mask, 'Consumption Difference'] / 2)

    # December consumption: remainder
    df.loc[valid_mask, 'Consumption December'] = df.loc[valid_mask, 'Consumption Difference'] - df.loc[valid_mask, 'Consumption November']

    # November readings
    df.loc[valid_mask, 'Previous Reading November'] = df.loc[valid_mask, 'Cleaned Present Reading October']
    df.loc[valid_mask, 'Present Reading November'] = df.loc[valid_mask, 'Previous Reading November'] + df.loc[valid_mask, 'Consumption November']

    # December readings
    df.loc[valid_mask, 'Previous Reading December'] = df.loc[valid_mask, 'Present Reading November']
    df.loc[valid_mask, 'Present Reading December'] = df.loc[valid_mask, 'Previous Reading December'] + df.loc[valid_mask, 'Consumption December']

    return df


merged_df = calculate_values(merged_df)
merged_df.head()
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2038 entries, 0 to 2037
Data columns (total 19 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Control Number                    2038 non-null   object 
 1   Account Name                      2038 non-null   object 
 2   Service Address_x                 2038 non-null   object 
 3   Cleaned Previous Reading October  1577 non-null   float64
 4   Cleaned Present Reading October   1577 non-null   float64
 5   Service Address_y                 2038 non-null   object 
 6   Cleaned Previous Reading January  1564 non-null   float64
 7   Cleaned Present Reading January   1564 non-null   float64
 8   Consumption Difference            2038 non-null   float64
 9   November Consumption              2038 non-null   float64
 10  December Consumption              2038 non-null   float64
 11  Cleaned Present Reading November  1577 non-null   float64
 12  Cleane

In [77]:
nov_columns = {
    'Control Number': 'Control Number', 
    'Account Name': 'Account Name', 
    'Service Address_y': 'Service Address', 
    'Previous Reading November': 'Previous Reading', 
    'Present Reading November': 'Present Reading', 
    'Consumption November': 'Consumption'
}

dec_columns = {
    'Control Number': 'Control Number', 
    'Account Name': 'Account Name', 
    'Service Address_y': 'Service Address', 
    'Previous Reading December': 'Previous Reading', 
    'Present Reading December': 'Present Reading', 
    'Consumption December': 'Consumption'
}

valid_columns = [
    'Control Number', 
    'Account Name', 
    'Service Address', 
    'Previous Reading', 
    'Present Reading', 
    'Consumption'
]

nov_df = merged_df[['Control Number', 'Account Name', 'Service Address_y', 'Previous Reading November', 'Present Reading November', 'Consumption November']]
dec_df = merged_df[['Control Number', 'Account Name', 'Service Address_y', 'Previous Reading December', 'Present Reading December', 'Consumption December']]

nov_df = nov_df.rename(columns=nov_columns)
dec_df = dec_df.rename(columns=dec_columns)

nov_df = nov_df[valid_columns]
dec_df = dec_df[valid_columns]

display(nov_df.head())
display(dec_df.head())

Unnamed: 0,Control Number,Account Name,Service Address,Previous Reading,Present Reading,Consumption
0,501549,"Albano, Lilane",Alicante St.,780.0,789.0,9.0
1,501453,"Alcantara, Hilda",Alicante St.,377.0,382.0,5.0
2,500750,"Aljecera, Marcelino",Alicante St.,3523.0,3544.0,21.0
3,500990,"Alminana, Irus",Alicante St.,2287.0,2350.0,63.0
4,501704,"Alminana, Violeta",Alicante St.,314.0,336.0,22.0


Unnamed: 0,Control Number,Account Name,Service Address,Previous Reading,Present Reading,Consumption
0,501549,"Albano, Lilane",Alicante St.,789.0,797.0,8.0
1,501453,"Alcantara, Hilda",Alicante St.,382.0,386.0,4.0
2,500750,"Aljecera, Marcelino",Alicante St.,3544.0,3565.0,21.0
3,500990,"Alminana, Irus",Alicante St.,2350.0,2412.0,62.0
4,501704,"Alminana, Violeta",Alicante St.,336.0,358.0,22.0


In [78]:
from pathlib import Path

output_dir = Path("../../dataset/raw/2023/compiled")
nov_df.to_csv(output_dir / 'NOV_2023.csv', index=False)
dec_df.to_csv(output_dir / 'DEC_2023.csv', index=False)

print(f"November 2023 records: {nov_df.shape[0]}")
print(f"December 2023 records: {dec_df.shape[0]}")

November 2023 records: 2038
December 2023 records: 2038


# Create records for MAR2023

In [79]:
# Review of consumption difference
import pandas as pd
from IPython.display import display

feb_df = pd.read_csv("../../dataset/clean/training/semi_clean/2023/FEB_2023_semi_clean.csv", encoding='latin-1')
apr_df = pd.read_csv("../../dataset/clean/training/semi_clean/2023/APR_2023_semi_clean.csv", encoding='latin-1')

feb_mini_df = feb_df[['Control Number', 'Account Name', 'Cleaned Previous Reading', 'Cleaned Present Reading']]
apr_mini_df = apr_df[['Control Number', 'Account Name', 'Cleaned Previous Reading', 'Cleaned Present Reading']]

merged_df = pd.merge(feb_mini_df, apr_mini_df, on=['Control Number', 'Account Name'], suffixes=(' February', ' April'))

merged_df['Consumption Difference'] = pd.to_numeric(merged_df['Cleaned Present Reading April']) - pd.to_numeric(merged_df['Cleaned Previous Reading February'])

negative_df = merged_df[merged_df['Consumption Difference'] < 0]
print("Negative Consumption Difference Count: ", len(negative_df))
display(negative_df[['Control Number', 'Account Name', 'Cleaned Previous Reading February','Cleaned Present Reading February', 'Cleaned Previous Reading April', 'Cleaned Present Reading April', 'Consumption Difference']].head())
display(negative_df)

result = merged_df.loc[merged_df['Consumption Difference'] < 0, 'Control Number'].unique()
invalid_record = [result.astype(int) for result in result]
print("Invalid Record Count: ", len(invalid_record))
print(invalid_record)

Negative Consumption Difference Count:  3


Unnamed: 0,Control Number,Account Name,Cleaned Previous Reading February,Cleaned Present Reading February,Cleaned Previous Reading April,Cleaned Present Reading April,Consumption Difference
475,500333,"Pensader, Elvie",4395.0,4395.0,0.0,28.0,-4367.0
1155,500103,"Almojuela, Flora",3584.0,3584.0,3576.0,3576.0,-8.0
1193,500586,"Espenilla, Elmer Jr.",624.0,624.0,14.0,14.0,-610.0


Unnamed: 0,Control Number,Account Name,Cleaned Previous Reading February,Cleaned Present Reading February,Cleaned Previous Reading April,Cleaned Present Reading April,Consumption Difference
475,500333,"Pensader, Elvie",4395.0,4395.0,0.0,28.0,-4367.0
1155,500103,"Almojuela, Flora",3584.0,3584.0,3576.0,3576.0,-8.0
1193,500586,"Espenilla, Elmer Jr.",624.0,624.0,14.0,14.0,-610.0


Invalid Record Count:  3
[np.int64(500333), np.int64(500103), np.int64(500586)]


In [80]:
# Drop the records with negative consumption difference from July and October Records
filtered_feb_df = feb_df[~feb_df['Control Number'].isin(invalid_record)]
feb_df_count = len(feb_df)
filtered_feb_df_count = len(filtered_feb_df)
filtered_apr_df = apr_df[~apr_df['Control Number'].isin(invalid_record)]
apr_df_count = len(apr_df)
filtered_apr_df_count = len(filtered_apr_df)

columns = ['Control Number', 
            'Account Name', 
            'Service Address', 
            'Previous Reading', 
            'Present Reading', 
            'Cleaned Previous Reading', 
            'Cleaned Present Reading', 
            'Cleaned Consumption']

filtered_feb_df = filtered_feb_df[columns]
filtered_apr_df = filtered_apr_df[columns]

if feb_df_count - len(invalid_record) == len(filtered_feb_df):
    print("February records are valid")
if apr_df_count - len(invalid_record) == len(filtered_apr_df):
    print("October records are valid")

February records are valid
October records are valid


In [81]:
# Null Checks
print(f'Null July Present Readings Count: {filtered_feb_df['Cleaned Present Reading'].isnull().sum()}')
print(f'Null July Previous Readings Count: {filtered_apr_df['Cleaned Previous Reading'].isnull().sum()}')

print(f'Null Oct Present Readings Count: {filtered_feb_df['Cleaned Present Reading'].isnull().sum()}')
print(f'Null Oct Previous Readings Count: {filtered_apr_df['Cleaned Previous Reading'].isnull().sum()}')

# Coerce to numeric and create masks for invalid entries
feb_present_numeric = pd.to_numeric(filtered_feb_df['Cleaned Present Reading'], errors="coerce")
feb_previous_numeric = pd.to_numeric(filtered_apr_df['Cleaned Previous Reading'], errors="coerce")

apr_present_numeric = pd.to_numeric(filtered_apr_df['Cleaned Present Reading'], errors="coerce")
apr_previous_numeric = pd.to_numeric(filtered_apr_df['Cleaned Previous Reading'], errors="coerce")

# Boolean masks where coercion failed (i.e., non-numeric values)
feb_invalid_present_mask = feb_present_numeric.isna() & filtered_feb_df['Cleaned Present Reading'].notna()
feb_invalid_previous_mask = feb_previous_numeric.isna() & filtered_feb_df['Cleaned Previous Reading'].notna()

apr_invalid_present_mask = apr_present_numeric.isna() & filtered_apr_df['Cleaned Present Reading'].notna()
apr_invalid_previous_mask = apr_previous_numeric.isna() & filtered_apr_df['Cleaned Previous Reading'].notna()

# Extract invalid entries
feb_invalid_present_values = filtered_feb_df.loc[feb_invalid_present_mask, "Cleaned Present Reading"].unique()
feb_invalid_previous_values = filtered_feb_df.loc[feb_invalid_previous_mask, "Cleaned Previous Reading"].unique()

apr_invalid_present_values = filtered_apr_df.loc[apr_invalid_present_mask, "Cleaned Present Reading"].unique()
apr_invalid_previous_values = filtered_apr_df.loc[apr_invalid_previous_mask, "Cleaned Previous Reading"].unique()

# Report results
print('\n\nFEBRUARY\n')
print(f"Non-Numeric Present Readings Count: {feb_invalid_present_mask.sum()}")
print(f"Values: {feb_invalid_present_values.tolist()}")

print(f"Non-Numeric Previous Readings Count: {feb_invalid_previous_mask.sum()}")
print(f"Values: {feb_invalid_previous_values.tolist()}")

print('\n\nAPRIL\n')
print(f"Non-Numeric Present Readings Count: {apr_invalid_present_mask.sum()}")
print(f"Values: {apr_invalid_present_values.tolist()}")

print(f"Non-Numeric Previous Readings Count: {apr_invalid_previous_mask.sum()}")
print(f"Values: {apr_invalid_previous_values.tolist()}")


Null July Present Readings Count: 486
Null July Previous Readings Count: 481
Null Oct Present Readings Count: 486
Null Oct Previous Readings Count: 481


FEBRUARY

Non-Numeric Present Readings Count: 0
Values: []
Non-Numeric Previous Readings Count: 53
Values: [670.0, 2793.0, 7219.0, 6250.0, 448.0, 784.0, 965.0, 3415.0, 1407.0, 1918.0, 687.0, 1170.0, 3388.0, 1327.0, 588.0, 498.0, 1339.0, 394.0, 2337.0, 1826.0, 1256.0, 480.0, 151.0, 1430.0, 4362.0, 363.0, 7609.0, 173.0, 886.0, 952.0, 3033.0, 1357.0, 2324.0, 908.0, 1255.0, 1405.0, 1217.0, 3509.0, 598.0, 755.0, 154.0, 5441.0, 980.0, 1496.0, 1213.0, 474.0, 301.0, 1048.0, 4359.0, 2540.0, 3984.0, 204.0]


APRIL

Non-Numeric Present Readings Count: 0
Values: []
Non-Numeric Previous Readings Count: 0
Values: []


In [82]:
merged_df = filtered_feb_df.merge(filtered_apr_df, on=['Control Number', 'Account Name'], how='inner', suffixes=(' February', ' April'))
display(merged_df.head())
display(merged_df.tail())
display(merged_df.info())

Unnamed: 0,Control Number,Account Name,Service Address February,Previous Reading February,Present Reading February,Cleaned Previous Reading February,Cleaned Present Reading February,Cleaned Consumption February,Service Address April,Previous Reading April,Present Reading April,Cleaned Previous Reading April,Cleaned Present Reading April,Cleaned Consumption April
0,501549,"Albano, Lilane",Alicante St.,670.0,,670.0,670.0,0,Alicante St.,,,,,0
1,501453,"Alcantara, Hilda",Alicante St.,330.0,347.0,330.0,347.0,17,Alicante St.,347.0,347.0,347.0,347.0,0
2,500750,"Aljecera, Marcelino",Alicante St.,3414.0,3419.0,3414.0,3419.0,5,Alicante St.,3419.0,3425.0,3419.0,3425.0,6
3,500990,"Alminana, Irus",Alicante St.,1768.0,1812.0,1768.0,1812.0,44,Alicante St.,1812.0,1854.0,1812.0,1854.0,42
4,501704,"Alminana, Violeta",Alicante St.,213.0,,213.0,213.0,0,Alicante St.,213.0,213.0,213.0,213.0,0


Unnamed: 0,Control Number,Account Name,Service Address February,Previous Reading February,Present Reading February,Cleaned Previous Reading February,Cleaned Present Reading February,Cleaned Consumption February,Service Address April,Previous Reading April,Present Reading April,Cleaned Previous Reading April,Cleaned Present Reading April,Cleaned Consumption April
2011,500641,"Moya, Concepcion",Villamor St.,,,,,0,Villamor St.,,,,,0
2012,500021,"Nacino, Christopher",Villamor St.,1834.0,1836.0,1834.0,1836.0,2,Villamor St.,1836.0,1839.0,1836.0,1839.0,3
2013,500091,"Ragasa, Fe",Villamor St.,3523.0,3541.0,3523.0,3541.0,18,Villamor St.,3541.0,3564.0,3541.0,3564.0,23
2014,501109,"Almocera, Ricky","Puro, Calipat-an",,,,,0,"Puro, Calipat-an",,,,,0
2015,501381,"Gaurano, Clutario","Puro, Calipat-an",,,,,0,"Puro, Calipat-an",,,,,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2016 entries, 0 to 2015
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Control Number                     2016 non-null   int64  
 1   Account Name                       2016 non-null   object 
 2   Service Address February           2016 non-null   object 
 3   Previous Reading February          1489 non-null   float64
 4   Present Reading February           1491 non-null   float64
 5   Cleaned Previous Reading February  1527 non-null   float64
 6   Cleaned Present Reading February   1527 non-null   float64
 7   Cleaned Consumption February       2016 non-null   int64  
 8   Service Address April              2016 non-null   object 
 9   Previous Reading April             1503 non-null   float64
 10  Present Reading April              1499 non-null   float64
 11  Cleaned Previous Reading April     1531 non-null   float

None

In [83]:
equal_readings = merged_df[~(merged_df['Cleaned Present Reading February'] == merged_df['Previous Reading April']) & ~(merged_df['Cleaned Present Reading February'].isna() | merged_df['Previous Reading April'].isna())]
print(equal_readings.info())
display(equal_readings.head())

<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 1261 to 1261
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Control Number                     1 non-null      int64  
 1   Account Name                       1 non-null      object 
 2   Service Address February           1 non-null      object 
 3   Previous Reading February          1 non-null      float64
 4   Present Reading February           1 non-null      float64
 5   Cleaned Previous Reading February  1 non-null      float64
 6   Cleaned Present Reading February   1 non-null      float64
 7   Cleaned Consumption February       1 non-null      int64  
 8   Service Address April              1 non-null      object 
 9   Previous Reading April             1 non-null      float64
 10  Present Reading April              1 non-null      float64
 11  Cleaned Previous Reading April     1 non-null      float64
 1

Unnamed: 0,Control Number,Account Name,Service Address February,Previous Reading February,Present Reading February,Cleaned Previous Reading February,Cleaned Present Reading February,Cleaned Consumption February,Service Address April,Previous Reading April,Present Reading April,Cleaned Previous Reading April,Cleaned Present Reading April,Cleaned Consumption April
1261,501500,"Bonino, Albena",Gutierrez St.,313.0,314.0,313.0,314.0,1,Gutierrez St.,315.0,319.0,315.0,319.0,4


In [84]:
import numpy as np

def calculate_present_reading(df):
    df['Present Reading March'] = np.where(
        df['Cleaned Previous Reading April'].notna(),
        df['Cleaned Previous Reading April'],
        np.where(
            df['Cleaned Present Reading February'].notna(),
            df['Cleaned Present Reading April'],
            np.nan
        )
    )
    return df

def calculate_previous_reading(df):
    df['Previous Reading March'] = np.where(
        df['Cleaned Present Reading February'].notna(),
        df['Cleaned Present Reading February'],
        np.where(
            df['Present Reading March'].notna(),
            df['Present Reading March'],
            np.nan
        )
    )
    return df

def calculate_consumption(df):
    df['Consumption March'] = df['Present Reading March'] - df['Previous Reading March']
    df['Consumption March'] = df['Consumption March'].clip(lower=0).fillna(0)
    return df

merged_df = calculate_present_reading(merged_df)
merged_df = calculate_previous_reading(merged_df)
merged_df = calculate_consumption(merged_df)
merged_df.head()
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2016 entries, 0 to 2015
Data columns (total 17 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Control Number                     2016 non-null   int64  
 1   Account Name                       2016 non-null   object 
 2   Service Address February           2016 non-null   object 
 3   Previous Reading February          1489 non-null   float64
 4   Present Reading February           1491 non-null   float64
 5   Cleaned Previous Reading February  1527 non-null   float64
 6   Cleaned Present Reading February   1527 non-null   float64
 7   Cleaned Consumption February       2016 non-null   int64  
 8   Service Address April              2016 non-null   object 
 9   Previous Reading April             1503 non-null   float64
 10  Present Reading April              1499 non-null   float64
 11  Cleaned Previous Reading April     1531 non-null   float

In [85]:
mar_columns = {
    'Control Number': 'Control Number', 
    'Account Name': 'Account Name', 
    'Service Address February': 'Service Address', 
    'Previous Reading March': 'Previous Reading', 
    'Present Reading March': 'Present Reading', 
    'Consumption March': 'Consumption'
}

valid_columns = [
    'Control Number', 
    'Account Name', 
    'Service Address', 
    'Previous Reading', 
    'Present Reading', 
    'Consumption'
]

merged_df.rename(columns=mar_columns, inplace=True)
mar_df = merged_df[valid_columns]

display(mar_df.head())

Unnamed: 0,Control Number,Account Name,Service Address,Previous Reading,Present Reading,Consumption
0,501549,"Albano, Lilane",Alicante St.,670.0,,0.0
1,501453,"Alcantara, Hilda",Alicante St.,347.0,347.0,0.0
2,500750,"Aljecera, Marcelino",Alicante St.,3419.0,3419.0,0.0
3,500990,"Alminana, Irus",Alicante St.,1812.0,1812.0,0.0
4,501704,"Alminana, Violeta",Alicante St.,213.0,213.0,0.0


In [86]:
from pathlib import Path

output_dir = Path("../../dataset/raw/2023/compiled")
mar_df.to_csv(output_dir / 'MAR_2023.csv', index=False)

print(f"March 2023 records: {mar_df.shape[0]}")

March 2023 records: 2016
