<a href="https://colab.research.google.com/github/pginjupalli/Verizon-EnergyPriceForecasting/blob/main/Verizon_2_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part 1: Understanding the Data**

Here, we will look at the dataset and note the features, their correlations with the label, whether or not there's missing data, and come up with ideas clearn and prepare the data for modeling.

In [18]:
import pandas as pd
import numpy as np
import os

In [19]:
from google.colab import drive
drive.mount('/content/drive')

with open('/content/drive/My Drive/BTT Verizon 2/electricity_prices.csv', 'r') as f:
  df = pd.read_csv(f)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [20]:
print(df.isnull().sum()) # columns with null values
df[df['customers'].notnull()] # rows where 'customer' is not null

year                    0
month                   0
stateDescription        0
sectorName              0
customers           26040
price                   0
revenue                 0
sales                   0
dtype: int64


Unnamed: 0,year,month,stateDescription,sectorName,customers,price,revenue,sales
26040,2008,1,Washington,all sectors,3145488.0,6.64,574.73235,8658.35972
26041,2008,1,Rhode Island,transportation,0.0,0.00,0.00000,0.00000
26042,2008,1,South Carolina,transportation,0.0,0.00,0.00000,0.00000
26043,2008,1,Massachusetts,industrial,14142.0,13.18,100.40468,761.89017
26044,2008,1,Massachusetts,residential,2611383.0,16.81,315.43624,1876.02831
...,...,...,...,...,...,...,...,...
85865,2024,1,Arkansas,all sectors,1717720.0,9.63,442.98773,4598.63147
85866,2024,1,Arkansas,commercial,208669.0,10.26,97.79467,953.02154
85867,2024,1,Arkansas,industrial,34951.0,7.08,109.92656,1553.02838
85868,2024,1,Arkansas,residential,1474098.0,11.24,235.26399,2092.56172


As this above code shows, the 'customers' feature is missing data from a third of the examples. This is too much data to throw away, so we must replace the null values with actual data. This can be done by replacing null values with values for mean, median, mode, etc.

We analyze these values for the 'customers' feature and see which value works best for replacement of null values.

In [21]:
df_with_customers = df[df['customers'].notnull()] # rows where 'customer' is not null

print("Customers Mean: " + str(df_with_customers['customers'].mean()))
print("Customers Median: " + str(df_with_customers['customers'].median()))
print("Customers Mode: " + str(df_with_customers['customers'].mode()))

df_with_customers.select_dtypes(['int64', 'float64']).corr()

Customers Mean: 2916013.4194216947
Customers Median: 299754.0
Customers Mode: 0    0.0
Name: customers, dtype: float64


Unnamed: 0,year,month,customers,price,revenue,sales
year,1.0,-0.014991,0.009542,0.139038,0.02312,0.004143
month,-0.014991,1.0,0.000577,0.016232,0.005551,0.002567
customers,0.009542,0.000577,1.0,0.083902,0.909009,0.876248
price,0.139038,0.016232,0.083902,1.0,0.066214,0.026643
revenue,0.02312,0.005551,0.909009,0.066214,1.0,0.987114
sales,0.004143,0.002567,0.876248,0.026643,0.987114,1.0


This is the current statistical values for the customer's column and its correlation with others. Our full dataset should match this one as closely as possible.

Let's place the null values with the values for mean, median, and mode, and see which one closely resembles this correlation data.

In [22]:
df_with_mean = df.copy()
df_with_mean['customers'] = df['customers'].fillna(value = df_with_customers['customers'].mean(), inplace = False)

print("Correlations with the mean as a replacement")
df_with_mean.select_dtypes(['int64', 'float64']).corr()

Correlations with the mean as a replacement


Unnamed: 0,year,month,customers,price,revenue,sales
year,1.0,-0.010297,0.005547,0.261512,0.045368,0.008315
month,-0.010297,1.0,0.000482,0.019868,0.006237,0.002925
customers,0.005547,0.000482,1.0,0.074034,0.821476,0.743986
price,0.261512,0.019868,0.074034,1.0,0.074662,0.02855
revenue,0.045368,0.006237,0.821476,0.074662,1.0,0.979835
sales,0.008315,0.002925,0.743986,0.02855,0.979835,1.0


In [23]:
df_with_median = df.copy()
df_with_median['customers'] = df['customers'].fillna(value = df_with_customers['customers'].median(), inplace = False)

print("Correlations with the median as a replacement")
df_with_median.select_dtypes(['int64', 'float64']).corr()

Correlations with the median as a replacement


Unnamed: 0,year,month,customers,price,revenue,sales
year,1.0,-0.010297,0.100436,0.261512,0.045368,0.008315
month,-0.010297,1.0,2.8e-05,0.019868,0.006237,0.002925
customers,0.100436,2.8e-05,1.0,0.100938,0.820243,0.739499
price,0.261512,0.019868,0.100938,1.0,0.074662,0.02855
revenue,0.045368,0.006237,0.820243,0.074662,1.0,0.979835
sales,0.008315,0.002925,0.739499,0.02855,0.979835,1.0


In [24]:
df_with_mode = df.copy()
df_with_mode['customers'] = df['customers'].fillna(value = df_with_customers['customers'].mode(), inplace = False)

print("Correlations with the mode as a replacement")
df_with_mode.select_dtypes(['int64', 'float64']).corr()

Correlations with the mode as a replacement


Unnamed: 0,year,month,customers,price,revenue,sales
year,1.0,-0.010297,0.009554,0.261512,0.045368,0.008315
month,-0.010297,1.0,0.000584,0.019868,0.006237,0.002925
customers,0.009554,0.000584,1.0,0.083905,0.909009,0.876248
price,0.261512,0.019868,0.083905,1.0,0.074662,0.02855
revenue,0.045368,0.006237,0.909009,0.074662,1.0,0.979835
sales,0.008315,0.002925,0.876248,0.02855,0.979835,1.0


From these 3 ways of replacement, replacing with the **mode** had the new correlations resemble the original, expected correlations. So, we can replace the null values in 'customers' with the mode moving forward.

In [None]:
df['customers'] = df['customers'].fillna(value = df_with_customers['customers'].mode(), inplace = False) # Replacing null values in customers with the mode