# "EnergyConsumption"-Data Exploration

### Dataset:
The dataset is a time series of electrical power consumption measurements.<br>
It consists the following columns:<br>
    Date and Time<br>
    Global active power (in kilowatts)<br>
    Global reactive power (in kilowatts)<br>
    Voltage (in volts)<br>
    Global intensity (in amperes)<br>
    Sub metering 1 (in watt-hours of active energy)<br>
    Sub metering 2 (in watt-hours of active energy)<br>
    Sub metering 3 (in watt-hours of active energy)<br>

## (1) Defining Problem Statement and Analyzing basic Metrics

The main objective of this project is to analyze the dataset of electricity consumption generate insights to help energy providers and policymakers make informed decisions about energy production and consumption. The analysis will be data-driven, focusing on basic metrics and visualizations to support the findings.


## (2) Import The libraries and load the dataset

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
data=pd.read_csv('household_power_consumption.txt',sep=";")

  data=pd.read_csv('household_power_consumption.txt',sep=";")


##### <u>Displaying of first 5 records of the dataset</u>

In [5]:
data.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


##### <u>Displaying of last 5 records of dataset</u>

In [6]:
data.tail()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
2075254,26/11/2010,20:58:00,0.946,0.0,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.0,240.0,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.0,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.0,239.7,3.8,0.0,0.0,0.0
2075258,26/11/2010,21:02:00,0.932,0.0,239.55,3.8,0.0,0.0,0.0


# (3) Data Exploration and Pre-processing

#### Check basic metrics and data types

##### <u>Displaying the number of rows and columns</u>

In [7]:
data.shape

(2075259, 9)

##### <u>Giving a brief information of dataset columns along with their indices and datatypes</u>

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    object 
 3   Global_reactive_power  object 
 4   Voltage                object 
 5   Global_intensity       object 
 6   Sub_metering_1         object 
 7   Sub_metering_2         object 
 8   Sub_metering_3         float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB


Observations:<br>
<ul>
<li>The dataset contains 2075259 rows and 9 columns</li>
<li>We can see that columns like "Date", "Time", "title", "Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity", "Sub_metering_1", "Sub_metering_2", and contain string values, which are represented using the "object" datatype in this dataframe.</li>
<li>Column "Sub_metering_3 is havind "float" datatype </li>

</ul>


##### <u>Describing the statistical summary of numerical type data </u>

In [9]:
data.describe()

Unnamed: 0,Sub_metering_3
count,2049280.0
mean,6.458447
std,8.437154
min,0.0
25%,0.0
50%,1.0
75%,17.0
max,31.0


Observations:<br>
<ul>
<li>25% of data lies as 0.000000e+00</li>
<li>50% of data lies as 1.000000e+00 </li>
<li>maximum data lies as 3.100000e+01</li>
</ul>

#### Statistical Summary of categorical type data

In [10]:
data.describe(include='object')

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2
count,2075259,2075259,2075259,2075259.0,2075259,2075259.0,2075259.0,2075259.0
unique,1442,1440,6534,896.0,5168,377.0,153.0,145.0
top,6/12/2008,17:24:00,?,0.0,?,1.0,0.0,0.0
freq,1440,1442,25979,472786.0,25979,169406.0,1840611.0,1408274.0


Observations:<br>
<ul>
<li>we can see that all columns are having missing values</li>
<li>All columns except "Date" and "Time" need to be changed to "Float" datatype</li>
</ul>

#### Checking for missing values 
It is an essential part of both data cleaning and preprocessing. This step involves identifying and addressing incomplete data, which falls under data cleaning as it seeks to rectify issues that arise from missing information.

In [11]:
data.isnull().any()

Date                     False
Time                     False
Global_active_power      False
Global_reactive_power    False
Voltage                  False
Global_intensity         False
Sub_metering_1           False
Sub_metering_2           False
Sub_metering_3            True
dtype: bool

##### <u>Displaying the count of null values in each column</u>

In [12]:
data.isnull().sum()

Date                         0
Time                         0
Global_active_power          0
Global_reactive_power        0
Voltage                      0
Global_intensity             0
Sub_metering_1               0
Sub_metering_2               0
Sub_metering_3           25979
dtype: int64

##### <u>Displaying missing value percentage of each column</u>

In [13]:
missing_values_percentage = (data.isnull().mean() * 100).round(2)
print("Missing Percentage:",missing_values_percentage)

Missing Percentage: Date                     0.00
Time                     0.00
Global_active_power      0.00
Global_reactive_power    0.00
Voltage                  0.00
Global_intensity         0.00
Sub_metering_1           0.00
Sub_metering_2           0.00
Sub_metering_3           1.25
dtype: float64


Observations:<br>
<ul>
<li>We can see that only "Sub_metering_3" columns is having missing values so we have to replace them with "Unknown" value</li></ul>

#### Conversion of Datatype

##### <u>Converting the columns with "object" datatype to "float" datatype except date and time columns</u>

In [14]:
data['Global_active_power']=pd.to_numeric(data['Global_active_power'],errors='coerce')
data['Global_reactive_power']=pd.to_numeric(data['Global_reactive_power'],errors='coerce')
data['Voltage']=pd.to_numeric(data['Voltage'],errors='coerce')
data['Global_intensity']=pd.to_numeric(data['Global_intensity'],errors='coerce')
data['Sub_metering_1']=pd.to_numeric(data['Sub_metering_1'],errors='coerce')
data['Sub_metering_2']=pd.to_numeric(data['Sub_metering_2'],errors='coerce')

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    float64
 3   Global_reactive_power  float64
 4   Voltage                float64
 5   Global_intensity       float64
 6   Sub_metering_1         float64
 7   Sub_metering_2         float64
 8   Sub_metering_3         float64
dtypes: float64(7), object(2)
memory usage: 142.5+ MB


Observation:<br>
<li>The column with datatype "object" are converted to "float" datatype</li>
<li>We can see the datatype has been coverted to "float"</li>

#### Handling with null values

##### <u>We are filling the missing values with the mean value of column "Sub_metering_3"</u>

In [16]:
average=data['Sub_metering_3'].mean()
data.fillna(average,inplace=True)

In [17]:
data.isnull().sum()

Date                     0
Time                     0
Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64

Observation:<br>
<li>Here the missing values are replaced with the mean value of column "Sub_metering_3</li>
<li>After replacing the null values the sum of total null values in columns is 0</li>