## Importing the Libraries and the Data

In [2]:
import pandas as pd
import numpy as np

In [3]:
steeldf = pd.read_csv("Steel_industry_data.csv")

Importing libraries for data manipulation!

##  Understanding the Data and Explaining the Columns

In [6]:
steeldf.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35040 entries, 0 to 35039
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   date                                  35040 non-null  object 
 1   Usage_kWh                             35040 non-null  float64
 2   Lagging_Current_Reactive.Power_kVarh  35040 non-null  float64
 3   Leading_Current_Reactive_Power_kVarh  35040 non-null  float64
 4   CO2(tCO2)                             35040 non-null  float64
 5   Lagging_Current_Power_Factor          35040 non-null  float64
 6   Leading_Current_Power_Factor          35040 non-null  float64
 7   NSM                                   35040 non-null  int64  
 8   WeekStatus                            35040 non-null  object 
 9   Day_of_week                           35040 non-null  object 
 10  Load_Type                             35040 non-null  object 
dtypes: float64(6), 

Our dataframe is not large, only 9.8 MB, which means we won’t have much work regarding memory improvement for performance. Still, we can make some small adjustments; however, we won’t alter the data in a way that compromises the dataframe.

In [8]:
steeldf.head(15)

Unnamed: 0,date,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,CO2(tCO2),Lagging_Current_Power_Factor,Leading_Current_Power_Factor,NSM,WeekStatus,Day_of_week,Load_Type
0,01/01/2018 00:15,3.17,2.95,0.0,0.0,73.21,100.0,900,Weekday,Monday,Light_Load
1,01/01/2018 00:30,4.0,4.46,0.0,0.0,66.77,100.0,1800,Weekday,Monday,Light_Load
2,01/01/2018 00:45,3.24,3.28,0.0,0.0,70.28,100.0,2700,Weekday,Monday,Light_Load
3,01/01/2018 01:00,3.31,3.56,0.0,0.0,68.09,100.0,3600,Weekday,Monday,Light_Load
4,01/01/2018 01:15,3.82,4.5,0.0,0.0,64.72,100.0,4500,Weekday,Monday,Light_Load
5,01/01/2018 01:30,3.28,3.56,0.0,0.0,67.76,100.0,5400,Weekday,Monday,Light_Load
6,01/01/2018 01:45,3.6,4.14,0.0,0.0,65.62,100.0,6300,Weekday,Monday,Light_Load
7,01/01/2018 02:00,3.6,4.28,0.0,0.0,64.37,100.0,7200,Weekday,Monday,Light_Load
8,01/01/2018 02:15,3.28,3.64,0.0,0.0,66.94,100.0,8100,Weekday,Monday,Light_Load
9,01/01/2018 02:30,3.78,4.72,0.0,0.0,62.51,100.0,9000,Weekday,Monday,Light_Load


In [9]:
steeldf.describe()

Unnamed: 0,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,CO2(tCO2),Lagging_Current_Power_Factor,Leading_Current_Power_Factor,NSM
count,35040.0,35040.0,35040.0,35040.0,35040.0,35040.0,35040.0
mean,27.386892,13.035384,3.870949,0.011524,80.578056,84.36787,42750.0
std,33.44438,16.306,7.424463,0.016151,18.921322,30.456535,24940.534317
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.2,2.3,0.0,0.0,63.32,99.7,21375.0
50%,4.57,5.0,0.0,0.0,87.96,100.0,42750.0
75%,51.2375,22.64,2.09,0.02,99.0225,100.0,64125.0
max,157.18,96.91,27.76,0.07,100.0,100.0,85500.0


In [10]:
print(f'Our dataframe has {steeldf.shape[0]} index and {steeldf.shape[1]} columns')

Our dataframe has 35040 index and 11 columns


- **Date**: Continuous-time data taken on the first of the month
- **Usage_kWh**: Industry Energy Consumption Continuous kWh
- **Lagging Current**: reactive power Continuous kVarh
- **Leading Current**: reactive power Continuous kVarh
- **CO2**: Continuous ppm
- **NSM**: Number of Seconds from midnight Continuous S
- **Week status**: Categorical (Weekend  or a Weekday)
- **Day of week**: Categorical Sunday, Monday : Saturday
- **Load Type**: Categorical Light Load, Medium Load, Maximum Load

## Verifying nulls and duplicates

In [13]:
steeldf.isna().sum().sum

<bound method Series.sum of date                                    0
Usage_kWh                               0
Lagging_Current_Reactive.Power_kVarh    0
Leading_Current_Reactive_Power_kVarh    0
CO2(tCO2)                               0
Lagging_Current_Power_Factor            0
Leading_Current_Power_Factor            0
NSM                                     0
WeekStatus                              0
Day_of_week                             0
Load_Type                               0
dtype: int64>

In [14]:
steeldf.duplicated().sum()

0

Nossa dataframe não contem núlos nem duplicatas

## Looking the Values of our Object datas

In [17]:
steeldf['Load_Type'].unique()

array(['Light_Load', 'Medium_Load', 'Maximum_Load'], dtype=object)

In [18]:
steeldf['WeekStatus'].unique()

array(['Weekday', 'Weekend'], dtype=object)

In [19]:
steeldf['Day_of_week'].unique()

array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'], dtype=object)

## Transforming the Data Columns

In [21]:
steeldf['date'] = pd.to_datetime(steeldf['date'],format="%d/%m/%Y %H:%M")
steeldf['Hour'] = steeldf['date'].dt.hour
steeldf['Day'] = steeldf['date'].dt.day
steeldf['Month'] = steeldf['date'].dt.month
change = steeldf.select_dtypes('object').columns
change2 = steeldf.select_dtypes('float64').columns
steeldf[change] = steeldf[change].astype('category')
steeldf[change2] = steeldf[change2].astype('float32')

In [22]:
steeldf.head(3)

Unnamed: 0,date,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,CO2(tCO2),Lagging_Current_Power_Factor,Leading_Current_Power_Factor,NSM,WeekStatus,Day_of_week,Load_Type,Hour,Day,Month
0,2018-01-01 00:15:00,3.17,2.95,0.0,0.0,73.209999,100.0,900,Weekday,Monday,Light_Load,0,1,1
1,2018-01-01 00:30:00,4.0,4.46,0.0,0.0,66.769997,100.0,1800,Weekday,Monday,Light_Load,0,1,1
2,2018-01-01 00:45:00,3.24,3.28,0.0,0.0,70.279999,100.0,2700,Weekday,Monday,Light_Load,0,1,1


In [23]:
steeldf

Unnamed: 0,date,Usage_kWh,Lagging_Current_Reactive.Power_kVarh,Leading_Current_Reactive_Power_kVarh,CO2(tCO2),Lagging_Current_Power_Factor,Leading_Current_Power_Factor,NSM,WeekStatus,Day_of_week,Load_Type,Hour,Day,Month
0,2018-01-01 00:15:00,3.17,2.95,0.00,0.0,73.209999,100.000000,900,Weekday,Monday,Light_Load,0,1,1
1,2018-01-01 00:30:00,4.00,4.46,0.00,0.0,66.769997,100.000000,1800,Weekday,Monday,Light_Load,0,1,1
2,2018-01-01 00:45:00,3.24,3.28,0.00,0.0,70.279999,100.000000,2700,Weekday,Monday,Light_Load,0,1,1
3,2018-01-01 01:00:00,3.31,3.56,0.00,0.0,68.089996,100.000000,3600,Weekday,Monday,Light_Load,1,1,1
4,2018-01-01 01:15:00,3.82,4.50,0.00,0.0,64.720001,100.000000,4500,Weekday,Monday,Light_Load,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35035,2018-12-31 23:00:00,3.85,4.86,0.00,0.0,62.099998,100.000000,82800,Weekday,Monday,Light_Load,23,31,12
35036,2018-12-31 23:15:00,3.74,3.74,0.00,0.0,70.709999,100.000000,83700,Weekday,Monday,Light_Load,23,31,12
35037,2018-12-31 23:30:00,3.78,3.17,0.07,0.0,76.620003,99.980003,84600,Weekday,Monday,Light_Load,23,31,12
35038,2018-12-31 23:45:00,3.78,3.06,0.11,0.0,77.720001,99.959999,85500,Weekday,Monday,Light_Load,23,31,12


In [24]:
steeldf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35040 entries, 0 to 35039
Data columns (total 14 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   date                                  35040 non-null  datetime64[ns]
 1   Usage_kWh                             35040 non-null  float32       
 2   Lagging_Current_Reactive.Power_kVarh  35040 non-null  float32       
 3   Leading_Current_Reactive_Power_kVarh  35040 non-null  float32       
 4   CO2(tCO2)                             35040 non-null  float32       
 5   Lagging_Current_Power_Factor          35040 non-null  float32       
 6   Leading_Current_Power_Factor          35040 non-null  float32       
 7   NSM                                   35040 non-null  int64         
 8   WeekStatus                            35040 non-null  category      
 9   Day_of_week                           35040 non-null  category      
 10

In [25]:
steeldf.to_csv("AnalysisSteel")