Here is some context around the additional features that have been engineered in the notebook, to help you in the future:

Firstly we have the average price changes across periods. This is a measure of the average price change by company between peak, mid-peak and off peak periods. 
We then take this idea one step further by creating another similar feature but instead of looking at the average price difference, we look at the maximum price difference across periods and months. This gives another way to look at the price changes across months.

The reason why these 2 features could be useful is because they are another way of representing the variance of prices throughout the year. Imagine, if your utilities bill massively increased over winter, as a consumer you’d be annoyed and want to find a better deal!

After this we continue feature engineering with some more concepts, including transformation of columns.
To make predictions with a statistical or machine learning algorithm, all of the data must be converted to numeric data types.
Therefore, we convert date into months and remove the raw date column, as we cannot use it in its original form.
We also convert boolean columns into binary values.

And we convert categorical columns into dummy variables. A dummy variable is a binary flag that indicates when a row matches the value from the categorical column that it was created from.

As we saw during exploratory data analysis, the distribution of some columns was skewed. This is important to identify because when modeling data for prediction, based on the technique or algorithm that we use, there are sometimes assumptions within the data that we should follow.

One common assumption is that the columns within the data are normally distributed. Hence, if we find that columns are not normally distributed, we should treat these columns to try and transform them into a distribution that is more normal.
Therefore, the next thing we do is transform some columns to have a closer to normal distribution. We do this using the logarithm function. As you can see from the visualisations, the newly transformed columns are much closer to a normal distribution than what they were earlier.

Finally, we plot correlations of all the columns to see if we can identify any columns to remove. Columns that have very high correlations indicate an area to look out for. In this case, you may want to remove one of the columns, since they are likely both holding very similar information.

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.simplefilter("ignore")

In [31]:
df= pd.read_csv('clean_data_after_eda.csv')

In [32]:
df.head()

Unnamed: 0,id,channel_sales,cons_12m,cons_gas_12m,cons_last_month,date_activ,date_end,date_modif_prod,date_renewal,forecast_cons_12m,...,var_6m_price_off_peak_var,var_6m_price_peak_var,var_6m_price_mid_peak_var,var_6m_price_off_peak_fix,var_6m_price_peak_fix,var_6m_price_mid_peak_fix,var_6m_price_off_peak,var_6m_price_peak,var_6m_price_mid_peak,churn
0,24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,0,54946,0,2013-06-15,2016-06-15,2015-11-01,2015-06-23,0.0,...,0.000131,4.100838e-05,0.0009084737,2.086294,99.530517,44.235794,2.086425,99.53056,44.2367,1
1,d29c2c54acc38ff3c0614d0a653813dd,MISSING,4660,0,0,2009-08-21,2016-08-30,2009-08-21,2015-08-31,189.95,...,3e-06,0.001217891,0.0,0.009482,0.0,0.0,0.009485,0.001217891,0.0,0
2,764c75f661154dac3a6c254cd082ea7d,foosdfpfkusacimwkcsosbicdxkicaua,544,0,0,2010-04-16,2016-04-16,2010-04-16,2015-04-17,47.96,...,4e-06,9.45015e-08,0.0,0.0,0.0,0.0,4e-06,9.45015e-08,0.0,0
3,bba03439a292a1e166f80264c16191cb,lmkebamcaaclubfxadlmueccxoimlema,1584,0,0,2010-03-30,2016-03-30,2010-03-30,2015-03-31,240.04,...,3e-06,0.0,0.0,0.0,0.0,0.0,3e-06,0.0,0.0,0
4,149d57cf92fc41cf94415803a877cb4b,MISSING,4425,0,526,2010-01-13,2016-03-07,2010-01-13,2015-03-09,445.75,...,1.1e-05,2.89676e-06,4.86e-10,0.0,0.0,0.0,1.1e-05,2.89676e-06,4.86e-10,0


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14606 entries, 0 to 14605
Data columns (total 44 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              14606 non-null  object 
 1   channel_sales                   14606 non-null  object 
 2   cons_12m                        14606 non-null  int64  
 3   cons_gas_12m                    14606 non-null  int64  
 4   cons_last_month                 14606 non-null  int64  
 5   date_activ                      14606 non-null  object 
 6   date_end                        14606 non-null  object 
 7   date_modif_prod                 14606 non-null  object 
 8   date_renewal                    14606 non-null  object 
 9   forecast_cons_12m               14606 non-null  float64
 10  forecast_cons_year              14606 non-null  int64  
 11  forecast_discount_energy        14606 non-null  float64
 12  forecast_meter_rent_12m         

In [34]:
df.describe()

Unnamed: 0,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_cons_year,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,...,var_6m_price_off_peak_var,var_6m_price_peak_var,var_6m_price_mid_peak_var,var_6m_price_off_peak_fix,var_6m_price_peak_fix,var_6m_price_mid_peak_fix,var_6m_price_off_peak,var_6m_price_peak,var_6m_price_mid_peak,churn
count,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,...,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0,14606.0
mean,159220.3,28092.38,16090.269752,1868.61488,1399.762906,0.966726,63.086871,0.137283,0.050491,43.130056,...,2.5e-05,3.304264e-05,1.406015e-05,0.92281,1.460247,0.634413,0.922835,1.46028,0.6344275,0.097152
std,573465.3,162973.1,64364.196422,2387.571531,3247.786255,5.108289,66.165783,0.024623,0.049037,4.485988,...,0.000204,0.0002231791,0.000127317,18.218286,13.760009,5.895346,18.218452,13.7601,5.89547,0.296175
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5674.75,0.0,0.0,494.995,0.0,0.0,16.18,0.11634,0.0,40.606701,...,3e-06,0.0,0.0,0.0,0.0,0.0,4e-06,0.0,0.0,0.0
50%,14115.5,0.0,792.5,1112.875,314.0,0.0,18.795,0.143166,0.084138,44.311378,...,4e-06,9.45015e-08,0.0,0.0,0.0,0.0,1.1e-05,9.45015e-08,0.0,0.0
75%,40763.75,0.0,3383.0,2401.79,1745.75,0.0,131.03,0.146348,0.098837,44.311378,...,1.1e-05,2.89676e-06,4.86e-10,0.007962,0.0,0.0,0.007973,2.89676e-06,4.86e-10,0.0
max,6207104.0,4154590.0,771203.0,82902.83,175375.0,30.0,599.31,0.273963,0.195975,59.266378,...,0.005543,0.004791197,0.002415288,525.988881,399.471115,80.876029,525.994268,399.4725,80.87765,1.0


In [35]:
df.describe(include = ['object'])

Unnamed: 0,id,channel_sales,date_activ,date_end,date_modif_prod,date_renewal,has_gas,origin_up
count,14606,14606,14606,14606,14606,14606,14606,14606
unique,14606,8,1796,368,2129,386,2,6
top,24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,2009-08-01,2016-02-01,2015-11-01,2015-06-23,f,lxidpiddsbxsbosboudacockeimpuepw
freq,1,6754,95,145,721,587,11955,7097


In [36]:
df.columns

Index(['id', 'channel_sales', 'cons_12m', 'cons_gas_12m', 'cons_last_month',
       'date_activ', 'date_end', 'date_modif_prod', 'date_renewal',
       'forecast_cons_12m', 'forecast_cons_year', 'forecast_discount_energy',
       'forecast_meter_rent_12m', 'forecast_price_energy_off_peak',
       'forecast_price_energy_peak', 'forecast_price_pow_off_peak', 'has_gas',
       'imp_cons', 'margin_gross_pow_ele', 'margin_net_pow_ele', 'nb_prod_act',
       'net_margin', 'num_years_antig', 'origin_up', 'pow_max',
       'var_year_price_off_peak_var', 'var_year_price_peak_var',
       'var_year_price_mid_peak_var', 'var_year_price_off_peak_fix',
       'var_year_price_peak_fix', 'var_year_price_mid_peak_fix',
       'var_year_price_off_peak', 'var_year_price_peak',
       'var_year_price_mid_peak', 'var_6m_price_off_peak_var',
       'var_6m_price_peak_var', 'var_6m_price_mid_peak_var',
       'var_6m_price_off_peak_fix', 'var_6m_price_peak_fix',
       'var_6m_price_mid_peak_fix', 'var_6m_p

In [37]:
# Menghapus kolom 'A' secara in-place
df.drop(columns=['id','channel_sales','origin_up','has_gas','date_end','date_activ','date_modif_prod','date_renewal'], inplace=True)

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14606 entries, 0 to 14605
Data columns (total 36 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   cons_12m                        14606 non-null  int64  
 1   cons_gas_12m                    14606 non-null  int64  
 2   cons_last_month                 14606 non-null  int64  
 3   forecast_cons_12m               14606 non-null  float64
 4   forecast_cons_year              14606 non-null  int64  
 5   forecast_discount_energy        14606 non-null  float64
 6   forecast_meter_rent_12m         14606 non-null  float64
 7   forecast_price_energy_off_peak  14606 non-null  float64
 8   forecast_price_energy_peak      14606 non-null  float64
 9   forecast_price_pow_off_peak     14606 non-null  float64
 10  imp_cons                        14606 non-null  float64
 11  margin_gross_pow_ele            14606 non-null  float64
 12  margin_net_pow_ele              