# **Milestone 1**

##**Context**
 
 - Why is this problem important to solve?
 1. Global warming is a major crisis in the world and emission of greenhouse gases is the primary factor resposible for it. CO2 is one of the biggest contributors to this greenhouse effect.
 2.Forecasting CO2 emissions by various energy sources can make an impact on decision-making in terms of choosing better methods of electricity production and thereby reduce emission.
 3. Hence, solving this problem would help fight global warming

##**Objective**

 - What is the intended goal?
 1. The goal is to forecast the carbon emissions value for natural gas (NNEIEUS) fuel type for the next 12 months and propose certain measures that can be adopted as policies to reduce these emissions

##**Key questions**

- What are the key questions that need to be answered?

1. What is the frequency of the data in the time series of interest?
2. Are there any trends and/or seasonal patterns in the dataset? Or is the series stationary?
3. Are there any missing values?
4. Which model best fits the data and how will you evaluate its success?
5. What policies would help reduce the emissions?

##**Problem Formulation**:

- What is it that we are trying to solve using data science?

1. Using data science, we are trying to reduce green house emissions by forecasting amounts of CO2 emission due to natural gas usage for the next 12 months
2. Having reliable forecasts would help us recommend business strategies to better combat global warming


##**Attributes Information:**

This dataset is the past monthly data of Carbon dioxide emissions from electricity generation from the US Energy Information Administration categorized by fuel type such as Coal, Natural gas etc.

**MSN:-** Reference to Mnemonic Series Names (U.S. Energy Information Administration Nomenclature)

**YYYYMM:-** The month of the year on which these emissions were observed (from 1973 to 2016)

**Value:-** Amount of CO2 Emissions in Million Metric Tons of Carbon Dioxide

**Description:-**  Different category of electricity production through which carbon is emissioned:
1. Coal Electric Power Sector CO2 Emissions
2. Natural Gas Electric Power Sector CO2 Emissions
3. Distillate Fuel, Including Kerosene-Type Jet Fuel, Oil Electric Power Sector CO2 Emissions
4. Petroleum Coke Electric Power Sector CO2 Emissions
5. Residual Fuel Oil Electric Power Sector CO2 Emissions
6. Petroleum Electric Power Sector CO2 Emissions
7. Geothermal Energy Electric Power Sector CO2 Emissions
8. Non-Biomass Waste Electric Power Sector CO2 Emissions
9. Total Energy Electric Power Sector CO2 Emissions

## **Important Notes**

- This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise. 

- In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.

- The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.

- All the outputs in the notebook are just for reference and can be different if you follow a different approach.

- There are sections called **Think About It** in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques. 

###**Loading the libraries**

**Please note that we are downgrading the version of the statsmodels library to version 0.12.1.** Due to some variation, the latest version of the library might not give us the desired results. You can run the below code to downgrade the library and avoid any issues in the output. Once the code runs successfully, either restart the kernel or restart the Jupyter Notebook before importing the statsmodels library.It is enough to run the install statsmodel cell once.To be sure you are using the correct version of the library, you can use the code in the Version check cell of the model.

In [None]:
#!pip install statsmodels==0.12.1

In [None]:
# Version check 
import statsmodels
statsmodels.__version__

In [None]:
#Import basic libraries
import pandas as pd
import warnings
import itertools
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

###**Loading the data**

In [None]:
df = pd.read_excel('MER_T12_06.xlsx')
df.head()

In [None]:
#to ignore warnings
import warnings
import itertools
warnings.filterwarnings("ignore")

In [None]:
#conversion of "YYYYMM" columnn into standard datetime format & making it as index
# We are using errors=’coerce’. It will replace all non-numeric values with NaN.

dateparse = lambda x: pd.to_datetime(x, format='%Y%m', errors = 'coerce')
df = pd.read_excel('MER_T12_06.xlsx', parse_dates=['YYYYMM'], index_col='YYYYMM', date_parser=dateparse) 
df.head(15)

**The arguments can be explained as:**

- **parse_dates:** This is a key to identify the date time column. Example, the column name is ‘YYYYMM’.
- **index_col:** This is a key that forces pandas to use the date time column as index.
- **date_parser:** Converts an input string into datetime variable.

- Let us first identify and **drop the non datetimeindex** rows. First, let's convert the index to datetime, coerce errors, and filter NaT

In [None]:
df.count()

In [None]:

ts = df[pd.Series(pd.to_datetime(df.index, errors='coerce')).notnull().values]
ts.head()
ts.describe().T

**Observations**
1. The observations have reduced to 4707 after filtering on NaT
2. There are 9 unique categories in MSN and Description columns
3. The 'Value' coulmn has missing values with a high frequency of 384. The rows with these missing values should be eliminated

In [None]:
#Check the datatypes of each column. Hint: Use dtypes method
ts.dtypes

In [None]:
#convert the emision value into numeric value
natural=pd.DataFrame(pd.to_numeric(ts['Value'],errors='coerce')).convert_dtypes()
ts['Value']=natural['Value']

In [None]:
#Check total number of missing values of each column. Hint: Use isnull() method
ts.isnull().sum()
#ts.isna().sum()

In [None]:
#Drop the missing value using dropna(inplace = True)
ts.dropna(inplace = True)
ts.describe().T

**Observations**
1. The datatype of 'Value' column has changed from object to numeric. After dropping the missing values, the number of observations have reduced to 4323

###**Dataset visualization**

- The dataset has 8 energy sources of CO2 emission. 
- Group the CO2 Emission dataset based on the type of energy source.
- Visualize the trend of CO2 emission from each energy source

In [None]:
#t=ts.groupby('Description')
coal=ts[ts['MSN']=='CLEIEUS']
coal.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Coal Electric Power Sector CO2 Emissions')
plt.legend('Emission value')

####**Visualize the dependency of the emission in the power generation with time.**

In [None]:
import statsmodels
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint, adfuller
import matplotlib.pyplot as plt

In [None]:
import matplotlib.pyplot as plt
natural=ts[ts['MSN']=='NNEIEUS']
fig=natural.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Natural Gas Electric Power Sector CO2 Emissions')
plt.legend('Emission value')
fig.figure.savefig("Natural gas_emissions", bbox_inches='tight', dpi=600)

In [None]:
distillate=ts[ts['MSN']=='DKEIEUS']
distillate.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Distillate Fuel, Including Kerosene-Type Jet Fuel, Oil Electric Power Sector CO2 Emissions')
plt.legend('Emission value')

In [None]:
petroleum=ts[ts['MSN']=='PCEIEUS']
petroleum.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Petroleum Coke Electric Power Sector CO2 Emissions')
plt.legend('Emission value')

In [None]:
Residual=ts[ts['MSN']=='RFEIEUS']
Residual.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Residual Fuel Oil Electric Power Sector CO2 Emissions')
plt.legend('Emission value')

In [None]:
pelectric=ts[ts['MSN']=='PAEIEUS']
pelectric.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Petroleum Electric Power Sector CO2 Emissions')
plt.legend('Emission value')

In [None]:
geothermal=ts[ts['MSN']=='GEEIEUS']
geothermal.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Geothermal Energy Electric Power Sector CO2 Emissions')
plt.legend('Emission value')

In [None]:
nonbio=ts[ts['MSN']=='NWEIEUS']
nonbio.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Non-Biomass Waste Electric Power Sector CO2 Emissions')
plt.legend('Emission value')

In [None]:
total=ts[ts['MSN']=='TXEIEUS']
total.plot(figsize=(16, 8))
plt.xlabel("Month")
plt.ylabel("CO2 Emission value")
plt.title('Total Energy Electric Power Sector CO2 Emissions')
plt.legend('Emission value')

- **Observations and Insights: _____**
1. The times series of CO2 emissions from every energy source seems to be non-stationary
2. The time series of CO2 emissions due to 'Geothermal Energy Electric Power Sector' and 'Non-Biomass Waste Electric Power Sector' and 'Natural Gas Electric Power Sector' appear to have seasonality
3. The time series of 'Residual Fuel Oil Electric Power Sector CO2 Emissions' and 'Petroleum Electric Power Sector CO2 Emissions' show a very similar trend indicating interdependency
4. The time series of 'Total Energy Electric Power Sector CO2 Emissions' seems to have a seasonality component too

####**Bar chart of CO2 Emissions per energy source**

In [None]:
CO2_per_source = ts.groupby('Description')[['Value']].sum().sort_values(by = 'Value')


#x = CO2_per_source['Description']
#y = np.array(CO2_per_source['Value'])
cols = ['Geothermal Energy', 'Non-Biomass Waste', 'Petroleum Coke','Distillate Fuel ',
        'Residual Fuel Oil', 'Petroleum', 'Natural Gas', 'Coal', 'Total Emissions']
CO2_per_source.head()

In [None]:
#plt.xticks(CO2_per_source.index, cols)
#CO2_per_source.plot(kind='bar',legend=True) 
plt.bar(CO2_per_source.index,CO2_per_source['Value'])
#ax.plot(l,y)
plt.set_xticks(CO2_per_source.index)
plt.set_xticklabels(cols)

plt.show()



In [None]:
##Code here
#CO2_per_source.plot(kind='bar',legend=True) 
#plt.xticks(CO2_per_source['Description'], cols)


**Observations and insights**
1. Coal electric power CO2 emissions are significantly higher than other sources or CO2 emissions and is a major contributor to Total energy Elctric power sector CO2 emissions. This is followed by Natural gas, Petroleum and residual fuel oil
2. In the future, CO2 emissions can be significantly reduced by switching to other energy sources such as Geothermal, Non-Biomass waste, Petroleum coke, Distillate fuel etc, as they contribute to lease amount of CO2 emissions


####**For developing the time series model and forcasting, use the natural gas CO2 emission from the electirical power generation**


In [None]:
mte = ts.iloc[:,1:] 
# Monthly total emissions (mte)
ab= mte.groupby(['Description',pd.Grouper(freq="M")])['Value'].sum().unstack(level = 0)
nat =ab['Natural Gas Electric Power Sector CO2 Emissions'] # monthly total emissions (mte)
nat.head()

- **Potential techniques -** What different techniques should be explored?
1. The first step is to decompose the time series (for natural gas CO2 emission) into trend, seasonality etc. 
2. The time series can be tested for stationarity using Augmented Dicky-Fuller Test. If the time series is non-stationary then we need to make it stationary by differencing the data.
3. Then the order of estimation can be obtained through ACF and PACF plots. 
4. If the ACF plot is exponenentially decaying and PACF plot is finite, we can choose AR(p) model. If the ACF itself is finite, we can pick MA(q) model. 
5. If the time series does not follow AR or MA models, it means that it follows ARMA (p,q) or ARIMA (p,d,q) model. So to find out the optimal values of p, d and q, we need to do hyper-parameter search to find their optimal values.

- **Overall solution design -** What is the potential solution design?
1. We would split the dataset into training and test data. We would split the data in a such a way to include the lag component. Training data: 197301-201013, test data: 201101-201607
2. The training data should be tested for stationarity
3. We could perform differencing of multiple orders until stationarity is obtained. THis order defines the Integration parameter (d) in ARIMA modelling
4. ACF and PACF plots are used to identify the model's order and the potential models for fitting the data
5. We would then try multiple modeling techniques (AR,MA,ARMA,ARIMA) with different lags and try to fit the training data and then tested with test data. This would help identify the best parameters of the model for fitting the data.
6. After dentifying the best parameters (p, d, and q) for our data, we would train the model with the same parameters on the full data for 'Natural gas CO2 emissions' and get the forecasts for the next 12 months i.e. from 201608-201708

- **Measures of success -** What are the key measures of success to compare different techniques?
1. we need to select which evaluation metric we want to optimize to build the model. We would use AIC and RMSE to compare every model with the other ones

##**Proposed approach**

- **Potential techniques -** What different techniques should be explored?
- **Overall solution design -** What is the potential solution design?
- **Measures of success -** What are the key measures of success to compare different techniques?