# **Milestone 1**

##**Context**
 
 - Why is this problem important to solve?

##**Objective**

 - What is the intended goal?

##**Key questions**

- What are the key questions that need to be answered?

##**Problem Formulation**:

- What is it that we are trying to solve using data science?

##**Attributes Information:**

This datset is the past monthly data of Carbon dioxide emissions from electricity generation from the US Energy Information Administration categorized by fuel type such as Coal, Natural gas etc.

**MSN:-** Reference to Mnemonic Series Names (U.S. Energy Information Administration Nomenclature)

**YYYYMM:-** The month of the year on which these emissions were observed

**Value:-** Amount of CO2 Emissions in Million Metric Tons of Carbon Dioxide

**Description:-**  Different category of electricity production through which carbon is emissioned.

## **Important Notes**

- This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise. 

- In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.

- The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.

- All the outputs in the notebook are just for reference and can be different if you follow a different approach.

- There are sections called **Think About It** in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques. 

###**Loading the libraries**

In [None]:
# Uncomment to upgrade statsmodels
#!pip install statsmodels --upgrade

Collecting statsmodels
  Downloading statsmodels-0.13.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
[K     |████████████████████████████████| 9.8 MB 5.2 MB/s 
Collecting patsy>=0.5.2
  Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
[K     |████████████████████████████████| 233 kB 58.5 MB/s 
Installing collected packages: patsy, statsmodels
  Attempting uninstall: patsy
    Found existing installation: patsy 0.5.1
    Uninstalling patsy-0.5.1:
      Successfully uninstalled patsy-0.5.1
  Attempting uninstall: statsmodels
    Found existing installation: statsmodels 0.10.2
    Uninstalling statsmodels-0.10.2:
      Successfully uninstalled statsmodels-0.10.2
Successfully installed patsy-0.5.2 statsmodels-0.13.0


In [None]:
#Import basic libraries
import pandas as pd
import warnings
import itertools
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

###**Loading the data**

In [None]:
df = pd.read_excel('MER_T12_06.xlsx')
df.head()

Unnamed: 0,MSN,YYYYMM,Value,Description
0,CLEIEUS,197301,72.076,Coal Electric Power Sector CO2 Emissions
1,CLEIEUS,197302,64.442,Coal Electric Power Sector CO2 Emissions
2,CLEIEUS,197303,64.084,Coal Electric Power Sector CO2 Emissions
3,CLEIEUS,197304,60.842,Coal Electric Power Sector CO2 Emissions
4,CLEIEUS,197305,61.798,Coal Electric Power Sector CO2 Emissions


In [None]:
#to ignore warnings
import warnings
import itertools
warnings.filterwarnings("ignore")

In [None]:
#conversion of "YYYYMM" columnn into standard datetime format & making it as index
# We are using errors=’coerce’. It will replace all non-numeric values with NaN.

dateparse = lambda x: pd.to_datetime(x, format='%Y%m', errors = 'coerce')
df = pd.read_excel('MER_T12_06.xlsx', parse_dates=['YYYYMM'], index_col='YYYYMM', date_parser=dateparse) 
df.head(15)

**The arguments can be explained as:**

- **parse_dates:** This is a key to identify the date time column. Example, the column name is ‘YYYYMM’.
- **index_col:** This is a key that forces pandas to use the date time column as index.
- **date_parser:** Converts an input string into datetime variable.

- Let us first identify and **drop the non datetimeindex** rows. First, let's convert the index to datetime, coerce errors, and filter NaT

In [None]:
ts = df[pd.Series(pd.to_datetime(df.index, errors='coerce'))._______().values]
ts.head()

Unnamed: 0_level_0,MSN,Value,Description
YYYYMM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1973-01-01,CLEIEUS,72.076,Coal Electric Power Sector CO2 Emissions
1973-02-01,CLEIEUS,64.442,Coal Electric Power Sector CO2 Emissions
1973-03-01,CLEIEUS,64.084,Coal Electric Power Sector CO2 Emissions
1973-04-01,CLEIEUS,60.842,Coal Electric Power Sector CO2 Emissions
1973-05-01,CLEIEUS,61.798,Coal Electric Power Sector CO2 Emissions


In [None]:
#Check the datatypes of each column. Hint: Use dtypes method

In [None]:
#convert the emision value into numeric value
 

In [None]:
#Check total number of missing values of each column. Hint: Use isnull() method


In [None]:
#Drop the missing value using dropna(inplace = True)


###**Dataset visualization**

- The dataset has 8 energy sources of CO2 emission. 
- Group the CO2 Emission dataset based on the type of energy source.

In [None]:
(_______) = (_______).groupby('Description')
(_______).head()

####**Visualize the dependency of the emission in the power generation with time.**

In [None]:
cols = ['Geothermal Energy', 'Non-Biomass Waste', 'Petroleum Coke','Distillate Fuel ',
        'Residual Fuel Oil', 'Petroleum', 'Natural Gas', 'Coal', 'Total Emissions']

In [None]:
## Code here

- **Observations and Insights: _____**

#### **Visualize the trend of CO2 emission from each energy source individually**

In [None]:
###Code here

#### **Observations and Insights: ______**

####**Bar chart of CO2 Emissions per energy source**

In [None]:
CO2_per_source = _______.groupby('Description')['Value'].sum().sort_values()

In [None]:
cols = ['Geothermal Energy', 'Non-Biomass Waste', 'Petroleum Coke','Distillate Fuel ',
        'Residual Fuel Oil', 'Petroleum', 'Natural Gas', 'Coal', 'Total Emissions']

In [None]:
##Code here

####**For developing the time series model and forcasting, use the natural gas CO2 emission from the electirical power generation**


In [None]:
_______ = ts.iloc[:,1:]   # Monthly total emissions (mte)
_______= _______.groupby(['Description', pd.Grouper(freq="M")])['Value'].sum().unstack(level = 0)
_______ = _______['Natural Gas Electric Power Sector CO2 Emissions'] # monthly total emissions (mte)
_______.head()

####**Observations & insights: _____**

##**Proposed approach**

- **Potential techniques -** What different techniques should be explored?
- **Overall solution design -** What is the potential solution design?
- **Measures of success -** What are the key measures of success?