# Project: Investigate a Dataset (Non-Renuable Energy Sources)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
> <ul>
> <li>This Project sheds a light on the Non-Renuable Energy Sources Usages by the country and also the Co2 Emissions
> <li>It trys to explore the recent history of Non-renuable energy sources namely oil and coal.</li>    
> <li>While exploring the usage of energy it also explores the  pollution caused in the form of CO2 Emissions </li> 
> <li>The Data is collected from www.gapminder.org/data  </li>
> <li><b>This project was done as a part of Udacity Indias MLFND Course by Mayur Selukar (@mrselukar) <b></li> 
<br><br>
> <b>Cheers  </b><br><br>
> <b>NOTE</b> Resources are mentioned near usage in Comments

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
py.offline.init_notebook_mode(connected=True)
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
# https://www.youtube.com/watch?v=XUNaGFa9xCM
# resource for plotly
def print_df(data_f):
    """
    Nicely prints a dataframe
    """
    index_name = d_frame.index.name
    for i in range (1,len(index_name)):
        if i % 15 == 0 :
            index_name = index_name[:i] + '<br>' + index_name[i:]
            i = i+4
    table = ff.create_table(data_f,index=True,index_title = index_name,height_constant=60)
    table.layout.width=145*len(data_f.columns)
    py.offline.iplot(table)

<a id='wrangling'></a>
## Data Wrangling

### General Properties
> The data collected was from BP and CDIAC in both case each row represents a country and the columns give the reading for countries for that year.  
> If a reading is missing its either a NaN or a "-" 
#### Whats done.
> All the data is loaded into various data frames  
> Index is set as Int64  
> Data is converted to numerica type  
> Index is set as the country Name

In [3]:
"""
data_links is the 
"""
data_links = {'coal_consumption_per_capita' : './csv_data/Coal Consumption per capita.xls.csv',
             'coal_consumption_total':'./csv_data/Coal Consumption.xls.csv',
             'co2_per_capita' : './csv_data/carbon_dioxide_emissions_per_capita.csv',
             'co2_total':'./csv_data/carbon_dioxide_total_emissions.csv',
             'oil_consumption_per_capita':'./csv_data/Oil Consumption per capita.xls.csv',
             'oil_consumption_total':'./csv_data/Oil Consumption.xls.csv',
             'oil_production_total':'./csv_data/Oil Production.xls.csv',
             'oil_production_per_capita':'./csv_data/Oil Production per capita.xls.csv',
             'oil_reserves_per_capita':'./csv_data/Oil Proved reserves per capita.csv',
             'oil_reserves_total':'./csv_data/Oil Proved reserves.xls.csv'}

In [4]:
df = {} # a dictionaries of data_frames
"""
Loading the data in to data frames
"""
for name,link in data_links.items():
    df[name] = pd.read_csv(link)
    # listing the column names 
    col_headders = list(df[name].columns.values)
    # setting the index of each row to be the country name 
    df[name].set_index(col_headders[0],inplace=True)    
## All the data is in tonns per year 

### Data Cleaning (Removing the empty data and filling NaNs for unexpected entries)
#### Also type casting the data to a numeric type for operating on it 

In [5]:
for name,d_frame in df.items():
    ### for each column in every df we make the data type as f64 and set all the exceptions as NaNs
    ### The exception in this case were the empty records initialized as '-'
    df[name] = d_frame.apply(pd.to_numeric,errors='coerce')
    d_frame = d_frame.replace([0,'-'],np.NaN)
    # setting the cloumn indexes as ints from strings
    d_frame.columns = d_frame.columns.astype(int)
    # This Reassignment is necessary  or else the column name remains strings
    df[name] = d_frame

### Data Cleaning 
> Trimming the data  
> Adding missing columns  
> Dropping Countries Where less than 20 years of data is present

In [6]:
for name,d_frame in df.items():
    #print name
    drop_list = [x for x in d_frame.columns if x < 1980 or x > 2010 ]
    d_frame.drop(drop_list,axis = 1,inplace =True)
    d_frame.replace(0,np.NaN, inplace =True)
    col_names = d_frame.columns
    # Dropping coutries with less than 20 years of data
    d_frame.dropna(thresh=20, inplace = True)
    req_columns = range(1980,2011,1)
    for x in req_columns:  
        if not x in col_names:
            d_frame[x] = np.nan

### Adding units to the index column name

In [7]:
df['co2_per_capita'].index.name = df['co2_per_capita'].index.name+"<br>(Tonnes per Year)"
df['co2_total'].index.name = df['co2_total'].index.name+"<br>(Tonnes per Year)"

### Helper Functions
> Before exploring the questions some helper functions need to be created  
> <ol>
> <li>add_averages(d_frame)  
>> Add an average column (Labled: 0) and index (Labled: 'Global') to df  
>> http://www.datadan.io/python-pandas-pitfalls-hard-lessons-learned-over-time/ 
> </li>
> <li>def plot_line_graph(data_f):
>> Plots line graph for the given dadta frame  
>> The df.columns are the points on xaxis
>> Quantity represented on y axis 
> </li>


In [8]:
"""
Add an average column and index to df
http://www.datadan.io/python-pandas-pitfalls-hard-lessons-learned-over-time/
"""
def add_averages(data_f):
    new_row = data_f.mean(axis=0)
    new_col = data_f.mean(axis=1)
    #print new_col
    data_f.loc['Global Average'] = new_row
    data_f[0] = new_col

In [9]:
def plot_line_graph(data_f):
    # Adding graph data
    years = list(data_f.columns)
    # removing the zero (indiacting total)
    try:
        years.remove(0)
    except:
        print "Warning The zero Column was not found"
    # each row corrosponds to 1 trace on th line graph
    trace_data = []
    for index,row in data_f.iterrows():
        country = index 
        data = row
        trace = go.Scatter(
                        x=years,
                        y=data, 
                        xaxis='Year', 
                        yaxis=data_f.index.name,
                        mode='lines+markers',
                        name=country,
                        marker = dict(
                            size = 7,
                            line = dict(
                            width = 0,
                                )
                            )
                        )
        trace_data.append(trace)
    my_layout = dict(
                    title = 'Top Countries '+ data_f.index.name,
                    hovermode= 'closest',
                    yaxis = dict(
                                    zeroline = True,
                                    title= data_f.index.name,
                                    ticklen= 5
                                ),
                    xaxis = dict(
                                    zeroline = True,
                                    title= 'Years',
                                    ticklen= 5
                                )
             )

    figure = dict(data=trace_data, layout=my_layout)
    py.offline.iplot(figure)

### Reorganizing the data
> 1. Addign average Columns
> 2. Sorting the data based of the average over the 30 years

In [10]:
"""
This Cell Add the averages to each data frame 
"""
for name,d_frame in df.items():
    add_averages(d_frame)
    

In [11]:
"""
This Cell Sorts the data based on the column with index 0
ie the column representing the average over the 30 years 
The Global Average Row is not sorted and remains at the bottom
"""
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
for name,d_frame in df.items():
    d_frame.drop(['Global Average'],axis = 0)
    d_frame.sort_values(by = [0],ascending = False,inplace = True,na_position = 'first')

<a id='eda'></a>
## Exploratory Data Analysis
### Q1. Who are the Top Consumers of Oil how do they compare to global average?
> This Explores the Top Consumers per capita and as a nation.

In [12]:
def explore_df(data_f):
    print_df(data_f.ix[:7,:])
    plot_line_graph(data_f.ix[:7,:])

In [13]:
explore_df(df['oil_consumption_per_capita'])

#### Answer: OilConsumption/Capita
> As seen the Top Consumer of Crude oil us Singapore and has been since 1993 before which UAE was the top consumer/capita
> however the consumption of UAE is on the fall.

In [14]:
explore_df(df['oil_consumption_total'])

#### Answer: Total Oil Consumption
> As seen the Top Consumer of Crude oil is USA and has been since 1980
> however the consumption of Japan is on the fall and that of china is rising  
> All top 6 are above global consumption average for a country

In [15]:
explore_df(df['coal_consumption_per_capita'])

#### Answer: Coal Consumption/Person
> As seen the Top Consumer of Coal is Australia and has been since 1996 befor which it was Czech Republic.  
> however the consumption of the top countries is decereasing.  
> All top 6 are above global consumption average for a country.  
> global Consumption/person is almost constant  

In [16]:
explore_df(df['coal_consumption_total'])

#### Answer: Coal Consumption Total
> As seen the Top Consumer of Coal is china and has been since 1988 befor which it was USA.  
> Indias Coal Consumption is also Incereasing   
> All Other countries show a drop in total coal consumption

### Q2. Where is the oil comming from ? 

In [17]:
explore_df(df['oil_production_per_capita'])

In [18]:
explore_df(df['oil_production_total'])

> Altough the Total Global Oil production has slightly incereased the oil production/capita has decereased  
Russia, Saudi Arabia and the USA are the top producers of oil.   
> Per capita Quatar, Kuwait and UAE are the top producers.

### How much oil is left and where is it ?

In [19]:
explore_df(df['oil_reserves_per_capita'])

In [20]:
explore_df(df['oil_reserves_total'])

> Per Person reserves are on a downward spiral but the Total proven reserves are incereasing.  
This indiacates that there is an incerease in proven reserves but the population is growing faster.
Most of the Oil Reservers are in Saudi Arabia, where as Kuwait and UAE have the most Oil reserves per person

### Who is polluting the most?

In [21]:
explore_df(df['co2_per_capita'])

In [22]:
explore_df(df['co2_total'])

>> Per Person Quator is releasing the most CO2  
>> On the total CO2 emmited USA was at top till 2005 where China took over and is currently releasing most amount of CO2   
>> The Global Average was nearly constant till 2009 but is on an uprise

<a id='conclusions'></a>
## Conclusions

> The report visualizes the pattern of crude oil and coal consumption over the time period of 1980 to 2010  
> It answers the various question on consumption, production of the Non-Renuable energy surces.  
> It also explores the remaining proven reserves and the C02 emission by Countries

> Despite of various environment conservation efforts the Consumption of non-renuable energy is on the rise and the proven reserves/person are on the decline  
> Mean while the Co2 emissions are on the rise after a long time   
> This begs the question where are all the environment conservation efforts gone.