# Energy Indicators Analysis

### Reading the data
The energy data from the file `Energy Indicators.xls`, which is a list of indicators of [energy supply and renewable electricity production](Energy%20Indicators.xls) from the [United Nations](http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls) for the year 2013, is read and stored into a DataFrame with the variable name of **energy**.

This is an Excel file, and not a comma separated values file. Also, The footer and header information from the datafile are excluded. The first two columns are unneccessary, I get rid of them, and I change the column labels so that the columns are:

`['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']`

I convert `Energy Supply` to gigajoules (there are 1,000,000 gigajoules in a petajoule). For all countries which have missing data (e.g. data with "...") are reflected as `np.NaN` values.

I rename the following list of countries:

```"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"```

There are also several countries with numbers and/or parenthesis in their name. I remove these, 

e.g. 

`'Bolivia (Plurinational State of)'` is replaced by `'Bolivia'`, 

`'Switzerland17'` is `'Switzerland'`.

<br>

Next, I load the GDP data from the file `world_bank.csv`, which is a csv containing countries' GDP from 1960 to 2015 from [World Bank](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD). Call this DataFrame **GDP**. 

I skip the header, and rename the following list of countries:

```"Korea, Rep.": "South Korea", 
"Iran, Islamic Rep.": "Iran",
"Hong Kong SAR, China": "Hong Kong"```

<br>

Finally, I load the [Sciamgo Journal and Country Rank data for Energy Engineering and Power Technology](http://www.scimagojr.com/countryrank.php?category=2102) from the file `scimagojr-3.xlsx`, which ranks countries based on their journal contributions in the aforementioned area. this DataFrame is called **ScimEn**.

I join the three datasets: GDP, Energy, and ScimEn into a new dataset (using the intersection of country names). I use only the last 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr 'Rank' (Rank 1 through 15). 

The index of this DataFrame is the name of the country, and the columns are ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations',
       'Citations per document', 'H index', 'Energy Supply',
       'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015'].

In [1]:
import pandas as pd
import numpy as np
import os
energy_file = os.path.join("Resources", 'Energy Indicators.xls')
scimagojr_file = os.path.join("Resources", 'scimagojr-3.xlsx')
gdp_file = os.path.join("Resources", 'world_bank.csv')
xlsx = pd.ExcelFile(energy_file)
total_rows = xlsx.book.sheet_by_index(0).nrows
skiprows = 17
nrows = 244 - 17
skip_footer = total_rows - nrows - skiprows - 1
energy = xlsx.parse(0, skiprows=skiprows, skip_footer=skip_footer, usecols=[1,3,4,5]).rename(columns={'Unnamed: 01':'Country', 'Petajoules':'Energy Supply', 'Gigajoules':'Energy Supply per Capita', '%':'% Renewable'})
energy=energy.replace('...', np.nan)
energy=energy.replace({'Country':{"Republic of Korea": "South Korea","United States of America": "United States","United Kingdom of Great Britain and Northern Ireland": "United Kingdom","China, Hong Kong Special Administrative Region": "Hong Kong"}})
energy['Country'] = energy['Country'].replace(['\s*\d+', '\s*\(.*\)'], '', regex=True)
energy['Energy Supply']=energy['Energy Supply']*1000000
GDP = pd.read_csv(gdp_file,skiprows=4).rename(columns={'Country Name':'Country'})
GDP=GDP.replace({'Country':{"Korea, Rep.": "South Korea","Iran, Islamic Rep.": "Iran","Hong Kong SAR, China": "Hong Kong"}})
ScimEn=pd.read_excel(scimagojr_file)
columns_to_keep = ['Country','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015']
GDP = GDP[columns_to_keep]
Merged=pd.merge(pd.merge(energy, GDP, how='inner', left_on='Country', right_on='Country'),ScimEn, how='inner', left_on='Country', right_on='Country').set_index('Country')
Merged.head()


KeyError: 'Country'

### Question
What are the top 15 countries by Scimagojr 'Rank' (Rank 1 through 15).

In [None]:
top15 = Merged.nsmallest(15,'Rank')
def top_15(df):
    result = df
    result = result.reset_index()
    result = result.set_index("Rank")
    return result[["Country"]]
top_15(top15)

### Question
When the datasets were joined using the intersection of the Countries, how many entries were lost?

In [None]:
def lost_entries():
    Merged1=pd.merge(pd.merge(energy, GDP, how='outer', left_on='Country', right_on='Country'),ScimEn, how='outer', left_on='Country', right_on='Country').set_index('Country')
    return len(Merged1)-len(Merged)

lost_entries()

## Working with the top 15 Countries

### Question
What is the average GDP over the last 10 years for each country? (excluding missing values from this calculation.)


In [None]:
import numpy as np

def avg_gdp(df):
    my_data = df
    rows  = ['2006','2007','2008','2009','2010','2011','2012','2013','2014','2015']
    my_data['avgGDP']=my_data.apply(lambda x: np.mean(x[rows].dropna()), axis=1)
    return my_data['avgGDP'].sort_values(axis=0, ascending=False)

avg_gdp(top15)

### Question
By how much had the GDP changed over the 10 year span for the country with the 6th largest average GDP?

In [None]:
def gdp_change_6th(df):
    my_data = df
    rows  = ['2006','2007','2008','2009','2010','2011','2012','2013','2014','2015']
    my_data['avgGDP']=my_data.apply(lambda x: np.average(x[rows].dropna()), axis=1)
    my_data['DeltaGDP']=my_data.apply(lambda x: x['2015']-x['2006'], axis=1)
    my_data=my_data.sort_values(by=['avgGDP'],ascending=False)
    return my_data.reset_index().loc[5][["Country",'DeltaGDP']]

gdp_change_6th(top15)

### Question
What is the mean `Energy Supply per Capita`?


In [None]:
def Avg_energy(df):
    return df['Energy Supply per Capita'].mean()

Avg_energy(top15)

### Question
What country has the maximum % Renewable and what is the percentage?

In [None]:
def max_renew(df):
    my_data = df
    max_percent=my_data['% Renewable'].max()
    max_country = my_data[my_data['% Renewable'] == max_percent]
    return (max_country.index[0],max_percent)

max_renew(top15)

### Question
Create a new column that is the ratio of Self-Citations to Total Citations. 
What is the maximum value for this new column, and what country has the highest ratio?


In [None]:
def top_citation_ratio():
    global top15
    top15['Ratio']=top15['Self-citations']/top15['Citations']
    Topratio=top15['Ratio'].max()
    top_country = top15[top15['Ratio'] == Topratio]
    return (top_country.index[0],Topratio)

top_citation_ratio()

### Question

Estimate the population using Energy Supply and Energy Supply per capita. 
What is the third most populous country according to this estimate?


In [None]:
def top_pop():
    global top15
    top15['PopEst']=top15['Energy Supply']/top15['Energy Supply per Capita']
    my_data = top15.sort_values(by=['PopEst'], ascending=False)
    my_data=my_data.reset_index()
    return my_data.loc[2]['Country']

top_pop()

### Question
Create a column that estimates the number of citable documents per person. 
What is the correlation between the number of citable documents per capita and the energy supply per capita?


In [None]:
def corr_cite_energ():
    global top15
    top15['Citable docs per capita']=top15['Citable documents']/top15['PopEst']
    return top15['Citable docs per capita'].corr(top15['Energy Supply per Capita'], method='pearson')

corr_cite_energ()

In [None]:
def plot9():
    import matplotlib as plt
    %matplotlib inline
    
    top15.plot(x='Citable docs per capita', y='Energy Supply per Capita', kind='scatter', xlim=[0, 0.0006])
    
plot9()

### Classifying
Create a new column with a 1 if the country's % Renewable value is at or above the median for all countries in the top 15, and a 0 if the country's % Renewable value is below the median.


In [None]:
def renew_class():
    global top15
    top15['HighRenew']=top15.apply(lambda x: 0 if x['% Renewable']<np.median(top15['% Renewable']) else 1, axis=1 )
    return top15['HighRenew']

renew_class()

### Summary per continent
Group the Countries by Continent, then create a dateframe that displays the sample size (the number of countries in each continent bin), and the sum, mean, and std deviation for the estimated population of each country.


In [None]:
def continent_group():
    global top15
    ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}
    top15 = top15.reset_index()
    top15["Continent"] = top15.apply(lambda x: ContinentDict[x["Country"]], axis =1)
    my_data = top15.groupby("Continent")
    continent_data = pd.DataFrame(my_data["Country"].count())
    continent_data["Population"] = my_data["PopEst"].sum()
    continent_data["Average Population/Country"] = my_data["PopEst"].mean()
    continent_data["Standard deviation"] = my_data["PopEst"].std()
    top15 = top15.set_index("Country")
    return  continent_data

continent_group()

### Question
I cut % Renewable into 5 bins. Group Top15 by the Continent, as well as these new % Renewable bins. How many countries are in each of these groups?

In [None]:
def continent_renew():
    global top15
    top15['Bin%Renew']=pd.cut(top15['% Renewable'],5)
    my_data = top15.reset_index()
    result = pd.DataFrame(my_data.groupby(["Continent", 'Bin%Renew'])["Country"].count())
    return result

continent_renew()


Convert the Population Estimate series to a string with thousands separator (using commas)


In [None]:
def pop_convert():
    global top15
    top15['Convert PopEst']=top15['PopEst'].map('{:,.2f}'.format)
    return top15['Convert PopEst']

pop_convert()

### Visualization

In [None]:
def plot_visual():
    import matplotlib as plt
    %matplotlib inline
    global top15
    ax = top15.plot(x='Rank', y='% Renewable', kind='scatter', 
                    c=['#e41a1c','#377eb8','#e41a1c','#4daf4a','#4daf4a','#377eb8','#4daf4a','#e41a1c',
                       '#4daf4a','#e41a1c','#4daf4a','#4daf4a','#e41a1c','#dede00','#ff7f00'], 
                    xticks=range(1,16), s=6*top15['2014']/10**10, alpha=.75, figsize=[16,6]);

    for i, txt in enumerate(top15.index):
        ax.annotate(txt, [top15['Rank'][i], top15['% Renewable'][i]], ha='center')

    print("This is an example of a visualization that can be created to help understand the data. \
This is a bubble chart showing % Renewable vs. Rank. The size of the bubble corresponds to the countries' \
2014 GDP, and the color corresponds to the continent.")
    
plot_visual()

In [None]:
#plot_optional() # Be sure to comment out plot_optional() before submitting the assignment!