# Introduction #
This notebook is my first attempt at conducting exploratory data analysis. It uses the World Development Indicators database from the World Bank in order to explore energy use and how it relates to various development indicators such as GDP, Value Added, and etc. After exploring the data with a group of visualizations, it also conducts simple linear regressions on a few correlations.

This first section imports a number of the necessary packages to conduct rudimentary data analysis. Numpy is used to operate linear algebra algorithms and equations, Pandas is used for data processing and creating data frames that are used to transform the SQL database we'll be using to make work easier. Sqlite3 is used purely to make the connection in the first section, from there on we use Pandas. Finally, scipy.stats is the package necessary to do simple linear regressions. Scipy is built off Numpy and, although Numpy contains its own linear regression operation, I prefer using scipy.stats. Matplotlib.pyplot is our visualization tool. 

In [1]:
import numpy as np
import pandas as pd
import sqlite3 as sql
import scipy.stats as sp
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

conn = sql.connect('../input/database.sqlite')

# Section 1 #
This short section starts our exploratory analysis. I decided to utilize the SQLite database in order to combine data analytic practice -- SQL and Python are often used together.

The first section reads from the Series table to look at what indicators we could use. We create two dummy variable lists to read through and combine the names of the indicators with the codes we will use to access the Indicators table. This index of codes and names normally can be printed but for now won't be displayed to save on space. 

In [2]:
Series = pd.read_sql('''
                     SELECT IndicatorName
                            ,SeriesCode
                            ,UnitOfMeasure
                            ,DevelopmentRelevance
                     FROM   Series
                     ''', con = conn)

names, codes = [], []
for v in Series['IndicatorName']:
    if v not in names:
        names.append(v)
for v in Series['SeriesCode']:
    if v not in codes:
        codes.append(v)
index = list(zip(names, codes))

# Section 2 #
This section is straight forward. From our previous index I decided to run an analysis of energy usage with gross domestic product and a number of other indicators. I also picked the total population indicator in order to attain a better assessment of the total energy use, rather than energy use per capita. In total, the indicators are: energy use, total population, gross domestic product, industry value added, gross capital formation, gross fixed capital formation, and household final consumption. All money accounts are in constant 2005 USD. Population is straight forward and not inflated. Energy use is in kilograms of oil equivalent.

Each piece of code is run from the pandas package and uses a method to read from the SQL database. While the value of each variable is the only thing really needed from the database, I pick the CountryCode and Year as well in order to match indices between the various data frames. 

In [3]:
nrg = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                  WHERE  IndicatorCode IS 'EG.USE.PCAP.KG.OE' ''', con = conn)
pop = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                  WHERE IndicatorCode IS 'SP.POP.TOTL' ''', con = conn)
gdp = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                  WHERE IndicatorCode IS 'NY.GDP.MKTP.KD' ''', con = conn)
#Final consumption expenditure
fce = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                  WHERE  IndicatorCode IS 'NE.CON.TOTL.KD' ''', con = conn)
#Manufacturing value added
mva = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                  WHERE  IndicatorCode IS 'NV.IND.MANF.KD' ''', con = conn)
#Gross fixed capital formation
gfc = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                  WHERE IndicatorCode IS 'NE.GDI.FTOT.KD' ''', con = conn)
co2 = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                  WHERE IndicatorCode IS 'EN.ATM.CO2E.KT' ''', con = conn)
hit = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                  WHERE IndicatorCode IS 'TX.VAL.TECH.CD' ''', con = conn)

#List of country codes that basically cause a double count -- they refer to regions of the world
names = ['EAS','EAP','EMU','ECS','EUU','HPC','HIC','OEC','LCN','LAC','LMY','LMC','MIC','NAX','OED',
        'SAS','SSF','SSA','UMC','WLD','DZA']

# Section 3 #
Here we move in to the meat and potatoes of the project. First this code merges the values from nrg(energy use) and pop(total population) in order to generate a number for the total actual energy use rather than the energy use per capita. Then the code drops the values linked to the country codes listed above in the list of names. This is because those codes correspond to regional or world values that basically causes a double count in the data; it seems prudent to remove this data in order to have a more accurate representation of the data and not double count.

Next we have a method called model that takes in two data frames, assuming that one will be the energy data frame. The method merges the two data frames on the CountryCode and Year columns in order to make sure the data aligns correctly between the values: since the read_sql code above creates its own index starting from 0, the data in different SQL reads will not necessarily correspond to one another. After merging, the code drops those columns to make the linear regression more easy to digest. It then drops any NaN rows -- not doing so results in another error -- and then takes the logarithm of both values in order to make the data appear linear rather than exponential as it actually is (and so that we can compare between different countries that may have values powers greater than other countries.

We use the scipy.stats linear regression method, linregress, particularly because I find it easier to use than Numpy's code. We show a scatter plot of the original data, and then plot into that graph the linear regression line using the results from the scipy.stats regression model. In addition, we print the values for the Beta coefficient, the Alpha coefficient (or the y-interval), the R-Coefficient and the R-squared coefficient.

In [4]:
energy = pd.merge(nrg, pop, how='left', on=['CountryCode','Year'])
energy['energy'] = energy.Value_x * energy.Value_y
energy = energy.drop(['Value_x','Value_y'], axis=1)
for n in names:
    drop   = energy.loc[energy.CountryCode == n].index
    energy = energy.drop(drop, axis=0)

def model(df1, df2):
    df = pd.merge(df1,  df2, how='left', on=['CountryCode','Year'])
    df = df.drop(['CountryCode','Year'], axis=1)
    df = df.dropna(axis=0, how='any')
    df = np.log(df)

    results = sp.linregress(df)
    plt.scatter(df.energy, df.Value)
    plt.plot(   df.energy, (df.energy*results[0])+results[1], color='orange')
    print('Beta:      {0} \n'
          'Alpha:     {1} \n'
          'R-Coef:    {2} \n'
          'R-Squared: {3}'.format(results[0],results[1],results[2],results[2]**2))

model(energy, gdp)

Here we see our first results, and probably the clearest results of this data exploration. The model demonstrates a clear correlation between energy use and GDP. Our linear regression results in an R-Squared of 0.85, which as far as I am aware is relatively good for this type of analysis (as a relative novice at this statistical analysis, 0.85 seems rather decent for a first analysis). The model for the linear regression as shown is:

    ln(y) = Alpha + Beta * ln(x) 

Where y is GDP and X is total energy use. This in turn results in a model closer to the following:

    Y = Alpha * X ^ Beta

With the Energy vs. GDP correlation out of the way, I run some exploratory analysis on a few more variables in the WorldBank Development Indicators set. 

First is fce, or final consumption expenditure in GDP. This results in a 0.86 R-Squared, slightly higher than the GDP R-Squared and much better than the other models we go through.


In [5]:
model(energy, fce)

Next is mva, Manufacturing Value Added, which is a component of GDP. I picked this value since manufacturing is considered to be an energy intensive industry, but we see here a lower R-Squared value of 0.811 compared to Final Consumption Expenditure. Following this section I run the model on gfc, Gross Fixed Capital Formation, and we see an R-Squared of 0.82, very slightly better than Manufacturing Value Added but still much lower than consumption for the coefficient of determination. 

In [6]:
model(energy, mva)

In [7]:
model(energy, gfc)

In [8]:
model(energy, co2)

Here we see the model run against C02 emissions, and as a result see the strongest correlation with an R-Squared of 0.89, which indicates a strong correlation between energy usage and CO2 emissions. 

Below is a fun model of energy usage versus patent applications for both residents and nonresidents combined. The model in this case shows the lowest R-squared value of 0.69.

In [12]:
pat = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                     WHERE IndicatorCode IS 'IP.PAT.RESD' ''', con = conn)
pat2 = pd.read_sql('''SELECT CountryCode, Year, Value FROM Indicators
                      WHERE IndicatorCode IS 'IP.PAT.NRES' ''', con = conn)
pat3 = pd.merge(pat, pat2, how='left', on=['CountryCode','Year'])
patents = pd.DataFrame({'CountryCode':pat3.CountryCode,
                        'Year':pat3.Year,
                        'Value':pat3.Value_x + pat3.Value_y})
model(energy, patents)

# Conclusion #
This statistical analysis is the first such project I have ever undertaken. Using my knowledge of SQL, Python, and various statistical packages, I conducted cursory exploratory analysis on a number of economic variables I thought might correlate with total energy usage. Although the project itself was not extremely rigorous, it does demonstrate an interesting correlation between total energy use and gross domestic product. Furthermore, the model I built recognizes a better correlation between final consumption expenditures and energy use than between capital formation (or investments) or manufacturing value added (a commonly noted energy-intensive sector of the economy). 

Any comments would be greatly appreciated to make my next project much better. Thanks for taking the time to look through this project, examine the code, and even read these sections!