## The Story

Use Markdown cells to write a brief summary of the data analysis you are planning to undertake:

  - What is the goal of this work?
    
  - What kind of data is analyzed in this work? 
    
  - What summary statistics are obtained in this work?

  
This part is worth 3 marks. I recommend writing this part once you have completed all the remaining parts of this assignment.

## Data Preparation

In this part you need to construct two Pandas DataFrames using [World Bank Data API](https://pypi.org/project/wbgapi/). To install this API, run ``pip install wbgapi`` in the command line.


This part is worth 12 marks overall. A detailed breakdown of the marks if given below.

### Countries

Choose 10 or more countries and split them into at least three different groups. For instance:

  - Continents: Europe, Asia, Africa, ...
    
  - Economic development: high, medium, low.
    
  - Population size: large, medium, small.
    
  - Area size: large, medium, small.
    
  - Any other splitting.
  
Then create three variables:

  - ``country_codes`` - a list of the country codes of the chosen countries. Use this [link](https://wits.worldbank.org/wits/wits/witshelp/content/codes/country_codes.htm) to find the country codes, e.g. ``'GBR'``.
  
  - ``country_names`` - a dictionary with keys being country codes and values being country names, e.g. ``'GBR':'United Kingdom'``.
  
  - ``country_groups`` - a dictionary with keys being country codes and values being country groups, e.g. ``'GBR':'Europe'``.
  
This part is worth 2 marks: 1 mark for Python code and 1 mark for comments and explanations.

In [1]:
# write your code here
import wbgapi as wb
import pandas as pd

In [2]:
# simple list of few countries.
countries = ["Russia","Germany","United Kingdom","china","india","indonasia","afghanistan","nigeria","kenya","algeria"]

# three groups according to continent 
europe = countries[0:3]
asia = countries[3:7]
africa = countries[7:10]

# generating codes of given countries 
country_codes = []
for i in range(len(countries)):
    code = wb.economy.coder(countries[i])
    country_codes.append(code)


# make dictionary of names and codes 
country_name = {}
i = 0
while i < len(countries):
    country_name[country_codes[i]] = countries[i]
    i+=1

    
# grouping each country 
country_groups = dict()
country_groups["europe"] = europe
country_groups["asia"] = asia
country_groups["africa"] = africa




### Indicators

In [10]:
# overview 
wb.series.info(q = "energy")
wb.series.info(q = "co2 emission")


id,value
EN.ATM.CO2E.GF.KT,CO2 emissions from gaseous fuel consumption (kt)
EN.ATM.CO2E.GF.ZS,CO2 emissions from gaseous fuel consumption (% of total)
EN.ATM.CO2E.KD.GD,CO2 emissions (kg per 2015 US$ of GDP)
EN.ATM.CO2E.KT,CO2 emissions (kt)
EN.ATM.CO2E.LF.KT,CO2 emissions from liquid fuel consumption (kt)
EN.ATM.CO2E.LF.ZS,CO2 emissions from liquid fuel consumption (% of total)
EN.ATM.CO2E.PC,CO2 emissions (metric tons per capita)
EN.ATM.CO2E.PP.GD,CO2 emissions (kg per PPP $ of GDP)
EN.ATM.CO2E.PP.GD.KD,CO2 emissions (kg per 2017 PPP $ of GDP)
EN.ATM.CO2E.SF.KT,CO2 emissions from solid fuel consumption (kt)


Explore [The World Bank Data](https://data.worldbank.org/indicator) website and choose two categories of indicators, for instance:

  - Economy and Education
  - Health and Poverty

Choose four or more indicators from each category (eight or more in total). At least two of them should be multi-level indicators. For instance:

  - Gross domestic product (GDP): 
    
      - total, in billions usd
      
      - total, in billions usd adjusted to purchasing power parity (PPP) 
      

  - Population:
  
      - male
      
      - female
      
      - total

  
You can choose indicators from different World Bank data categories, if that suits your story.      
      
You will need indicator IDs to access the data via the World Bank API. There are two ways to find the IDs:

  - Find the wanted indicator on the World Bank website and read its ID from the web address. For instance, the "Population, total", indicator web address is:
  
    ```
    https://data.worldbank.org/indicator/SP.POP.TOTL
    ```
    The indicator ID is thus: ``SP.POP.TOTL``
    

  - Use ``wb.search()`` method to find the wanted indicator ID. For instance:

    ````python
    import wbgapi as wb
    wb.search("population, total")
    ````
      
  - Use ``wb.series.info()`` method to find the indicator name from its ID. For instance:
  
    ````python
    import wbgapi as wb
    wb.series.info("SP.POP.TOTL")
    ````  

Collect indicator IDs to two lists, say ``indicator_ids_1`` and ``indicator_ids_2``, one for each category.
      
**Important:** Choose indicators wisely to be able to tell a story. You will need to summarize these indicators in the next section on this assignment. 

This part is worth 3 marks: 2 marks for Python code and 1 mark for comments and explanations of the indicators.

In [12]:
# write your code here

# indicator for energy and production or usage of energy
indicator_ids_1 = ["EG.USE.COMM.FO.ZS","EG.USE.ELEC.KH.PC","EG.USE.PCAP.KG.OE","EG.ELC.PETR.ZS"]

# indicator for co2 emission from different sources
indicator_ids_2 = ["EN.ATM.CO2E.KT","EN.ATM.CO2E.LF.ZS","EN.ATM.CO2E.SF.KT","EN.ATM.CO2E.GF.KT"]


### DataFrames

Use [World Bank Data API](https://pypi.org/project/wbgapi/) to get data for each indicator and each country for the most recent 10-20 years, subject to data availability.

Then create two Pandas DataFrames each having the following structure:

  - MultiIndexed columns consisting of indicators you have chosen above. You should choose short but informative column names to label the indicators.
  
  - MultiIndexed rows consisting of country codes and years. 


The necessary Pandas techniques are explained in Notebooks 2.5, 2.6, and 2.7.


To get the World Bank data use ``wb.data.DataFrame()`` method. For instance:

````python
import wbgapi as wb
indicator_ids = ['NY.GDP.PCAP.CD', 'SP.POP.TOTL']
country_codes = ['FRA','GBR','IRL']
my_dataframe  = wb.data.DataFrame(indicator_ids, country_codes, mrv=5) # most recent 5 years 
````

Here you should use indicators and country codes you have constructed above.


The resulting DataFrames should have structure similar to this one:


<table border="1" class="dataframe">
  <thead>
    <tr>
      <th></th>
      <th></th>
      <th colspan="2" halign="left">GDP (bln)</th>
      <th colspan="3" halign="left">Population (mln)</th>
    </tr>
    <tr>
      <th></th>
      <th></th>
      <th>Gross</th>
      <th>PPP</th>
      <th>Total</th>
      <th>Female</th>
      <th>Male</th>
    </tr>
    <tr>
      <th>country</th>
      <th>year</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="3" valign="top">FRA</th>
      <th>2001</th>
      <td>1376.47</td>
      <td>1686.72</td>
      <td>61.36</td>
      <td>31.61</td>
      <td>29.75</td>
    </tr>
    <tr>
      <th>2002</th>
      <td>1494.29</td>
      <td>1762.93</td>
      <td>61.81</td>
      <td>31.85</td>
      <td>29.95</td>
    </tr>
    <tr>
      <th>2003</th>
      <td>1840.48</td>
      <td>1753.61</td>
      <td>62.24</td>
      <td>32.09</td>
      <td>30.15</td>
    </tr>
    <tr>
      <th rowspan="3" valign="top">GBR</th>
      <th>2001</th>
      <td>1643.91</td>
      <td>1643.95</td>
      <td>59.12</td>
      <td>30.29</td>
      <td>28.83</td>
    </tr>
    <tr>
      <th>2002</th>
      <td>1784.08</td>
      <td>1725.43</td>
      <td>59.37</td>
      <td>30.39</td>
      <td>28.98</td>
    </tr>
    <tr>
      <th>2003</th>
      <td>2057.09</td>
      <td>1810.62</td>
      <td>59.65</td>
      <td>30.50</td>
      <td>29.15</td>
    </tr>
    <tr>
      <th rowspan="3" valign="top">IRL</th>
      <th>2001</th>
      <td>109.25</td>
      <td>125.91</td>
      <td>3.87</td>
      <td>1.94</td>
      <td>1.92</td>
    </tr>
    <tr>
      <th>2002</th>
      <td>127.99</td>
      <td>138.49</td>
      <td>3.93</td>
      <td>1.97</td>
      <td>1.96</td>
    </tr>
    <tr>
      <th>2003</th>
      <td>164.31</td>
      <td>145.01</td>
      <td>4.00</td>
      <td>2.00</td>
      <td>1.99</td>
    </tr>
  </tbody>
</table>

<br>


This part is worth 7 marks: 3 marks for Python code for each DataFrame and 1 mark for comments explaining the Python code.

In [21]:
# write your code here
# make pandas dataframe using wbgapi library built in function of different countries using this builtin function
df1  = wb.data.DataFrame(indicator_ids_1, country_codes, mrv=4)

# 2nd data frame 
df2  = wb.data.DataFrame(indicator_ids_2, country_codes, mrv=4)


## Data Analysis 

Use Pandas ``groupby()`` and ``pivot_table()`` methods to construct 8 different summary statistics. They must include the following Pandas techniques:

- ``groupby()`` combined with ``aggregate()``, ``filter()``, ``transform()``, and ``apply()`` methods.


- ``groupby()`` using an external key, the dictionary ``country_groups`` you have constructed above.


- at least one summary statistics must use the ``pivot_table()`` method. 


- at least two summary statistics must use data from both DataFrames.

The necessary Pandas techniques are explained in Notebooks 2.8 and 2.9.

**Important:** Make sure your summary statistics make sense and tell a story. This story must be summarized in the first part of this assignment, "The Story".


This part is worth 10 marks: 1 mark for Python code for each summary statistic and 2 marks for comments explaining the Python code and the summary statistics.

In [24]:
# write your code here

# gropu by economy of dataframe 2
df2.groupby('economy').mean()



Unnamed: 0_level_0,YR2015,YR2016,YR2017,YR2018
economy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AFG,3076.682,3146.377,7380.0,7440.0
CHN,4367580.0,4292267.0,10017770.0,10313460.0
DEU,300013.9,301709.6,732200.0,709540.0
DZA,56335.9,55532.21,145100.0,151670.0
GBR,158042.4,145726.9,366380.0,358800.0
IND,934348.8,948936.2,2301440.0,2434520.0
KEN,4621.776,5036.529,18890.0,18400.0
NGA,35199.94,35115.11,112920.0,130670.0
RUS,711225.3,705753.8,1557190.0,1607550.0


In [39]:
# aggrigation to max 
# just make simple lambda func and apply on data set of two years 
my_dataframe2[['YR2015', 'YR2016']].apply(lambda x: x/10)

Unnamed: 0_level_0,Unnamed: 1_level_0,YR2015,YR2016
economy,series,Unnamed: 2_level_1,Unnamed: 3_level_1
AFG,EN.ATM.CO2E.GF.KT,28.2359,31.9029
AFG,EN.ATM.CO2E.KT,799.0,739.0
AFG,EN.ATM.CO2E.LF.ZS,5.934207,4.704081
AFG,EN.ATM.CO2E.SF.KT,397.5028,482.9439
CHN,EN.ATM.CO2E.GF.KT,36694.9356,40309.1308
CHN,EN.ATM.CO2E.KT,983043.0,981431.0
CHN,EN.ATM.CO2E.LF.ZS,1.351583,1.361683
CHN,EN.ATM.CO2E.SF.KT,727292.6115,695165.2911
DEU,EN.ATM.CO2E.GF.KT,14922.4898,16108.0309
DEU,EN.ATM.CO2E.KT,74231.0,74715.0


In [37]:
grouped = df2.groupby('YR2016')

# filtering the data only less than 1000000 in 2017
grouped.filter(lambda x: x['YR2017'].mean() < 1000000.)

Unnamed: 0_level_0,Unnamed: 1_level_0,YR2015,YR2016,YR2017,YR2018
economy,series,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AFG,EN.ATM.CO2E.KT,7990.0,7390.0,7380.0,7440.0
DEU,EN.ATM.CO2E.KT,742310.0,747150.0,732200.0,709540.0
DZA,EN.ATM.CO2E.KT,145970.0,143350.0,145100.0,151670.0
GBR,EN.ATM.CO2E.KT,400370.0,378890.0,366380.0,358800.0
KEN,EN.ATM.CO2E.KT,17090.0,18770.0,18890.0,18400.0
NGA,EN.ATM.CO2E.KT,108150.0,108420.0,112920.0,130670.0


In [38]:
import numpy as np

# get group of 2018 data 
newgroup = df2.groupby('YR2018')

# transform that group of that according to mean value
newgroup.transform(lambda val: (np.mean(val)))

Unnamed: 0_level_0,Unnamed: 1_level_0,YR2015,YR2016,YR2017
economy,series,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AFG,EN.ATM.CO2E.KT,7990.0,7390.0,7380.0
CHN,EN.ATM.CO2E.KT,9830430.0,9814310.0,10017770.0
DEU,EN.ATM.CO2E.KT,742310.0,747150.0,732200.0
DZA,EN.ATM.CO2E.KT,145970.0,143350.0,145100.0
GBR,EN.ATM.CO2E.KT,400370.0,378890.0,366380.0
IND,EN.ATM.CO2E.KT,2150220.0,2183280.0,2301440.0
KEN,EN.ATM.CO2E.KT,17090.0,18770.0,18890.0
NGA,EN.ATM.CO2E.KT,108150.0,108420.0,112920.0
RUS,EN.ATM.CO2E.KT,1557530.0,1530900.0,1557190.0


---