**Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Before submitting your project, it will be a good idea to go back through your report and remove these sections to make the presentation of your work as tidy as possible. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: Investigate Carbon Per Capita


## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline



<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [2]:
# read data
co2 = pd.read_csv("data/indicator CDIAC carbon_dioxide_emissions_per_capita.csv")

# how big is the dataset?
print co2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Columns: 254 entries, CO2 per capita to 2012
dtypes: float64(253), object(1)
memory usage: 466.4+ KB
None


In [3]:
# what does it look like?
print co2.head()

          CO2 per capita  1751  1755  1762  1763  1764  1765  1766  1767  \
0               Abkhazia   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1            Afghanistan   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2  Akrotiri and Dhekelia   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3                Albania   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4                Algeria   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

   1768    ...         2003      2004      2005      2006      2007      2008  \
0   NaN    ...          NaN       NaN       NaN       NaN       NaN       NaN   
1   NaN    ...     0.022704  0.027472  0.036780  0.047090  0.068312  0.131602   
2   NaN    ...          NaN       NaN       NaN       NaN       NaN       NaN   
3   NaN    ...     1.382066  1.332966  1.353789  1.224310  1.279420  1.297753   
4   NaN    ...     2.899236  2.762220  3.257010  3.113135  3.312875  3.328945   

       2009      2010      2011      2012  
0       NaN 

#### Observations

* The data goes back a long way, but there are a lot of NaNs in those early years.
* There are also NaNs in more recent years too, which could be informative.
* The country column heading should be changed.

In [4]:
# summary stats
co2.describe()

Unnamed: 0,1751,1755,1762,1763,1764,1765,1766,1767,1768,1769,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
count,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,200.0,200.0,200.0,200.0,201.0,201.0,201.0,201.0,65.0,65.0
mean,1.131253,1.105193,1.244749,1.239133,1.233543,1.227977,1.363424,1.357226,1.351055,1.344913,...,5.160597,5.266567,5.242741,5.250673,5.387444,5.293425,5.064836,5.189779,8.395975,8.391032
std,,,,,,,,,,,...,6.89271,7.222377,7.198735,6.95014,7.281876,6.858804,6.47927,6.538857,7.796528,8.04149
min,1.131253,1.105193,1.244749,1.239133,1.233543,1.227977,1.363424,1.357226,1.351055,1.344913,...,0.022704,0.027472,0.021237,0.025019,0.003482,0.008618,0.011942,0.008443,0.401268,0.426003
25%,1.131253,1.105193,1.244749,1.239133,1.233543,1.227977,1.363424,1.357226,1.351055,1.344913,...,0.61023,0.68239,0.730938,0.681245,0.645242,0.613104,0.609647,0.640119,4.196912,4.141014
50%,1.131253,1.105193,1.244749,1.239133,1.233543,1.227977,1.363424,1.357226,1.351055,1.344913,...,3.216404,3.242589,3.302545,3.301199,3.16437,3.277726,3.106366,3.263605,6.722385,6.506759
75%,1.131253,1.105193,1.244749,1.239133,1.233543,1.227977,1.363424,1.357226,1.351055,1.344913,...,7.5143,7.530032,7.296806,7.46316,7.555909,7.293089,6.782996,6.983513,9.513115,9.19422
max,1.131253,1.105193,1.244749,1.239133,1.233543,1.227977,1.363424,1.357226,1.351055,1.344913,...,55.322622,62.069377,63.187436,57.986895,57.066817,48.702062,41.378843,40.098333,41.220928,46.643197


In [5]:
# give better name to first columns
co2.rename(columns = {'CO2 per capita' : 'country'}, inplace = True)

In [6]:
co2.head()

Unnamed: 0,country,1751,1755,1762,1763,1764,1765,1766,1767,1768,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
0,Abkhazia,,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,,,,,,,,,,...,0.022704,0.027472,0.03678,0.04709,0.068312,0.131602,0.213325,0.262174,,
2,Akrotiri and Dhekelia,,,,,,,,,,...,,,,,,,,,,
3,Albania,,,,,,,,,,...,1.382066,1.332966,1.353789,1.22431,1.27942,1.297753,1.215055,1.336544,,
4,Algeria,,,,,,,,,,...,2.899236,2.76222,3.25701,3.113135,3.312875,3.328945,3.564361,3.480977,3.562504,3.785654


In [7]:
# how many countries have data from 1751?
co2['1751'].describe()

count    1.000000
mean     1.131253
std           NaN
min      1.131253
25%      1.131253
50%      1.131253
75%      1.131253
max      1.131253
Name: 1751, dtype: float64

Not surprisingly, only one country has data going that far back. Which one?

In [8]:
# Which country has data for 1751?
co2.loc[pd.notnull(co2['1751'])]

Unnamed: 0,country,1751,1755,1762,1763,1764,1765,1766,1767,1768,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
219,United Kingdom,1.131253,1.105193,1.244749,1.239133,1.233543,1.227977,1.363424,1.357226,1.351055,...,9.069663,9.022278,9.001868,8.952901,8.684595,8.526467,7.705539,7.954469,7.324996,7.502533


But of course, the foremost colonial power of the era!

In [9]:
# How many countries do not have data in the last year?
missing_recent = co2.loc[pd.isnull(co2['2012'])]
len(missing_recent)

170

###  Sanity checks

Which countries have the highest cumulative CO2?


In [10]:
# change index for simpler aggregating
co2 = co2.set_index('country')

In [11]:
co2.sum(1).sort_values(ascending = False)

country
Qatar                       3005.390701
United States               2043.289976
Brunei                      2006.882965
United Kingdom              1784.762577
Luxembourg                  1758.837129
United Arab Emirates        1703.118653
Kuwait                      1624.240671
Bahrain                     1558.783103
Canada                      1410.204984
Belgium                     1379.276171
Trinidad and Tobago         1264.426704
Germany                     1210.906480
Australia                   1088.688431
Netherlands                  889.863537
Slovak Republic              845.036499
Denmark                      805.823307
Russia                       792.381253
Poland                       780.967275
France                       732.874690
New Caledonia                718.242462
Netherlands Antilles         706.418690
Austria                      697.931600
Saudi Arabia                 686.246180
Bahamas                      685.328774
South Africa                 643

#### Observations

Most of the top 10 are unsurprising: rich and/or oil-producing countries.

However, there are surprises too: Belgium, Trinidad & Tobago, and Luxembourg. Perhaps this is reflects data that goes back further for those countries. 

Let's compare with the means.

In [12]:
# which countries have the highest mean CO2?
co2.mean(1).sort_values(ascending = False).head(10)

country
Qatar                   46.959230
United Arab Emirates    31.539234
Netherlands Antilles    28.256748
Luxembourg              26.649047
Brunei                  24.776333
Kuwait                  24.242398
Aruba                   20.710109
Bahrain                 19.984399
Nauru                   12.399117
Kazakhstan              12.058241
dtype: float64

#### Observations

Now the top 10 is nearly all oil-rich countries. Nonetheless, we still have Luxembourg in the top 10, along with Netherlands Antilles. And other Caribbean island countries are near the top.

Also...the bottom results show that we have many rows without any values.

In [13]:
# how many rows without values?
len(co2.loc[co2.sum(1).isnull()])

34

In [14]:
# remove rows without any data
co2 = co2.dropna(how = 'all')

# verify
co2.loc[co2.sum(1).isnull()]

Unnamed: 0_level_0,1751,1755,1762,1763,1764,1765,1766,1767,1768,1769,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


### Trim & add region data

CO2 patterns, I suspect, vary a lot by region and not just by country. 

For my analysis, I'd like to primarily focus on Europe since that's where I'm currently living. 

In [15]:
co2.head()

Unnamed: 0_level_0,1751,1755,1762,1763,1764,1765,1766,1767,1768,1769,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,,,,,,,,,,...,0.022704,0.027472,0.03678,0.04709,0.068312,0.131602,0.213325,0.262174,,
Albania,,,,,,,,,,,...,1.382066,1.332966,1.353789,1.22431,1.27942,1.297753,1.215055,1.336544,,
Algeria,,,,,,,,,,,...,2.899236,2.76222,3.25701,3.113135,3.312875,3.328945,3.564361,3.480977,3.562504,3.785654
Andorra,,,,,,,,,,,...,7.414281,7.49969,7.390955,6.83994,6.622435,6.527241,6.17852,6.0921,,
Angola,,,,,,,,,,,...,0.58781,1.17761,1.161662,1.308849,1.435044,1.474353,1.500054,1.593918,,


In [16]:
# reset index so that 'country' is a column
co2 = co2.reset_index()

In [17]:
# verify
co2.head()

Unnamed: 0,country,1751,1755,1762,1763,1764,1765,1766,1767,1768,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
0,Afghanistan,,,,,,,,,,...,0.022704,0.027472,0.03678,0.04709,0.068312,0.131602,0.213325,0.262174,,
1,Albania,,,,,,,,,,...,1.382066,1.332966,1.353789,1.22431,1.27942,1.297753,1.215055,1.336544,,
2,Algeria,,,,,,,,,,...,2.899236,2.76222,3.25701,3.113135,3.312875,3.328945,3.564361,3.480977,3.562504,3.785654
3,Andorra,,,,,,,,,,...,7.414281,7.49969,7.390955,6.83994,6.622435,6.527241,6.17852,6.0921,,
4,Angola,,,,,,,,,,...,0.58781,1.17761,1.161662,1.308849,1.435044,1.474353,1.500054,1.593918,,


In [18]:
# use later?
def prep_df(df):
    # Make a copy & reshape df so it can 
    # be converted to long (tidy) format.
    # Assume 'country' is a column, not index
    df_copy = df.copy() 
    df_copy = df_copy.set_index('country')
    df_yr_ix = df_copy.T
    df_yr_ix = df_yr_ix.reset_index()
    df_yr_ix.rename(columns = {"index" : "year"}, inplace = True)
    
    # create list of country names for use as value columns
    cols = (list(df_yr_ix.columns.values))[1:]
    
    # convert to long format
    df_yr_long = pd.melt(df_yr_ix, id_vars = 'year', value_vars = cols)

    
    return df_yr_long

co2_long = prep_df(co2)

# verify by looking at country with data for all years
#co2_long[co2_long.country == 'United Kingdom']

In [19]:
#co2_long.tail()

## Add region & sub-region columns

Data from http://www.internetworldstats.com/list1.htm

In [20]:
# add continent and region info for
# better grouping

regions = pd.read_json('data/all_countries.json')
regions = regions[['name', 'region', 'sub-region']]

In [21]:
regions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 249 entries, 0 to 248
Data columns (total 3 columns):
name          249 non-null object
region        240 non-null object
sub-region    240 non-null object
dtypes: object(3)
memory usage: 7.8+ KB


In [22]:
# what are the 9 rows with null region & sub-region?
regions.loc[regions.region.isnull()]

Unnamed: 0,name,region,sub-region
8,Antarctica,,
30,Bouvet Island,,
32,British Indian Ocean Territory,,
46,Christmas Island,,
47,Cocos (Keeling) Islands,,
78,French Southern Territories,,
96,Heard Island and McDonald Islands,,
206,South Georgia and the South Sandwich Islands,,
236,United States Minor Outlying Islands,,


In [23]:
# give better name to country column
# also easier for merging with gapminder df

regions = regions.rename(columns = {'name':'country'})

In [24]:
# drop countries without any region data
regions = regions.dropna(how = 'any')

# verify
regions.loc[regions.region.isnull()]

Unnamed: 0,country,region,sub-region


The merge will be on the country. 

To simplify things, let's rename the regions.name column

In [25]:
# how many different regions?
regions.region.unique()

array([u'Asia', u'Europe', u'Africa', u'Oceania', u'Americas'], dtype=object)

In [26]:
# how many countries in the regions df?
len(regions.country)

240

In [27]:
# examine the join field for the merge
list(regions.country)

[u'Afghanistan',
 u'\xc5land Islands',
 u'Albania',
 u'Algeria',
 u'American Samoa',
 u'Andorra',
 u'Angola',
 u'Anguilla',
 u'Antigua and Barbuda',
 u'Argentina',
 u'Armenia',
 u'Aruba',
 u'Australia',
 u'Austria',
 u'Azerbaijan',
 u'Bahamas',
 u'Bahrain',
 u'Bangladesh',
 u'Barbados',
 u'Belarus',
 u'Belgium',
 u'Belize',
 u'Benin',
 u'Bermuda',
 u'Bhutan',
 u'Bolivia (Plurinational State of)',
 u'Bonaire, Sint Eustatius and Saba',
 u'Bosnia and Herzegovina',
 u'Botswana',
 u'Brazil',
 u'Brunei Darussalam',
 u'Bulgaria',
 u'Burkina Faso',
 u'Burundi',
 u'Cambodia',
 u'Cameroon',
 u'Canada',
 u'Cabo Verde',
 u'Cayman Islands',
 u'Central African Republic',
 u'Chad',
 u'Chile',
 u'China',
 u'Colombia',
 u'Comoros',
 u'Congo',
 u'Congo (Democratic Republic of the)',
 u'Cook Islands',
 u'Costa Rica',
 u"C\xf4te d'Ivoire",
 u'Croatia',
 u'Cuba',
 u'Cura\xe7ao',
 u'Cyprus',
 u'Czech Republic',
 u'Denmark',
 u'Djibouti',
 u'Dominica',
 u'Dominican Republic',
 u'Ecuador',
 u'Egypt',
 u'El Sa

In [28]:
# how many countries remain in co2 df)?
#len(co2_long.country.unique())
len(co2)

201

In [29]:
# examine co2 join field
list(co2.country.unique())

['Afghanistan',
 'Albania',
 'Algeria',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Brunei',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cayman Islands',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Cyprus',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Faeroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 '

So `regions` df has 39 more countries than `co2_long` df, which will be the driving table for the merge. This is easier to work with than vice versa.

Obvious problems:

* accents (in regions df only)
* long-winded names (us, uk of gb & n ireland)
* parenthesis (virgin islands)
* extra spaces

### Clean up 'country' column in both dataframes

In [30]:
# remove accents from countries in regions df
import unidecode
regions.country = regions.country.apply(unidecode.unidecode)

In [31]:
# verify
list(regions.country)

['Afghanistan',
 'Aland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cabo Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo',
 'Congo (Democratic Republic of the)',
 'Cook Islands',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Curacao',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'F

In [32]:
# remove special chars
import re

def remove_special_char(s):
    clean_s = re.sub('\W+',' ', s)
    return clean_s

In [33]:
co2.country = co2.country.apply(remove_special_char)

In [34]:
co2.country

0                   Afghanistan
1                       Albania
2                       Algeria
3                       Andorra
4                        Angola
5                      Anguilla
6           Antigua and Barbuda
7                     Argentina
8                       Armenia
9                         Aruba
10                    Australia
11                      Austria
12                   Azerbaijan
13                      Bahamas
14                      Bahrain
15                   Bangladesh
16                     Barbados
17                      Belarus
18                      Belgium
19                       Belize
20                        Benin
21                      Bermuda
22                       Bhutan
23                      Bolivia
24       Bosnia and Herzegovina
25                     Botswana
26                       Brazil
27       British Virgin Islands
28                       Brunei
29                     Bulgaria
                 ...           
171     

In [35]:
regions.country = regions.country.apply(remove_special_char)

In [36]:
regions.country

0                                            Afghanistan
1                                          Aland Islands
2                                                Albania
3                                                Algeria
4                                         American Samoa
5                                                Andorra
6                                                 Angola
7                                               Anguilla
9                                    Antigua and Barbuda
10                                             Argentina
11                                               Armenia
12                                                 Aruba
13                                             Australia
14                                               Austria
15                                            Azerbaijan
16                                               Bahamas
17                                               Bahrain
18                             

### Merge `co2` with `regions`

In [37]:
# merge co2 df with region df

co2_regions = co2.merge(regions, how = 'left', on='country')

# rearrange col order

cols = co2_regions[['region', 'sub-region']]

co2_regions.drop(labels=['region', 'sub-region'], axis=1,inplace = True)
co2_regions.insert(1, 'region', cols[[0]])
co2_regions.insert(2, 'sub-region', cols[[1]])

In [38]:
# still have 201 countries?
len(co2_regions.country.unique())

201

In [39]:
co2_regions.head()

Unnamed: 0,country,region,sub-region,1751,1755,1762,1763,1764,1765,1766,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
0,Afghanistan,Asia,Southern Asia,,,,,,,,...,0.022704,0.027472,0.03678,0.04709,0.068312,0.131602,0.213325,0.262174,,
1,Albania,Europe,Southern Europe,,,,,,,,...,1.382066,1.332966,1.353789,1.22431,1.27942,1.297753,1.215055,1.336544,,
2,Algeria,Africa,Northern Africa,,,,,,,,...,2.899236,2.76222,3.25701,3.113135,3.312875,3.328945,3.564361,3.480977,3.562504,3.785654
3,Andorra,Europe,Southern Europe,,,,,,,,...,7.414281,7.49969,7.390955,6.83994,6.622435,6.527241,6.17852,6.0921,,
4,Angola,Africa,Middle Africa,,,,,,,,...,0.58781,1.17761,1.161662,1.308849,1.435044,1.474353,1.500054,1.593918,,


In [40]:
# how many countries didn't match up?
len(co2_regions.loc[co2_regions.region.isnull()])

25

In [41]:
# which countries were they?
co2_regions.loc[co2_regions.region.isnull()]

Unnamed: 0,country,region,sub-region,1751,1755,1762,1763,1764,1765,1766,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
23,Bolivia,,,,,,,,,,...,1.602519,1.456394,1.347341,1.601935,1.347951,1.431829,1.474029,1.556418,,
27,British Virgin Islands,,,,,,,,,,...,3.590413,3.882571,4.001819,4.448838,4.560733,4.671529,4.781154,5.047681,,
28,Brunei,,,,,,,,,,...,15.359649,15.060464,14.116429,13.020376,26.952714,27.507506,23.206929,22.960326,,
35,Cape Verde,,,,,,,,,,...,0.549963,0.565572,0.620308,0.643992,0.645242,0.631962,0.633957,0.717071,,
42,Congo Dem Rep,,,,,,,,,,...,0.031381,0.034723,0.038761,0.040894,0.042838,0.045133,0.042375,0.046079,,
43,Congo Rep,,,,,,,,,,...,0.32252,0.342643,0.411999,0.368833,0.385281,0.38422,0.478165,0.501538,,
59,Faeroe Islands,,,,,,,,,,...,16.374565,16.605795,15.879923,15.808836,15.928244,14.788293,13.724079,14.604035,,
81,Hong Kong China,,,,,,,,,,...,5.872271,5.642391,5.953476,5.641943,5.814144,5.568527,5.293808,5.144529,5.515206,5.341218
86,Iran,,,,,,,,,,...,6.153554,6.494674,6.729846,7.2234,7.555637,7.892211,7.895172,7.726537,7.824127,7.977037
105,Macao China,,,,,,,,,,...,3.343809,3.673475,3.816033,3.307877,2.776959,2.29846,2.45045,1.895194,,


In [42]:
#co2_regions.head()
#missing_region = set(co2_regions[co2_regions.region.isnull()].country.unique())
#no_reg_no_vals = missing_region.intersection(all_nulls)
#list(no_reg_no_vals)

In [43]:
# countries missing regions but with co2 values
#no_reg_with_vals = co2_regions[co2_regions.country.isin(missing_region)]
#no_reg_with_vals = no_reg_with_vals[no_reg_with_vals.value.notnull()].country.unique()
#no_reg_with_vals

### Match countries using partial string matching

When entering search queries, spellcheck puts a lot of weight on the first character, assuming that because it's so important, it's far less likely for users to missspell it. 

I am guessing that we can match up most countries by assuming the key word is the **first** word in the `country` field.

In [49]:
# create df of countries that didn't have any matches in regions df

no_match = co2_regions.loc[co2_regions.region.isnull()].country

In [50]:
# verify
len(no_match)

25

In [182]:
# loop through countries without region data
# and look for partial string matches

matches = []
for country in no_match:
    poss_match = {'co2_country': country}
    country_split = str.split(country)
    
    # does the field start with country name?
    if regions.country.str.startswith(country).any():
            poss_match['regions_country'] = dict(regions[regions.country.str.startswith(country)].country)
    
    # does the field contain the first word of the country match?
    elif regions.country.str.contains(country_split[0]).any():
            poss_match['regions_country'] = dict(regions[regions.country.str.contains(country_split[0])].country)
            
    # does the field contain the second word of the country?
    elif len(country_split) > 1: 
        if regions.country.str.contains(country_split[1]).any():
            poss_match['regions_country'] = dict(regions[regions.country.str.contains(country_split[1])].country)
        else: 
            poss_match['regions_country'] = None
    else:
        poss_match['regions_country'] = None
        
    matches.append(poss_match)

In [176]:
pp.pprint(matches)

[{'co2_country': 'Bolivia',
  'regions_country': {26: 'Bolivia Plurinational State of '}},
 {'co2_country': 'British Virgin Islands',
  'regions_country': {242: 'Virgin Islands British '}},
 {'co2_country': 'Brunei', 'regions_country': {33: 'Brunei Darussalam'}},
 {'co2_country': 'Cape Verde', 'regions_country': {40: 'Cabo Verde'}},
 {'co2_country': 'Congo Dem Rep ',
  'regions_country': {50: 'Congo', 51: 'Congo Democratic Republic of the '}},
 {'co2_country': 'Congo Rep ',
  'regions_country': {50: 'Congo', 51: 'Congo Democratic Republic of the '}},
 {'co2_country': 'Faeroe Islands',
  'regions_country': {1: 'Aland Islands',
                      41: 'Cayman Islands',
                      52: 'Cook Islands',
                      71: 'Falkland Islands Malvinas ',
                      72: 'Faroe Islands',
                      138: 'Marshall Islands',
                      164: 'Northern Mariana Islands',
                      203: 'Solomon Islands',
                      229: 'Turks

In [184]:
# confirm matches when there is only one possible match found
for dct in matches:
    if dct['regions_country'] == None:
        dct['matched'] = "n"
    elif  len(dct['regions_country']) > 1:
        dct['matched'] = 'n'
    elif len(dct['regions_country']) == 1:
        print dct
        dct['matched'] = raw_input('Is this a match? ([y] or n) ') or 'y'
        if dct['matched'] == 'y':
            #print dct['regions_country'].keys()[0]
            matching_key = dct['regions_country'].keys()[0]
            dct['regions_country'] = dct['regions_country'].values()[0]
            dct['matched'] = 'y'
            
            # add region & sub-region values from regions df
            dct['region'] = regions.region.at[int(matching_key)]
            dct['sub-region'] = regions['sub-region'].at[int(matching_key)]
    else:
        dct['matched'] = 'n'
        

{'co2_country': 'Bolivia', 'regions_country': {26: 'Bolivia Plurinational State of '}}
Is this a match? ([y] or n) 
{'co2_country': 'British Virgin Islands', 'regions_country': {242: 'Virgin Islands British '}}
Is this a match? ([y] or n) 
{'co2_country': 'Brunei', 'regions_country': {33: 'Brunei Darussalam'}}
Is this a match? ([y] or n) 
{'co2_country': 'Cape Verde', 'regions_country': {40: 'Cabo Verde'}}
Is this a match? ([y] or n) 
{'co2_country': 'Hong Kong China', 'regions_country': {99: 'Hong Kong'}}
Is this a match? ([y] or n) 
{'co2_country': 'Iran', 'regions_country': {104: 'Iran Islamic Republic of '}}
Is this a match? ([y] or n) 
{'co2_country': 'Macao China', 'regions_country': {130: 'Macao'}}
Is this a match? ([y] or n) 
{'co2_country': 'Macedonia FYR', 'regions_country': {131: 'Macedonia the former Yugoslav Republic of '}}
Is this a match? ([y] or n) 
{'co2_country': 'Micronesia Fed Sts ', 'regions_country': {144: 'Micronesia Federated States of '}}
Is this a match? ([y] 

In [185]:
##### for multiple matches (stored as dict), select correct country

for dct in matches:
    if isinstance(dct['regions_country'], dict):
        pp.pprint(dct)
        matching_key = raw_input('Enter the key number of the matching country. If none match, type \'n\'.')
        if matching_key == 'n':
            dct['regions_country'] = None
            dct['matched'] = matching_key

        elif matching_key != 'n':
            while int(matching_key) not in dct['regions_country'].keys():
                print "Error: you entered "+ matching_key
                print "This is not a valid key number. Please try again."
                print "Valid keys are:"
                print dct['regions_country'].keys()
                matching_key = raw_input('Enter the key number of the matching country. If none match, type \'n\'.')

        
            print "You selected " + dct['regions_country'][int(matching_key)] + "."
            dct['regions_country'] = dct['regions_country'][int(matching_key)]
            dct['matched'] = 'y'
            
            # add region & sub-region values from regions df
            dct['region'] = regions.region.at[int(matching_key)]
            dct['sub-region'] = regions['sub-region'].at[int(matching_key)]

    else:
        pass

{'co2_country': 'Congo Dem Rep ',
 'matched': 'n',
 'regions_country': {50: 'Congo', 51: 'Congo Democratic Republic of the '}}
Enter the key number of the matching country. If none match, type 'n'.51
You selected Congo Democratic Republic of the .
{'co2_country': 'Congo Rep ',
 'matched': 'n',
 'regions_country': {50: 'Congo', 51: 'Congo Democratic Republic of the '}}
Enter the key number of the matching country. If none match, type 'n'.50
You selected Congo.
{'co2_country': 'Faeroe Islands',
 'matched': 'n',
 'regions_country': {1: 'Aland Islands',
                     41: 'Cayman Islands',
                     52: 'Cook Islands',
                     71: 'Falkland Islands Malvinas ',
                     72: 'Faroe Islands',
                     138: 'Marshall Islands',
                     164: 'Northern Mariana Islands',
                     203: 'Solomon Islands',
                     229: 'Turks and Caicos Islands',
                     242: 'Virgin Islands British ',
           

In [186]:
len(matches)

25

In [257]:
matches_df = pd.DataFrame(matches)

In [258]:
matches_df

Unnamed: 0,co2_country,matched,region,regions_country,sub-region
0,Bolivia,y,Americas,Bolivia Plurinational State of,South America
1,British Virgin Islands,y,Americas,Virgin Islands British,Caribbean
2,Brunei,y,Asia,Brunei Darussalam,South-Eastern Asia
3,Cape Verde,y,Africa,Cabo Verde,Western Africa
4,Congo Dem Rep,y,Africa,Congo Democratic Republic of the,Middle Africa
5,Congo Rep,y,Africa,Congo,Middle Africa
6,Faeroe Islands,y,Europe,Faroe Islands,Northern Europe
7,Hong Kong China,y,Asia,Hong Kong,Eastern Asia
8,Iran,y,Asia,Iran Islamic Republic of,Southern Asia
9,Macao China,y,Asia,Macao,Eastern Asia


##  Less straightforward matching

In [259]:
# what's left?
matches_df.loc[matches_df.matched == 'n']

Unnamed: 0,co2_country,matched,region,regions_country,sub-region
13,Netherlands Antilles,n,,,
23,West Bank and Gaza,n,,,
24,Vietnam,n,,,


### Netherlands Antilles

This reveals a missing step in my matching code: it doesn't allow for matches on first word OR second word. 'Antilles' should have yielded a match, but since the code first checks for a match on first word without any further checks, it never made the list of 'possible matches'.

On the otherhand, the logic matched up  22 out of 25 countries, so it doesn't seem worth the additional lines of code to match one more country.

In [260]:
regions.loc[regions['country'].str.contains('Antilles')]

Unnamed: 0,country,region,sub-region


In [262]:
regions.loc[regions['country'].str.contains('Dutch')]

Unnamed: 0,country,region,sub-region
200,Sint Maarten Dutch part,Americas,Caribbean


In [264]:
def create_reg_dict(country, df = regions): 
    # use for manual matching
    reg_dict = df.loc[df.country == country].to_dict()
    reg_dict['regions_country'] = reg_dict.pop('country')
    for key in reg_dict.keys():
        reg_dict[key] = reg_dict[key].values()[0]

    return reg_dict
    
def manual_match(i, values_dict, df = matches_df):
    df.iloc[i] = matches_df.iloc[i].fillna(values_dict)
    #df.iloc[i].matched = 'y'
    return df

In [265]:
# add missing info for Netherland Antilles
reg_dict = create_reg_dict(regions.loc[200].country)
matches_df = manual_match(13, reg_dict)


In [266]:
# verify
matches_df.loc[matches_df.matched=='n']

Unnamed: 0,co2_country,matched,region,regions_country,sub-region
13,Netherlands Antilles,n,Americas,Sint Maarten Dutch part,Caribbean
23,West Bank and Gaza,n,,,
24,Vietnam,n,,,


### West Bank and Gaza

For 'West Bank and Gaza', I needed to confirm whether this would be considered part of Israel or if it's Palestine. According to Wikipedia, it's a contested issue (at least to Israel) but the general consensus seems to be that 'West Bank and Gaza' are synonymous with 'Palestine'.

In [267]:
regions.loc[regions['country'].str.contains('Palestine')]

Unnamed: 0,country,region,sub-region
169,Palestine State of,Asia,Western Asia


In [270]:
# Add missing info for West Bank & Gaza
reg_dict = create_reg_dict(regions.loc[169].country)
matches_df = manual_match(23, reg_dict)


In [271]:
matches_df.loc[matches_df.matched == 'n']

Unnamed: 0,co2_country,matched,region,regions_country,sub-region
13,Netherlands Antilles,n,Americas,Sint Maarten Dutch part,Caribbean
23,West Bank and Gaza,n,Asia,Palestine State of,Western Asia
24,Vietnam,n,,,


## Vietnam

This is pure data entry error: early on I had noticed that it was spelled `Viet Nam` in the regions df. 

In [237]:
# In which sub-region will I find Vietnam? 

regions.loc[regions.region=='Asia']['sub-region'].unique()

array([u'Southern Asia', u'Western Asia', u'South-Eastern Asia',
       u'Eastern Asia', u'Central Asia'], dtype=object)

In [238]:
regions.loc[regions['sub-region']=='South-Eastern Asia']
    

Unnamed: 0,country,region,sub-region
33,Brunei Darussalam,Asia,South-Eastern Asia
37,Cambodia,Asia,South-Eastern Asia
103,Indonesia,Asia,South-Eastern Asia
121,Lao People s Democratic Republic,Asia,South-Eastern Asia
134,Malaysia,Asia,South-Eastern Asia
152,Myanmar,Asia,South-Eastern Asia
174,Philippines,Asia,South-Eastern Asia
199,Singapore,Asia,South-Eastern Asia
220,Thailand,Asia,South-Eastern Asia
221,Timor Leste,Asia,South-Eastern Asia


In [273]:
reg_dict = create_reg_dict(regions.loc[241].country)
matches_df = manual_match(24, reg_dict)


In [274]:
matches_df.loc[matches_df.matched=='n']

Unnamed: 0,co2_country,matched,region,regions_country,sub-region
13,Netherlands Antilles,n,Americas,Sint Maarten Dutch part,Caribbean
23,West Bank and Gaza,n,Asia,Palestine State of,Western Asia
24,Vietnam,n,Asia,Viet Nam,South-Eastern Asia


## Now we're ready to merge


In [282]:
# make sure source & target df's have same index vals
matches_df = matches_df.set_index('co2_country')
co2_regions = co2_regions.set_index('country')

In [285]:
# create dict of {index_val: value} for 'region' & 'sub-region'
reg_dict = matches_df.region.to_dict()
sub_reg_dict = matches_df['sub-region'].to_dict()

In [310]:
# fill missing values using these dicts
co2_regions.region = co2_regions.region.fillna(reg_dict)
co2_regions['sub-region'] = co2_regions['sub-region'].fillna(sub_reg_dict)

In [317]:
# verify the coutries with missing region info have been filled-in
co2_regions.loc[matches_df.index.values][['region', 'sub-region']]

Unnamed: 0_level_0,region,sub-region
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Bolivia,Americas,South America
British Virgin Islands,Americas,Caribbean
Brunei,Asia,South-Eastern Asia
Cape Verde,Africa,Western Africa
Congo Dem Rep,Africa,Middle Africa
Congo Rep,Africa,Middle Africa
Faeroe Islands,Europe,Northern Europe
Hong Kong China,Asia,Eastern Asia
Iran,Asia,Southern Asia
Macao China,Asia,Eastern Asia


In [318]:
# verify no more missing values
co2_regions.loc[co2_regions.region.isnull()]

Unnamed: 0_level_0,region,sub-region,1751,1755,1762,1763,1764,1765,1766,1767,...,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [158]:
matches_df.loc[matches_df.matched != 'y']

Unnamed: 0,co2_country,regions_country,matched
1,British Virgin Islands,,
3,Cape Verde,,
4,Congo Dem Rep,Congo,n
5,Congo Rep,Congo,n
6,Faeroe Islands,,
13,Netherlands Antilles,Netherlands,n
19,United Kingdom,United Arab Emirates,n
20,United States,United Arab Emirates,n
23,West Bank and Gaza,Western Sahara,n
24,Vietnam,,


In [148]:
regions.loc[regions.country.str.startswith('Congo')]

Unnamed: 0,country,region,sub-region
50,Congo,Africa,Middle Africa
51,Congo Democratic Republic of the,Africa,Middle Africa


In [155]:
# tediously fill in remaining missing values
from collections import defaultdict
manual_match = defaultdict(dict)
manual_match['Congo Dem Rep']['region'] = 'Africa'
manual_match['Congo Dem Rep']['sub-region'] = 'Middle Africa'

In [157]:
regions.loc[regions.country.str.startswith('Virgin')]

Unnamed: 0,country,region,sub-region
242,Virgin Islands British,Americas,Caribbean
243,Virgin Islands U S,Americas,Caribbean


In [159]:
manual_match['British Virgin Islands']['region'] = 'Americas'
manual_match['British Virgin Islands']['sub-region'] = 'Caribbean'

In [161]:
regions.loc[regions.country.str.contains('Verde')]

Unnamed: 0,country,region,sub-region
40,Cabo Verde,Africa,Western Africa


In [162]:
manual_match['Cape Verde']['region'] = 'Africa'
manual_match['Cape Verde']['sub-region'] = 'Western Africa'

In [163]:
# look for Faeroe Islands
regions.loc[regions.country.str.contains('Islands')]

Unnamed: 0,country,region,sub-region
1,Aland Islands,Europe,Northern Europe
41,Cayman Islands,Americas,Caribbean
52,Cook Islands,Oceania,Polynesia
71,Falkland Islands Malvinas,Americas,South America
72,Faroe Islands,Europe,Northern Europe
138,Marshall Islands,Oceania,Micronesia
164,Northern Mariana Islands,Oceania,Micronesia
203,Solomon Islands,Oceania,Melanesia
229,Turks and Caicos Islands,Americas,Caribbean
242,Virgin Islands British,Americas,Caribbean


In [169]:
def manually_match(country, reg, sub_reg, d = manual_match):
    print d

In [170]:
manually_match('Faeroe Islands', 'Europe', 'Northern Europe')

<function manual_match at 0x10e9fef50>


In [None]:
# manually searched for correct match for 'British Indian Ocean Territory':
co2_regions[co2_regions.country.str.contains('Ocean')].country.unique()

# no match found
matches.pop(1)
pp.pprint(matches)

In [None]:
# look for matches to 'Netherlands': 'Netherlands Antilles'
regions[regions.country.str.contains('Curacao')].country
matches.append({'Curacao': matches[6].values()[0]})

pp.pprint(matches)


In [None]:
# remove wrong entry for Netherlands antilles
matches.pop(6)
pp.pprint(matches)

In [None]:
# fix the 'united' countries
print "From co2_regions:"
print co2_regions[co2_regions.country.str.startswith('United')].country.unique()
print 
print "From regions:"
(regions[regions.country.str.startswith('United')].country.values)

In [None]:
matches.append({'United States of America': 'United States'})
matches.append({'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom'})
pp.pprint(matches)

In [None]:
# remove UAE (it's ok in merged df)
matches.pop(12)
pp.pprint(matches)

In [None]:
# find match for 'West Bank and Gaza'

regions[regions.country.str.contains('Palestin')].country

In [None]:
len(matches)
matches.pop(14)

In [None]:
matches.append({'Palestine, State of': 'West Bank and Gaza'})
pp.pprint(matches)

In [None]:
for l in matches:
    print l.keys()[0]

In [None]:
%timeit
for d in matches:
    print d.values()[0]
    #print d.keys()
    r = regions.loc[regions.country == d.keys()[0]].region.item()
    sr = regions.loc[regions.country == d.keys()[0]]['sub-region'].item()
    #print r
    #print sr
    co2_regions.loc[co2_regions.country == d.values()[0], 'region'] = r
    co2_regions.loc[co2_regions.country == d.values()[0], 'sub-region'] = sr
    
    #print regions[regions.country == d.keys()]
    #co2_regions[co2_regions.country == d.values()].region = regions[regions.country == d.keys()].region
    #co2_regions[co2_regions.country == d.values()].sub-region = regions[regions.country == d.keys()].sub-region
    #print d.keys()
    #print d.values()

In [None]:
# how many countries are still missing regions?
missing_region = set(co2_regions[co2_regions.region.isnull()].country.unique())


> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [None]:
# plot

import seaborn as sns

sns.set(color_codes = True)
#sns.set_style({'axes.linewidth': '0.5'})


co2_regions.year = pd.Categorical(co2_regions.year, ordered = True)

#p = sns.factorplot(x = 'year', y = 'value', hue = 'country', 
#               col = 'sub-region', sharey = False, 
#                   data = co2_regions[co2_regions.region == 'Europe'])

#p.set_xticklabels(rotation=45)
#p.set(ylim=(0,None))

eur_subreg = co2_regions[co2_regions.region == 'Europe']['sub-region'].unique()


co2_regions.loc[co2_regions['sub-region'].isin(eur_subreg)].country.unique()

np.arange(int(co2_regions.year.min()), int(co2_regions.year.max()), 10.)

eur_subreg

co2_regions[co2_regions['sub-region'].isin(eur_subreg)].groupby(['sub-region', 'country']).sum()

# get a better legend?

for subreg in eur_subreg:
    df = co2_regions[co2_regions['sub-region'] == subreg]
    fignum = pd.Index(eur_subreg).get_loc(subreg)
    plt.figure(fignum)
    
    ax = sns.pointplot(x = 'year', y = 'value', hue = 'country', 
                   col = subreg, data = df, scale = .5)
    ax.legend(loc = 0)
    ax.set(ylim=(0,None))
    ax.set(xlim=(1950,None))
    #plt.xticks(np.arange(int(co2_regions.year.min()), int(co2_regions.year.max()), 10.))
    #ax.set_xticklabels(rotation=45)
    plt.title(subreg)
    

# which countries had no corresponding regional info
co2_totals = co2_regions[co2_regions.region.isnull()].groupby('country').sum()
co2_totals

In [None]:
# which countries do not have a match in regions?
countries = list(co2[co2.columns[0]])
regions_list = list(regions['name'].values)

In [None]:
no_region = [i for i in countries if i not in regions_list]
len(no_region)

In [None]:
no_region

In [None]:
# and vice-versa

no_country = [i for i in regions_list if i not in countries]
len(no_country)

In [None]:
no_country

In [None]:
# how many more matches did we get?
print "Before:"
print len(no_country)
print ""
print "After:"

regions_list = list(regions['name'].values)
no_country = [i for i in regions_list if i not in countries]
len(no_country)


In [None]:
# more cleanup

bvi = 'British Virgin Islands'
vib = 'Virgin Islands (British)'

bvi = bvi.split()
vib = vib.split()
vib_clean = [i.strip("()") for i in vib]

for word in bvi:
    if word in vib_clean:
        print "match"
    else:
        print "no match"



In [None]:
# TODO: remove words in ()
# str.replace(r"\(.*\)","")

vib = 'Virgin Islands (British)'
import re
#print re.sub(r'[^A-Za-z0-9 ]+', '', vib)
print re.sub('\W+',' ', vib )


#vib.replace("(","")

In [None]:
test = ['South', 'Georgia', 'the', 'South', 'Sandwich', 'Islands']
for i in test: 
    if i in connector_words:
        test.remove(i)
        print test
    else:
        pass

In [None]:
def remove_special_char(s):
    clean_s = re.sub('\W+',' ', s)
    return clean_s
    

names_cleaned = regions['name'].apply(remove_special_char)
names_cleaned_list = names_cleaned.str.split()
connector_words = ['and', 'et', 'of', 'the']

for l in names_cleaned_list:
    new_list = []
    for word in l:
        if word in connector_words:
            l.remove(word)
        else:
            pass
    new_list.append(l)
    print l
       


In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
import seaborn as sns

sns.set(color_codes = True)
#sns.set_style({'axes.linewidth': '0.5'})


co2_my_countries_long.year = pd.Categorical(co2_my_countries_long.year, ordered = True)



p = sns.factorplot(x = 'year', y = 'value', hue = 'country', 
               row = 'region', size = 6, data = co2_my_countries_long)

p.set_xticklabels(rotation=45)
p.set(ylim=(0,None))


co2_my_countries_long.groupby(['sub-region', 'country']).describe()

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!