# Home task: pandas 

## Question 1

- Load the energy data from the file [Energy Indicators.xls](http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls).
It is a list of indicators of energy supply and renewable electricity production from the United Nations for the year 2013.


- It should be put into a DataFrame with the variable name of "energy"


- Make sure to exclude the footer and header information from the datafile.


- The first two columns are unneccessary, so you should get rid of them, and you should change the column labels so that the columns are:<br>
`['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']`


- Convert `Energy Supply` to gigajoules (there are 1,000,000 gigajoules in a petajoule).


- For all countries which have missing data (e.g. data with `...`) make sure this is reflected as `np.NaN` values.


- Rename the following list of countries (for use in later questions):
    - `Republic of Korea`: `South Korea`,
    - `United States of America`: `United States`,
    - `United Kingdom of Great Britain and Northern Ireland`: `United Kingdom`,
    - `China, Hong Kong Special Administrative Region`: `Hong Kong`


- There are also several countries with numbers and/or parenthesis in their name. Be sure to remove these, e.g.:
    - `Bolivia (Plurinational State of)` should be `Bolivia`,
    - `Switzerland17` should be `Switzerland`.


- Next, load the GDP data from the file ["world_bank.csv"](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD). 
It is a csv containing countries' GDP from 1960 to 2015 from World Bank. Call this DataFrame "GDP"


- Make sure to skip the header, and rename the following list of countries:
    - `Korea, Rep.`: `South Korea`,
    - `Iran, Islamic Rep.`: `Iran`,
    - `Hong Kong SAR, China`: `Hong Kong`


- Finally, load the "Sciamgo Journal and Country Rank data for [Energy Engineering and Power Technology"](http://www.scimagojr.com/countryrank.php?category=2102). It ranks countries based on their journal contributions in the aforementioned area. Call this DataFrame "ScimEn"


- Join the three datasets: Energy, GDP, and ScimEn into a new dataset (using the intersection of country names). Use only the 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr 'Rank' (Rank 1 through 15).


- The index of this DataFrame should be the name of the country, and the columns should be<br>
`['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', 2012', '2013', '2014', '2015']`

Function "answer_one" should return the resulted DataFrame (20 columns and 15 entries)

## Answer the following questions in the context of only the top 15 countries by Scimagojr Rank (aka the DataFrame returned by `answer_one()`)

In [1]:
import pandas as pd

# File names
ENERGY_FILE = "Energy Indicators.xls"
GDP_FILE = "world_bank.csv"
SCIM_ENERGY_FILE = "scimagojr.xlsx"

In [2]:
# Country names to rename
rename = {
    "Republic of Korea": "South Korea",
    "Democratic People's Republic of Korea": "North Korea",
    "United States of America": "United States",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "China, Hong Kong Special Administrative Region": "Hong Kong",
    "China, Macao Special Administrative Region": "Macao",
    "The former Yugoslav Republic of Macedonia": "Republic of North Macedonia"
}

# Read first DataFrame from XLS file
energy = pd.read_excel(
    ENERGY_FILE,
    sheet_name="Energy",
    skiprows=17,      # Exclude header
    skipfooter=38,    # Exclude footer
    usecols="B,D:F",  # Get neccessary columns
    names=["Country", "Energy Supply", "Energy Supply per Capita", "% Renewable"],
    converters={
        "Country": lambda x: (
            rename[str(x)] if str(x) in rename else            # Rename chosen country names
            str(x)[:str(x).find(" (")] if " (" in str(x) else  # Remove parenthesis in country names
            str(x)
        ),
        "Energy Supply": lambda x: int(x) * 1000000,  # Convert values "Energy Supply" to gigajoules
        "Energy Supply per Capita": lambda x: int(x),
    },
    na_values="...",  # Convert "..." to NaN values
)

# Set "Country" column as DataFrame index
energy.set_index("Country", inplace=True)

# Check if DataFrame is read correctly
energy.head()

Unnamed: 0_level_0,Energy Supply,Energy Supply per Capita,% Renewable
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,321000000.0,10.0,78.66928
Albania,102000000.0,35.0,100.0
Algeria,1959000000.0,51.0,0.55101
American Samoa,,,0.641026
Andorra,9000000.0,121.0,88.69565


In [3]:
# Country names to rename
rename = {
    "Czechia": "Czech Republic",
    "Egypt, Arab Rep.": "Egypt",
    "Micronesia, Fed. Sts.": "Micronesia",
    "Hong Kong SAR, China": "Hong Kong",
    "Macao SAR, China": "Macao",
    "Iran, Islamic Rep.": "Iran",
    "Korea, Rep.": "South Korea",
    "Korea, Dem. People's Rep.": "North Korea",
    "Venezuela, RB": "Venezuela",
}

# Read second DataFrame from CSV file
gdp = pd.read_csv(
    GDP_FILE,
    skiprows=4,  # Exclude header
    converters={
        "Country Name": lambda x: (
            rename[str(x)] if str(x) in rename else                    # Rename chosen country names
            str(x)[:str(x).find(", The")] if ", The" in str(x) else    # Remove ", The" from country names
            str(x)[:str(x).find(", Rep.")] if ", Rep." in str(x) else  # Remove ", Rep." from country names
            str(x)
        )
    }
)

# Set "Country Name" column as DataFrame index
gdp.set_index("Country Name", inplace=True)

# Check if DataFrame is read correctly
gdp.head()

Unnamed: 0_level_0,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,Unnamed: 67
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,ABW,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,...,2791061000.0,2963128000.0,2983799000.0,3092179000.0,3276188000.0,3395794000.0,2610039000.0,3126019000.0,,
Africa Eastern and Southern,AFE,GDP (current US$),NY.GDP.MKTP.CD,21125020000.0,21616230000.0,23506280000.0,28048360000.0,25920670000.0,29472100000.0,32014370000.0,...,1006526000000.0,927348500000.0,885176400000.0,1021043000000.0,1007196000000.0,1000834000000.0,927593300000.0,1081998000000.0,1169484000000.0,
Afghanistan,AFG,GDP (current US$),NY.GDP.MKTP.CD,537777800.0,548888900.0,546666700.0,751111200.0,800000000.0,1006667000.0,1400000000.0,...,20550580000.0,19998140000.0,18019550000.0,18896350000.0,18418860000.0,18904500000.0,20143450000.0,14583140000.0,,
Africa Western and Central,AFW,GDP (current US$),NY.GDP.MKTP.CD,10447640000.0,11173210000.0,11990530000.0,12727690000.0,13898110000.0,14929790000.0,15910840000.0,...,894322500000.0,768644700000.0,691363400000.0,684898800000.0,767025700000.0,822538400000.0,786460000000.0,844459700000.0,877863300000.0,
Angola,AGO,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,...,137244400000.0,87219300000.0,49840490000.0,68972770000.0,77792940000.0,69309110000.0,50241370000.0,65685440000.0,106713600000.0,


In [4]:
# Read second DataFrame from XLSX file
scim_en = pd.read_excel(
    SCIM_ENERGY_FILE,
    sheet_name="Sheet1"
)

# Set "Country" column as DataFrame index
scim_en.set_index("Country", inplace=True)

# Check if DataFrame is read correctly
scim_en.head()

Unnamed: 0_level_0,Rank,Region,Documents,Citable documents,Citations,Self-citations,Citations per document,H index
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
China,1,Asiatic Region,360468,358777,3947871,2705774,10.95,308
United States,2,Northern America,199442,195042,3068926,881789,15.39,422
India,3,Asiatic Region,76103,74167,760964,280893,10.0,217
Japan,4,Asiatic Region,56249,55680,633294,136132,11.26,217
United Kingdom,5,Western Europe,52572,51156,909276,151672,17.3,267


In [5]:
def answer_one():
    # Columns names for the returned DataFrame
    columns = [
        "Rank", "Documents", "Citable documents", "Citations", "Self-citations", "Citations per document", "H index",
        "Energy Supply", "Energy Supply per Capita", "% Renewable",
        "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015"
    ]

    # Construct new DataFrame using joining of all read DataFrames on their indexes (inner join)
    top15 = energy.join([gdp, scim_en], how="inner")

    # Set name of DataFrame index as "Country"
    top15.index.name = "Country"

    # Sort DataFrame by "Rank" column and get first 15 rows
    return top15[columns].sort_values("Rank").head(15)

# Test the function
answer_one()

Unnamed: 0_level_0,Rank,Documents,Citable documents,Citations,Self-citations,Citations per document,H index,Energy Supply,Energy Supply per Capita,% Renewable,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
China,1,360468,358777,3947871,2705774,10.95,308,127191000000,93,19.75491,2752119000000.0,3550328000000.0,4594337000000.0,5101691000000.0,6087192000000.0,7551545000000.0,8532186000000.0,9570471000000.0,10475620000000.0,11061570000000.0
United States,2,199442,195042,3068926,881789,15.39,422,90838000000,286,11.57098,13815590000000.0,14474230000000.0,14769860000000.0,14478060000000.0,15048960000000.0,15599730000000.0,16253970000000.0,16843190000000.0,17550680000000.0,18206020000000.0
India,3,76103,74167,760964,280893,10.0,217,33195000000,26,14.96908,940259900000.0,1216736000000.0,1198895000000.0,1341888000000.0,1675616000000.0,1823052000000.0,1827638000000.0,1856721000000.0,2039126000000.0,2103588000000.0
Japan,4,56249,55680,633294,136132,11.26,217,18984000000,149,10.23282,4601663000000.0,4579750000000.0,5106679000000.0,5289494000000.0,5759072000000.0,6233147000000.0,6272363000000.0,5212328000000.0,4896994000000.0,4444931000000.0
United Kingdom,5,52572,51156,909276,151672,17.3,267,7920000000,124,10.60047,2709978000000.0,3092996000000.0,2931684000000.0,2417566000000.0,2491397000000.0,2666403000000.0,2706341000000.0,2786315000000.0,3065223000000.0,2934858000000.0
Germany,6,47781,46767,641717,133693,13.43,230,13261000000,165,17.90153,2994704000000.0,3425578000000.0,3745264000000.0,3411261000000.0,3399668000000.0,3749315000000.0,3527143000000.0,3733805000000.0,3889093000000.0,3357586000000.0
Russian Federation,7,43567,43290,175721,79765,4.03,103,30709000000,214,17.28868,989932100000.0,1299703000000.0,1660848000000.0,1222646000000.0,1524917000000.0,2045923000000.0,2208294000000.0,2292470000000.0,2059242000000.0,1363482000000.0
Canada,8,39036,38276,787010,125333,20.16,263,10431000000,296,61.94543,1319265000000.0,1468820000000.0,1552990000000.0,1374625000000.0,1617343000000.0,1793327000000.0,1828366000000.0,1846597000000.0,1805750000000.0,1556509000000.0
Italy,9,35991,34424,529459,123042,14.71,192,6530000000,109,33.66723,1949552000000.0,2213102000000.0,2408655000000.0,2199929000000.0,2136100000000.0,2294994000000.0,2086958000000.0,2141924000000.0,2162010000000.0,1836638000000.0
South Korea,10,35294,35005,503147,87529,14.26,182,11007000000,221,2.279353,1053217000000.0,1172614000000.0,1047339000000.0,943941900000.0,1144067000000.0,1253223000000.0,1278428000000.0,1370795000000.0,1484318000000.0,1465773000000.0


### Question 2
What is the average GDP over the last 10 years for each country? (exclude missing values from this calculation.)

*This function should return a Series named `avgGDP` with 15 countries and their average GDP sorted in descending order.*

In [6]:
def answer_two():
    # Get DataFrame from answer_one function
    top15 = answer_one()

    # Select columns from "2006" to "2015", for each row calculate mean
    # among values in selected columns, and sort values in descending order
    avg_gdp = top15.loc[:, "2006":"2015"].mean(axis="columns").sort_values(ascending=False)
    return avg_gdp

# Test the function
answer_two()

Country
United States         1.570403e+13
China                 6.927707e+12
Japan                 5.239642e+12
Germany               3.523342e+12
United Kingdom        2.780276e+12
France                2.691337e+12
Italy                 2.142986e+12
Brazil                1.988889e+12
Russian Federation    1.666746e+12
Canada                1.616359e+12
India                 1.602352e+12
Spain                 1.400886e+12
South Korea           1.221372e+12
Australia             1.207513e+12
Iran                  4.563261e+11
dtype: float64

### Question 3
By how much had the GDP changed over the 10 year span for the country with the 6th largest average GDP?

*This function should return a single number.*

In [7]:
def answer_three():
    # Get Series with average GDPs from answer_two function
    avg_gdps = answer_two()

    # Sort average GDPs in descending order and get the name of country on 6th position
    country_6th_gdp = avg_gdps.sort_values(ascending=False).index[5]

    # Get DataFrame from answer_one function
    top15 = answer_one()

    # Get GDP values in year 2006 and 2015 for country on 6th position
    country_gdp = top15.loc[country_6th_gdp, ["2006", "2015"]]

    # Calculate the difference between GDP value in 2015 and in 2006
    return country_gdp["2015"] - country_gdp["2006"]

# Test the function
answer_three()

118652421857.7998

### Question 4

Create a new column that is the ratio of Self-Citations to Total Citations. 
What is the maximum value for this new column, and what country has the highest ratio?

*This function should return a tuple with the name of the country and the ratio.*

In [8]:
def answer_four():
    # Get DataFrame from answer_one function
    top15 = answer_one()

    # Create "Self-citations ratio" column by dividing the values of
    # "Self-citations" column by the values of "Citations" column
    top15["Self-citations ratio"] = top15["Self-citations"] / top15["Citations"]

    # Select the "Self-citations ratio" column with the maximum value in that column
    max_ratio = top15[top15["Self-citations ratio"] == top15["Self-citations ratio"].max()]["Self-citations ratio"]

    # Return tuple which contains the country name (Series index) and the maximum ratio (Series value)
    return (max_ratio.index.item(), max_ratio.item())

# Test the function
answer_four()

('China', 0.6853754846599598)

### Question 5

Create a column that estimates the population using Energy Supply and Energy Supply per capita. 
What is the third most populous country according to this estimate?

*This function should return a single string value.*

In [9]:
def answer_five():
    # Get DataFrame from answer_one function
    top15 = answer_one()

    # Population estimation is computed by integer dividing the values of
    # "Energy Supply" column by the values of "Energy Supply per Capita" column
    top15["Population"] = top15["Energy Supply"].floordiv(top15["Energy Supply per Capita"])

    # Sort rows by "Population" columns in descending order and get the name of country on 3rd position
    return top15.sort_values("Population", ascending=False).index[2]

# Test the function
answer_five()

'United States'

### Question 6
Create a column that estimates the number of citable documents per person. 
What is the correlation between the number of citable documents per capita and the energy supply per capita? Use the `.corr()` method, (Pearson's correlation).

*This function should return a single number.*


In [10]:
def answer_six():
    # Get DataFrame from answer_one function
    top15 = answer_one()

    # Create "Population" column by integer dividing the values of "Energy Supply"
    # column by the values of "Energy Supply per Capita" column
    top15["Population"] = top15["Energy Supply"].floordiv(top15["Energy Supply per Capita"])

    # Create "Citable documents per Capita" by dividing the values of
    # "Citable documents" column by the values of "Population" column
    top15["Citable documents per Capita"] = top15["Citable documents"] / top15["Population"]

    # Create the correlation matrix between values of "Citable documents per Capita" and
    # "Energy Supply per Capita" columns, and return the correlation coefficient
    return top15[["Citable documents per Capita", "Energy Supply per Capita"]].corr().iloc[0, 1]

# Test the function
answer_six()

0.7114342519843205

### Question 7
Use the following dictionary to group the Countries by Continent, then create a dateframe that displays the sample size (the number of countries in each continent bin), and the sum, mean, and std deviation for the estimated population of each country.

```python
ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}
```

*This function should return a DataFrame with index named Continent `['Asia', 'Australia', 'Europe', 'North America', 'South America']` and columns `['size', 'sum', 'mean', 'std']`*

In [11]:
def answer_seven():
    # Maps the country names to their continents
    continent_dict = {
        "China": "Asia",
        "United States": "North America",
        "Japan": "Asia",
        "United Kingdom": "Europe",
        "Russian Federation": "Europe",
        "Canada": "North America",
        "Germany": "Europe",
        "India": "Asia",
        "France": "Europe",
        "South Korea": "Asia",
        "Italy": "Europe",
        "Spain": "Europe",
        "Iran": "Asia",
        "Australia": "Australia",
        "Brazil": "South America"
    }

    # Get DataFrame from answer_one function, and set its index as a column
    top15 = answer_one().reset_index()

    # Create "Continent" using continent_dict and the values of "Country" column
    top15["Continent"] = top15["Country"].map(continent_dict)
    
    # Create "Population" column by integer dividing the values of "Energy Supply"
    # column by the values of "Energy Supply per Capita" column
    top15["Population"] = top15["Energy Supply"].floordiv(top15["Energy Supply per Capita"])

    # Group rows by "Continent", get values of "Country" and "Population" columns, and perform
    # aggregate functions: count on "Country" column, sum, mean and std on "Population" column
    continents = top15.groupby("Continent")[["Country", "Population"]].agg(
        {"Country": "count", "Population": ["sum", "mean", "std"]}
    )

    # Rename the columns labels
    continents.columns = ["size", "sum", "mean", "std"]
    return continents

# Test the function
answer_seven()

Unnamed: 0_level_0,size,sum,mean,std
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Asia,5,2898666384,579733276.8,679097900.0
Australia,1,23316017,23316017.0,
Europe,6,457929664,76321610.666667,34647670.0
North America,2,352855248,176427624.0,199669600.0
South America,1,205915254,205915254.0,
