# Elements Of Data Processing - Week 2

## Preliminaries


## iPython 

iPython is an interactive computing and development environment.
Let's start with some basic examples in iPython. Launch iPython on the command line with the `ipython` command:

### Basics
Python objects are formatted to be more readable in iPython:

In [None]:
import numpy as np

samples = {i:np.random.randn() for i in range(10)}

In [None]:
samples

### Object Introspection
You can find general information about an object, e.g., a variable, a function or an instance method using a question mark (?)

In [None]:
student_names = []

In [None]:
student_names?

In [None]:
def add_integers (x,y):
    '''
    (int,int) -> int
    -----------
    Adds two integers as input
    Returns the sum of input arguments
    '''
    return x+y

In [None]:
add_integers?

In [None]:
add_integers??

### Magic Commands

Magic commands are designed to facilitate common tasks.

In [None]:
# run a .py file
%run hello.py

In [None]:
# check the execution time of a statement
import numpy as np
a = np.random.randn(10,10)
%timeit np.dot(a,a)

## Jupyter Notebook

The advantage of the notebooks (vs terminal) is that you can interactively read the questions,
write the answers and see the result. The notebook application runs as server process on the command line. Start running the notebook in the command line:

## Pandas

A library that contains high-level data structures and manipulation tools for faster analysis.

In [None]:
import pandas as pd

### Series
One-dimensional array-like object containing the array of data and an associated array of data labels called index.

<img src="images/series1.jpg">

The basic method to create a Series:
    
    - s = Series(data, index=index)

Here, data can be different things, including:
    
    - a list
    - an array
    - a dictionary

#### Example 1 : Create a Basic Series Object

In [None]:
# series constructor with data as a list of integers

l = [4,3,-5,9,1,7]
s = pd.Series(l)

In [None]:
# the default indexing starts from zero
s.index

In [None]:
# retrieve the values of the series
s.values

In [None]:
# create your own index using lists
newIndex = ['a','b','c','d','e','f']
s.index  = newIndex

In [None]:
# verify the index
s

In [None]:
# Creating a series from a python dict

Aus_Emission = {'1990':15.45288167, '2000':17.20060983, '2007':17.86526004,
                '2008':18.16087566,'2009':18.20018196,'2010':16.92095367,
                '2011':16.86260095, '2012':16.51938578, '2013':16.34730205}

co2_Emission = pd.Series(Aus_Emission)

In [None]:
# retrieve the values of the series
co2_Emission.values

In [None]:
# verify the series object
co2_Emission

### Slicing
You can **select** sections of list-like types (arrays, tuples, NumPy arrays) by using various slice notations:

In [None]:
# slicing the series using a boolean array operation 
co2_Emission[co2_Emission>16.0]

In [None]:
# slicing the series using a time period
co2_Emission[:'2000']

In [None]:
# double the values of the series object
doubled = co2_Emission*2
doubled

In [None]:
# finding the average value of the series
co2_Emission.mean()

In [None]:
# defining the column name
co2_Emission.name = 'CO2 Emission'

In [None]:
# defining the name of the index
co2_Emission.index.name = 'Year'

In [None]:
# verify the series object
co2_Emission

### Exercise 1

Pandas Series objects have both <i>ndarray-like</i> and <i>dict-like properties</i>. Given the co2_Emission series object do the following:

- Similar to the average of the series object, retrieve the maximum, median and cumulative sum of CO2 emission between  1960 to 2013 (max(), median() and cumsum() methods).


- Retrieve the CO2 emissions in Australia between 2000 to 2010.
- Given the population of Australia in 2013 is 23117353, retrieve the CO2 emission per capita for that year.



### DataFrames

Represents a tabular data structure

Can be thought of as a dictionary of Series objects

Has both row and column indices

<img src="images/DF.jpg">


In [None]:
# create a new series of the population
Aus_Population = {'1990':17065100, '2000':19153000, '2007':20827600,
                 '2008':21249200,'2009':21691700,'2010':22031750,
                 '2011':22340024, '2012':22728254, '2013':23117353}
population = pd.Series(Aus_Population)

In [None]:
# verify the values in the series
population

In [None]:
# create a DataFrame object from the series objects
australia = pd.DataFrame({'co2_emission':co2_Emission, 'Population':population})
australia

In [None]:
# create a DataFrame from a csv file
countries = pd.read_csv('countries.csv',encoding = 'ISO-8859-1')

In [None]:
# check the top 10 countries in the DataFrame
countries.head(10) # the default value is set to 5

In [None]:
# count the number of countries in each region
countries.Region.value_counts()

In [None]:
# set the name of countries as the index
countries.set_index('Country')


In [None]:
# create a new DataFrame for the CO2 emission from a csv file
emission = pd.read_csv('emission.csv',encoding = 'ISO-8859-1')
emission.head()

In [None]:
# Create a subset of emission dataset for Year 2010
yr2010 = emission['2010']
names  = emission['Country']
yr2010.index = names
type(yr2010)

In [None]:
# Sort column values using sort_values 
yr2010.sort_values()


In [None]:
#Sort column values to find the top countries
yr2010.sort_values(ascending = False)

### Exercise 2

- Retrieve the mean, median of CO2 emission generated in 2012 by all countries.
- Retrieve the top 5 countries with the most CO2 emission in 2012. How about the 5 countries with the least emission? (remember that sort_values has an **ascending** parameter that is set to True by default).
- Retrieve the sum of CO2 emission for all years and find the 2 years with the maximum CO2 emission.





#### More Sort Operations

In [None]:
# Sort column values of a DataFrame
sorted2012 = emission.sort_values( by = '2012',ascending = False )
sorted2012

In [None]:
# Sort column values using two columns
sorted2012 = emission.sort_values( by = ['2012','2013'],ascending = [False, True] )
sorted2012

#### Groupby
<img src="files/images/groupby1.jpg">

In [None]:
# Groupby using the Region column
grouped = countries.groupby('Region')
type(grouped)


In [None]:
# find the size of each group
for k,group in grouped:
        print (k)
        print (group.shape[0])

In [None]:
# Find the number of high income and low income countries by region
for k,group in grouped:
        print (k)
        print (group['IncomeGroup'].value_counts())

#### Slicing using the .ix method

In [None]:
# Slicing using a range of rows and range of columns 
emission.ix[2:5,2:6]

In [None]:
# Slicing using specific rows and specific columns
emission.ix[[3,5],['Country','1990']]

In [None]:
# Specific rows and all columns

emission.ix[[3,5],:]

In [None]:
# All rows and specific columns
emission.ix[:,['Country','1990']]

### Exercise 3

- Create a DataFrame object that has the name, region and IncomeGroup of the top 10 emitting countries in 2012.






### If time permits, put all together:

Study the affect of population size on CO2 emission.
     - Create a new DataFrame object using  pd.read_csv('population.csv',index_col = 'Country', encoding = 'ISO-8859-1').
     - Select ['Canada', 'United States', 'China', 'Australia'] and compare their population growth. Use the following formula:
    

$$ growth = \frac{1}{period}*\frac{(value_e - value_s)*100}{value_e} $$

     - Compute the sum and mean of CO2 emission for the same countries.
     - Does an increase in population lead to an increase in the CO2 emission for all of these countries? 
     - Find the top 10 emitting (per capita) countries in each region for 2010. 
     - Is there any interesting trend in these countries with regard to their IncomeGroup?

### If you are bored:
- Create a Series object where the indices are the letters in the alphabet and the values of its **count** column is the number of countries that their name starts with that letter
    