<p><a name="sections"></a></p>


# Sections

- <a href="#DS">Data Structure</a><br>
- <a href="#DM">Data Manipulation</a><br>
- <a href="#miss">Handling Missing Data</a><br>
- <a href="#grouping">Grouping and aggregration</a><br>
- <a href="#time">Time Series</a><br>
- <a href="#sol">Solutions</a><br>

# Pandas

<p><a name="DS"></a></p>
### Data Structure

- Pandas is a Python package built on top of NumPy.  It is particularly strong in the area of handling spreadsheet structures, dealing with missing data, and processing time series data.

- We will talk about three data structure objects in today's lecture: Series, DataFrame and Time Series.

These are the new data types introduced by pandas:

- **Series**: 1D labeled homogeneously-typed array.
- **DataFrame**: General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns.
- **Time Series**: Series with index containing datetimes.

Import the package, as follows:

In [1]:
import numpy as np
import pandas as pd

<p><a name="series"></a></p>
## Series

- A series is a one-dimensional array-like object containing homogenously typed elements.   
- Each element has an associated data label, called its index. By default, the index consists of ordinary array indices, i.e. consecutive integers starting from zero.

In [2]:
obj = pd.Series(['a', 'b', 'c', 'd'])
obj

0    a
1    b
2    c
3    d
dtype: object

In [3]:
obj.index  #this is the default index

RangeIndex(start=0, stop=4, step=1)

- An entry can be retrieved using the index, as follows:

In [4]:
obj[0]

'a'

- Often it will be more desirable to create a series with a custom index. 
- Here the index is manually set the index from 1 to 4, with 4 repeated. Note there are two rows with the same index 4.

In [5]:
obj2 = pd.Series(['a', 'b', 'c', 'd','e'], index=[1, 2, 3, 4, 4])
obj2

1    a
2    b
3    c
4    d
4    e
dtype: object

- Calling that entry gives both values.  In this way a series is different from a dictionary.

In [8]:
obj2.index #custom index

Int64Index([1, 2, 3, 4, 4], dtype='int64')

In [7]:
obj2[4]

4    d
4    e
dtype: object

- The index value may also be a string.  A new entry with a string index is written:

In [9]:
obj2['something']=660
obj2

1              a
2              b
3              c
4              d
4              e
something    660
dtype: object

- Note the entries are not retrievable by their place but the value of their index.

In [19]:
print(obj2[5]) # This one also works

KeyError: 5

- The attribute `values` returns all the values.

In [20]:
obj2.values

array(['a', 'b', 'c', 'd', 'e', 660], dtype=object)

In [21]:
obj2.values[1]   # obj.values is simply an array 

'b'

- The **Series** object is similar to a **dictionary**, `Series.index` is like `dictionary.keys`, and `Series.values` is like `dictionary.values`. Directly convert a dictionary to a Series, as follows:

In [22]:
dict_ = {1: 'a', 2: 'b', 3: 'c', 4: 'd'}
obj3 = pd.Series(dict_)
obj3

1    a
2    b
3    c
4    d
dtype: object

Convert a Series back to a dictionary.

In [23]:
obj3.to_dict()

{1: 'a', 2: 'b', 3: 'c', 4: 'd'}

Note what happens in translating a series with repeated index values.  Only the last entry for the repeated index is included in the dictionary.

In [24]:
obj2.to_dict()

{1: 'a', 2: 'b', 3: 'c', 4: 'e', 'something': 660}

<p><a name="DF"></a></p>
## DataFrame

- A data frame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns.
- Each column can be a different type (integers, strings, floating point numbers, Python objects, etc.).  
- All columns must be the same length, to give the data frame a defined shape.

In [37]:
data = {'commodity': ['Gold', 'Gold', 'Silver', 'Silver'],
        'year': [2013, 2014, 2014, 2015],
        'production_Moz': [107.6, 109.7, 868.3, 886.7]} #world wide in million oz

# convert to DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,commodity,year,production_Moz
0,Gold,2013,107.6
1,Gold,2014,109.7
2,Silver,2014,868.3
3,Silver,2015,886.7


In [31]:
df.columns 

Index(['year', 'production_Moz'], dtype='object')

In [32]:
df.index #standard index

Index(['Gold', 'Gold', 'Silver', 'Silver'], dtype='object', name='commodity')

In [33]:
df.index=([4,5,6,7])
df

Unnamed: 0,year,production_Moz
4,2013,107.6
5,2014,109.7
6,2014,868.3
7,2015,886.7


In [34]:
df.index #custom integer index

Int64Index([4, 5, 6, 7], dtype='int64')

- The index may be set using the method `set_index`, as follows:

In [38]:
df=df.set_index('commodity')
df

Unnamed: 0_level_0,year,production_Moz
commodity,Unnamed: 1_level_1,Unnamed: 2_level_1
Gold,2013,107.6
Gold,2014,109.7
Silver,2014,868.3
Silver,2015,886.7


In [39]:
df.index #custom string index

Index(['Gold', 'Gold', 'Silver', 'Silver'], dtype='object', name='commodity')

In [44]:
#floats can also be an index
print(df.set_index('production_Moz').index)

Float64Index([107.6, 109.7, 868.3, 886.7], dtype='float64', name='production_Moz')


In [43]:
df

Unnamed: 0_level_0,year,production_Moz
commodity,Unnamed: 1_level_1,Unnamed: 2_level_1
Gold,2013,107.6
Gold,2014,109.7
Silver,2014,868.3
Silver,2015,886.7


In [49]:
df['year'] #this yields a pandas series

0    2013
1    2014
2    2014
3    2015
Name: year, dtype: int64

In [50]:
df[['year']] #this yields a pandas data frame

Unnamed: 0,year
0,2013
1,2014
2,2014
3,2015


- The dataframe can restore the original index using the mathod `reset_index`, as follows:

In [51]:
df = df.reset_index()
df

Unnamed: 0,index,commodity,year,production_Moz
0,0,Gold,2013,107.6
1,1,Gold,2014,109.7
2,2,Silver,2014,868.3
3,3,Silver,2015,886.7


- A data frame can also be created with a nested list. The two ways are equivalent.

In [52]:
df_2=pd.DataFrame([[107.6, 'Gold', 2013],
                   [109.7, 'Gold', 2014],
                   [868.3, 'Silver', 2014],
                   [886.7, 'Silver', 2015]], 
                    columns=['production_Moz','commodity','year'])
df_2

Unnamed: 0,production_Moz,commodity,year
0,107.6,Gold,2013
1,109.7,Gold,2014
2,868.3,Silver,2014
3,886.7,Silver,2015


- A data frame has an attribute **values**, which is of the multidimensional array type.

In [53]:
print(type(df.values))

<class 'numpy.ndarray'>


In [55]:
print(df.values)
print('-'*55)
print(df_2.values)

[[0 'Gold' 2013 107.6]
 [1 'Gold' 2014 109.7]
 [2 'Silver' 2014 868.3]
 [3 'Silver' 2015 886.7]]
-------------------------------------------------------
[[107.6 'Gold' 2013]
 [109.7 'Gold' 2014]
 [868.3 'Silver' 2014]
 [886.7 'Silver' 2015]]


- data frame v.s. series is similar to 2D array v.s. 1D array. A data frame has column names for the additional dimension.

In [56]:
print(type(df.columns))

<class 'pandas.core.indexes.base.Index'>


In [57]:
print(df.columns)  # column name

Index(['index', 'commodity', 'year', 'production_Moz'], dtype='object')


In [58]:
df.columns.tolist()

['index', 'commodity', 'year', 'production_Moz']

- Each column in a DataFrame can be retrieved as a Series. 
- There are two ways to get the column: to retrieve by attribute and to retrieve by dictionary-like notation.

In [59]:
df.year         # retrieve by attribute

0    2013
1    2014
2    2014
3    2015
Name: year, dtype: int64

In [60]:
df['year']  # retrieve by dictionary-like notation

0    2013
1    2014
2    2014
3    2015
Name: year, dtype: int64

- The name of an individual column may be changed as follows:

In [63]:
df.columns=['index','commodity', 'com','production']
df

Unnamed: 0,index,commodity,com,production
0,0,Gold,2013,107.6
1,1,Gold,2014,109.7
2,2,Silver,2014,868.3
3,3,Silver,2015,886.7


In [64]:
df.columns = df.columns.str.replace('com','metal')
df

Unnamed: 0,index,metalmodity,metal,production
0,0,Gold,2013,107.6
1,1,Gold,2014,109.7
2,2,Silver,2014,868.3
3,3,Silver,2015,886.7


- Indexing a pandas data frame is similar to indexing a numpy array. In pandas the first index retrieves a column and the second index retrieves the row.  
- To return the third element of the metal column, use the following:

In [66]:
df['metal'][0]

2013

- Slicing a pandas data frame is also similar to slicing a numpy array.  The following code returns the second and third elements of the production column.

In [67]:
df['production'][1:3]

1    109.7
2    868.3
Name: production, dtype: float64

- In order to slice multiple columns pass a list of column names.  The following represents the world production of gold and silver in 2014.

In [68]:
df[['metal','production']][1:3]

Unnamed: 0,metal,production
1,2014,109.7
2,2014,868.3


** Exercise 1** 

Create a Pandas DataFrame, named 'NYC', whose columns are 'boro', 'pop' and 'area'. The frame represents the five boroughs of New York City, including the 2010 census population (in millions),and land area in square miles.  The rows represent the following:

- The Bronx is 42 square miles.  In the 2010 census, the Bronx had 1.39 million people.
- Manhattan, with 2010 population 1.59 million, has an area of 23 square miles.
- Brooklyn is 71 square miles.  The 2010 population was 2.47 million.
- In 2010, Staten Island had 0.44 million inhabitants.  It is 59 square miles.
- 2.23 million people lived across the 109 square miles of Queens, in 2010.

Create a new column representing the population density using:
```
NYC['density']=NYC['pop']/NYC['area']
```

Now set the index of NYC to be the borough names using `set_index` function of a data frame. Make sure to update the data frame.

In [117]:
NYC = {}

In [124]:
#### Your code here
NYC = {'boro': ['The Bronx','Manhattan','Brooklyn','Staten Island','Queens'],'pop': [1.39,1.59,2.47,0.44,2.23], 
       'area':[42,23,71,59,109]}
NYC = pd.DataFrame(NYC)
NYC['density']=NYC['pop']/NYC['area']
NYC =NYC.set_index('boro')

NYC

Unnamed: 0_level_0,pop,area,density
boro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1.39,42,0.033095
Manhattan,1.59,23,0.06913
Brooklyn,2.47,71,0.034789
Staten Island,0.44,59,0.007458
Queens,2.23,109,0.020459


{}

<p><a name="IO"></a></p>
## I/O tools

- Pandas has a number of functions for reading tabular data as a data frame object.

In [79]:
# The below command prompt is not supported on this platform, please click and open the file directly to view its contents
!cat foo.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [80]:
pd.read_csv('foo.csv')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- In some cases, there is no header in the file. By setting `header = None`, the column names will be filled with incremental numbers.

In [81]:
# The below command prompt is not supported on this platform, please click and open the file directly to view its contents
!cat foo_noheader.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [82]:
my_df=pd.read_csv('foo_noheader.csv', header = None)
my_df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- The column names may then be set, as follows:

In [83]:
my_df.columns=['a', 'b', 'c', 'd', 'message']
my_df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- Another way is to pass the column names (as a list of strings) to the `names` parameter in `read_csv`.  
- Note the parameter is called "names" instead of "columns".

In [84]:
# Set the names manually
pd.read_csv('foo_noheader.csv', 
             names=['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


- Importing files has been covered. This exercise demonstrates file exporting.

**Exercise 2** 

- Write the data frame, `NYC`, to a file, NYC.csv. The function `to_csv` is useful for this task.
- Now load 'NYC.csv' to a data frame named NYC2.

In [86]:
#### Your code here
NYC.to_csv('NYC.csv')
!cat 'NYC.csv'

boro,pop,area,density
The Bronx,1.39,42,0.033095238095238094
Manhattan,1.59,23,0.0691304347826087
Brooklyn,2.47,71,0.0347887323943662
Stanten Island,0.44,59,0.007457627118644068
Queens,2.23,109,0.020458715596330276


In [87]:
NYC

Unnamed: 0_level_0,pop,area,density
boro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1.39,42,0.033095
Manhattan,1.59,23,0.06913
Brooklyn,2.47,71,0.034789
Stanten Island,0.44,59,0.007458
Queens,2.23,109,0.020459


In [89]:
NYC2=pd.read_csv('NYC.csv', index_col=0) # use the first column as index
NYC2

Unnamed: 0_level_0,pop,area,density
boro,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1.39,42,0.033095
Manhattan,1.59,23,0.06913
Brooklyn,2.47,71,0.034789
Stanten Island,0.44,59,0.007458
Queens,2.23,109,0.020459


<p><a name="DM"></a></p>
# Data Manipulation in Pandas

- Like numpy, pandas defines many broadcast operations, as well as numerous methods of manipulating data.

<p><a name="concat"></a></p>
### concat
Pandas DataFrames can be expanded in both directions. First create two data frames.

In [100]:
df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), 
                   columns=['a', 'b', 'c'],
                   index=['One', 'two', 'three'])
df2 = pd.DataFrame(np.arange(6).reshape((3, 2)), 
                   columns=['d','e'],
                   index=['three', 'two','one'])
df1

Unnamed: 0,a,b,c
One,0,1,2
two,3,4,5
three,6,7,8


In [101]:
df2

Unnamed: 0,d,e
three,0,1
two,2,3
one,4,5


- Since the two data frames have the same number of rows, it is natural to combine them "horizontally".  
- Note the concatenation takes place on the name of the index and not the order.

In [None]:
pd.concat([df1, df2], axis = 1)

- The argument "axis = 1" means expanding along the column indices. Setting "axis = 0" will combine two data frames with same number of columns vertically. 

- Now changing the name of row 'one' to One' gives it a different index.  In this case the concatenation will use all the rows, filling in missing values with NaN.

In [None]:
df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), 
                   columns=['a', 'b', 'c'],
                   index=['One', 'two', 'three'])
df2 = pd.DataFrame(np.arange(6).reshape((3, 2)), 
                   columns=['d','e'],
                   index=['three', 'two','one'])
print(df1,'\n\n',df2)

In [102]:
pd.concat([df1, df2], axis = 1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b,c,d,e
One,0.0,1.0,2.0,,
one,,,,4.0,5.0
three,6.0,7.0,8.0,0.0,1.0
two,3.0,4.0,5.0,2.0,3.0


- To include only the shared rows, set the join parameter to 'inner', as following:

In [103]:
pd.concat([df1, df2], axis = 1, join='inner')

Unnamed: 0,a,b,c,d,e
two,3,4,5,2,3
three,6,7,8,0,1


**Exercise 3**

In the iPython notebook, create the data frame below.  Observe that this is a data frame with new features.  

There are three new features:

- **high_point** is the location of highest elevation 
- **geography** indicates if the borough is an island, on an island, or mainland
- **inception** indicates the year of incorporation into the City of New York

Combine 'new_features' with the old NYC data frame to make a new data frame named 'NYC3'.

In [107]:
new_features = pd.DataFrame({'high_point': ['Battle Hill', 'Chapel Farm', 'North Glen Oaks', 'Bennett Park','Todt Hill'],\
                            'geography':['on island','on mainland','on island','is an island','is an island'],\
                           'inception':['1634','1898','1683','1624','1683']},\
                            index=['Brooklyn', 'Bronx', 'Queens', 'Manhattan',"Staten Island"])
new_features

Unnamed: 0,high_point,geography,inception
Brooklyn,Battle Hill,on island,1634
Bronx,Chapel Farm,on mainland,1898
Queens,North Glen Oaks,on island,1683
Manhattan,Bennett Park,is an island,1624
Staten Island,Todt Hill,is an island,1683


In [125]:
#### Your code here
NYC3 = pd.concat([NYC,new_features],axis = 1)
NYC3

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,pop,area,density,high_point,geography,inception
Bronx,,,,Chapel Farm,on mainland,1898.0
Brooklyn,2.47,71.0,0.034789,Battle Hill,on island,1634.0
Manhattan,1.59,23.0,0.06913,Bennett Park,is an island,1624.0
Queens,2.23,109.0,0.020459,North Glen Oaks,on island,1683.0
Staten Island,0.44,59.0,0.007458,Todt Hill,is an island,1683.0
The Bronx,1.39,42.0,0.033095,,,


<p><a name="sort"></a></p>
### sort
- It is possible to order the rows of data frames using `sort_values()`.  This object method takes a column name as an argument.
- It is used on the new_features data frame to order by date of inception, as follows:


In [126]:
new_features.sort_values('inception')

Unnamed: 0,high_point,geography,inception
Manhattan,Bennett Park,is an island,1624
Brooklyn,Battle Hill,on island,1634
Queens,North Glen Oaks,on island,1683
Staten Island,Todt Hill,is an island,1683
Bronx,Chapel Farm,on mainland,1898


By default the sort is done in ascending order.  To apply the sort in decending order, set the `ascending` parameter to `False`.

In [127]:
new_features.sort_values('inception',ascending=False)

Unnamed: 0,high_point,geography,inception
Bronx,Chapel Farm,on mainland,1898
Queens,North Glen Oaks,on island,1683
Staten Island,Todt Hill,is an island,1683
Brooklyn,Battle Hill,on island,1634
Manhattan,Bennett Park,is an island,1624


<p><a name="merge"></a></p>
### merge
Merging is the most common way to combine multiple data frames. Create two data frames first.

In [128]:
df3 = pd.DataFrame([['a','b','c'],['d','e','f'],['g','h','i']]\
                   ,columns=['col1','col2','col3'])
df4 = pd.DataFrame({'col2':['x','e','b','z'],'col4':[1,2,3,4],'col5':['i','f','e','h']})
df3

Unnamed: 0,col1,col2,col3
0,a,b,c
1,d,e,f
2,g,h,i


In [129]:
df4

Unnamed: 0,col2,col4,col5
0,x,1,i
1,e,2,f
2,b,3,e
3,z,4,h


- Merging will use the **`on`** column as a key for the merge.  The code below identifies the column ‘col2’ from both data frames. 
- The argument **`how`** set to 'inner' makes the merge only keep rows occuring in both data frames.

In [133]:
pd.merge(df3, df4, how='inner', on ='col2')

Unnamed: 0,col1,col2,col3,col4,col5
0,a,b,c,3,e
1,d,e,f,2,f


- The default value of the parameter `how` is 'inner'. The following code performs the same task as above.

In [134]:
pd.merge(df3, df4, on ='col2')

Unnamed: 0,col1,col2,col3,col4,col5
0,a,b,c,3,e
1,d,e,f,2,f


- To keep every row in df1 then set the parameter `how` = 'left'.

In [135]:
pd.merge(df3, df4, how='left', on ='col2')

Unnamed: 0,col1,col2,col3,col4,col5
0,a,b,c,3.0,e
1,d,e,f,2.0,f
2,g,h,i,,


- To keep all rows from both df1 and df2, set the parameter `how` = 'outer'.

In [136]:
pd.merge(df3, df4, how='outer', on ='col2')

Unnamed: 0,col1,col2,col3,col4,col5
0,a,b,c,3.0,e
1,d,e,f,2.0,f
2,g,h,i,,
3,,x,,1.0,i
4,,z,,4.0,h


- If the `on` column does not have the same name in the two data frames, use 'left_on' and 'right_on' to indicate how to perform the merge.  
- Note that columns with the same name, in the two data frames, will be named with an x or y character appended.

In [137]:
pd.merge(df3, df4, left_on='col2', right_on='col5')

Unnamed: 0,col1,col2_x,col3,col2_y,col4,col5
0,d,e,f,b,3,e
1,g,h,i,z,4,h


**Exercise 4**

- Run the following code to create a data frame, 'Elevations'. It contains NYC locations and their elevation in feet.  How is this related to the NYC data frame? Why separate this information into another data frame?
- Combine this data with the full NYC3 data frame to make a new data frame named NYC4.
- Note NYC3 is indexed using boro, we want to keep it after the merge.
- Change the name of the elevation column to 'peak_elevation'.
- Order the data frame by highest to lowest 'peak_elevation'.

In [138]:
Elevations = pd.DataFrame([['Battle Hill',220],['Marcus Garvey Park',103],['Bennett Park',265],\
                           ['Todt Hill',410],['Washington Square Park',27],['Chapel Farm',280],\
                           ['Bryant Park',58],['North Glen Oaks',258],['St Marys Park',47]],
                      columns=['location', 'elevation'])
Elevations

Unnamed: 0,location,elevation
0,Battle Hill,220
1,Marcus Garvey Park,103
2,Bennett Park,265
3,Todt Hill,410
4,Washington Square Park,27
5,Chapel Farm,280
6,Bryant Park,58
7,North Glen Oaks,258
8,St Marys Park,47


In [141]:
NYC3

Unnamed: 0,pop,area,density,high_point,geography,inception
Bronx,,,,Chapel Farm,on mainland,1898.0
Brooklyn,2.47,71.0,0.034789,Battle Hill,on island,1634.0
Manhattan,1.59,23.0,0.06913,Bennett Park,is an island,1624.0
Queens,2.23,109.0,0.020459,North Glen Oaks,on island,1683.0
Staten Island,0.44,59.0,0.007458,Todt Hill,is an island,1683.0
The Bronx,1.39,42.0,0.033095,,,


In [151]:
# Your code here
NYC4 = pd.merge(NYC3.reset_index(), Elevations, how = 'outer', left_on = 'high_point', right_on = 'location').set_index('index')
NYC4.columns=NYC4.columns.str.replace('elevation','peak_elevation')
NYC4.sort_values('peak_elevation', ascending = False)

Unnamed: 0_level_0,pop,area,density,high_point,geography,inception,location,peak_elevation
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Staten Island,0.44,59.0,0.007458,Todt Hill,is an island,1683.0,Todt Hill,410.0
Bronx,,,,Chapel Farm,on mainland,1898.0,Chapel Farm,280.0
Manhattan,1.59,23.0,0.06913,Bennett Park,is an island,1624.0,Bennett Park,265.0
Queens,2.23,109.0,0.020459,North Glen Oaks,on island,1683.0,North Glen Oaks,258.0
Brooklyn,2.47,71.0,0.034789,Battle Hill,on island,1634.0,Battle Hill,220.0
,,,,,,,Marcus Garvey Park,103.0
,,,,,,,Bryant Park,58.0
,,,,,,,St Marys Park,47.0
,,,,,,,Washington Square Park,27.0
The Bronx,1.39,42.0,0.033095,,,,,


<p><a name="SF"></a></p>
### selection and filter

- The `loc` method provides purely label (index/columns)-based indexing. 
- This method allows selection from a data frame by index and columns. 

In [152]:
df1

Unnamed: 0,a,b,c
One,0,1,2
two,3,4,5
three,6,7,8


The following returns a single column of df1.

In [153]:
df1['a'] #gives series

One      0
two      3
three    6
Name: a, dtype: int64

In [155]:
df1['b'] #this throws an error

One      1
two      4
three    7
Name: b, dtype: int64

In [156]:
type(df1['a'])

pandas.core.series.Series

In [157]:
df1[['a']] #gives a single column data frame

Unnamed: 0,a
One,0
two,3
three,6


In [158]:
df1[['a','c']] #gives a multi column data frame

Unnamed: 0,a,c
One,0,2
two,3,5
three,6,8


- The following uses `loc` to return a single row of df1, using the index string name.

In [159]:
df1.loc['two'] # the row that has index two

a    3
b    4
c    5
Name: two, dtype: int64

In [161]:
df1.loc[['two']] # the row that has index two

Unnamed: 0,a,b,c
two,3,4,5


- A second parameter is passed to loc to specify the chosen column. For example:

In [162]:
df1.loc['two', 'b'] # the row that has index two and column b

4

- Note the two ways to accomplish this:

In [163]:
print(df1.loc['two', 'b'])
print(df1['b'][1])

4
4


- Fancy indexing can be done with `loc` in pandas, as was done in Numpy. Select a row with a condition, as follows. 
- The code below returns all columns for the rows in which column 'a' is zero.

In [164]:
df1.loc[df1.a==0,:]

Unnamed: 0,a,b,c
One,0,1,2


- Columns are selected in a similar way.  The code below returns all rows for the columns in which row 'one' is zero.

In [165]:
df1.loc[:, df1.loc['One']==0]

Unnamed: 0,a
One,0
two,3
three,6


- Note: loc only accepts labels as input. If you try to use numbers, it will give you an error. For example:

In [166]:
df1.loc[1, 2]

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>

- To select data by position number, use iloc. The iloc method provides a purely position based indexing.

In [167]:
# select as a matrix 
# row 2, col 3
df1.iloc[1, 2]

5

In [168]:
# first row, first two columns
# return a Series
row1 = df1.iloc[0,:2]
row1

a    0
b    1
Name: One, dtype: int64

- You can also use a list to slice the dataframe.

In [169]:
df1.iloc[[0,2], :2]

Unnamed: 0,a,b
One,0,1
three,6,7


In [170]:
NYC4 #look at NYC4 again

Unnamed: 0_level_0,pop,area,density,high_point,geography,inception,location,peak_elevation
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bronx,,,,Chapel Farm,on mainland,1898.0,Chapel Farm,280.0
Brooklyn,2.47,71.0,0.034789,Battle Hill,on island,1634.0,Battle Hill,220.0
Manhattan,1.59,23.0,0.06913,Bennett Park,is an island,1624.0,Bennett Park,265.0
Queens,2.23,109.0,0.020459,North Glen Oaks,on island,1683.0,North Glen Oaks,258.0
Staten Island,0.44,59.0,0.007458,Todt Hill,is an island,1683.0,Todt Hill,410.0
The Bronx,1.39,42.0,0.033095,,,,,
,,,,,,,Marcus Garvey Park,103.0
,,,,,,,Washington Square Park,27.0
,,,,,,,Bryant Park,58.0
,,,,,,,St Marys Park,47.0


In [171]:
#get the i=2 and i=4 column of NYC4
NYC4.iloc[:,[2,4]]

Unnamed: 0_level_0,density,geography
index,Unnamed: 1_level_1,Unnamed: 2_level_1
Bronx,,on mainland
Brooklyn,0.034789,on island
Manhattan,0.06913,is an island
Queens,0.020459,on island
Staten Island,0.007458,is an island
The Bronx,0.033095,
,,
,,
,,
,,


- DataFrame’s apply method applies a function on 1D arrays to each column or row.

In [175]:
df1.apply(lambda x: max(x), axis=0) # 0 stands for apply to each column

a    6
b    7
c    8
dtype: int64

In [174]:
df1

Unnamed: 0,a,b,c
One,0,1,2
two,3,4,5
three,6,7,8


In [173]:
df1.apply(lambda x: min(x), axis=1) # 1 stands for apply to each row

One      0
two      3
three    6
dtype: int64

- If you just want to apply the function to a single column, you can extract that specific series first and then call the `map()` method just like the `map` operator in Python. 

In [176]:
df1.a.map(lambda x: x+1)

One      1
two      4
three    7
Name: a, dtype: int64

### Removing data

In [177]:
#this removes row 'two'
df1.loc[df1.index != 'two']

Unnamed: 0,a,b,c
One,0,1,2
three,6,7,8


In [178]:
df1

Unnamed: 0,a,b,c
One,0,1,2
two,3,4,5
three,6,7,8


In [179]:
#this removes column 'a'
df1.drop('a', 1)

Unnamed: 0,b,c
One,1,2
two,4,5
three,7,8


In [180]:
df1

Unnamed: 0,a,b,c
One,0,1,2
two,3,4,5
three,6,7,8


- Rows and columns may also be removed using fancy indexing or `drop()`.

In [181]:
#this removes column b
df1.loc[:,df1.columns != 'b']

Unnamed: 0,a,c
One,0,2
two,3,5
three,6,8


In [182]:
# remember the following expression is a boolean and acts as a mask
df1.columns != 'b'

array([ True, False,  True])

In [183]:
df1

Unnamed: 0,a,b,c
One,0,1,2
two,3,4,5
three,6,7,8


In [184]:
#this removes column 'b'
df1.drop('b', axis=1)

Unnamed: 0,a,c
One,0,2
two,3,5
three,6,8


In [185]:
df1

Unnamed: 0,a,b,c
One,0,1,2
two,3,4,5
three,6,7,8


<p><a name="miss"></a></p>
# Handling Missing Data

- Missing or, equivalently, corrupt data is an unavoidable reality in processing large data sets.  There are various ways of dealing with it, depending upon the circumstances:
 - Discard it, and all related data.
 - Interpolate values from surrounding data
 - Isolate it and analyze it separately

- Which approach to use is a scientific question.  Whatever approach is chosen, pandas has computational methods to carry it out.
- Read a csv file that contains NaNs. **Note:** index_col is set to 0.  This means the first column is used as the index.

In [186]:
df_miss = pd.read_csv('missing.csv',index_col=0)
df_miss

Unnamed: 0,one,two,three,four
a,-1.250699,-0.573801,0.705961,-1.015682
b,,-0.217766,0.655179,1.379276
c,-0.860359,-1.313747,0.676174,1.034417
d,,,,
e,0.079169,0.029138,0.239183,-0.492039
f,-1.14906,,,-0.160499


- To figure out where the missing data is, use the `isnull()` method.

In [187]:
df_miss.isnull()

Unnamed: 0,one,two,three,four
a,False,False,False,False
b,True,False,False,False
c,False,False,False,False
d,True,True,True,True
e,False,False,False,False
f,False,True,True,False


- How many missing values are there in the dataframe?

In [188]:
sum(df_miss.isnull()) # Built-in sum doesn't work

TypeError: unsupported operand type(s) for +: 'int' and 'str'

- Summing up the boolean array reports how many missing values are in each column.

In [189]:
np.sum(df_miss.isnull())

one      2
two      2
three    2
four     1
dtype: int64

- The same is possible for rows by setting the axis parameter to 1.

In [190]:
np.sum(df_miss.isnull(), axis=1)

a    0
b    1
c    0
d    4
e    0
f    2
dtype: int64

- To isolate the rows in which there are null values, aggregate the `df.isnull()` boolean data frame along rows, using `any` with `axis=1`.

In [193]:
mask=df_miss.isnull().any(axis=1)
mask

a    False
b     True
c    False
d     True
e    False
f     True
dtype: bool

- Passing the boolean Series to the first position of the `loc` method of the DataFrame selects the rows that have value equal to True:

In [195]:
df_miss.loc[mask,:]

Unnamed: 0,one,two,three,four
b,,-0.217766,0.655179,1.379276
d,,,,
f,-1.14906,,,-0.160499


- Similarly, if you want to locate the rows that contains only missing values, you can use `all()`

In [196]:
mask=df_miss.isnull().all(axis=1)
mask

a    False
b    False
c    False
d     True
e    False
f    False
dtype: bool

**Exercise 5**
- Handling NaNs has been shown, however, not all missing values are NaNs. Consider the example below:

```
Employee = pd.read_csv('Employee_continue.csv')
```

Print the data frame, looking for missing values by inspection. How many missing values are there? Some of the missing values might not be NaNs.

In [197]:
Employee = pd.read_csv('Employee_continue.csv')
Employee

Unnamed: 0,Department,Education,Sex,Title,Year,Name,Salary
0,IT,Bachelor,M,analyst,1.0,Bob,90.0
1,IT,Master,M,analyst,2.0,Jake,90.0
2,HR,Master,M,analyst,2.0,John,90.0
3,HR,Bachelor,F,analyst,2.0,Judy,90.0
4,Trade,PHD,M,associate,3.0,Sam,120.0
5,?,PHD,F,associate,5.0,Amy,120.0
6,Trade,Master,F,associate,,Jennifer,120.0
7,HR,Master,M,VP,8.0,Peter,262.5
8,IT,?,F,VP,9.0,Mary,262.5


- See '?' in the data frame. For a small data frame like this the '?' may be replaced by `np.nan` manually. In dealing with a large data frame, it is more efficient to use the function `replace`. Use `replace` to swap '?' with `np.nan`.

In [198]:
#### Your code here
Employee = Employee.replace('?',np.nan)
Employee

Unnamed: 0,Department,Education,Sex,Title,Year,Name,Salary
0,IT,Bachelor,M,analyst,1.0,Bob,90.0
1,IT,Master,M,analyst,2.0,Jake,90.0
2,HR,Master,M,analyst,2.0,John,90.0
3,HR,Bachelor,F,analyst,2.0,Judy,90.0
4,Trade,PHD,M,associate,3.0,Sam,120.0
5,,PHD,F,associate,5.0,Amy,120.0
6,Trade,Master,F,associate,,Jennifer,120.0
7,HR,Master,M,VP,8.0,Peter,262.5
8,IT,,F,VP,9.0,Mary,262.5


- How many missing values are there in each row? How many in each column?

In [213]:
#### Your code here
rows = Employee.isnull().any(axis=1)
cols = Employee.isnull().any(axis=0)
cols

Department     True
Education      True
Sex           False
Title         False
Year           True
Name          False
Salary        False
dtype: bool

- Print the rows with missing values.

In [214]:
#### Your code here
Employee.loc[rows,:]

Unnamed: 0,Department,Education,Sex,Title,Year,Name,Salary
5,,PHD,F,associate,5.0,Amy,120.0
6,Trade,Master,F,associate,,Jennifer,120.0
8,IT,,F,VP,9.0,Mary,262.5


- Print the columns with missing values.

In [216]:
#### Your code here
Employee.loc[:,cols]

Unnamed: 0,Department,Education,Year
0,IT,Bachelor,1.0
1,IT,Master,2.0
2,HR,Master,2.0
3,HR,Bachelor,2.0
4,Trade,PHD,3.0
5,,PHD,5.0
6,Trade,Master,
7,HR,Master,8.0
8,IT,,9.0


Once all the missing values are represented by `NaN`s, Pandas provides various methods for handling them:

<p><a name="dropna"></a></p>
## dropna
- One option is to discard the rows with missing values. Below the arguments `axis=0` and `how='any'` indicate dropping *rows* with a NaN in *any* position.

In [217]:
df_miss

Unnamed: 0,one,two,three,four
a,-1.250699,-0.573801,0.705961,-1.015682
b,,-0.217766,0.655179,1.379276
c,-0.860359,-1.313747,0.676174,1.034417
d,,,,
e,0.079169,0.029138,0.239183,-0.492039
f,-1.14906,,,-0.160499


In [218]:
df_miss.dropna(axis=0, how='any')

Unnamed: 0,one,two,three,four
a,-1.250699,-0.573801,0.705961,-1.015682
c,-0.860359,-1.313747,0.676174,1.034417
e,0.079169,0.029138,0.239183,-0.492039


- Another option is to drop rows full of NaNs. This can be done with `how='all'`.

In [219]:
df_miss.dropna(axis=0, how='all')

Unnamed: 0,one,two,three,four
a,-1.250699,-0.573801,0.705961,-1.015682
b,,-0.217766,0.655179,1.379276
c,-0.860359,-1.313747,0.676174,1.034417
e,0.079169,0.029138,0.239183,-0.492039
f,-1.14906,,,-0.160499


- Applying `dropna()` to the above data frame may be done once more. Drop a column with the argument `axis=1`.

In [220]:
df_miss.dropna(axis=0, how='all').dropna(axis=1, how='any')

Unnamed: 0,four
a,-1.015682
b,1.379276
c,1.034417
e,-0.492039
f,-0.160499


<p><a name="fillna"></a></p>
## fillna

- An alternative to discarding information is to **impute** the data. 
- This can be done with the `fillna()` function with the value to be imputed as the argument.

In [221]:
df_miss.fillna(0)

Unnamed: 0,one,two,three,four
a,-1.250699,-0.573801,0.705961,-1.015682
b,0.0,-0.217766,0.655179,1.379276
c,-0.860359,-1.313747,0.676174,1.034417
d,0.0,0.0,0.0,0.0
e,0.079169,0.029138,0.239183,-0.492039
f,-1.14906,0.0,0.0,-0.160499


- Another common way to impute is by the mean of the column.

In [222]:
df_miss['one'].fillna(df_miss['one'].mean())

a   -1.250699
b   -0.795237
c   -0.860359
d   -0.795237
e    0.079169
f   -1.149060
Name: one, dtype: float64

<p><a name="interpolate"></a></p>
## Interpolate

- Interpolation is the insertion of new data between preeexisting fixed points. Linear interpolation uses a linear function to create new data point.  
- In pandas this is accomplished using `interpolate()` with the `method` parameter set to linear, `method='linear'`.  
- This will fill in missing data points with a linear interpolation between the data points bordering the missing values.

In [223]:
df_miss

Unnamed: 0,one,two,three,four
a,-1.250699,-0.573801,0.705961,-1.015682
b,,-0.217766,0.655179,1.379276
c,-0.860359,-1.313747,0.676174,1.034417
d,,,,
e,0.079169,0.029138,0.239183,-0.492039
f,-1.14906,,,-0.160499


In [224]:
df_miss.interpolate(method='linear')

Unnamed: 0,one,two,three,four
a,-1.250699,-0.573801,0.705961,-1.015682
b,-1.055529,-0.217766,0.655179,1.379276
c,-0.860359,-1.313747,0.676174,1.034417
d,-0.390595,-0.642305,0.457679,0.271189
e,0.079169,0.029138,0.239183,-0.492039
f,-1.14906,0.029138,0.239183,-0.160499


- Since `df.loc['b','one']` is a NaN between `df.loc['a','one']` and `df.loc['c','one']`, the value inserted is the mean of them.
- Note how this technique treats NaN values at the bottom of a column.

In [225]:
(df_miss.loc['a', 'one'] + df_miss.loc['c', 'one'])/2

-1.055528936449

<p><a name="grouping"></a></p>

# Grouping and Aggregation

Grouping and  aggregation are critical components of data analysis which involve:

- **Splitting** data into groups based on some features.
- **Applying** a function to each group independently.
- **Combining** the result into data structure.

### Grouping

Grouping splits a data frame into categorical groups, according to a given variable, or set of variables.   Consider the following data frame.

In [226]:
Country = pd.read_csv('countries.csv', delimiter=';')
Country.head()

Unnamed: 0,Country,Country (de),Country (local),Country code,Continent,Capital,Population,Area,Coastline,Government form,Currency,Currency code,Dialing prefix,Birthrate,Deathrate,Life expectancy,Url
0,Afghanistan,Afghanistan,Afganistan/Afqanestan,AF,Asia,Kabul,33332025,652230,0,Presidential islamic republic,Afghani,AFN,93,38.3,13.7,51.3,https://www.laenderdaten.info/Asien/Afghanista...
1,Egypt,Ägypten,Misr,EG,Africa,Cairo,94666993,1001450,2450,Presidential republic,Pfund,EGP,20,30.3,4.7,72.7,https://www.laenderdaten.info/Afrika/Aegypten/...
2,Åland Islands,Ålandinseln,Åland,AX,Europe,Mariehamn,29013,1580,0,Autonomous region of Finland,Euro,EUR,358,0.0,0.0,0.0,https://www.laenderdaten.info/Europa/Aland/ind...
3,Albania,Albanien,Shqipëria,AL,Europe,Tirana,3038594,28748,362,parliamentary republic,Lek,ALL,355,13.1,6.7,78.3,https://www.laenderdaten.info/Europa/Albanien/...
4,Algeria,Algerien,Al-Jaza’ir/Algérie,DZ,Africa,Algiers,40263711,2381741,998,Presidential republic,Dinar,DZD,213,23.0,4.3,76.8,https://www.laenderdaten.info/Afrika/Algerien/...


In [227]:
#choose the desires columns
Country=Country[['Country','Continent','Population','Area','Coastline',
                'Currency','Birthrate','Deathrate','Life expectancy']]
#change the column names to lowercase
Country.columns = Country.columns.str.lower()
#drop columns with missing values (Antarctica)
Country=Country.dropna(axis=0, how='any')
#limit to countries with populations over 20 million
Country=Country[Country['population']>20e6]
#add a boolean valued column that is true for countries with coastlines
Country['coastal']=Country['coastline']!=0
Country

Unnamed: 0,country,continent,population,area,coastline,currency,birthrate,deathrate,life expectancy,coastal
0,Afghanistan,Asia,33332025,652230,0,Afghani,38.3,13.7,51.3,False
1,Egypt,Africa,94666993,1001450,2450,Pfund,30.3,4.7,72.7,True
4,Algeria,Africa,40263711,2381741,998,Dinar,23.0,4.3,76.8,True
7,Angola,Africa,20172332,1246700,1600,Kwanza,38.6,11.3,56.0,True
12,Argentina,South America,43886748,2780400,4989,Peso,17.0,7.5,77.1,True
16,Ethiopia,Africa,102374044,1104300,0,Birr,36.9,7.9,62.2,False
17,Australia,Australia,22992654,7741220,25760,Dollar,12.1,7.2,82.2,True
20,Bangladesh,Asia,156186882,143998,580,Taka,19.0,5.3,73.2,True
32,Brazil,South America,205823665,8514877,7491,Real,14.3,6.6,73.8,True
39,China,Asia,1373541278,9596960,14500,Yuan,12.4,7.7,75.5,True


- This will calculate the mean population.

In [228]:
Country['population'].mean()

112995987.12068966

- This will calculate the sum of all populations in Asia.

In [229]:
Country[Country['continent']=='Asia']['population'].sum()

4207906053

- This will return the number of land-locked countries in the data frame.

In [230]:
Country[Country['coastal']==False].country.count()

5

- To group the countries by continent, do the following:

In [231]:
group = Country.groupby('continent')

- `group` is assigned the value returned by the `groupby` function, whose type is:

In [232]:
print(type(group))


<class 'pandas.core.groupby.generic.DataFrameGroupBy'>


In [234]:
print(group)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11e7a0fd0>


- The `DataFrameGroupBy` object is an `iterable`. Iterate over the object and print the contents, as follows:

In [233]:
for item in group:
    print(item)

('Africa',                               country continent  population     area  \
1                               Egypt    Africa    94666993  1001450   
4                             Algeria    Africa    40263711  2381741   
7                              Angola    Africa    20172332  1246700   
16                           Ethiopia    Africa   102374044  1104300   
44   Democratic Republic of the Congo    Africa    81331050  2344858   
51                        Ivory Coast    Africa    23740424   322463   
65                              Ghana    Africa    26908262   238533   
99                           Cameroon    Africa    24360803   475440   
104                             Kenya    Africa    46790758   580367   
125                        Madagascar    Africa    24430325   587041   
131                           Morocco    Africa    33655786   446550   
145                        Mozambique    Africa    25930150   799380   
156                           Nigeria    Africa   186

- Each `item` we print is a two element `tuple`. The first element is the grouping.  
- The second element is a data frame for that grouping. There is an alternative way of iterating through the groupby object, as follows:

In [235]:
for key, values in group:
    print(key) #this indicates the grouping
    print('-'*70)
    print(values) #this is a dataframe for that 
    print('\n')

Africa
----------------------------------------------------------------------
                              country continent  population     area  \
1                               Egypt    Africa    94666993  1001450   
4                             Algeria    Africa    40263711  2381741   
7                              Angola    Africa    20172332  1246700   
16                           Ethiopia    Africa   102374044  1104300   
44   Democratic Republic of the Congo    Africa    81331050  2344858   
51                        Ivory Coast    Africa    23740424   322463   
65                              Ghana    Africa    26908262   238533   
99                           Cameroon    Africa    24360803   475440   
104                             Kenya    Africa    46790758   580367   
125                        Madagascar    Africa    24430325   587041   
131                           Morocco    Africa    33655786   446550   
145                        Mozambique    Africa    2593015

- This is a great way to print and inspect a `DataFrameGroupBy` object. Above was an example of **splitting**.
- The size function includes the number of elements in each grouping.

In [236]:
group.size()

continent
Africa           17
Asia             23
Australia         1
Europe            9
North America     3
South America     5
dtype: int64

- The following will return a data frame with the mean values for every coulumn in every grouping.

In [237]:
group.mean()

Unnamed: 0_level_0,population,area,coastline,birthrate,deathrate,life expectancy,coastal
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Africa,53677070.0,983617.8,1302.235294,31.588235,8.394118,62.770588,0.882353
Asia,182952400.0,1103215.0,7791.652174,18.256522,6.691304,72.56087,0.869565
Australia,22992650.0,7741220.0,25760.0,12.1,7.2,82.2,1.0
Europe,63249840.0,2256001.0,7989.888889,10.155556,11.1,78.044444,1.0
North America,160841700.0,7258573.0,77111.333333,13.766667,7.333333,79.2,1.0
South America,71716930.0,2926291.0,4180.4,16.96,6.14,75.22,1.0


- If you just want the mean of one specific column.

In [238]:
group.birthrate.mean()
# group[['birthrate']].mean() # This returns a data frame

continent
Africa           31.588235
Asia             18.256522
Australia        12.100000
Europe           10.155556
North America    13.766667
South America    16.960000
Name: birthrate, dtype: float64

### Aggregration

- The data frame can be grouped by multiple keys:

In [239]:
group2 = Country.groupby(['continent', 'currency'])

In [240]:
for key, values in group2:
    print(key)
    print('-'*70)
    print(values)
    print('\n')

('Africa', 'Ariary')
----------------------------------------------------------------------
        country continent  population    area  coastline currency  birthrate  \
125  Madagascar    Africa    24430325  587041       4828   Ariary       32.1   

     deathrate  life expectancy  coastal  
125        6.7             65.9     True  


('Africa', 'Birr')
----------------------------------------------------------------------
     country continent  population     area  coastline currency  birthrate  \
16  Ethiopia    Africa   102374044  1104300          0     Birr       36.9   

    deathrate  life expectancy  coastal  
16        7.9             62.2    False  


('Africa', 'Dinar')
----------------------------------------------------------------------
   country continent  population     area  coastline currency  birthrate  \
4  Algeria    Africa    40263711  2381741        998    Dinar       23.0   

   deathrate  life expectancy  coastal  
4        4.3             76.8     True  


240       12.5        8.2             79.8     True  


('North America', 'Peso')
----------------------------------------------------------------------
    country      continent  population     area  coastline currency  \
138  Mexico  North America   123166749  1964375       9330     Peso   

     birthrate  deathrate  life expectancy  coastal  
138       18.5        5.3             75.9     True  


('South America', 'Bolívar Fuerte')
----------------------------------------------------------------------
       country      continent  population    area  coastline        currency  \
238  Venezuela  South America    30912302  912050       2800  Bolívar Fuerte   

     birthrate  deathrate  life expectancy  coastal  
238       19.2        5.2             75.8     True  


('South America', 'Nuevo Sol')
----------------------------------------------------------------------
    country      continent  population     area  coastline   currency  \
171    Peru  South America    30741062  1

In [245]:
group2.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,population,area,coastline,birthrate,deathrate,life expectancy,coastal
continent,currency,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,Ariary,24430320.0,587041.0,4828.0,32.1,6.7,65.9,1.0
Africa,Birr,102374000.0,1104300.0,0.0,36.9,7.9,62.2,0.0
Africa,Dinar,40263710.0,2381741.0,998.0,23.0,4.3,76.8,1.0
Africa,Dirham,33655790.0,446550.0,1835.0,18.0,4.8,76.9,1.0
Africa,Franc,43144090.0,1047587.0,318.0,32.733333,9.733333,58.166667,1.0
Africa,Ghana Cedi,26908260.0,238533.0,539.0,30.8,7.1,66.6,1.0
Africa,Kwanza,20172330.0,1246700.0,1600.0,38.6,11.3,56.0,1.0
Africa,Metical,25930150.0,799380.0,2470.0,38.3,11.9,53.3,1.0
Africa,Naira,186053400.0,923768.0,853.0,37.3,12.7,53.4,1.0
Africa,Pfund,65698250.0,1431467.0,1651.5,29.4,6.1,68.4,1.0


- Apply multiple functions to each group with the method `agg()`:

In [246]:
group.agg(['count', 'sum', 'min', 'max', 'mean', 'std']).T

Unnamed: 0,continent,Africa,Asia,Australia,Europe,North America,South America
population,count,17,23,1,9,3,5
population,sum,912510196,4207906053,22992654,569248535,482525182,358584633
population,min,20172332,22235000,22992654,21599736,35362905,30741062
population,max,186053386,1373541278,22992654,142355415,323995528,205823665
population,mean,5.36771e+07,1.82952e+08,2.29927e+07,6.32498e+07,1.60842e+08,7.17169e+07
population,std,4.23622e+07,3.64416e+08,,3.44658e+07,1.47959e+08,7.5338e+07
area,count,17,23,1,9,3,5
area,sum,16721503,25373946,7741220,20304011,21775720,14631453
area,min,238533,35980,7741220,238391,1964375,912050
area,max,2381741,9596960,7741220,17098242,9984670,8514877


- To look at a single column of the aggregate analysis, use the column indexing:

In [248]:
group.agg(['count', 'sum', 'min', 'max', 'mean', 'std'])['birthrate']

Unnamed: 0_level_0,count,sum,min,max,mean,std
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Africa,17,537.0,18.0,43.4,31.588235,7.007129
Asia,23,419.9,7.8,38.3,18.256522,7.272665
Australia,1,12.1,12.1,12.1,12.1,
Europe,9,91.4,8.5,12.3,10.155556,1.45268
North America,3,41.3,10.3,18.5,13.766667,4.244212
South America,5,84.8,14.3,19.2,16.96,1.844722


- The column std is missing a value above because Australia only has one row in the data frame.
- Also note that `sum` is an inappropriate function to apply to birthrate where it would be applicable to population.

- Different aggregating functions can be applied to different columns. This can be done with a dictionary.

In [249]:
colFun = {'country':['count'],
          'population': ['sum','min', 'max','mean','std'], 
          'area': ['sum','min', 'max','mean'],
          'coastline':['sum','min', 'max'],
          'birthrate':['min', 'max','mean','std'],
          'deathrate':['min', 'max','mean','std'],
          'life expectancy':['min', 'max','mean','std']}
analysis=group.agg(colFun)
analysis.T

Unnamed: 0,continent,Africa,Asia,Australia,Europe,North America,South America
country,count,17.0,23.0,1.0,9.0,3.0,5.0
population,sum,912510200.0,4207906000.0,22992654.0,569248500.0,482525200.0,358584600.0
population,min,20172330.0,22235000.0,22992654.0,21599740.0,35362900.0,30741060.0
population,max,186053400.0,1373541000.0,22992654.0,142355400.0,323995500.0,205823700.0
population,mean,53677070.0,182952400.0,22992654.0,63249840.0,160841700.0,71716930.0
population,std,42362230.0,364416300.0,,34465820.0,147958600.0,75337960.0
area,sum,16721500.0,25373950.0,7741220.0,20304010.0,21775720.0,14631450.0
area,min,238533.0,35980.0,7741220.0,238391.0,1964375.0,912050.0
area,max,2381741.0,9596960.0,7741220.0,17098240.0,9984670.0,8514877.0
area,mean,983617.8,1103215.0,7741220.0,2256001.0,7258573.0,2926291.0


- Custom aggregation functions may also be applied. In the previous examples, aggregation functions are applied to each **column** in a data frame. 
- Keep this in mind when defining a custom function. For example, build a function that computes the mean after removing maxima (truncated mean).

In [252]:
def trunc_mean(x):    # x has to be a 'vector' (1d array or pandas Series)
    sec=x[x!=x.max()]
    if sec.shape[0]!=0:
        return np.mean(sec)

In [253]:
Country[Country['country']!='Australia' ].groupby('continent').agg(trunc_mean)

Unnamed: 0_level_0,population,area,coastline,birthrate,deathrate,life expectancy,coastal
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Africa,45403550.0,896235.1,1081.875,30.85,8.125,61.8875,0.0
Asia,128834800.0,717135.7,5658.727273,17.345455,6.372727,71.995455,0.0
Europe,53361640.0,400721.1,4282.0,9.8875,10.6875,77.525,
North America,79264830.0,5895525.0,14627.0,11.4,6.75,77.85,
South America,38190240.0,1529144.0,3352.75,16.4,5.8,74.75,


**Exercise 6**

Using the data frame Country:

- Find the minimum and maximum of the 'life expectancy' for costal and landlocked countries.
- How many coastal and landlocked countries are in each continent? Determine the mean deathrate for coastal and landlocked countries in each continent.
- Create a new column measuring population density.  Determine the mean population density of all the countries.  Create a new column 'dense' that is True for countries with higher than average density and False for lower than average density.
- For each category of 'dense' compute the mean birthrate and the difference between the maximal birthrate and the minimal birthrate.

In [None]:
#### Your code here


<p><a name="time"></a></p>
# Time Series

We will use the NYSE data from [Kaggle](https://www.kaggle.com/dgawlik/nyse) to illustrate how to use the time series data structure in pandas.

There are two time series related csv files:
- prices.csv: raw, as-is daily prices. Most of data spans from 2010 to the end 2016, for companies new on stock market date range is shorter. There have been approx. 140 stock splits in that time, this set doesn't account for that.
- prices-split-adjusted.csv: same as prices, but there have been added adjustments for splits.

In [None]:
url_adjusted = 'https://graderdata.s3.amazonaws.com/prices-split-adjusted.csv'

### Your code here
split_adjusted_price = pd.read_csv(url_adjusted)

In [None]:
split_adjusted_price.head()

- The date column contains timestamps so we can use it as the index of our dataframe.

In [None]:
split_adjusted_price['date'] = pd.to_datetime(split_adjusted_price['date'])
split_adjusted_price = split_adjusted_price.set_index('date')

In [None]:
split_adjusted_price.head()

- Once it becomes datetime data type, we can extract the month, date, year or even the weekday easily.

In [None]:
split_adjusted_price.index.day

In [None]:
split_adjusted_price.index.month

In [None]:
split_adjusted_price.index.dayofweek

- You can also extract the observations within any time range.

In [None]:
# View all observations that occured in May 2016
split_adjusted_price.loc['5/2016']

In [None]:
# Observations between May 3rd and May 4th
split_adjusted_price['5/3/2016':'5/4/2016']

- Once you have the dataframe has a datetime index, you can easily plot the time series graph across different columns.
- i.e the closing price of Apple Inc

In [None]:
%matplotlib inline
split_adjusted_price[split_adjusted_price['symbol'] == 'AAPL'].close.plot()

<p><a name="sol"></a></p>
# Soluitons

**Exercise 1**

In [None]:
NYC=pd.DataFrame([['Bronx',1.39,42],
                ['Manhattan',1.59,23],
                ['Brooklyn',2.47,71],
                ['Staten Island',0.44,59],
                ['Queens',2.23,109]],columns=['boro','pop','area'])

NYC['density']=NYC['pop']/NYC['area']
NYC=NYC.set_index('boro')
NYC

**Exercise 3**

In [None]:
NYC3=pd.concat([NYC, new_features], axis = 1)
NYC3

**Exercise 4**

In [None]:
NYC4=pd.merge(NYC3.reset_index(),Elevations,left_on='high_point',right_on='location').set_index('index')
NYC4.columns=NYC4.columns.str.replace('elevation','peak_elevation')
NYC4=NYC4.sort_values('peak_elevation',ascending=False)

**Exercise 5**

In [None]:
Employee=Employee.replace('?',np.nan)
Employee
np.sum(Employee.isnull(), axis=1)
np.sum(Employee.isnull(), axis=0)
Employee.loc[Employee.isnull().any(axis=1),:]
Employee.loc[:,Employee.isnull().any(axis=0) ]

**Exercise 6**

In [None]:
Country.groupby('coastal').agg({'life expectancy':['min','max']})
Country.groupby(['continent','coastal']).agg({'deathrate':['count','mean']})
Country['density']=Country['population']/Country['area']
Country['dense']=Country['density']>Country['density'].mean()
def diff_(x):
    return max(x)-min(x)
Country.groupby('dense').agg({'birthrate':['mean',diff_]})