This is the fourth session on basic usage of Python 3, which is on pandas usage.

## Course Structure

### Session 4 [1]

**Even More with pandas**
* Sorting and ranking

* Unique Values, Value Counts

* Data Wrangling
    - Clean
    - Transform,
    - Merge
    - Reshape
    
* Data Transformation
    - Removing duplicates
    - Using a Function or Mapping
    - Replacing Values
    - Discretization and Binning
    - Detecting and Filtering Outliers

* Combining and Merging Data Sets
    - Database-style DataFrame Merges
    - Merging on Index
    - Concatenating Along an Axis

**Homework**


## Sorting and ranking

In [19]:
import pandas as pd
import numpy as np

In [12]:
list(range(0, 12,2))

[0, 2, 4, 6, 8, 10]

In [14]:
df1 = pd.DataFrame({'var_1':range(0, 12,2), 'var_2':range(1,13,2)}, index=['d', 'a', 'b', 'c', 'e', 'f'])
df1

Unnamed: 0,var_1,var_2
d,0,1
a,2,3
b,4,5
c,6,7
e,8,9
f,10,11


<font color = 'blue'>
Sorting a data set by some criterion is another important built-in operation. <br> To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object: <fobt color = 'blue'>

In [15]:
df1.sort_index()

Unnamed: 0,var_1,var_2
a,2,3
b,4,5
c,6,7
d,0,1
e,8,9
f,10,11


<font color = 'blue'> you can sort by index on either axis: ascending or descending

In [21]:
df2 = pd.DataFrame(np.arange(12).reshape((3, 4)), index=['three', 'one', 'two'], columns=['d', 'a', 'b', 'c'])
df2

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7
two,8,9,10,11


In [23]:
df2.sort_index(axis = 1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4
two,9,10,11,8


In [24]:
df2.sort_index(axis = 1, ascending = False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5
two,8,11,10,9


<font color = 'blue'>   you may want to sort by the values in one or more columns. To do so,
pass one or more column names to the by option:

In [26]:
df2.sort_index(by='b', ascending = False)

  """Entry point for launching an IPython kernel.


Unnamed: 0,d,a,b,c
two,8,9,10,11
one,4,5,6,7
three,0,1,2,3


## Unique Values, Value Counts, and Membership

In [34]:
S1 = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [35]:
S1.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [38]:
S1.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

In [39]:
S1.value_counts(sort = False)

c    3
a    3
d    1
b    2
dtype: int64

## Combining and Merging Data Sets

<font color = 'blue'>  **pandas.merge**<br>
connects rows in DataFrames based on one or more keys. <br> This will
be familiar to users of SQL or other relational databases, as it implements database
join operations.<br>
<br>
**pandas.concat** <br>
glues or stacks together objects along an axis.

In [50]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})

In [51]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [52]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [53]:
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


<font color = 'blue'>  If not s|pecified, merge uses the
overlapping column names as the keys. It’s a good practice to specify explicitly, though:

In [54]:
pd.merge(df1, df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


<font color = 'blue'>  If the column names are different in each object, you can specify them separately:

In [56]:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})

In [57]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


<font color = 'blue'>  notice that the 'c' and 'd' values and associated data are missing from
the result. <br>
By default merge does an 'inner' join; the keys in the result are the intersection.
 <bhr>
Other possible options are 'left', 'right', and 'outer'. <br>
The outer join takes the union of the keys, combining the effect of applying both left and right joins:

In [58]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


<font color = 'blue'>  Many-to-many merges:

In [60]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})

In [61]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [62]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


In [63]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


<font color = 'blue'> To merge with multiple keys, pass a list of column names:

In [65]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'], 'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],  'key2': ['one', 'one', 'one', 'two'], 'rval': [4, 5, 6, 7]})

In [66]:
left

Unnamed: 0,key1,key2,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [67]:
right

Unnamed: 0,key1,key2,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In [68]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


<font color = 'blue'> To determine which key combinations will appear in the result depending on the choice
of merge method, think of the multiple keys as forming an array of tuples to be used
as a single join key (

<font color = 'blue'> merge has a suffixes option for specifying strings to append to overlapping
names in the left and right DataFrame objects:

In [69]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


### Merging on Index

<font color = 'blue'> you can pass left_index=True or right_index=True (or both) to indicate that the
index should be used as the merge key:

In [72]:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

In [73]:
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [74]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [75]:
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


### Concatenating Along an Axis

<font color = 'blue'> pandas objects such as Series and DataFrame, having labeled axes
enable you to further generalize array concatenation. In particular, you have a number
of additional things to think about: <br>
• If the objects are indexed differently on the other axes, should the collection of
axes be unioned or intersected? <br>
• Do the groups need to be identifiable in the resulting object? <br>
• Does the concatenation axis matter at all? <br>
The concat function in pandas provides a consistent way to address each of these concerns.

In [77]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

In [80]:
s1

a    0
b    1
dtype: int64

In [81]:
s2

c    2
d    3
e    4
dtype: int64

In [82]:
s3

f    5
g    6
dtype: int64

<font color = 'blue'>  Calling concat with these object in a list glues together the values and indexes:

In [83]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

<font color = 'blue'> By default concat works along axis=0, producing another Series. If you pass axis=1, the
result will instead be a DataFrame (axis=1 is the columns):

In [84]:
pd.concat([s1, s2, s3], axis =1 )

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


<font color = 'blue'> The same logic extends to DataFrame objects:

In [86]:
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], columns=['three', 'four'])

In [87]:
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [88]:
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [89]:
pd.concat([df1, df2])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,four,one,three,two
a,,0.0,,1.0
b,,2.0,,3.0
c,,4.0,,5.0
a,6.0,,5.0,
c,8.0,,7.0,


In [90]:
pd.concat([df1, df2], axis =1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


<font color = 'blue'> DataFrames in which the row index is not meaningful in
the context of the analysis:

In [91]:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [92]:
df1

Unnamed: 0,a,b,c,d
0,0.866563,-0.857063,-1.035841,-2.808718
1,0.006293,0.247702,-2.112872,1.011676
2,-1.729583,-0.543642,-1.102728,0.321743


In [93]:
df2

Unnamed: 0,b,d,a
0,-0.99998,-1.071108,0.753299
1,-0.742733,0.427064,0.376217


In [94]:
pd.concat([df1, df2], ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b,c,d
0,0.866563,-0.857063,-1.035841,-2.808718
1,0.006293,0.247702,-2.112872,1.011676
2,-1.729583,-0.543642,-1.102728,0.321743
3,0.753299,-0.99998,,-1.071108
4,0.376217,-0.742733,,0.427064


## Data Transformation

### Removing Duplicates

In [96]:
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]})

In [97]:
data

Unnamed: 0,k1,k2
0,one,1
1,one,1
2,one,2
3,two,3
4,two,3
5,two,4
6,two,4


In [99]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
2,one,2
3,two,3
5,two,4


### Transforming Data Using a Function or Mapping

In [106]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'],
        'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [107]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


<font color = 'blue'> Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal:`

In [102]:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

In [103]:
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)

In [104]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


## Home work

1. While using sort_index to sort the Datframe:
    - how are NULL values sorted?
2. For a cetgorical feature, count frequency of each category in descending order of index