* Main difference between pandas and numpy is pandas is useful for working with tabular and heterogeneous data, NumPy is best suited for homogeneous numerical array data.
* Numpy does not provide column  row labels, missing data handling.
* Provides high level data structures and functions designed to make working with structured and tabular data fast, easy and expressive.
* Primary objects in pandas are DataFrame (tabular column oriented data structure with both row and column labels) and Series (1D labeled array object)
* It is combination of high performance array computing ideas of NumPy and flexible data manipulation of spreadsheets and relational database.
* Provides functionality to reshape, slice aggregate and select subset of data.

In [1]:
import pandas as pd
import numpy as np
from pandas import Series

# `Series`
* 1-D array like object containing a sequence of values and associated array of data labels called its index.

In [2]:
s1 = Series([4,7,-5,3])

In [3]:
s1

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
s1.values # array representation of Series

array([ 4,  7, -5,  3], dtype=int64)

In [5]:
s1.index # index object of Series

RangeIndex(start=0, stop=4, step=1)

#### Series with index
* Useful to identify each data point with label

In [6]:
s2 = Series([4,7,-5,3], index = ['a', 'b', 'c', 'd'])

In [7]:
s2

a    4
b    7
c   -5
d    3
dtype: int64

In [8]:
s2['b']

7

In [9]:
s2[1]

7

In [10]:
s2[['c', 'b']] # List of indices

c   -5
b    7
dtype: int64

In [11]:
s2.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [12]:
list(s2.items())

[('a', 4), ('b', 7), ('c', -5), ('d', 3)]

* Doing operations like scalar multiplication, boolean filtering, math functions will preserve index-value link.

In [13]:
s2[s2 > 0]

a    4
b    7
d    3
dtype: int64

In [14]:
s2 * 2

a     8
b    14
c   -10
d     6
dtype: int64

In [15]:
np.exp(s2) # numpy's ufunc works with Series.

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

* Series can also be defined as fixed length ordered dictionary. 

In [16]:
'b' in s2

True

In [17]:
'1' in s2

False

In [18]:
d1 = {'Ohio':35000, 'Texas':25000, 'Oregon':16000, 'Utah': 5000}

In [19]:
s3 = Series(d1) 

In [20]:
s3

Ohio      35000
Texas     25000
Oregon    16000
Utah       5000
dtype: int64

* By default index will be dictionary key in sorted order.
* To specify an order.

In [21]:
states = ['California', 'Ohio', 'Utah', 'Texas']

In [22]:
s4 = Series(d1, index = states)

In [23]:
s4

California        NaN
Ohio          35000.0
Utah           5000.0
Texas         25000.0
dtype: float64

* `NaN` is not a number. Denotes missing or NA values. Oregon is excluded since it is not in list.

----------
* `isnull` or `notnull` is used to detect missing data.

In [24]:
pd.isnull(s4)

California     True
Ohio          False
Utah          False
Texas         False
dtype: bool

In [25]:
pd.notnull(s4)

California    False
Ohio           True
Utah           True
Texas          True
dtype: bool

* Series also has same methods as instance methods

In [26]:
s4.isnull()

California     True
Ohio          False
Utah          False
Texas         False
dtype: bool

In [27]:
s4.notnull()

California    False
Ohio           True
Utah           True
Texas          True
dtype: bool

--------------------

In [28]:
s3

Ohio      35000
Texas     25000
Oregon    16000
Utah       5000
dtype: int64

In [29]:
s4

California        NaN
Ohio          35000.0
Utah           5000.0
Texas         25000.0
dtype: float64

In [30]:
s3 + s4

California        NaN
Ohio          70000.0
Oregon            NaN
Texas         50000.0
Utah          10000.0
dtype: float64

---------------

* Series object and its index has an attribute `name`.

In [31]:
s4.name = 'population'

In [32]:
s4.index.name = 'states'

In [33]:
s4

states
California        NaN
Ohio          35000.0
Utah           5000.0
Texas         25000.0
Name: population, dtype: float64

In [34]:
s1

0    4
1    7
2   -5
3    3
dtype: int64

* We can change series index inplace.

In [35]:
s1.index = ['Bob', 'Joe', 'Dave', 'Jeff']

In [36]:
s1

Bob     4
Joe     7
Dave   -5
Jeff    3
dtype: int64

------------

In [37]:
s1['Joe':'Jeff'] # indexing using labels is inclusive for the end.

Joe     7
Dave   -5
Jeff    3
dtype: int64

In [38]:
s5 = Series([7.3,-2.5,3.4,1.5], index = ['a', 'c', 'd', 'e'])

In [39]:
s6 = Series([-2.1,3.6,-1.5,4,3.1], index = ['a', 'c', 'e', 'f', 'g'])

In [40]:
s5 + s6

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

* Missing values in the labels locations that do not overlap.

In [41]:
s7 = Series(np.random.randn(5))

In [42]:
s7

0    0.039462
1    0.848554
2    1.206744
3    0.210562
4   -1.006886
dtype: float64

In [43]:
s7.map(abs)

0    0.039462
1    0.848554
2    1.206744
3    0.210562
4    1.006886
dtype: float64

In [44]:
s17 = Series(["pair", "oranje", "bananna", "oranje", "oranje", "oranje", "purvil"])

In [45]:
s17

0       pair
1     oranje
2    bananna
3     oranje
4     oranje
5     oranje
6     purvil
dtype: object

In [46]:
correction = {
    "pair":"pear",
    "oranje" : "orange",
    "bananna":"banana"
}

In [47]:
s17.map(correction) # ideal to change multiple value in column. Key of dict is original value and value is desired value.
# Importnt: if original value is not in dict, it will be converted to NaN.

0      pear
1    orange
2    banana
3    orange
4    orange
5    orange
6       NaN
dtype: object

In [48]:
s6

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [49]:
s8 = s6.reindex(['c', 'g', 'f', 'a', 'e'])

In [50]:
s8

c    3.6
g    3.1
f    4.0
a   -2.1
e   -1.5
dtype: float64

In [51]:
s8.sort_index()

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [52]:
s8.sort_index(ascending=False)

g    3.1
f    4.0
e   -1.5
c    3.6
a   -2.1
dtype: float64

In [53]:
s8.sort_values() # sort series by its values

a   -2.1
e   -1.5
g    3.1
c    3.6
f    4.0
dtype: float64

* Missing values are sorted at the end of series

In [54]:
s9 = Series([7,-5,7,4,2,0,4])

In [55]:
s9.rank() # ties the breaks by assigning each group the mean rank.

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

* To assign the rank in order it is obsereved

In [56]:
s9.rank(method = "first")

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [57]:
s9.rank(ascending=False, method = 'max') # maximum rank in the group

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

![ranking methods](images/ranking.jpg)

In [58]:
s10 = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [59]:
s10.index.is_unique

False

In [60]:
s10['a']

a    0
a    1
dtype: int64

In [61]:
s10['c']

4

In [85]:
s11 = Series(['a', 'a', 'b', 'c'] * 4)

In [86]:
s11

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [63]:
s11.describe()

count     16
unique     3
top        a
freq       8
dtype: object

In [64]:
s12 = s11.unique()

In [65]:
s12

array(['a', 'b', 'c'], dtype=object)

In [66]:
s11.value_counts()

a    8
c    4
b    4
dtype: int64

In [67]:
s13 = s11.isin(['b', 'c'])

In [84]:
s13

0     False
1     False
2      True
3      True
4     False
5     False
6      True
7      True
8     False
9     False
10     True
11     True
12    False
13    False
14     True
15     True
dtype: bool

In [68]:
s11[s13]

2     b
3     c
6     b
7     c
10    b
11    c
14    b
15    c
dtype: object

* `Index.get_indexer` gives index array from an array of non-distinct values into another array of distinct values.

In [69]:
s14 = Series(['c', 'a', 'b', 'b', 'c', 'a'])

In [70]:
s15 = Series(['c', 'b', 'a'])

In [71]:
pd.Index(s15).get_indexer(s14)

array([0, 2, 1, 1, 0, 2], dtype=int64)

### `Series.str.contains()`  `Series.str.endswith()` `Series.str.startswith()`
* Vecrtorized version of python's `in` operator

In [72]:
s16 = Series(['purvil dave', 'bhavika joshi', 'japan dave', 'kamil patel'])
s16.str.contains('dave')

0     True
1    False
2     True
3    False
dtype: bool

In [73]:
s16.str.endswith('patel')

0    False
1    False
2    False
3     True
dtype: bool

In [74]:
s18 = pd.Series(['a', 'b', 'c'], index = [1,3,5])

In [75]:
s18

1    a
3    b
5    c
dtype: object

In [76]:
s18[1] # Explicit indexing

'a'

In [77]:
s18[1:3] # implicit indexing

3    b
5    c
dtype: object

In [78]:
s18.loc[1] # explicit indexing

'a'

In [79]:
s18.loc[1:3] # explicit indexing

1    a
3    b
dtype: object

In [80]:
s18.iloc[1] # implicit indexing

'b'

In [81]:
s18.iloc[1:3] # implicit indexing

3    b
5    c
dtype: object

#### `Series.sample()`
* Perform simple random sampling by generating array of random numbers and using numbers to select observations from series. It also works for dataframe to select rows or columns

```
Series.sample(30, random_state = 1).
```
* Using `random_state` we can make generation of random number predictable. It just provides seed to pseudorandom number generator.