## Introduction
#### pandas contains data structures and data manipulation tools to make data cleaning and analysis fast and easy.

#### Can be used in tandem with numerical computing tools (NumPy and SciPy), analytical libraries (statsmodels, scikit-learn) and data visualisation libraries (matplotlib).

#### It adopts NumPy's idiomatic style of array-based computing. 

### Pandas is designed to work with heterogenous data, whereas NumPy is designed for homogenous numerical data.

## pandas Data Structures
#### There are 2 workhorse Data Structures in pandas - Series and DataFrame.
#### Not a universal solution to every problem, but provide a solid, easy-to-use basis.

### Series
#### It's a 1-D array like object containing sequence of values and associated array of labels.
#### The output representation for a Series shows index on the left and values on the right.
#### If we do not specify an index, a default one from 0 to (n-1) is created.
#### You can access the index and values of a Series seperately through 'index' and 'values' attributes respectively.

In [9]:
import pandas as pd

In [10]:
print(pd.__version__)

2.1.1


In [11]:
import pandas as pd

obj = pd.Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [12]:
obj2 = pd.Series([4,7,-5,3], index=['d','b','a','c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [13]:
ob = pd.Series([2,35,2],['w','e','s'])
ob

w     2
e    35
s     2
dtype: int64

In [14]:
ob.values

array([ 2, 35,  2], dtype=int64)

In [15]:
type(ob.values)

numpy.ndarray

In [16]:
type(ob)

pandas.core.series.Series

In [17]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [18]:
obj2.values

array([ 4,  7, -5,  3], dtype=int64)

#### You can use labels in index to select single values or set of values. To select set of values you need to use list of indices.
#### We can use NumPy like operations (filtering with boolean indexing, multiplication or math functions). The index-value link will not be affected by this.

In [19]:
obj2['a']

-5

In [20]:
obj2['d'] = 6
obj2[['c','a','d']]

c    3
a   -5
d    6
dtype: int64

In [21]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [22]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [23]:
import numpy as np

np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

#### Another way of looking at Series is like a fixed-length, ordered Dictionary.
#### This is because mapping of index to values resembles that of a dict.
#### If you have a dict, you can create a Series from it. If you are only passing the dict, then its keys become the index in order.
#### You can override this by passing dict keys in the order you want them to appear.
#### In case the key list is not present in the dict, its respective value will be NaN (Python Version of missing or NA value). Missing values in Series can be found with the 'notnull' or 'isnull' operator.

In [24]:
'b' in obj2

True

In [25]:
'e' in obj2

False

In [26]:
sdata = {'Ohio': 3500, 'Texas':71000, 'Oregon':16000, 'Utah':5000}

obj3 = pd.Series(sdata)
obj3

Ohio       3500
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [27]:
states = ['California','Ohio','Oregon','Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [28]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [29]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

#### A useful feature of Series is that it automatically aligns by the index label in arithematic operations.
#### This alignment of indexes can be seen as similar to Joins in Databases.
#### Both Series object and its Index have a name attribute and they integrate with other pandas functinality.
#### A Series index can be replaced in-place by assignment.

In [30]:
obj3

Ohio       3500
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [31]:
obj4

California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [32]:
obj3 + obj4

California         NaN
Ohio            7000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [33]:
ob= obj3 + obj4
ob.fillna(-1)

California        -1.0
Ohio            7000.0
Oregon         32000.0
Texas         142000.0
Utah              -1.0
dtype: float64

In [34]:
obj4.name = 'population'
obj4.index.name = 'state'

obj4

state
California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [35]:
obj.index = ['Bob','Steve','Jeff','Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

### DataFrame
#### Represents rectangular table  of data containing ordered collection of columns, each of different value type (numeric, string, boolean, etc.).
#### It has a row and column index and can be though of as a dict of Series, all sharing same index.
#### Under the hood, data in DataFrame is stored as one or more 2-D blocks.
#### NOTE - Even though DataFrame is 2-D, it can be used to represent higher dimensional data in tabular form using hierarchical indexing. index after index

#### Many ways to create a DataFrame, the most common is from a dict of equal-length lists or NumPy arrays.
#### It will have an index assigned automatically and columns placed in sorted order.
#### The head() method will show you the first 5 rows of the DataFrame by default.

In [36]:
ob1= pd.Series({'A':2,'B':4,'C':6,'D':7,'E':8})
ob2 =pd.Series({'A':78,'B':45,'C':98,'D':56})
data = pd.DataFrame({"Marks": ob1 ,"Grades": ob2 })
data

Unnamed: 0,Marks,Grades
A,2,78.0
B,4,45.0
C,6,98.0
D,7,56.0
E,8,


In [37]:
b=data.fillna(0).astype(int)

In [38]:
b.T

Unnamed: 0,A,B,C,D,E
Marks,2,4,6,7,8
Grades,78,45,98,56,0


In [39]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year' : [2000, 2001, 2002, 2001, 2002, 2003],
       'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)

In [40]:
frame["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [41]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [42]:
frame.set_index('state')  # column as its index modified original one

Unnamed: 0_level_0,year,pop
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Ohio,2000,1.5
Ohio,2001,1.7
Ohio,2002,3.6
Nevada,2001,2.4
Nevada,2002,2.9
Nevada,2003,3.2


In [43]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [44]:
frame.set_index('state', inplace=True)

In [45]:
frame

Unnamed: 0_level_0,year,pop
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Ohio,2000,1.5
Ohio,2001,1.7
Ohio,2002,3.6
Nevada,2001,2.4
Nevada,2002,2.9
Nevada,2003,3.2


In [46]:
frame.index

Index(['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], dtype='object', name='state')

In [47]:
frame.loc['Ohio' ]  #Note ohio with particular label present 

Unnamed: 0_level_0,year,pop
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Ohio,2000,1.5
Ohio,2001,1.7
Ohio,2002,3.6


In [48]:
frame.loc['Ohio' ,'year' ] 

state
Ohio    2000
Ohio    2001
Ohio    2002
Name: year, dtype: int64

In [49]:
frame.loc[0]  # deafult index is replced by state

KeyError: 0

In [None]:
frame.reset_index(inplace=True)  #restart defalt index

In [None]:
frame.loc[0]

In [None]:
frame

In [None]:
frame.iloc[0]

#### To arrange sequence of columns, specify the sequence.
#### If you pass a column that isn't present in the dict, it will appear with missing values (NaN).
#### A column can be retrieved as a Series using a dict-like notation or by attribute.
#### The returned Series will have the same index as the DataFrame.
#### NOTE - Attribute-like access and tab completion of column names is provided as convineance in IPython. The attribute use of column name only works if the column name is valid Python variable name.

In [None]:
pd.DataFrame(data, columns=['year', 'state','pop'])

In [None]:
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2

In [None]:
# changing column name
frame2.columns = ['Year', 'State', 'Pop', 'Debt']
frame2.columns

In [None]:
frame2.columns = [ x.upper() for x in frame2.columns]
frame2.columns

In [None]:
#### if spaces b/w or replace word/char replace it with _
frame2.columns = frame2.columns.str.replace('A', "_" )
frame2.columns

In [None]:
frame2.columns = ['year', 'state', 'pop', 'debt']
frame2['state']

In [None]:
frame2.loc['four']= ['2001', 'India' ,'2.4' ,'3.9']
frame2

In [None]:
frame2.loc['four', ['state','year']] = [ 'Nevada',1991  ]  # more than one colum so one more braces under it.
frame2

In [None]:
frame2.year

#### Rows can be retrieved by position or name using the 'loc' method.

In [None]:
frame2.loc['three']

#### Columns can be modified by assignment. We can pass a scalar values or a list of values to be assigned.
#### When assigning, the length of the value list must match the length of the DataFrame.
#### If we use a Series to assign to a column, its index will realign to the index of the DataFrame. Missing values will be inserted with NaNs.
#### assigning a non-existant column will create a new column. New columns cannot be created with the 'attribute' method.

In [None]:
frame2['pop'] = 16.5
frame2

In [None]:
frame2['debt'] = np.arange(6.0)
frame2

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index = ['two','four','five'])
frame2['debt'] = val
frame2

In [None]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

#### The 'del' keyword will delete columns as with a dict.
#### NOTE - Any column returned from any index operation is a view for the DataFrame, and not a copy. any change made to it will be reflected in the DataFrame.

In [None]:
del frame2['eastern']
frame2.columns

#### Another method of input is nested dict of dicts.
#### If it is passed as input to a DataFrame, pandas will interpret outer dict keys as columns and inner keys as row indexes.
#### The keys of inner dict are sorted to form the index. But this can be changed if the index values are explicitly specified.
#### The same rules apply to a dict of Series.

In [None]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
      'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3

In [None]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

In [None]:
pdata = {'Ohio': frame3['Ohio'][:-1],
        'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

#### You can transpose a dataframe (swap rows with columns) in similar syntax as NumPy array.
#### If index and columns in a DataFrame have their name attributes set, they will also be displayed.
#### The values attribute returns the data in a DataFrame as a 2-D ndarray.
#### If the dtypes of the columns are different, then the output ndarray's dtype will be chosen in such a way as to accomodate all the columns.

In [None]:
frame3.T

In [None]:
frame3.index.name = 'year'
frame3.columns.name = 'state'

frame3

In [None]:
frame3.values  # indexing never comes in matrix dim

In [None]:
frame2.values

### Index Objects
#### They are responsible for holding axis labels and other metadata.
#### Any array or sequence of labels used in a Series or DataFrame is internally converted to an Index.
#### Indexes are immutable (can't be modififed by any user). Hence it is safer to share among data structures.
#### They behave like a fixed-size set. But unlike a set, they can contain duplicate labels.
#### selection of duplicate labels will select all occurences of that label.

In [None]:
obj = pd.Series(range(3), index=['a','b','b'])
index = obj.index
index

In [None]:
index[1] = 'd' # Indexes are immutable (can't be modififed by any user) but can be deleted with all rows

In [None]:
labels = pd.Index(np.arange(3))
labels

In [None]:
obj2 = pd.Series([1.5, -2.5, 0], index = labels)
obj2

In [None]:
obj2.index is labels

In [None]:
frame3

In [None]:
frame3.columns

In [None]:
'Ohio' in frame3.columns

In [None]:
2003 in frame3.columns

In [None]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

### Dropping Entries from an Axis
#### Dropping entries from an axis is easy if you already have index array or list without those entries.
#### The 'drop' method will return a new object with the indicate dvalues deleted from the axis.
#### In DataFrames, index values can be deleted from either axis using the drop function.
#### By default drop will delete values from row axis. To delet from columns, you need to specify 'axis=1' or "axis='columns'".

In [None]:
obj = pd.Series(np.arange(5) + 1, index=['a','b','c','d','e'])
obj

In [None]:
new_obj = obj.drop('c')  # droppped index with contents
new_obj

In [None]:
obj.drop(['d','c'])

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                   index = ['Ohio','Colorado','Utah','New York'],
                   columns = ['one','two','three','four'])

data

In [None]:
data.drop(['Colorado', 'Ohio' ] , axis= 0)  # Note outside list attrib.

In [None]:
data.drop('two', axis=1)  # dropped index  passing as list attr. axis=0 or 'index' && 1 or 'columns'}, default 0

In [None]:
data.drop(['two','four'], axis='columns')

#### drop, like many other functions which modify size or shape of Series or DataFrame, can manipulate objects in-place without returning a new object.
#### Careful with inplace, as it can destroy any data that it drops.

In [None]:
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj.drop('c', inplace=True)  # modify original one with inplace = true
obj

### Indexing, Selection and Filtering
#### series indexing works similarly to NumPy array indexing. Difference is that for Series you can use index values insted of only integers.
#### Slicing Series is different from Python slicing because here the endpoint is inclusive of the slice.
#### Using slices for modifications will modify the Series as well.

In [None]:
obj = pd.Series(np.arange(4.), index = ['a','b','c','d'])
obj

In [None]:
obj['b']

In [None]:
obj[1]

In [None]:
obj[2:4]

When you use double square brackets in pandas like obj[['b', 'a', 'd']], you are selecting a DataFrame with columns/row 'b', 'a', and 'd'.

This is a common operation in pandas to select specific columns from a DataFrame. 

The outer square brackets denote the selection, and the inner square brackets create a list of column names

In [None]:
obj[['b','a','d'] ]  # b','a','d' are indexes

In [None]:
obj[ [1,3] ]    # 1,3 are iloc(auto) index

In [None]:
obj[obj < 2]

In [None]:
obj['b':'c']

In [None]:
obj['b':'c'] = 5
obj

#### DataFrame indexing is for retrieving 1 or more columns through single value or sequence. 

In [None]:
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                   index = ['Ohio','Colorado','Utah','New York'],
                   columns = ['one','two','three','four'])

data

In [None]:
data['two']

In [None]:
data[['three', 'one']]

#### There are some special cases in this indexing - 
#### 1. Selecting data with Boolean array - This filters rows based on the value of a column and selects all the columns.

In [None]:
data.index

In [None]:
[data['three'] > 5]

In [None]:
data[data['three'] > 5] # Note select all columns

In [None]:
data

In [None]:
filt = data['three'] > 5
data.loc[ ~filt ]

In [None]:
data.loc[ ~filt , 'four']

In [None]:
# update single row of data using loc
data.loc[ ~filt , 'four'] = 55
data

In [None]:
data[]

#### 2. Indexing with a Boolean dataframe, usually by scalar comparison - This DataFrame looks more like a 2-D NumPy array.

In [None]:
data < 5

In [None]:
data

In [None]:
data[data < 5] = 0
data

### Selection with loc and iloc
#### These are special indexing operators that select subset of rows and columns with NumPy like notation.

## loc gets rows (and/or columns) with particular labels.
## iloc gets rows (and/or columns) at integer locations.

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                   index = ['Ohio','Colorado','Utah','New York'],
                   columns = ['one','two','three','four'])

data

In [None]:
data.index

In [None]:
data.iloc[2, :] #integer specified indexing but index name is ohio and so on

In [None]:
data.loc[2, :]  # will not work because not 2 index as particular label present

In [None]:
data.iloc[2, [3,0,1]]  #specified each integer location index 

In [None]:
data.loc[2, [3,0,1]]

In [None]:
data.loc['Colorado', ['two','three']]

#### Both the operators work with slices as well as single labels and list of labels.

In [None]:
data.loc[:'Utah', 'two'] # Note last location included in slicing i.e 'Utah'

In [None]:
data

In [None]:
data.iloc[:, :3][data.three > 5]

In [None]:
A = pd.Series(['one','two' , 'three'], index =[1,3,5])
A

In [None]:
A

In [None]:
A[1:3] #by default uses integer index i.e iloc ---> automatically indexing

In [None]:
A.loc[1:3] #uses particular i.e index based on specied data

In [None]:
A.iloc[1:3]

### Integer Indexes
#### Working with pandas objects indexed with integers is slightly different from in-built Python Data Structures oin indexing semantics.
#### This trips-up new users. Like the below example, which will give an error.

In [None]:
ser = pd.Series(np.arange(3.))
ser[-1]

#### Doing a 'fall back' on integer indexing would be difficult without introducing subtle bugs.
#### Inferring what the user wants in this case would be difficult. But with non-integer index, there is no potential for ambiguity.
#### Hence to keep things consistent, axis with integer labels will always be label oriented.
#### We can always use loc and iloc for precise handling.

In [None]:
ser

In [None]:
ser2 = pd.Series(np.arange(3.), index=['a','b','c'])
ser2[-1]

In [None]:
ser

In [None]:
ser[:1]

In [None]:
ser.loc[:1]

In [None]:
ser.iloc[:1]

#### For DataFrames, the alignment is performed in both rows and columns. 
#### So the result of adding 2 DataFrames is a DataFrame whose index and columns are unions of each of the DataFrames.
#### Missing values will apear both in the row and column indices whose labels are not common to the 2 objects.
#### If the row or column has no values, the result will also contain NaNs.

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)), columns=list('bcd'), 
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'), 
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
df1

In [None]:
df2

In [None]:
df1 + df2

In [None]:
df1 = pd.DataFrame({'A': [1,2]})
df2 = pd.DataFrame({'B': [3,4]})

In [None]:
df1

In [None]:
df2

In [None]:
df1 - df2

### Arithematic Methods with Fill Values
#### A good practice is to fill a value in arithematic operations between differently indexed objects when 1 axis label is found in one object but not in another.
#### This involves using the 'add' method with a 'fill_value'.

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3,4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4,5)), columns=list('abcde'))

In [None]:
df2

In [None]:
df1 + df2

In [None]:
df1.add(df2, fill_value=0) # Fill existing missing (NaN)

#### For every arithematic operation for Series and DataFrame, we have a counterpart starting with letter r, where arguments are flipped.
#### We can specify fill value also when reindexing a Series or DataFrame. 

In [None]:
df1

In [None]:
1 / df1

In [None]:
df1.rdiv(1)

In [None]:
df1

In [None]:
df1.reindex(columns = df2.columns, fill_value=0)

### Operations between DataFrame and Series
#### In NumPy, when you run an operation between a 2-D and a 1-D array, the 1-D array is operated on all the rows of the 2-D array.
#### This is referred to as Broadcasting.
#### In DataFrames, when we operate between a DataFrame and a Series, the same broadcasting happens.
#### By default, arithematic between DataFrame and Series matches insex of the Series on the Columns, broadcasting down rows.
#### If index values are not found, objects will be reindexed to form the union. 

In [None]:
# For NumPy

arr = np.arange(12.).reshape((3,4))
arr

In [None]:
arr - arr[0]  # alter. arr[0][1]

In [51]:
# For DataFrame and Series
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                    columns=list('bde'),
                    index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

In [52]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [53]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [54]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [55]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [56]:
series2 = pd.Series(range(3), index=['b','e','f'])
series2

b    0
e    1
f    2
dtype: int64

In [57]:
frame_D = pd.DataFrame(np.arange(3.).reshape(1,3) ,
                    columns=['b','e','f'])
frame_D

Unnamed: 0,b,e,f
0,0.0,1.0,2.0


In [58]:
series2 = pd.Series(range(3), index=['b','e','f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


#### If you want to match on rows and broadcast over Columns, use one of the arithematic methods.
#### The axis you pass is the axis to match on. To match on row index we pass axis='index' or axis=0.

In [59]:
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [60]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [61]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [62]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function Application and Mapping
#### NumPy ufuncs can also work with pandas objects.
#### We can also apply user defined functions for 1-D arrays to each row or column using DataFrame's 'apply' method.
#### The function will be invoked once on each column. Result will be Series having Columns of frame as its index.
#### By passing axis='columns', the function will be invoked once per row.

In [63]:
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),  #note list is passed as word or [ inside with char]
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-1.649331,-0.471404,0.009118
Ohio,0.25062,0.4936,1.562551
Texas,0.632421,0.763509,1.906187
Oregon,0.364213,0.346768,0.327184


In [64]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.649331,0.471404,0.009118
Ohio,0.25062,0.4936,1.562551
Texas,0.632421,0.763509,1.906187
Oregon,0.364213,0.346768,0.327184


In [65]:
f = lambda x: x.max() - x.min() # lambda is just a fancy way of saying function.
frame.apply(f)

b    2.281753
d    1.234914
e    1.897069
dtype: float64

In [66]:
frame.apply(f, axis='columns')

Utah      1.658449
Ohio      1.311931
Texas     1.273765
Oregon    0.037030
dtype: float64

#### Most of the common array statistics are DataFrame methods, so using apply is not necessary.
#### The functions may not return a scalar value, but may also return a Series.
#### Element-wise Python functions can be applied to pandas objects using 'applymap'.
#### The name is because series has a method 'map' for applying functions for each element.

In [67]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min','max'])

frame.apply(f)

Unnamed: 0,b,d,e
min,-1.649331,-0.471404,0.009118
max,0.632421,0.763509,1.906187


In [68]:
format = lambda x: '%.2f' % x
frame.applymap(format)

  frame.applymap(format)


Unnamed: 0,b,d,e
Utah,-1.65,-0.47,0.01
Ohio,0.25,0.49,1.56
Texas,0.63,0.76,1.91
Oregon,0.36,0.35,0.33


In [69]:
frame['e'].map(format)

Utah      0.01
Ohio      1.56
Texas     1.91
Oregon    0.33
Name: e, dtype: object

### Sorting and Ranking
#### Sorting dataset  lexicographically is possible using sort_index method, which returns a new object.
#### In DataFrames, you can sort by index on either axis.
#### Data is sorted in ascending order by default, but can be sorted in descending orer too.

In [70]:
obj = pd.Series(range(4), index=['d','a','b','c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [71]:
frame = pd.DataFrame(np.arange(8).reshape((2,4)),
                    index=['three','one'],
                    columns = ['d','a','b','c'])

frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [72]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [73]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


#### To sort Series by its values, use 'sort_values'.
#### Missing values will be sorted to the end of the Series by default.
#### While sorting DataFrames, you can use values of more than 1 columns as sort keys. Do this by passing 1 or more column names in the sort_values option.

In [74]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [75]:
frame = pd.DataFrame({'b': [4,7,-3,2], 'a':[0,1,0,1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [76]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [77]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [78]:
frame.sort_values(by=['a','b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


#### Ranking assigns ranks 1 through the number of valid data points in an array.
#### In pandas objects by default, 'rank' breaks ties by assigning each group mean rank.
#### Ranks can also be assigned in the order in which they are observed using method='first'. This breaks ties giving upper rank to the value observed first.
#### Rank can be assigned in descending order too.
#### For DataFrame, ranks can be computed for rows as well.

In [79]:
obj = pd.Series([7,-5,7,4,2,0,4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [80]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [81]:
# Assign tie values the max rank in the group
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [82]:
frame = pd.DataFrame({'b': [4.3,7,-3,2], 'a':[0,1,0,1],
                     'c': [-2,5,8,-2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [83]:
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


### Axis Indexes with duplicate Labels
#### All the previous examples have required unique axis labels.
#### Many pandas functions (eg - reindex) require unique labels, but that's not necessary.
#### The property 'is_unique' can tell if the labels of a Series are unique or not.
#### Data selection is affected by duplicate labels. Getting an index with duplicate label will get a Series, with single entries will return a scalar value.
#### This multiple values for a label will make code more complicated because indexing of output will keep on varying.
#### The same logic applies to indexing in DataFrames.

In [84]:
obj = pd.Series(range(5), index=['a','a','b','b','c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [85]:
obj.index.is_unique

False

In [86]:
obj['a']

a    0
a    1
dtype: int64

In [87]:
obj['c']

4

In [88]:
df = pd.DataFrame(np.random.randn(4,3), index=['a','a','b','b'])
df

Unnamed: 0,0,1,2
a,-0.650991,0.77874,-0.759846
a,1.142544,0.025798,1.289416
b,-0.884021,-0.827511,-0.851084
b,0.353877,0.559365,-0.525894


In [89]:
df.loc['b']

Unnamed: 0,0,1,2
b,-0.884021,-0.827511,-0.851084
b,0.353877,0.559365,-0.525894


## Summarizing and Computing Descriptive Statistics
#### pandas has a set of common mathematical and statistical methods library.
#### These methods can be classified into 'reductions' or 'summary statistics', i.e. extract single value from a Series or a Series of values from rows or columns of DataFrame.
#### Like NumPy, these methods have built-in handling for missing data.

#### sum method returns a Series containing column sums.
#### Passing axis='columns' or axis=1 sums across columns instead.
#### NA values are excluded unless entire entire slice in NA. It can be disabled with skipna option.

In [90]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                  [np.nan, np.nan], [0.75,-1.3]],
                 index=['a','b','c','d'],
                 columns=['one','two'])

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [91]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [92]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [93]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

#### Some methods (eg - idxmin, idxmax) return indirect statistic like index value where the min or max values were attained. 

In [94]:
df.idxmax()

one    b
two    d
dtype: object

#### Accumulation is another class of functions.
#### Rather than checking individual values, they go for aggregated values.

In [95]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


#### Some functions are neither reductions nor accumulations.
#### 'describe' is one such method. It provides multiple summary statistics in one shot.
#### For non-numeric

In [96]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [97]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [98]:
obj = pd.Series(['a','a','b','c'] * 4)

In [99]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### Correlation and Covariance
#### Some summary statistics (correlation, covariance) are computed from pairs of arguments.
#### We will be using pandas-datareader package with Quandl information for this example

In [100]:
import pandas_datareader.data as web
import datetime

start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
all_data = {ticker: web.DataReader(symbol, 'quandl', start, end)
        for ticker in ['WIKI/FB','WIKI/AAPL']}

ModuleNotFoundError: No module named 'pandas_datareader'

In [None]:
all_data

In [None]:
price = pd.DataFrame({ticker:data['AdjClose']
                     for ticker, data in all_data.items()})
price.columns = ['FB', 'AAPL']
price.head()

In [None]:
volume = pd.DataFrame({ticker:data['Volume']
                     for ticker, data in all_data.items()})
volume.columns = ['FB', 'AAPL']
volume.head()

#### We can get percent changes as a time-series operation using pct_change.
#### The 'corr' method computes correlation of overlapping, non-NA, aligned-by-index values of 2 series.
#### Similarly, 'cov' computes covariance between 2 Series.
#### Using Column name as an attribute allows corr and cov between 2 Series without index labels. corr and cov on a DataFrame returns a full matrix.
#### Using 'corrwith' we can compute pairwise correlations between DataFrame's columns or rows with another Series or DataFrame. 
#### Passing a Series in 'corrwith' will return a Series with correlation value for each column. Passing a DataFrame gets correlations of matching column names.
#### Passing axis='columns' does things row-by-row instead.

In [None]:
returns = price.pct_change()
returns.tail()

In [None]:
returns['FB'].corr(returns['AAPL'])

In [None]:
returns.FB.corr(returns.AAPL)

In [None]:
returns.corr()

In [None]:
returns.cov()

In [None]:
returns.corrwith(returns.AAPL)

In [None]:
returns.corrwith(volume)

### Unique Values, Value Counts, and Membership
#### There is a class of methods that extracts info about values in 1-D Series.
#### 'unique' gives you an array of unique values in a Series. The values are not sorted but could be after the fact.
#### 'value_counts' returns a Series with value frequencies. By default it is sorted in descending order for convinience and can be turned off with sort=False.
#### It is also a top-lebvel pandas method that can be used with any array or sequence.

In [None]:
obj = pd.Series(['c','d','a','a','e','b','c','c'])

In [None]:
uniques = obj.unique()

uniques

In [None]:
uniques.sort()
uniques

In [None]:
obj.value_counts()

In [None]:
pd.value_counts(obj.values, sort=False)

#### 'isin' does a vectorized set membership check. Useful in filtering a dataset to a subset of values in a Series or column in a DataFrame.
#### Index.get_indexer gives an index array from an array of non-distinct values into an array of distinct values.

In [None]:
mask = obj.isin(['b','c'])
mask

In [None]:
to_match = pd.Series(['c','a','b','b','c','a'])
unique_vals = pd.Series(['c','b','a'])

pd.Index(unique_vals).get_indexer(to_match)

#### We can use the value_counts as a method as told earlier to apply to an entire DataFrame to get a Histogram on multiple related columns.
#### The output's row labels are the distinct values in the DataFrame and the values are respective counts of these distinct values in each column.

In [None]:
data = pd.DataFrame({'Qu1': [1,3,4,3,4],
                    'Qu2': [2,3,1,2,3],
                    'Qu3': [1,5,2,4,4]})

data

In [None]:
result = data.apply(pd.value_counts).fillna(0)
result