# PANDAS 

## Series 

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

In [1]:
import pandas as pd

In [2]:
obj = pd.Series([4, 7, -5, 3]);obj

0    4
1    7
2   -5
3    3
dtype: int64

You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [3]:
type(obj)

pandas.core.series.Series

In [8]:
obj.values

array([ 4,  7, -5,  3])

In [26]:
obj.index

RangeIndex(start=0, stop=4, step=1)

We can assign indices to data points

In [10]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [11]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

you can use labels in the index when selecting single values or a set of values:

In [13]:
obj2['a','b']

KeyError: 'key of type tuple not found and not a MultiIndex'

In [14]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [17]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [18]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [16]:
obj2 > 0

d     True
b     True
a    False
c     True
dtype: bool

In [19]:
obj2*2

d     8
b    14
a   -10
c     6
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a map‐ ping of index values to data values. 
<br> It can be used in many contexts where you might use a dict:

In [20]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [21]:
 'b' in obj2

True

In [22]:
 'e' in obj2

False

In [23]:
4 in obj2

False

if you have dictionary you can create a series from it

In [28]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [29]:
obj3 = pd.Series(sdata)

In [30]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. 
<br> You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [31]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [32]:
obj4 = pd.Series(sdata, index=states);obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is con‐ sidered in pandas to mark missing or NA values. 
<br> Since 'Utah' was not included in states, it is excluded from the resulting object.

The isnull and notnull functions in pandas should be used to detect missing data:

In [None]:
array.sum()
np.sum(array)

In [33]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [34]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also has these as instance methods:

In [35]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Both the Series object itself and its index have a name attribute which could be important in other parts of pandas

In [36]:
obj4.name = 'population'

In [37]:
obj4.index.name = 'state'

In [38]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.

To create a dataframe you should call pd.DataFrame() function

In [39]:
df=pd.DataFrame([[1,2,3],[4,5,6]]);df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


Since we did not explicity set indices and columns it assigned numbers from 0 to N

In [40]:
df=pd.DataFrame([[1,2,3],[4,5,6]],index=["a","b"],columns=list("klm"));df

Unnamed: 0,k,l,m
a,1,2,3
b,4,5,6


There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

In [60]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002, 2003],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [61]:
data

{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [62]:
frame = pd.DataFrame(data);frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [59]:
frame["state new"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state new, dtype: object

In [54]:
frame["pop"]

0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
5    3.2
Name: pop, dtype: float64

# Getting columns of a dataframe

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

In [48]:
frame['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [45]:
frame.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

get the year columns

In [46]:
frame["year"]

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [47]:
frame.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

frame2[column] works for any column name, but frame2.column only works when the column name is a valid Python variable name.
<br> for example if there is space in the column name frame2.column will not work

Columns can be modified by assignment

In [63]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [64]:
frame['debt'] = 16.5 ; frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,16.5
1,Ohio,2001,1.7,16.5
2,Ohio,2002,3.6,16.5
3,Nevada,2001,2.4,16.5
4,Nevada,2002,2.9,16.5
5,Nevada,2003,3.2,16.5


When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame

In [65]:
import numpy as np

In [70]:
frame['debt'] = np.arange(10,16) ; frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,10
1,Ohio,2001,1.7,11
2,Ohio,2002,3.6,12
3,Nevada,2001,2.4,13
4,Nevada,2002,2.9,14
5,Nevada,2003,3.2,15


Try above script np.arange(10,17)

If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:

In [71]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']); val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

first create a dataframe with indices name this dataframe as frame2

In [72]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                          index=['one', 'two', 'three', 'four',
                                     'five', 'six'])

In [79]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


assing debt column with np.arange(1,7)

In [75]:
frame2.debt=np.arange(1,7)

now assign debt column as val series

In [77]:
val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

In [78]:
frame2.debt=val

In [80]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


The del keyword will delete columns as with a dict.

In [81]:
del frame2["debt"];frame2

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


to create a new column you should assignment

In [82]:
frame2["size"]=np.arange(100,700,100);frame2

Unnamed: 0,year,state,pop,size
one,2000,Ohio,1.5,100
two,2001,Ohio,1.7,200
three,2002,Ohio,3.6,300
four,2001,Nevada,2.4,400
five,2002,Nevada,2.9,500
six,2003,Nevada,3.2,600


frame2.columns and frame2.index return columsn and indices

In [86]:
frame2.columns

Index(['year', 'state', 'pop', 'size'], dtype='object')

In [84]:
frame2.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

In [85]:
frame2.values

array([[2000, 'Ohio', 1.5, 100],
       [2001, 'Ohio', 1.7, 200],
       [2002, 'Ohio', 3.6, 300],
       [2001, 'Nevada', 2.4, 400],
       [2002, 'Nevada', 2.9, 500],
       [2003, 'Nevada', 3.2, 600]], dtype=object)

If the DataFrame’s columns are different dtypes, the dtype of the values array will be chosen to accommodate all of the columns:

## Possible data inputs to DataFrame constructor

-
![Screen%20Shot%202018-11-09%20at%2018.15.54.png](attachment:Screen%20Shot%202018-11-09%20at%2018.15.54.png)

## Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [87]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']);obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [89]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e',100]);obj2

a     -5.3
b      7.2
c      3.6
d      4.5
e      NaN
100    NaN
dtype: float64

With DataFrame, reindex can alter either the (row) index, columns, or both. 
<br> When passed only a sequence, it reindexes the rows in the result:

In [90]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                             index=['a', 'c', 'd'],
                            columns=['Ohio', 'Texas', 'California'])

In [91]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [92]:
frame2 = frame.reindex(['a', 'b', 'c', 'd']); frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed with the columns keyword:

In [93]:
states = ['Texas', 'Utah', 'California']

In [94]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


## Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. 
<br> drop method will return a new object with the indicated value or values deleted from an axis:

In [95]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']);obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [96]:
 new_obj = obj.drop('c') ; new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [97]:
 data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                           index=['Ohio', 'Colorado', 'Utah', 'New York'],
                           columns=['one', 'two', 'three', 'four'])


In [98]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [99]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


You can drop values from the columns by passing axis=1 or axis='columns':

In [101]:
data.drop('two',axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


drop columns "one" and "four"

In [102]:
data.drop(["one","four"],axis=1)

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


In [103]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object:

In [104]:
data.drop(['two', 'three'], axis='columns',inplace=True)

In [105]:
data

Unnamed: 0,one,four
Ohio,0,3
Colorado,4,7
Utah,8,11
New York,12,15


## Indexing, Selection, and Filtering

In [106]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                           index=['Ohio', 'Colorado', 'Utah', 'New York'],
                           columns=['one', 'two', 'three', 'four'])

In [115]:
data

Unnamed: 0,one,two,three,four,4
Ohio,0,1,2,3,test
Colorado,4,5,6,7,test
Utah,8,9,10,11,test
New York,12,13,14,15,test


In [114]:
data[4]="test"

In [108]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [117]:
data[:4]

Unnamed: 0,one,two,three,four,4
Ohio,0,1,2,3,test
Colorado,4,5,6,7,test
Utah,8,9,10,11,test
New York,12,13,14,15,test


The row selection syntax data[:2] is provided as a convenience. 
<br> Passing a single element or a list to the [] operator selects columns.

## booelan indexing 

In [118]:
data

Unnamed: 0,one,two,three,four,4
Ohio,0,1,2,3,test
Colorado,4,5,6,7,test
Utah,8,9,10,11,test
New York,12,13,14,15,test


In [119]:
data['three'] > 5

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [124]:
data[(data['three'] > 5) & (data['three'] < 10)]

Unnamed: 0,one,two,three,four,4
Colorado,4,5,6,7,test


## Selection with loc and iloc

we can select a subset of rows or columns
<br> iloc: positional index
<br> loc: label index

In [129]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [128]:
del data[4]

In [126]:
 data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: object

In [130]:
 data.loc['Colorado']

one      4
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [133]:
 data.loc[['Colorado',"Utah"]]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11


In [134]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [135]:
 data.loc[:,['two',"four"]]

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


get ohio and utah rows

In [141]:
data.loc[["Ohio","Utah"]]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Utah,8,9,10,11


get only "one" "three" columns

In [143]:
data[["one","three"]]

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [142]:
data.loc[:,["one","three"]]

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


get ohio and utah rows and "one","three"columns

In [144]:
data.loc[["Ohio","Utah"],["one","three"]]

Unnamed: 0,one,three
Ohio,0,2
Utah,8,10


In [145]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


get the first two rows

In [149]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [150]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

get the last two columns

In [151]:
data.iloc[:,-2:]

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


get first two rows and last two columns

In [152]:
data.iloc[:2,-2:]

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7


# Function Application and Mapping

frame.apply(): Applying a function on one-dimensional arrays to each column or row

In [153]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                         index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [154]:
frame

Unnamed: 0,b,d,e
Utah,-0.424612,0.800143,0.447373
Ohio,0.060922,0.329245,0.707019
Texas,-1.46712,0.587612,0.152058
Oregon,0.201901,-1.822695,-0.147912


In [155]:
frame.min()

b   -1.467120
d   -1.822695
e   -0.147912
dtype: float64

In [156]:
frame.max()

b    0.201901
d    0.800143
e    0.707019
dtype: float64

In [159]:
frame

Unnamed: 0,b,d,e
Utah,-0.424612,0.800143,0.447373
Ohio,0.060922,0.329245,0.707019
Texas,-1.46712,0.587612,0.152058
Oregon,0.201901,-1.822695,-0.147912


In [157]:
f = lambda x: x.max() - x.min()

In [161]:
f(frame.b)

1.6690215318917496

In [158]:
frame.apply(f)

b    1.669022
d    2.622838
e    0.854930
dtype: float64

In [165]:
frame.apply(lambda x: x.max() - x.min())

b    1.669022
d    2.622838
e    0.854930
dtype: float64

In [166]:
frame

Unnamed: 0,b,d,e
Utah,-0.424612,0.800143,0.447373
Ohio,0.060922,0.329245,0.707019
Texas,-1.46712,0.587612,0.152058
Oregon,0.201901,-1.822695,-0.147912


In [167]:
frame.apply(f,axis="columns")

Utah      1.224755
Ohio      0.646097
Texas     2.054732
Oregon    2.024596
dtype: float64

In [168]:
frame.apply(f,axis=1)

Utah      1.224755
Ohio      0.646097
Texas     2.054732
Oregon    2.024596
dtype: float64

### The function passed to apply need not return a scalar value; it can also return a Series with multiple values:

In [169]:
frame

Unnamed: 0,b,d,e
Utah,-0.424612,0.800143,0.447373
Ohio,0.060922,0.329245,0.707019
Texas,-1.46712,0.587612,0.152058
Oregon,0.201901,-1.822695,-0.147912


In [172]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [173]:
f(frame.b)

min   -1.467120
max    0.201901
dtype: float64

In [174]:
f(frame.d)

min   -1.822695
max    0.800143
dtype: float64

In [175]:
f(frame.e)

min   -0.147912
max    0.707019
dtype: float64

In [176]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.46712,-1.822695,-0.147912
max,0.201901,0.800143,0.707019


## to apply on all the elements use applymap() function

In [177]:
frame

Unnamed: 0,b,d,e
Utah,-0.424612,0.800143,0.447373
Ohio,0.060922,0.329245,0.707019
Texas,-1.46712,0.587612,0.152058
Oregon,0.201901,-1.822695,-0.147912


In [178]:
myfunct=lambda x: x**2

In [185]:
def f(x,y=2):
    return x**y

## To apply element-wise on series use map() function 

In [190]:
 frame.e.map(f,y=3)

TypeError: map() got an unexpected keyword argument 'y'

In [189]:
 frame.e.apply(f,y=3)

Utah      0.089538
Ohio      0.353422
Texas     0.003516
Oregon   -0.003236
Name: e, dtype: float64

# Sorting

## Sorting index 

In [191]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [192]:
obj

d    0
a    1
b    2
c    3
dtype: int64

In [193]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

Try sortin index with mixed data type string and integers

In [194]:
obj = pd.Series(range(8), index=['d', 'a', 'b', 'c',4,2,3,1])

In [195]:
obj

d    0
a    1
b    2
c    3
4    4
2    5
3    6
1    7
dtype: int64

In [196]:
obj.sort_index()

TypeError: '<' not supported between instances of 'int' and 'str'

In [197]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                            index=['three', 'one'],
                         columns=['d', 'a', 'b', 'c'])

In [198]:
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [199]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [200]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [201]:
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


by default it is not a inplace operation, to make it inplace inplace=True

In [202]:
frame.sort_index(axis=1,inplace=True)

In [203]:
frame

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


## Sorting by values 

In [204]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [205]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [206]:
frame.sort_values(by="a")

Unnamed: 0,b,a
0,4,0
2,-3,0
1,7,1
3,2,1


To sort by multiple columns, pass a list of names:

In [207]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


## Axis Indexes with Duplicate Labels

In [208]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [209]:
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The index’s is_unique property can tell you whether its labels are unique or not:

In [213]:
obj.index.is_unique

False

Indexing a label with multiple entries returns a Series, while single entries return a scalar value:

In [214]:
obj['a']

a    0
a    1
dtype: int64

In [215]:
obj['c']

4

This can make your code more complicated, as the output type from indexing can
vary based on whether a label is repeated or not.

In [216]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

In [217]:
df

Unnamed: 0,0,1,2
a,-0.786853,-1.465117,0.534873
a,0.012777,-0.585877,0.308355
b,1.017513,0.884971,0.069208
b,-0.959367,-2.077474,0.196765


In [218]:
df.loc['b']

Unnamed: 0,0,1,2
b,1.017513,0.884971,0.069208
b,-0.959367,-2.077474,0.196765


# Descriptive Statistics

In [219]:
df = pd.DataFrame([[2.0, np.nan], [7, -4], [np.nan, np.nan], [1, -2]],
index=['a', 'b', 'c', 'd'],columns=['one', 'two'])

In [220]:
df

Unnamed: 0,one,two
a,2.0,
b,7.0,-4.0
c,,
d,1.0,-2.0


Calling DataFrame’s sum method returns a Series containing column sums:

In [221]:
df.sum()

one    10.0
two    -6.0
dtype: float64

Passing axis='columns' or axis=1 sums across the columns instead:

In [222]:
df.sum(axis=1)

a    2.0
b    3.0
c    0.0
d   -1.0
dtype: float64

In [223]:
df

Unnamed: 0,one,two
a,2.0,
b,7.0,-4.0
c,,
d,1.0,-2.0


NA values are excluded unless the entire slice (row or column in this case) is NA. 
<br> This can be disabled with the skipna option:

In [224]:
 df.mean(axis='columns', skipna=False)

a    NaN
b    1.5
c    NaN
d   -0.5
dtype: float64

producing multiple summary statistics in one shot:

In [226]:
df

Unnamed: 0,one,two
a,2.0,
b,7.0,-4.0
c,,
d,1.0,-2.0


In [225]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.333333,-3.0
std,3.21455,1.414214
min,1.0,-4.0
25%,1.5,-3.5
50%,2.0,-3.0
75%,4.5,-2.5
max,7.0,-2.0


### Descriptive and summary statistics

-
<div>
<img src="attachment:Screen%20Shot%202018-11-11%20at%2021.27.48.png" width="600" height="450" >
<div>

Example for pct_change:

In [227]:
df = pd.DataFrame({
  'FR': [4.0405, 4.0963, 4.3149],
   'GR': [1.7246, 1.7482, 1.8519],
    'IT': [804.74, 810.01, 860.13]},
   index=['1980-01-01', '1980-02-01', '1980-03-01'])

In [228]:
df

Unnamed: 0,FR,GR,IT
1980-01-01,4.0405,1.7246,804.74
1980-02-01,4.0963,1.7482,810.01
1980-03-01,4.3149,1.8519,860.13


In [229]:
df.pct_change()

Unnamed: 0,FR,GR,IT
1980-01-01,,,
1980-02-01,0.01381,0.013684,0.006549
1980-03-01,0.053365,0.059318,0.061876


Periods to shift for forming percent change.

In [230]:
df.pct_change(periods=2)

Unnamed: 0,FR,GR,IT
1980-01-01,,,
1980-02-01,,,
1980-03-01,0.067912,0.073814,0.06883


In [231]:
np.random.seed(1)
df=pd.DataFrame(np.random.randint(5,10,(4,4)),index=list("abcd"),columns=list("xyzt"));df

Unnamed: 0,x,y,z,t
a,8,9,5,6
b,8,5,5,6
c,9,9,6,7
d,9,7,9,8


In [232]:
df.diff()

Unnamed: 0,x,y,z,t
a,,,,
b,0.0,-4.0,0.0,0.0
c,1.0,4.0,1.0,1.0
d,0.0,-2.0,3.0,1.0


In [233]:
df.diff(axis="columns")

Unnamed: 0,x,y,z,t
a,,1,-4,1
b,,-3,0,1
c,,0,-3,1
d,,-2,2,-1


# Unique Values, Value Counts, and Membership

In [234]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']);obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [235]:
uniques = obj.unique()

In [236]:
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [237]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

isin performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame

In [240]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [238]:
mask = obj.isin(['b', 'c'])

In [241]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [242]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object