https://github.com/wesm/pydata-book/tree/1st-edition

tab --> help finish with function
<br>
shift+tab --. info about variable

In [2]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series, DataFrame
import seaborn as sns

# used for example for random
from numpy import *
# for matplot
%matplotlib inline

In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

================================================================================================================================
# <font color='green'>File Input and Output with Arrays</font>

```python
np.save
np.load 
```
are the two workhorse functions for efficiently saving and loading
array data on disk. Arrays are saved by default in an uncompressed raw binary format
with file extension **__.npy.__**

<img src="set_ops.PNG">




# <font color='green'>SLICE and DICE</font>

<font color="#000000">
<ol type="1">
An important first distinction
from lists is that array slices are views on the original array. This means that
<font color="red">the data is not copied, and any modifications to the **view** will be reflected in the source array</font>

As NumPy has
been designed with large data use cases in mind, you could imagine performance and
memory problems if NumPy insisted on copying data left and right.

**<font color='green'>DTYPE</font>**

supported dtypes:
<img src="data_type1.png"> 
<img src="data_type2.png"> 

<font color='green'>**FANCY INDEXING** </font>

# <font color='green'>**==============================================================**</font>

##Difference between Data Frame and Series

Quoting the Pandas docs

```python
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
```

*Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure
(Emphasis mine, sentence fragment not mine)*

So the Series is the datastructure for a single column of a DataFrame, not only conceptually, but literally i.e. the data in a DataFrame is actually stored in memory as a collection of Series.

Analogously: We need both lists and matrices, because matrices are built with lists. Single row matricies, while equivalent to lists in functionality still cannot exists without the list(s) they're composed of.

They both have extremely similar APIs, but you'll find that DataFrame methods always cater to the possibility that you have more than one column. And of course, you can always add another Series (or equivalent object) to a DataFrame, while adding a Series to another Series involves creating a DataFrame.

# <font color='green'>SERIES Data structure</font>

A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its <font color="red">index</font>. The simplest
Series is formed from only an array of data:

In [5]:
obj= Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [6]:
obj.values
obj.index

array([ 4,  7, -5,  3], dtype=int64)

RangeIndex(start=0, stop=4, step=1)

In [7]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

<font color='green'>**SERIES operations** </font>

In [8]:
obj2['a']

-5

In [9]:
#using list of values

obj2[["a","b", "c"]]

a   -5
b    7
c    3
dtype: int64

In [10]:
obj2[obj2>0]

d    4
b    7
c    3
dtype: int64

In [11]:
obj2*2


d     8
b    14
a   -10
c     6
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. It can be substituted into many functions that expect a
dict:

In [12]:
'b' in obj2
"e" in obj

True

False

In [13]:
#converting dictionary to Series
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3


Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

it appears as NaN (not a number) which is considered
in pandas to mark missing or NA values. I will use the terms “missing” or “NA”
to refer to missing data. The 
```python
isnull
notnull 
```

functions in pandas should be used to
detect missing data:

In [14]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

pd.isnull(obj4)
obj4.isnull()

pd.notnull(obj4)
obj4.notnull()

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

A critical Series feature for many applications is that it automatically aligns differentlyindexed
data in arithmetic operations:

In [15]:
obj3
obj4
obj3+obj4

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a 
```python
name
```
attribute, which integrates with
other key areas of pandas functionality:

In [16]:
obj4.name = 'population'

obj4.index.name = 'state'

obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [17]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

# <font color='green'>DataFrame Data structure</font>

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered
collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index). Compared with other
such DataFrame-like structures you may have used before (like R’s data.frame),

<font color="red"></font>. 
Columns can be modified by assignment.

<font color='green'>**DataFrame creation** </font>

There are numerous ways to construct a DataFrame, though one of the most common
is <font color='red'>from a dict of equal-length lists or NumPy arrays</font>



In [18]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

frame = DataFrame(data)

frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [19]:
#You can change column sequence

DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [20]:
#As with Series, if you pass a column that isn’t contained in data, it will appear with NA
#values in the result:

frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five'])
frame2

frame2.columns

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


Index([u'year', u'state', u'pop', u'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [21]:
frame2['state']
frame2.year

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

Rows can also be retrieved by position or name by a couple of methods, such as the

```python
ix
```
indexing field (much more on this later):

<font color='green'>**Column deletion** </font>

The best way to do this in pandas is to use drop:

```python
df = df.drop('column_name', 1)
```
where 1 is the axis number (0 for rows and 1 for columns.)

To delete the column without having to reassign df you can do:

```python
df.drop('column_name', axis=1, inplace=True)
```
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
```python
df.drop(df.columns[[0, 1, 3]], axis=1)  # df.columns is zero-based pd.Index 
```

In [22]:
frame2.ix['three']
#column can be modified by assingment
frame2['debt'] = 16.5

###!!!When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame.

frame2["D"] = np.arange(5)
#frame2["D"] = np.arange(7)  # incorrect as len of range is longer than len of df
frame2
frame2 = frame2.drop(frame2.columns[4], axis=1)
frame2

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Unnamed: 0,year,state,pop,debt,D
one,2000,Ohio,1.5,16.5,0
two,2001,Ohio,1.7,16.5,1
three,2002,Ohio,3.6,16.5,2
four,2001,Nevada,2.4,16.5,3
five,2002,Nevada,2.9,16.5,4


Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [23]:
###If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val

frame2['debt'] = val

frame2

two    -1.2
four   -1.5
five   -1.7
dtype: float64

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


In [24]:
#Assigning a column that doesn’t exist will create a new column
frame2['eastern'] = frame2.state == 'Ohio'
frame2['Wojtas faflun'] = 0
frame2


Unnamed: 0,year,state,pop,debt,eastern,Wojtas faflun
one,2000,Ohio,1.5,,True,0
two,2001,Ohio,1.7,-1.2,True,0
three,2002,Ohio,3.6,,True,0
four,2001,Nevada,2.4,-1.5,False,0
five,2002,Nevada,2.9,-1.7,False,0


In [25]:
del frame2['eastern']
del frame2['Wojtas faflun']
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

<font color='green'>**NESTED DICT** </font>

If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices

<img src="IMG\df_creation.PNG">


In [26]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
    'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [27]:
### you can always transpose the result

frame3.T

###Like Series, the values attribute returns the data contained in the DataFrame as a 2Dndarray
frame3.values

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

<font color='green'>**INDEX OBJECTS** </font>

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when constructing
a Series or DataFrame is internally converted to an Index

<font color='red'>Index objects are immutable</font> and thus can’t be modified by the user
Immutability is important so that Index objects can be safely shared among datastructures:

In addition to being array-like, an Index also functions as a <font color='red'>fixed-size set</font>

<img src="IMG\pd_index.PNG">

In [28]:
obj = Series(range(3), index=['a', 'b', 'c'])

obj.index
obj.index[2:]

###!!!Index objects are immutable and thus can’t be modified by the user:
obj.index[2] = 'a se sprawdze'

Index([u'a', u'b', u'c'], dtype='object')

Index([u'c'], dtype='object')

TypeError: Index does not support mutable operations

In [30]:
index = pd.Index(np.arange(3))

obj2 = Series([1.5, -2.5, 0], index=index)

obj2.index is index

True

<img src = "df_ix_fun.png">

In [31]:
frame3

'Ohio' in frame3.columns

2002 in frame3.index
2003 in frame3.index
frame3.index.append(pd.Index([2003]))


Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


True

True

False

Int64Index([2000, 2001, 2002, 2003], dtype='int64')

=========================================================================================================================
# <font color='green'>Essential functionalities</font>


<font color='green'>**REINDEXING** </font>

A critical method on pandas objects is reindex, which means to create a new object
with the data conformed to a new index.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

In [34]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

obj


d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [37]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The method option allows us to do this, using a method such
as 
```python
ffill
```
which forward fills the values.

<img src = "IMG\pd_reindex.png">

In [43]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
obj3.reindex(range(6), method='ffill')

0      blue
2    purple
4    yellow
dtype: object

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [49]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California'])
frame
frame.dtypes

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


Ohio          int32
Texas         int32
California    int32
dtype: object

In [59]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

#frame2.astype(int)  # it's not possible to change to int, see explanation below
frame2.dtypes


Ohio          float64
Texas         float64
California    float64
dtype: object

<img src = "IMG\pd_Nan.png">

In [61]:
states = ['Texas', 'Utah', 'California']
frame
frame.reindex(columns=states)

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


<img src = "IMG\pd_reix.png">

In [62]:
###!!!Both can be reindexed in one shot, though interpolation will only apply row-wise (axis 0)
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
    columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
b,1,,2
c,4,,5
d,7,,8


In [63]:
frame
frame.ix[['a', 'b', 'c', 'd'], states]

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


<font color='green'>**Dropping entries from axis** </font>

In [67]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
obj.drop('c')


a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [78]:
a = np.arange(16)
a

a.reshape(4,4)


data = DataFrame(np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four'])
data

# removing rows (axis = 0)
data.drop('Colorado')

# removing columns 
data.drop('two', axis = 1)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Utah,8,9,10,11
New York,12,13,14,15


Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


<font color='green'>**Indexing, selection, and filtering** </font>

Slicing with labels behaves differently than normal Python slicing in that the <font color='red'>**endpoint
is inclusive**</font>

In [83]:
#Series indexing work like in NumPy
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
obj['b']
obj[1]

obj[2:4]
obj[['b', 'a', 'd']]

obj[obj < 2]

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

1.0

1.0

c    2.0
d    3.0
dtype: float64

b    1.0
a    0.0
d    3.0
dtype: float64

a    0.0
b    1.0
dtype: float64

In [86]:
#end point is inclusive
obj['b':'c']

obj['b':'c'] = 5
obj

b    5.0
c    5.0
dtype: float64

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [89]:
data = DataFrame(np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four'])

data
data['two']
data[['three', 'one']]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


INDEXING WITH BOOLEAN

In [92]:
data[:2]

data[data['three'] > 6]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [93]:
data < 5
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [100]:
data.ix['Colorado', ['two', 'three']]

data.ix[['Colorado', 'Utah'], [3, 0, 1]]

data.ix[2]
data.ix[:'Utah', 'two']

data.ix[data.three > 6, :3]

two      5
three    6
Name: Colorado, dtype: int32

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

Unnamed: 0,one,two,three
Utah,8,9,10
New York,12,13,14


# <font color='green'>Arithmetic and data alignment</font>
One of the most important pandas features is the behavior of arithmetic between objects
with different indexes. When adding together objects, if any index pairs are not
the same, the respective index in the result will be the union of the index pairs.

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [102]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
s2

s1 + s2

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces NA values in the indices that don’t overlap.
Missing values propagate in arithmetic computations.

In [105]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
    index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

df1
df2
### Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame
df1 +df2

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


<font color='green'>**Arithmetic methods with fill values**</font>
<br>In arithmetic operations between differently-indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other:

<img src="IMG\pd_flex_arith.png">

In [109]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

df1
df2

df1 + df2

df1.add(df2, fill_value=0)

df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


Operations between DataFrame and Series

In [112]:
arr = np.arange(12.).reshape((3, 4))
arr

arr[0]
arr - arr[0]

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

array([ 0.,  1.,  2.,  3.])

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

In [116]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame 

series = frame.ix[0]
series

frame - series

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


<ol type="1">
<li>By default, arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame's columns, broadcasting down the rows

<li>If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union

In [118]:
series2 = Series(range(3), index=['b', 'e', 'f'])
series2
frame + series2

b    0
e    1
f    2
dtype: int64

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [121]:
series3 = frame['d']
frame
series3

frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


# <font color='green'>Function application and mapping</font>
<br> NumPy ufuncs (element-wise array methods) work fine with pandas objects

In [4]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame
# absolute value = wartość bezwzględna

np.abs(frame)

Unnamed: 0,b,d,e
Utah,-0.250496,0.708899,-1.056736
Ohio,-1.027431,1.150337,0.391049
Texas,-1.046852,0.142367,-0.521215
Oregon,-0.703335,-0.390194,0.902028


Unnamed: 0,b,d,e
Utah,0.250496,0.708899,1.056736
Ohio,1.027431,1.150337,0.391049
Texas,1.046852,0.142367,0.521215
Oregon,0.703335,0.390194,0.902028


Another frequent operation is applying a function on 1D arrays to each column or row.
DataFrame’s 
```python
apply()
```
method does exactly this:

```python
lambda
```
The lambda operator or lambda function is a way to create small anonymous functions, i.e. functions without a name. These functions are throw-away functions, i.e. they are just needed where they have been created. Lambda functions are mainly used in combination with the functions filter(), map() and reduce(). The lambda feature was added to Python due to the demand from Lisp programmers. 

https://www.python-course.eu/lambda.php

In [6]:
f = lambda x: x.max() - x.min()
frame.apply(f)

frame.apply(f, axis=1)

b    0.796356
d    1.540531
e    1.958765
dtype: float64

Utah      1.765635
Ohio      2.177767
Texas     1.189219
Oregon    1.605363
dtype: float64

The function passed to apply need not return a scalar value, it can also return a Series
with multiple values:

In [10]:
def f(x): 
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.046852,-0.390194,-1.056736
max,-0.250496,1.150337,0.902028


In [11]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.25,0.71,-1.06
Ohio,-1.03,1.15,0.39
Texas,-1.05,0.14,-0.52
Oregon,-0.7,-0.39,0.9


The reason for the name 
```python
applymap()
```
is that Series has a 
```python
map
```
method for applying an element-
wise function:

In [16]:
frame['e'].map(format)

Utah      -1.06
Ohio       0.39
Texas     -0.52
Oregon     0.90
Name: e, dtype: object

# <font color='green'>Sorting and ranking</font>
<br> Sorting a data set by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the 
```python
sort_index
```
method, which returns
a new, sorted object:

In [21]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj
obj.sort_index()

d    0
a    1
b    2
c    3
dtype: int64

a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis:
<br><br>The data is sorted in ascending order by default, but can be sorted in descending order,
too:

In [23]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame
frame.sort_index()
frame.sort_index(axis = 1)
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its 
```python
order
```
method:

In [26]:
obj = Series([4, 7, -3, 2])
obj
obj.order()

#Any missing values are sorted to the end of the Series by default:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.order()

0    4
1    7
2   -3
3    2
dtype: int64

  app.launch_new_instance()


2   -3
3    2
0    4
1    7
dtype: int64



4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

On DataFrame, you may want to sort by the values in one or more columns. To do so,
pass one or more column names to the by option:

In [27]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame 
frame.sort_index(by='b')

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


  app.launch_new_instance()


Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


In [28]:
#To sort by multiple columns, pass a list of names:
frame.sort_index(by=['a', 'b'])

  from ipykernel import kernelapp as app


Unnamed: 0,a,b
2,0,-3
0,0,4
3,1,2
1,1,7


Ranking is closely related to sorting, assigning ranks from one through the number of
valid data points in an array. It is similar to the indirect sort indices produced by
numpy.argsort, except that ties are broken according to a rule. The 

```python
rank()
```
methods for
Series and DataFrame are the place to look; by default rank breaks ties by assigning
each group the mean rank

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rank.html

<img src="IMG\pd_rank.PNG">

In [35]:
# zwracan ranking miejsc tj.. zwraca 1 dla obiektu, który powinnien być pierwszy
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj
obj.rank()
#Ranks can also be assigned according to the order they’re observed in the data:
obj.rank(method='first') #ranks assigned in order they appear in the array
obj.rank(method='min') #lowest rank in group
obj.rank(method='average') #average rank in group

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

0    6.0
1    1.0
2    6.0
3    4.0
4    3.0
5    2.0
6    4.0
dtype: float64

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [36]:
#Naturally, you can rank in descending order, too:
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [37]:
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
frame 
frame.rank(axis=1)

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


# <font color='green'>Axis indexes with duplicate values</font>
Up until now all of the examples I’ve showed you have had unique axis labels (index
values). While many pandas functions (like reindex) require that the labels be unique,
it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [39]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

#The index’s is_unique property can tell you whether its values are unique or not:
obj.index.is_unique

#Data selection is one of the main things that behaves differently with duplicates. 
#Indexing a value with multiple entries returns a Series while single entries return a scalar value:
obj['a']
obj['c']

a    0
a    1
b    2
b    3
c    4
dtype: int64

False

a    0
a    1
dtype: int64

4

In [40]:
#The same logic extends to indexing rows in a DataFrame:
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
df.ix['b']

Unnamed: 0,0,1,2
a,0.393275,0.987598,0.545625
a,0.224793,-3.188334,-0.184531
b,0.043456,1.424314,0.077632
b,0.345386,0.099457,-0.499891


Unnamed: 0,0,1,2
b,0.043456,1.424314,0.077632
b,0.345386,0.099457,-0.499891


# <font color='green'>Summarizing and Computing Descriptive Statistics</font>

137