# Basic datatypes in Pandas

Documentation sources:
* https://www.tutorialspoint.com/python_pandas/
* http://www.datasciencemadesimple.com/hierarchical-indexing-multiple-indexing-python-pandas/
* https://riptutorial.com/pandas/example/9041/reading-financial-data--for-multiple-tickers--into-pandas-panel---demo


There are two basic datatypes in Pandas that are essential for machine learning:
* `Series` is a one-dimensional array that can contain elements of the same type
* `DataFrame` is a two-dimensional array that has row and column indices for more natural indexing

Both datatypes use indices to provide named locations instead of non-mneumonic integer locations. 
Usually indices are lists of names but it is possible to use multi-indices that provide hierarchical access to data, i.e., allow to select blocks of rows or columns. 
This is initially hard to master but is very useful:
  * https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html

In [1]:
import numpy as np
import pandas as pd

## I. Series datatype

### Creation

* Series is a one-dimensional array that can contain elements of the same type. 
* The elements of the array can be named by providing a list of cell names.
* Series can be constructed from a list of values, numpy array or from a dictionary.
* You must give an explicit data type for an empty series to make it future proof.  

In [2]:
print(pd.Series(dtype='float64'))
print(pd.Series(['a', 'b', 'c']))
print(pd.Series(np.array(['a','b','c','d'])))
print(pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}))

Series([], dtype: float64)
0    a
1    b
2    c
dtype: object
0    a
1    b
2    c
3    d
dtype: object
a    0.0
b    1.0
c    2.0
dtype: float64


#### Prefilled Series and technical tidbits 

* By specifying a single cell value and an index it is possible to create prefilled Series.
* **Be cautious:** This works only if the value is numeric or a string as the constructor does conversions.
* Internally, the underlying data structure holding the data is a `numpy` array that can contain 64-bit types:
  * 64-bit integers
  * 64-bit IEEE floats
  * 64-bit pointers to complex data objects
* If you want to create a Series filled with `None` values you have to use one of two idioms:
  * force data representation of 64-bit pointers by string initialisation and assign `None` values to all entries
  * create appropriate `numpy` array filled with `None` values and convert it to Series
* All other methods are bound to create unexpected results.

In [3]:
print(pd.Series(1, index = [0, 1, 2]))
print(pd.Series(1.0, index = [0, 1, 2]))
print(pd.Series('1.0', index = [0, 1, 2]))

# Proper ways to get None series
series = pd.Series('', index = [0, 1, 2], dtype = object)
series[:] = None
print(series)

print(pd.Series(np.empty(3, dtype = object)))

# Incorrect ways to get None series 
print(pd.Series(None, index = [0, 1, 2], dtype = np.float64))
print(pd.Series(None, index = [0, 1, 2], dtype = object))

series = pd.Series(1, index = [0, 1, 2])
series[:] = None
print(series)

0    1
1    1
2    1
dtype: int64
0    1.0
1    1.0
2    1.0
dtype: float64
0    1.0
1    1.0
2    1.0
dtype: object
0    None
1    None
2    None
dtype: object
0    None
1    None
2    None
dtype: object
0   NaN
1   NaN
2   NaN
dtype: float64
0    NaN
1    NaN
2    NaN
dtype: object
0   NaN
1   NaN
2   NaN
dtype: float64


### Cell naming
* Names of the cells can be set during the construction by providing an index.
* If the index is different than the dictionary then missing values appear.
* Elements are always ordered according to the index.
* Elements that are not in the index are omitted.
* The same element can appear more than once in the index.

In [4]:
print(pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']))
print(pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b','c','d','a']))
print(pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['c','b','a']))
print(pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['c','b']))
print(pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['c','b','c']))

a    1
b    2
c    3
d    4
e    5
dtype: int64
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
c    2.0
b    1.0
a    0.0
dtype: float64
c    2.0
b    1.0
dtype: float64
c    2.0
b    1.0
c    2.0
dtype: float64


### Accessing a single cell

* Cells can be referenced by using raw integer locations with `Series.iloc[location]`.
* Cells can be referenced by using names specified by the index with `Series.loc[location]`.
* Raw integer locations start form the value `0` and `-1` is a shorthand to the last element.
* The shorthand `-1` does not work for `Series.loc[location]` even if the index is undefined. 
* Both indexing methods can create out of bounds errors.

In [5]:
series = pd.Series([1, 2, 3, 4, 5])
print(series.iloc[0], series.iloc[-1])
print(series.loc[0], series.loc[4])


1 5
1 5


* Operator `Series[location]` is a clever shorthand:
  * if `location` has the same type as the index then it is equivalent to `Series.loc[location]` 
  * else if `location` is integer then it is equivalent to `Series.iloc[location]` 
* Things can go **terribly wrong**:
  * if the named index has equal elements
  * if the named index elements have integer type
* If you do not create the Series by yourself and want to bullet-proof your code, always use `iloc` to avoid these problems.

In [6]:
print('--standard-usage--')
series = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
print(series[0], series[4])
print(series['a'], series['e'])

print('\n--reverse-indexing--')
series = pd.Series([1, 2, 3, 4, 5], index=[4, 3, 2, 1, 0])
print(series[0], series[4])

print('\n--double-index--')
series = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'a', 'd', 'e'])
print(series['a'])
print(series['e'])

print('\n--unexpected-behaviour--')
series = pd.Series([1, 2, 3, 4, 5], index=[0, 1, 2, 3, 0])
print(series[0])
print(series.iloc[0])

--standard-usage--
1 5
1 5

--reverse-indexing--
5 1

--double-index--
a    1
a    3
dtype: int64
5

--unexpected-behaviour--
0    1
0    5
dtype: int64
1


### Index vectors and slices

* It is possible to select a subset of the Series by specifying index vectors:
  * `Series.loc[index]` for specifying elements in terms of names
  * `Series.iloc[index]` for specifying elements in terms of raw integer locations
* The call `Series[index]` still works as a clever shorthand.
* The resulting object `Series[index]` is always a Series even if it is empty or contains a single element.
* It is possible to select a subset of Series by specifying a Boolean index vector `mask` of the same size.
* The resulting object `Series[mask]` is always a Series even if it is empty or contains a single element.
* The call `Series[mask]` is nerver converted to `Series.loc[mask]`, even if the index is a Boolean vector. 
* Slicing is an easy way to create continuous sub-Series:
  * `Series.iloc[:]` – select all 
  * `Series.iloc[a:]` – select from the integer position `a`
  * `Series.iloc[:b]` – select up to the integer position `b`, do not take `b`
  * `Series.iloc[a:b]` – select from the integer position `a` up to the integer position `b`, do not take `b`
* Slicing also works with `Series.loc` but the semantics is different:
  * `Series.loc[:]` – select all 
  * `Series.loc[a:]` – select from the label `a` 
  * `Series[:b]` – select up to the label `b`, take `b`
  * `Series[a:b]` – select from the label `a` up to the label `b`, take `b`
* Labels must be unique in slice definitions.
* Slicing is always guaranteed to return a Series even if it is empty or contains a single element.
* Although the shorthand `-1` works as the maximal index in slices, do not use it.

In [7]:
print('--standard-indexing--')
series = pd.Series([1, 2, 3, 4, 5], index = ['a', 'b', 'c', 'd', 'e'])
print(series.loc[['a', 'b', 'c']])
print(series.loc[['a']])
print(series.loc[[]], '\n')
print(series.iloc[[0, 1, 2]])
print(series.iloc[[0]])
print(series.iloc[[]], '\n')

print('--boolean-indexing--')
series = pd.Series([1, 2, 3, 4, 5])
print(series[[True, False, True, False, False]])
print(series[[True, False, False, False, False]])
print(type(series[[True, False, False, False, False]]))
print(series[[False, False, False, False, False]])
print(type(series[[False, False, False, False, False]]),'\n')

print('--unexpected-behaviour--')
series = pd.Series([1, 2, 3, 4, 5], index = [True, False, True, False, False])
print(series[[True, True, True, False, False]], '\n')
print(series[True],'\n')

print('--normal-slicing--')
series = pd.Series([1, 2, 3, 4, 5])
print(series[:])
print(series[:1])
print(series[1:])
print(series[1:3], '\n')

print('--corner-cases--')
print(series[1:2])
print(series[1:1])
print(series[2:1], '\n')

print('--non-numerical-slicing--')
series = pd.Series([1, 2, 3, 4, 5], index = ['a', 'b', 'c', 'd', 'e'])
print(series.loc[:])
print(series.loc[:'b'])
print(series.loc['b':])
print(series.loc['b':'d'])

--standard-indexing--
a    1
b    2
c    3
dtype: int64
a    1
dtype: int64
Series([], dtype: int64) 

a    1
b    2
c    3
dtype: int64
a    1
dtype: int64
Series([], dtype: int64) 

--boolean-indexing--
0    1
2    3
dtype: int64
0    1
dtype: int64
<class 'pandas.core.series.Series'>
Series([], dtype: int64)
<class 'pandas.core.series.Series'> 

--unexpected-behaviour--
True     1
False    2
True     3
dtype: int64 

True    1
True    3
dtype: int64 

--normal-slicing--
0    1
1    2
2    3
3    4
4    5
dtype: int64
0    1
dtype: int64
1    2
2    3
3    4
4    5
dtype: int64
1    2
2    3
dtype: int64 

--corner-cases--
1    2
dtype: int64
Series([], dtype: int64)
Series([], dtype: int64) 

--non-numerical-slicing--
a    1
b    2
c    3
d    4
e    5
dtype: int64
a    1
b    2
dtype: int64
b    2
c    3
d    4
e    5
dtype: int64
b    2
c    3
d    4
dtype: int64


## II. DataFrame datatype

* DataFrame is a two-dimensional array that has row and column indices for more natural indexing.
* DataFrame indexing works analogously to Series indexing with slight differences outlined below.
* Whenever you select a part of a DataFrame you get a reference and not an independent copy.

### Creation

* DataFrame is a two-dimensional array, each column must contain elements of the same type.
* The rows and columns of the array can be named by providing corresponding lists:
  * `index` for row names
  * `columns` for column names
* DataFrame can be constructed from a list of values or from a dictionary:
  * construction from the list of lists allows to specify the DataFrame row by row 
  * construction from the dictionary or list allows to specify the DataFrame column by column
* Objects in lists or dictionaries can be lists or Series:
  * By default, lists and Series are split among cells.
  * Double listing allows to insert lists and Series into individual cells.
* DataFrame can be created also from `numpy` arrays.
* Datatypes of columns can be specified with `dtype` argument:
  * It must be one of `numpy` datatypes.
  * Only one datatype can be specified.
  * It is meant for solving ambiguities in numeric values.

In [8]:
print(pd.DataFrame())
display(pd.DataFrame([1,2,3,4,5]))
display(pd.DataFrame([['Alex',10],['Bob',12],['Clarke',13]], columns=['Name','Age'], index = ['A', 'B', 'C']))

# There can be only one dtype. You cannot specify different ones for each column 
display(pd.DataFrame([['Alex',10],['Bob',12],['Clarke',13]], columns=['Name','Age'], dtype = str))

display(pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28, 34, 29, 42]}))
print('\n')

# Braketing Series makes all the difference in the constructor 
display(pd.DataFrame([pd.Series([1, 2, 3]), pd.Series([4, 5, 6])]))
display(pd.DataFrame([[pd.Series([1, 2, 3])], [pd.Series([4, 5, 6])]]))
display(pd.DataFrame({1: pd.Series([1, 2, 3]),2: pd.Series([4, 5, 6])}))
display(pd.DataFrame({1:[pd.Series([1, 2, 3])], 2:[pd.Series([4, 5, 6])]}))

Empty DataFrame
Columns: []
Index: []


Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


Unnamed: 0,Name,Age
A,Alex,10
B,Bob,12
C,Clarke,13


Unnamed: 0,Name,Age
0,Alex,10
1,Bob,12
2,Clarke,13


Unnamed: 0,Name,Age
0,Tom,28
1,Jack,34
2,Steve,29
3,Ricky,42






Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


Unnamed: 0,0
0,0 1 1 2 2 3 dtype: int64
1,0 4 1 5 2 6 dtype: int64


Unnamed: 0,1,2
0,1,4
1,2,5
2,3,6


Unnamed: 0,1,2
0,0 1 1 2 2 3 dtype: int64,0 4 1 5 2 6 dtype: int64


### Accessing  rows, columns and cells

* DataFrame is a series of Series with convenient access to individual cells.
* Individual columns can be accessed with `df[column]` access method.
* Individual rows can be accessed with `df.iloc[row_nr]` access method.
* Both methods return Series as the output.
* There are separate methods `df.itterows()`, `df.itertuples()` and `df.iteritems()` for iterating over rows and columns.
* They return a pair where the first component is a name and the second is a Series.
* A cell can be accessed indirectly using elements of a column or a row. 
* A cell can be accessed directly using double indexing `df.loc[row, column]` and `df.iloc[row_nr, col_nr]`.
* **Do not use** `dt.ix[row, column]` which allows for both index types, creating unexpected errors.
* **Important:** You can safely modify only cells that are directly accessed!
* Slicing and Boolean indexing works as before, although it can be confusing.

In [9]:
df = pd.DataFrame([['Alex',10],['Bob',12],['Clarke',13]], columns=['Name','Age'], index = ['A', 'B', 'C'])
print(df['Name'])
print(df.iloc[1],'\n')

print(df['Name'][1])
print(df.iloc[1][0])
print(df.loc['B','Name'])
print(df.iloc[1,0])

display(df.loc['A':'C', ['Age','Name']])
display(df.iloc[0:3, [1,0]])

display(df.loc['B':'C', 'Age':'Name'])
display(df.iloc[1:3, 1:1])

df.iloc[0,0] = 'Alexa' 
# Not safe df.iloc[0].iloc[0] = 'Alex'
df.loc['A', 'Name'] = 'Alex'
# Not safe df['Name']['A'] = 'Alex'

A      Alex
B       Bob
C    Clarke
Name: Name, dtype: object
Name    Bob
Age      12
Name: B, dtype: object 

Bob
Bob
Bob
Bob


Unnamed: 0,Age,Name
A,10,Alex
B,12,Bob
C,13,Clarke


Unnamed: 0,Age,Name
A,10,Alex
B,12,Bob
C,13,Clarke


B
C


B
C
