#### Requirements

In [4]:
import numpy as np

# make sure to have the same random numbers for every iteration
np.random.seed(42) 

## 8. Pandas: Introduction 

![](../images/pandas_logo.svg)

# Pandas

User-friendly, popular library for data science projects and particularly the data exploration phase.

- *Series* is the data structure for one-dimensional data

- *DataFrame* is the data structure for multi-dimensional data

- Tools for reading and writing data between in-memory data structures and different formats
    - CSV and text files
    - Microsoft Excel
    - SQL databases
    - fast HDF5 format
    
- integrated handling of missing data

- label-based slicing, fancy indexing, and subsetting of large data sets

- SPLIT-APPLY-COMBINE: Aggregating or transforming data
- High performance merging and joining of data sets

As in NumPy, Pandas is highly optimized for performance, with critical code paths written in C.

In [40]:
import pandas as pd 

## Data Structures

### Series

A series is a 1-dimensional labeled array. The basic command to create a series is

                               s = pd.Series(your_data)
                                    
*your_Data* can be:

- a Python list
- a Python dictionary
- a Numpy ndarray
- a single value (like 5)  

**with a Python list**

In [6]:
your_data = [1,2,3]

pd.Series(your_data)

0    1
1    2
2    3
dtype: int64

**with a Numpy ndarray**

In [41]:
your_data = np.arange(1,4)

pd.Series(your_data)

0    1
1    2
2    3
dtype: int64

**with a single value**

In [42]:
pd.Series(5)

0    5
dtype: int64

### Using Labels

The indices, the values and the data type are printed out. **The special thing about Pandas is that you can use meaningful labels instead of indices for the addressing of the values.** You can either directly use a  Python dictionary...

In [9]:
# example with a dictionary

pd.Series({'a':1, 'b':2, 'c':3})

a    1
b    2
c    3
dtype: int64

...or use the second argument "index" as shown below. 
                   
                               pd.Series(your_data, index=labels)

The length of *your_data* and *labels* need to be equal, otherwise there will be an error.

In [10]:
# use the index argument

values = [1,2,3] # if this is a list or ndarray doesnt matter
labels = ['a','b','c']

pd.Series(values, index=labels)

a    1
b    2
c    3
dtype: int64

#### Using Labels with Python Dictionaries

When you use a Python dictionary for data and specify a list of labels for index, than only for the specified labels the corresponding values are printed out.

Example

In [11]:
pd_series = pd.Series({'a':1, 'b':2, 'c':3}, index=['b', 'c'])

pd_series

b    2
c    3
dtype: int64

If the label does not exist, than a NaN (Not a Number) will be displayed
 and the data type is reset to the default 'float64' data type

In [12]:
pd.Series({'a':1, 'b':2, 'c':3}, index=['a', 'b', 'x'])

a    1.0
b    2.0
x    NaN
dtype: float64

**Single values are duplicated for every label**

In [13]:
s_scalar = pd.Series(5, index=['a','b','c'])

s_scalar

a    5
b    5
c    5
dtype: int64

In [14]:
type(s_scalar)

pandas.core.series.Series

### DataFrame
The Pandas Dataframe is a two-dimensional labeled data structure with columns that can hold different data types. It's a little like an Excel spreadsheet. The basic command to construct a DataFrame is

                        pd.DataFrame(your_data)

You can construct it from 


- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D ndarrays
- A Series
- Another DataFrame



#### From a dictionary of Series

In [15]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


#### From dictionary of lists / ndarrays

In [16]:
d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


#### From a list of dictionaries

In [17]:
d = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

pd.DataFrame(d)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


**From a two-dimensional ndarray**

In [18]:
array = np.array([[9,8,7], [5,6,7]])
pd.DataFrame(array)

Unnamed: 0,0,1,2
0,9,8,7
1,5,6,7


With the *index* argument the row labels can be set and with the *columns* argument the column labels can be set.

In [44]:
# specify row and column labels
df = pd.DataFrame(array, index= ['x', 'y'], columns= ['a', 'b', 'c'])

df

Unnamed: 0,a,b,c
x,9,8,7
y,5,6,7


In [45]:
type(df)

pandas.core.frame.DataFrame

### Indexing


|Operation |	Syntax |  Result|
|----------|----------|----------|
|Select column by label |	df[c_label] | 	Series|
|Select row by label |	df.loc[r_label] | 	Series|
|Select row by index  |	df.iloc[r_idx] | 	Series|
|Select column by index  |	df.iloc[:,c_idx] | 	Series|
|Slice rows by indices |	df[5:10] | 	DataFrame|
|Select rows by boolean vector |	df[bool_vec] | 	DataFrame|
|Select row and column by label  |	df.loc[r_label, c_label] | 	Value|
|Select row and column by index  |	df.iloc[r_index, c_index] | 	Value|

In [27]:
d = {'high' : [26, 22, 23, 29, 25, 23, 23],
     'low' : [18, 16, 15, 18, 18, 16, 16],
    }

df = pd.DataFrame(data=d, index=['mon','tue','wed','thu','fri', 'sat', 'sun'])

df

Unnamed: 0,high,low
mon,26,18
tue,22,16
wed,23,15
thu,29,18
fri,25,18
sat,23,16
sun,23,16


In [28]:
# Select column by label

df['high']

mon    26
tue    22
wed    23
thu    29
fri    25
sat    23
sun    23
Name: high, dtype: int64

In [29]:
# this is a series

type(df['high'])

pandas.core.series.Series

In [30]:
# select row by label

df.loc['mon']

high    26
low     18
Name: mon, dtype: int64

In [31]:
# select row by index

df.iloc[0]

high    26
low     18
Name: mon, dtype: int64

In [32]:
# select column by index

df.iloc[:, 0]

mon    26
tue    22
wed    23
thu    29
fri    25
sat    23
sun    23
Name: high, dtype: int64

In [33]:
# slice rows by indices
# you can also use negative indices or no indices

df[-2:]

Unnamed: 0,high,low
sat,23,16
sun,23,16


In [34]:
# slice rows by boolean vector

boolean_vector = [True, False, True, True, True, False, False]

df[boolean_vector]

Unnamed: 0,high,low
mon,26,18
wed,23,15
thu,29,18
fri,25,18


#### Masks

The boolean vectors can be created by using the comparison operators (<, ==, >) on a column.

In [39]:
# Give me all days, that have a high temperature of over 25 degrees

con1 = df['high']>22
con2 = df['low']>16

df[con1 & con2]

Unnamed: 0,high,low
mon,26,18
thu,29,18
fri,25,18
