# Introduction to Statistics with Python -- Part 2

## Pandas

As mentioned in the previous section, in the case of different data types, Pandas provides a convenient and efficient way of operation. The Pandas library should also be installed in your conda environment (it requires the NumPy library to be installed as well). You can import it in your code as follows:

import numpy as np
import pandas as pd

pd.__version__

Pandas objects are in principle a multi-dimensional arrays, where the rows and columns can be labeled and with lot's of flexibility concerning different data types and missing elements. There are three basic types of Pandas objects:
- Series
- DataFrame
- Index

### Pandas Series

Let's start with the Series object. A Pandas Series is a one-dimensional array in which data is stored by index. It can be created by using a list or an array as an input:

In [3]:
test_series = pd.Series([1.5, 2.3, 3.1, 4.7])
test_series

0    1.5
1    2.3
2    3.1
3    4.7
dtype: float64

There are two columns in this Series output: The left one showing the index and the right one with the corresponding stored values. Both can be accessed directly:

In [4]:
print('all series values: ', test_series.values)
print('all series indices: ', test_series.index)

all series values:  [1.5 2.3 3.1 4.7]
all series indices:  RangeIndex(start=0, stop=4, step=1)


Like with NumPy arrays, individual values can be accessed via the index:

In [5]:
print('first element: ', test_series[0])
print('first three elements: ', test_series[0:3])

first element:  1.5
first three elements:  0    1.5
1    2.3
2    3.1
dtype: float64


However, the big difference between a Pandas Series and a one-dimensional NumPy array is that the indices can be chosen freely for the Series (for NumPy arrays they are implicitly defined as a consecutive list of integers):

In [6]:
test_series2 = pd.Series([1.5, 2.3, 3.1, 4.7], index = ['alpha', 'beta', 'gamma', 'delta'])
print(test_series2)
print('\nEntry with index gamma: ', test_series2['gamma'])

alpha    1.5
beta     2.3
gamma    3.1
delta    4.7
dtype: float64

Entry with index gamma:  3.1


As you can see, the Pandas Series object is in reality a kind of dictionary. It maps specific keys/indices to specific values. 

### Pandas DataFrame

Mostly, you will be working with Pandas DataFrames when doing a statistical analysis. It shares the great adavantage of the Pandas Series that the indices of the data can be chosen flexibly. However, it is capable of handling multiple dimensions and can be thought of as a generalized NumPy array with flexible row and column indices or an extension of a Pandas Series.

Consequently, it can be constructed from a NumPy array as well as a Pandas Series:

In [7]:
# by NumPy array:
arraytest1 = np.random.rand(3,2)
frametest1 = pd.DataFrame(arraytest1, columns = ['black', 'white'], index = ['low', 'middle', 'high'])

frametest1

Unnamed: 0,black,white
low,0.931437,0.94964
middle,0.390291,0.850047
high,0.83692,0.162704


In [8]:
# by Pandas Series:
seriestest1 = pd.Series([3500, 2310, 4300], index = ['left', 'straight', 'right'])
seriestest2 = pd.Series([0.1, 0.6, 0.5], index = ['left', 'straight', 'right'])
frametest2 = pd.DataFrame({'wager': seriestest1, 'probability': seriestest2})

frametest2

Unnamed: 0,wager,probability
left,3500,0.1
straight,2310,0.6
right,4300,0.5


Similarly to the Series object, the indices and values can be accessed separately as well: 

In [9]:
print(frametest2.index)
print('\n', frametest2.values)

Index(['left', 'straight', 'right'], dtype='object')

 [[3.50e+03 1.00e-01]
 [2.31e+03 6.00e-01]
 [4.30e+03 5.00e-01]]


As with NumPy arrays, there is also the possibility to create a dataFrame from an external file very easily:

In [46]:
readset = pd.read_csv('data/IntroTestfile.csv')
readset

Unnamed: 0,1,2,3,4
0,5,6,7,8
1,9,10,11,12
2,13,14,15,16
3,17,18,19,20


### Pandas Index

You have already seen the special use of indices in Series and DataFrames. The Pandas Index can however be seen as an object on its own and basically behaves like an ordered, immutable array. It has its own constructor:

In [10]:
indextest1 = pd.Index([1, 2, 3, 4])
indextest1

Index([1, 2, 3, 4], dtype='int64')

Furthermore, it shares many of the functionalities of a NumPy array, i.e. accessing an element of the Pandas Index by its "index" (this time, only a standard numbering scheme) or slicing. You can also access the size, shape, dimension and type in the very same way as for NumPy arrays. What doesn't work, however, is to change an element of a Pandas Index:

In [11]:
indextest1[0] = 2

TypeError: Index does not support mutable operations

Instead of being inconvenient, this actually serves a purpose: It secures DataFrame and Series objects against any side effects a modification of the indices would have.

### Working with DataFrames

Now that you have a small overview of the functionalities provided by Pandas, let's dive more into the handling of DataFrames. Looking at one of our example dataFrames from above, we can e.g. add a new column by doing:

In [12]:
print(frametest1)

frametest1['ratio'] = frametest1['black'] / frametest1['white']

frametest1

           black     white
low     0.931437  0.949640
middle  0.390291  0.850047
high    0.836920  0.162704


Unnamed: 0,black,white,ratio
low,0.931437,0.94964,0.980833
middle,0.390291,0.850047,0.459141
high,0.83692,0.162704,5.143811


Columns can also be individually accessed as Series:

In [13]:
ratioseries = frametest1['ratio']
ratioseries

low       0.980833
middle    0.459141
high      5.143811
Name: ratio, dtype: float64

Note that as long as the corresponding index name doesn't coincide with the name of any Pandas method, you can also access columns by attribute:

In [14]:
blackseries = frametest1.black
blackseries

low       0.931437
middle    0.390291
high      0.836920
Name: black, dtype: float64

Considering mathematical operations, Pandas DataFrames cover a vast variety of matrix operations, many of which are inherited from the NumPy ufunc methods. The big difference is, that Pandas preserves the index and column labels for unary operations and automatically aligns them for binary operations like additions and multiplications. Here are two examples for unary operations:

In [15]:
print(frametest1)

# unary operations: index and column label preservation!
print('\ntransposed frame: ', frametest1.T)
print('\nexponential: ', np.exp(frametest1))

           black     white     ratio
low     0.931437  0.949640  0.980833
middle  0.390291  0.850047  0.459141
high    0.836920  0.162704  5.143811

transposed frame:              low    middle      high
black  0.931437  0.390291  0.836920
white  0.949640  0.850047  0.162704
ratio  0.980833  0.459141  5.143811

exponential:             black     white       ratio
low     2.538155  2.584778    2.666676
middle  1.477411  2.339757    1.582713
high    2.309244  1.176689  171.367606


For illustrating the automatic label alignment with binary operation, let's introduce a similar dataFrame to our test frame:

In [16]:
arraytest1b = np.random.rand(3,2)

frametest1b = pd.DataFrame(arraytest1b, columns = ['black', 'red'], index = ['low', 'middle', 'high'])
frametest1b['ratio'] = frametest1b.black / frametest1b.red
frametest1b

Unnamed: 0,black,red,ratio
low,0.926374,0.192017,4.824431
middle,0.212364,0.022229,9.553439
high,0.2801,0.48822,0.573717


If we now try to do some binary calculation with our original frame, we can observe that Pandas takes the one differently labeled column into account:

In [17]:
frametest1 / frametest1b

Unnamed: 0,black,ratio,red,white
low,1.005466,0.203305,,
middle,1.837841,0.04806,,
high,2.987932,8.965767,,


Here we see, that Pandas has marked the result for the columns that only exist in one of the two frames with not-a-number. This is how missing data is represented by default, as we saw in the NumPy overview.

NumPy arrays can be concatenated or split. The same is true for Pandas DataFrames but with some more subtleties. The most straight forward way for concatenation of dataFrames is actually very similar to the NumPy syntax:

In [20]:
combined_frame = pd.concat([frametest1, frametest1b], axis=0)
combined_frame

Unnamed: 0,black,white,ratio,red
low,0.931437,0.94964,0.980833,
middle,0.390291,0.850047,0.459141,
high,0.83692,0.162704,5.143811,
low,0.926374,,4.824431,0.192017
middle,0.212364,,9.553439,0.022229
high,0.2801,,0.573717,0.48822


Note that Pandas preserved all the row and column labels -- even if that means duplicating the row indices. A possibility to cirumvent this problem is to introduce another index for the row indices by using hierarchical indices:

In [22]:
combined_frame2 = pd.concat([frametest1, frametest1b], keys = ['1', '1b'], axis=0)
combined_frame2

Unnamed: 0,Unnamed: 1,black,white,ratio,red
1,low,0.931437,0.94964,0.980833,
1,middle,0.390291,0.850047,0.459141,
1,high,0.83692,0.162704,5.143811,
1b,low,0.926374,,4.824431,0.192017
1b,middle,0.212364,,9.553439,0.022229
1b,high,0.2801,,0.573717,0.48822


Now we have made each value specifically identifiable again. The missing data has again been declared as NaN. If we only want to join the data present in both of the dataFrames, we can do so by:

In [23]:
combined_frame3 = pd.concat([frametest1, frametest1b], axis=0, join='inner')
combined_frame3

Unnamed: 0,black,ratio
low,0.931437,0.980833
middle,0.390291,0.459141
high,0.83692,5.143811
low,0.926374,4.824431
middle,0.212364,9.553439
high,0.2801,0.573717


Maybe you have datasets with complementary information that you need to join. In this case, the Pandas pd.merge() method might be helpful:

In [26]:
lectures = pd.DataFrame({'student': ['Sarah', 'Marcus', 'Daniel', 'Emma'], 'lecture': ['Quantum mechanics', 'Optics', 'Cosmology', 'Quantum mechanics']})
grades = pd.DataFrame({'student': ['Sarah', 'Marcus', 'Daniel', 'Emma'], 'grades': [1.3, 2.0, 2.7, 2.3]})

term1 = pd.merge(lectures, grades)
term1

Unnamed: 0,student,lecture,grades
0,Sarah,Quantum mechanics,1.3
1,Marcus,Optics,2.0
2,Daniel,Cosmology,2.7
3,Emma,Quantum mechanics,2.3


What happened is that Pandas realized that the "students" column was present in both dataFrames. Therefore the merging sorted the information with respect to that column, which can be referred to as the "key" column. You can even include data from a frame of a different structure, where only the values of one column correspond to the original frame:

In [28]:
professors = pd.DataFrame({'lecture': ['Quantum mechanics', 'Cosmology', 'Optics', 'Thermodynamics'], 'professor': ['Feynman', 'Huygens', 'Zwicky', 'Ising']})

term2 = pd.merge(term1, professors)
term2

Unnamed: 0,student,lecture,grades,professor
0,Sarah,Quantum mechanics,1.3,Feynman
1,Emma,Quantum mechanics,2.3,Feynman
2,Marcus,Optics,2.0,Zwicky
3,Daniel,Cosmology,2.7,Huygens


Now, Pandas identified the values in the "lecture" columns and added the according information from the "professor" column to the original dataFrame. Note, that while the ordering was changes, the labels are still preserved. Also, the value which was only present in one of the dataFrames ('Thermodynamics') has been omitted in the merging by default. 

The pd.merge() method also allows for directly specifying the key column by using the "on" argument. Of course, this only works if the column exists in both of the frames.

In [32]:
term1b = pd.merge(lectures, professors, on='lecture')
term1b

Unnamed: 0,student,lecture,professor
0,Sarah,Quantum mechanics,Feynman
1,Emma,Quantum mechanics,Feynman
2,Marcus,Optics,Zwicky
3,Daniel,Cosmology,Huygens


Pandas pd.merge() also provides some flexibility in case of different label names by using the left_on and right_on arguments:

In [36]:
name = pd.DataFrame({'first name': ['Daniel', 'Emma', 'Sarah', 'Marcus'], 'last name': ['Gruber', 'Breitner', 'Neumann', 'Wagner']})
term3 = pd.merge(term2, name, left_on='student', right_on='first name')
term3

Unnamed: 0,student,lecture,grades,professor,first name,last name
0,Sarah,Quantum mechanics,1.3,Feynman,Sarah,Neumann
1,Emma,Quantum mechanics,2.3,Feynman,Emma,Breitner
2,Marcus,Optics,2.0,Zwicky,Marcus,Wagner
3,Daniel,Cosmology,2.7,Huygens,Daniel,Gruber


This gives us two columns with redundant information. We can easily drop that column:

In [37]:
term4 = pd.merge(term2, name, left_on='student', right_on='first name').drop('first name', axis=1)
term4

Unnamed: 0,student,lecture,grades,professor,last name
0,Sarah,Quantum mechanics,1.3,Feynman,Neumann
1,Emma,Quantum mechanics,2.3,Feynman,Breitner
2,Marcus,Optics,2.0,Zwicky,Wagner
3,Daniel,Cosmology,2.7,Huygens,Gruber


### Statistics with Pandas

Similar to NumPy, Pandas provides operations on its objects to calculate statisical figures of merit. E.g.

In [45]:
print('number of items: ', term4.count())
print('\nmean grade: ', term4['grades'].mean())
print('\nmedian grade: ', term4['grades'].median())
print('\nminimum grade: ', term4['grades'].min())
print('\nmaximum grade: ', term4['grades'].max())
print('\nstandard deviation of grades: ', term4['grades'].std())
print('\nvariance of grades: ', term4['grades'].var())
print('\nproduct of all items: ', term4['grades'].prod())
print('\nsum of all items: ', term4['grades'].sum())

number of items:  student      4
lecture      4
grades       4
professor    4
last name    4
dtype: int64

mean grade:  2.075

median grade:  2.15

minimum grade:  1.3

maximum grade:  2.7

standard deviation of grades:  0.5909032633745278

variance of grades:  0.3491666666666666

product of all items:  16.146

sum of all items:  8.3


Of course, there are a lot more functionalities provided by the Pandas library. If you'd like to read up on those, you can simply have a look at the online [Pandas documentation](https://pandas.pydata.org/). 

In the next part of the introduction, we will look at the possibilities for visualizing data with the MatplotLib libaray.