# Pandas
### Karl N. Kirschner

"...providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python." -- http://pandas.pydata.org/pandas-docs/stable/

- Tabular data with heterogeneously-typed columns, (CSV, SQL, or Excel spreadsheet)
- Ordered and unordered time series data.
- Arbitrary matrix data with row and column labels


**Significant things to note**:
- Data structures
    - Series - 1 dimensional data
    - DataFrame - 2 dimensional data
- Missing data - NaN


Additional source:
1. Wes McKinney, Python for Data Analysis; Data Wrangling with Pandas, Numpy and Ipython, O'Reilly, Second Edition, 2018.
***

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

## Pandas Series

Series contain two components:
1. one-dimensional array-like object that contains a sequence of data values
2. an associated array of data labels (i.e. 'index')

Note: indexes start at '0'

#### Creating

Create a series that contains 5 integers:

In [None]:
series_data = pd.Series([5, 10, 15, 20, 25])
series_data

#### Indexes
Now, let us add some indexes to help lables the integers:

In [None]:
series_data = pd.Series([5, 10, 15, 20, 25], index=['d', 'e', 'a', 'simulation', 'average'])
series_data

We can alter these indexes at any time.

In [None]:
series_data.index = ['Norway', 'Italy', 'Germany', 'simulation', 'average']
series_data

#### Accessing the series

Access only the values:

In [None]:
series_data.values

Access the data via an index label:

In [None]:
series_data['simulation']

Or by a position:

In [None]:
series_data[3]

#### Using operators

In [None]:
series_data ** 2

In [None]:
series_data + series_data

What happens when one of the series has missing data?

Let's create an alternate series that has the Italian data missing, and then add them to the original series:

In [None]:
series_data_alt = pd.Series([5, 10, 20, 25], index=['Germany', 'Norway', 'simulation', 'average'])

series_data + series_data_alt

#### Filtering and Sorting

Filter the data:

In [None]:
series_data[series_data >= 15]

Sorting a series by its index:

In [None]:
series_data.sort_index()

Sorting a series by data values:

In [None]:
series_data.sort_values

---
## Dataframes
- dataframes represents a **rectangular, ordered** table of data (numbers, strings, etc.)

- just like you are familiar with in a spreedsheet

Let's create a simple user function that will allow us to reset our example dataframe as needed
1. First create a dictionary
2. Convert the dictionary to a dataframe

In [None]:
def dict2dataframe():
    '''Create a dataframe 'by hand' using a dictionary that has equal lengths'''
    
    data = {'group': ['Deichkind', 'Die Fantastischen Vier', 'Seeed', 'Paul van Dyk'],
            'year': [2015, 2106, 2017, 2018],
            'attendence (x1000)': [50, 60, 70, 90]}

    dataframe = pd.DataFrame(data)  # convert the dictionary to a pandas' dataframe
    
    return dataframe

In [None]:
example_df = dict2dataframe()
example_df

Alter these indexes in the same way we did for the series:

In [None]:
example_df.index = ['band 1', 'band 2', 'band 3', 'band 4']
example_df

Note that index don't need to be unique for each row, but this can cause problems (for example, later we will delete based on the index label).

Assign `band 1` to the first two index positions

In [None]:
example_df.index = ['band 1', 'band 1', 'band 3', 'band 4']
example_df

#### Inserting columns and rows

Insert columns (simple):

In [None]:
example_df['quality'] = ['good', 'excellent', 'out of control good', 'get your techno on']
example_df

Inserting column, and fill it using 'NaN':

In [None]:
example_df['number of total concerts'] = pd.Series(data='NaN')
example_df

Include a new row:

(Notice: how `NaN` is added to the columns not specified)

In [None]:
example_df = example_df.append({'group' : 'Scorpions', 'year' : 1965, 'attendence (x1000)' : 100},
                               ignore_index=True)
example_df

#### Removing columns
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
- `axis` can be `'columns'` or `1`

In [None]:
example_df = example_df.drop(['attendence (x1000)', 'number of total concerts'], axis='columns')
example_df

---
## Accessing, selecting and filtering data
- there are many ways to do this (df: dataframe)
    - `df[val]` and `df[[]]`
    - `df.loc[val]`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
    - `df.loc[row_val, col_val]`
    - `df.iloc[row_index, col_index]`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc
    - and more
    
**Suggestion** - choose one method like `df.loc` and learn it first

Reset the example

In [None]:
example_df = dict2dataframe()
example_df

Reindex the dataframe:

In [None]:
example_df.index = ['band 1', 'band 2', 'band 3', 'band 4']
example_df

#### Accessing/Selecting rows (by the index)

Single row:
- Using slicing `:`

via index names:

In [None]:
example_df['band 1':'band 1']

via index numbers:

In [None]:
example_df[0:1]

Alternative
- `loc` with double `[[ ]]` (passing a list)

In [None]:
example_df.loc[['band 1']]

Multiple rows
- Using slicing `:`

via index names:

In [None]:
example_df['band 1':'band 3']

via index numbers:

In [None]:
example_df[0:3]

Alternative apporaches:
- `loc` with double `[[ ]]`

In [None]:
example_df.loc[['band 1', 'band 3']]

#### Access a specific cell (index, labels)

In [None]:

example_df.loc['band 3', 'group']

Or by index number
- `iloc`

In [None]:
example_df.iloc[2, 0]

#### Substitute a value at a specific cell

In [None]:
example_df.loc['band 3', 'number of total concerts'] = 10000
example_df

### Accessing/Selecting columns

#### Accessing columns (by label)

Single column:

In [None]:
example_df['group']

Multiple column 

- the double `[[ ]]` (passing a list to the dataframe)

In [None]:
example_df[['group', 'year']]

Alternative approaches
- the `df.columns` command

In [None]:
example_df[example_df.columns[0:2]]

- `loc`

Notice that the rows designation is left as `:`, followed by a `,` and then the columns

In [None]:
example_df.loc[:, 'group':'attendence (x1000)']

In [None]:
example_df

Now, let's putting everything together
- slicing for rows (e.g. `'band 1':'band 3'`) and
- slicing the columns (e.g. `'group':'attendence (x1000)'`)

In [None]:
example_df.loc['band 1':'band 3', 'group':'attendence (x1000)']

---
## Essential Functions

### Reordering the rows
- `reindex`

In [None]:
example_df.reindex(['band 3', 'band 4', 'band 1', 'band 2'])

---
### Dropping data entries
- pandas.drop will **drop columns** and **rows** using the **axis** keyword
    - axis='row' ; axis=0
    - axis='columns' ; axis=1
    
    
Let' remind ourselves of what the dataframe looks like.

In [None]:
example_df

#### Removing Rows

Remove a single row

In [None]:
example_df.drop('band 1', axis='rows')

Remove multiple rows

In [None]:
example_df.drop(['band 2', 'band 3'], axis='rows')

What happens if you have rows with the same index?

Let's reset, and set two rows as `band 3`:

In [None]:
example_df = dict2dataframe()
example_df.index = ['band 1', 'band 3', 'band 3', 'band 4']
example_df

In [None]:
example_df.drop(['band 3'])

Notice how we have reassigned the action to example_df, so the dataframe remains untouched:

In [None]:
example_df

#### Deleteing columns

Let's reset example using our created function from above:

In [None]:
example_df = dict2dataframe()
example_df

Remove a single column
- `drop`

In [None]:
example_df.drop('attendence (x1000)', axis='columns')

In [None]:
example_df.drop(['year', 'attendence (x1000)'], axis='columns')

### Modifying data entries

In [None]:
example_df.loc[3, 'group'] = 'Die Toten Hosen'
example_df

---

## Combining dataframes
- take the columns from different dataframes and put them together into a single collumn
    1. concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat
    2. append
    

In [None]:
homework_1_grades = pd.DataFrame({'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
                                  'homework 1': [63.0, 76.0, 76.0,
                                                 76.0, 0.0, 0.0, 
                                                 88.0, 86.0, 76.0,
                                                 86.0, 70.0, 0.0, 80.0]})
homework_1_grades

In [None]:
homework_2_grades = pd.DataFrame({'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
                                  'homework 2': [70.0, 73.0, 91.0,
                                                 89.0, 58.0, 0.0,
                                                 77.0, 91.0, 86.0,
                                                 78.0, 100.0, 61.5, 71.0]})
homework_2_grades

In [None]:
new_df_concat = pd.concat([ homework_1_grades['homework 1'], homework_2_grades['homework 2'] ], axis='rows')
new_df_concat

In [None]:
type(new_df_concat)

Alternative approach using 'append'

In [None]:
new_df_append = homework_1_grades['homework 1'].append(homework_2_grades['homework 2'])
new_df_append

In [None]:
type(new_df_append)

- Combine two dataframe based on their common keys.
    1. merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html#pandas.merge
    
(This is just one example from many ways to do this, including when the keys might not be shared.)

In [None]:
pd.merge(homework_1_grades, homework_2_grades, on='student')

---
## Math operators

Let's perform some math on a dataframe.

Dataframe:
- 5 rectangles that are defined by
    - length
    - height

In [None]:
rectangles_dict = {'length': [0.1, 9.4, 6.2, 3.8, 1.5],
                   'height': [8.7, 6.2, 9.4, 5.6, 3.3]}

rectangles_data = pd.DataFrame(rectangles_dict)
rectangles_data

#### Operate on all columns (e.g. dividing by 10)

In [None]:
rectangles_data/10

#### Operatation using two columns (e.g. for the area of a rectangle)

In [None]:
rectangles_data['length'] * rectangles_data['height']

#### Create a new column based on math using other columns

In [None]:
rectangles_data['area'] = rectangles_data['length'] * rectangles_data['height']
rectangles_data

### Descriptive statistics
- Using **python built-in functions** (e.g. max, min)

In [None]:
max(rectangles_data['area'])

In [None]:
min(rectangles_data['area'])

- Using **pandas functions**
    - count - number of non-NA values
    - sum, median, std, var
    - max, min
    - and many more


- Notice how the dataframe is given first, followed by the function (e.g. `df.max()`)

On all dataframe columns:

In [None]:
rectangles_data.max()

One a specific column:

In [None]:
rectangles_data['area'].max()

In [None]:
rectangles_data['area'].count()

In [None]:
rectangles_data['area'].mean()

In [None]:
rectangles_data['area'].std()

#### Moving averages (data smoothing)
- https://en.wikipedia.org/wiki/Moving_average

- rolling mean of data via pandas

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html?highlight=rolling#pandas.DataFrame.rolling

In [None]:
rectangles_data['area'].rolling(window=2, win_type=None).mean()

### Unique values

- Unique values

In [None]:
rectangles_data['area'].unique()

- Unique values and count their occurance

In [None]:
rectangles_data['area'].value_counts()

#### Using other libraries (e.g. statistics)
- make sure you have a good reason to do this (i.e. be consistent)

In [None]:
import statistics
statistics.mean(rectangles_data['area'])

### Sorting dataframes
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
- `df.sort_values()`
    
Our original, unsorted dataframe:

In [None]:
rectangles_data

- sort by a single column's values

In [None]:
rectangles_data.sort_values(by='area')

- sort by multiple columns
    - consecutively done

In [None]:
## rows index 1 and 2 should switch due to length value
rectangles_data.sort_values(by=['area', 'length'])

### filter by boolean operators

In [None]:
rectangles_data

In [None]:
rectangles_data['area'] > 7.0

#### return a dataframe based on one boolean condition

In [None]:
rectangles_data[rectangles_data['area'] > 7.0]

#### return a dataframe based on multiple boolean condition

In [None]:
rectangles_data[ (rectangles_data['area'] > 7.0) & (rectangles_data['area'] < 50.0) ]

---
## Data from a csv-formatted file

- The example CSV data file used below can be found at https://github.com/karlkirschner/2020_Scientific_Programming/blob/master/data_3d.csv

In [None]:
## For Colabs

## In order to upload data

#from google.colab import files
#uploaded = files.upload()

In [None]:
!head data_3d.csv --lines=10

For files without a header you can:
1. have pandas assign an index value as the header (e.g. 1 2 3)

In [None]:
df = pd.read_csv('data_3d.csv', header=None, sep=',')
df

2. Read in a csv file, using the first row (i.e. 0) as the header, with a comma separator


In [None]:
df = pd.read_csv('data_3d.csv', header=0, sep=',')
df

3. Assign the headers yourself
     - use `skiprows` if the first row labels are present, as in this example

In [None]:
df = pd.read_csv('data_3d.csv', sep=',', skiprows=1, names=['header 1', 'header 2', 'average'])
df

####  Save data to a new csv file, printing out to the first decimal place

In [None]:
df.to_csv('pandas_out.csv',
          sep=',', float_format='%.1f',
          index=False, encoding='utf-8')

## Visualizing the data via Pandas plotting

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html

In [None]:
df = pd.read_csv('data_3d.csv', header=0, sep=',')

In [None]:
## kind = line, box, hist, kde

df.plot(x='Time', y=['Exp', 'Theory'], kind='line',
        title='Example Plot', fontsize=16)

## Reminder about using Pandas statistics

In [None]:
df['Exp'].mean()

In [None]:
df['Theory'].mean()

In [None]:
df['Exp Rolling'] =  df['Exp'].rolling(window=4, win_type=None).mean()
df['Theory Rolling'] =  df['Theory'].rolling(window=4, win_type=None).mean()
df

In [None]:
df.plot(x='Time', y=['Theory', 'Theory Rolling'], kind='line',
        title='Example Plot', fontsize=16)

---
# Side Topics

## Pandas to Latex
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_latex.html

In [None]:
print(df.to_latex(index=False))

***
## Import Data from a European data csv file
(e.g. decimal usage: 10.135,11)

In [None]:
## CSV data file acan be found at
## https://github.com/karlkirschner/2020_Scientific_Programming/blob/master/data_eu.csv

## For Colabs

## In order to upload data

#from google.colab import files
#uploaded = files.upload()

In [None]:
!head data_eu.csv --lines=10

In [None]:
df = pd.read_csv('data_eu.csv', decimal=',', thousands='.', sep=';')
df.columns
df['Value']