# Pandas
### Karl N. Kirschner

"...providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python." -- http://pandas.pydata.org/pandas-docs/stable/

- Tabular data with heterogeneously-typed columns, (CSV, SQL, or Excel spreadsheet)
- Ordered and unordered time series data.
- Arbitrary matrix data with row and column labels


**Significant things to note**:
- Data structures
    - Series - 1 dimensional data
    - DataFrame - 2 dimensional data
- Missing data - NaN


Additional source:
1. Wes McKinney, Python for Data Analysis; Data Wrangling with Pandas, Numpy and Ipython, O'Reilly, Second Edition, 2018.
***

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

## Pandas Series

Series contain two components:
1. one-dimensional array-like object that contains a sequence of data values
2. an associated array of data labels (i.e. 'index')

Note: indexes start at '0'

In [None]:
## Note the capital 'S' in 'Series'
series_data_1 = pd.Series([5, 10, 15, 20, 25])
series_data_1

In [None]:
## access only the values
series_data_1.values

In [None]:
series_data_2 = pd.Series([5, 10, 15, 20, 25], index=['d', 'e', 'a', 'simulation', 'average'])
series_data_2

In [None]:
## Alter the name of the indexes
series_data_2.index = ['Norway', 'Italy', 'Germany', 'simulation', 'average']
series_data_2

In [None]:
## Access data via index label
series_data_2['simulation']

In [None]:
series_data_2[3]

In [None]:
## Filter the data
series_data_2[series_data_2 >= 15]

In [None]:
## Using operators
series_data_2 ** 2

In [None]:
series_data_2 + series_data_2

In [None]:
## Sorting by index
series_data_2.sort_index()

What happens when one of the series has missing data (e.g. the Italian data is not present)?

In [None]:
series_data_3 = pd.Series([5, 10, 20, 25], index=['Germany', 'Norway', 'simulation', 'average'])

series_data_2 + series_data_3

---
## Dataframes
- dataframes represents a **rectangular, ordered** table of data (numbers, strings, etc.)

- just like you are familiar with in a spreedsheet

In [None]:
def dict2dataframe():
    '''Create a dataframe 'by hand' using a dictionary that has equal lengths'''
    
    data = {'group': ['Deichkind', 'Die Fantastischen Vier', 'Seeed', 'Paul van Dyk'],
                      'year': [2015, 2106, 2017, 2018],
                      'attendence (x1000)': [50, 60, 70, 90]}

    dataframe = pd.DataFrame(data)  # convert the dictionary to a pandas' dataframe
    return dataframe

In [None]:
example_df = dict2dataframe()
example_df

In [None]:
example_df.index = ['band 1', 'band 2', 'band 3', 'band 4']
example_df

Note that index don't need to be unique for each row, but this can cause problems (for example, later we will delete based on the index label).

In [None]:
example_df.index = ['band 1', 'band 1', 'band 3', 'band 4']
example_df

In [None]:
## reset the example
example_df = dict2dataframe()
example_df.index = ['band 1', 'band 2', 'band 3', 'band 4']
example_df

In [None]:
## Adding a new column and fill in using 'NaN'
example_df['Number of total concerts'] = pd.Series(data='NaN')
example_df

---
## Accessing, selecting and filtering data
- there are many ways to do this (df: dataframe)
    - df[val] - selects a single column (but not always -- see below for filtering rows)
    - df.loc[val]
    - df.loc[row_val, col_val]
    - df.iloc[row_index, col_index]
    - and more

In [None]:
## accessing a single column (by a column label)
example_df['group']

In [None]:
## accessing a multiple column (passing a list to the dataframe)
## note the double [[ ]]
##  (more ways shown below)
example_df[['group', 'year']]

In [None]:
example_df[example_df.columns[0:2]]

In [None]:
## accessing a row (index label)
example_df.loc['band 3']

In [None]:
## access a specific cell - by labels
example_df.loc['band 3', 'group']

In [None]:
## substitute a value at a specific cell
example_df.loc['band 3', 'Number of total concerts'] = 10000
example_df

In [None]:
## or by index number
example_df.iloc[2, 0]

---
### Selecting a range of rows

In [None]:
example_df

In [None]:
## filter the table based on a row range
## Note: this can get confusing because of the way the pandas was setup
##    - multiple ways to do the same things (and unclear logic behaind it, IMO)

## Method 1
example_df[0:3]

In [None]:
## Method 2
example_df['band 1':'band 3']

In [None]:
## Method 3 (using the idea of .loc to access rows - seems most logical)
example_df.loc['band 1':'band 3']

### Selecting a range of columns

Notice that the rows designation is left as `:`, followed by a comma and then the columns

In [None]:
example_df.loc[:, 'group':'attendence (x1000)':1]

In [None]:
example_df

In [None]:
example_df.loc[:, 'year':'Number of total concerts':1]

Putting the slicing for rows (e.g. `'band 1':'band 3'`) with the columns (e.g. `'group':'attendence (x1000)'`) now:

In [None]:
example_df.loc['band 1':'band 3', 'group':'attendence (x1000)':1]

---
## Essential Functions

### Reindexing
- reordering the data rows

In [None]:
## create new dataframe
example_df_new = example_df.reindex(['band 3', 'band 4', 'band 1', 'band 2'])
example_df_new

In [None]:
## overwrite existing dataframe
#example_df = example_df.reindex(['band 3', 'band 4', 'band 1', 'band 2'])
#example_df

---
### Dropping data entries
- pandas.drop will **drop columns** and **rows** using the **axis** keyword
    - axis='row' ; axis=0
    - axis='columns' ; axis=1

In [None]:
## remove a row (both of the following work)

#example_df_new = example_df.drop('band 1')
example_df_new = example_df.drop('band 1', axis='rows') # axis=0 also works
example_df_new

In [None]:
## remove multiple rows

example_df_new = example_df
example_df_new = example_df.drop(['band 1', 'band 3'])
example_df_new

What happens if you have rows with the same index?

Let's reset, and add a duplicated `band 3` row:

In [None]:
## first add a new using reindex - two indexes named 'band 3'
example_df_new = example_df.reindex(['band 3', 'band 3', 'band 4', 'band 1', 'band 2'])
example_df_new

In [None]:
example_df_new.drop(['band 3'])

In [None]:
example_df

Let's reset example using our created function from above:

In [None]:
example_df = dict2dataframe()
example_df

In [None]:
## delete a column - method 1
del example_df['year']
example_df

In [None]:
## delete a column - method 2
example_df = example_df.drop('attendence (x1000)', axis='columns') # axis=1 also works
example_df

### Modifying data entries

In [None]:
example_df.loc[3, 'group'] = 'Die Toten Hosen'
example_df

### Combining dataframes
- take the columns from different dataframes and put them together into a single collumn
    1. concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat
    2. append
    

In [None]:
homework_1_grades = pd.DataFrame({'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
                                  'homework 1': [63.0, 76.0, 76.0,
                                                 76.0, 0.0, 0.0, 
                                                 88.0, 86.0, 76.0,
                                                 86.0, 70.0, 0.0, 80.0]})
print(homework_1_grades)
print()
print('Length: {0}'.format(len(homework_1_grades)))

In [None]:
homework_2_grades = pd.DataFrame({'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
                                  'homework 2': [70.0, 73.0, 91.0,
                                                 89.0, 58.0, 0.0,
                                                 77.0, 91.0, 86.0,
                                                 78.0, 100.0, 61.5, 71.0]})
print(homework_2_grades)
print()
print('Length: {0}'.format(len(homework_2_grades)))

In [None]:
new_df_concat = pd.concat([ homework_1_grades['homework 1'],
                            homework_2_grades['homework 2'] ])

print(new_df_concat)
print()
print('Length: {0}'.format(len(new_df_concat)))

Alternative approach using 'append'

In [None]:
new_df_append = homework_1_grades['homework 1'].append(homework_2_grades['homework 2'])

print(new_df_append)
print()
print('Length: {0}'.format(len(new_df_append)))

- Combine two dataframe based on their common keys.
    1. merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html#pandas.merge
    
(This is just one example from many ways to do this, including when the keys might not be shared.)

In [None]:
new_df_merge = pd.merge(homework_1_grades, homework_2_grades, on='student')
new_df_merge

---
## Math operators

In [None]:
## 5 rectangles defined by their length and height
rectangles_dict = {'length': [0.1, 9.4, 6.2, 3.8, 1.5],
                   'height': [8.7, 6.2, 9.4, 5.6, 3.3]}

rectangles_data = pd.DataFrame(rectangles_dict)
rectangles_data

In [None]:
## math on all columns
rectangles_data/10

In [None]:
## math between columns (e.g. for the area of a rectangle)
rectangles_data['length'] * rectangles_data['height']

In [None]:
## Create a new column based on math using other columns
rectangles_data['area']= rectangles_data['length'] * rectangles_data['height']

In [None]:
rectangles_data

### Descriptive statistics
- Using **python built-in functions** (e.g. max, min)

In [None]:
max(rectangles_data['area'])

In [None]:
min(rectangles_data['area'])

- Using **pandas functions**
    - count - number of non-NA values
    - sum, median, std, var
    - max, min
    - and many more

- Notice how the dataframe is given first, followed by the function.

In [None]:
rectangles_data.max()

In [None]:
rectangles_data['area'].max()

In [None]:
rectangles_data['area'].count()

In [None]:
rectangles_data['area'].mean()

In [None]:
rectangles_data['area'].std()

- rolling mean of data via pandas

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html?highlight=rolling#pandas.DataFrame.rolling

In [None]:
rectangles_data['area'].rolling(window=2, win_type=None).mean()

### Unique values

In [None]:
## unique values and count their occurance
rectangles_data['area'].value_counts()

In [None]:
## unique values
rectangles_data['area'].unique()

- using **other libraries** (e.g. statistics)

In [None]:
import statistics
statistics.mean(rectangles_data['area'])

### Sorting dataframes
- sorting by index
    - must specify the axis for sorting their labels

In [None]:
## original, unsorted dataframe
rectangles_data

- sort by column values

In [None]:
rectangles_data.sort_values(by='area')

- sort by multiple columns
    - consecutively done

In [None]:
## rows index 1 and 2 should switch due to length value
rectangles_data.sort_values(by=['area', 'length'])

### filter by boolean operators

In [None]:
rectangles_data

In [None]:
rectangles_data['area'] > 7.0

In [None]:
## single boolean condition
rectangles_data[rectangles_data['area'] > 7.0]

In [None]:
## multiple boolean conditions
rectangles_data[ (rectangles_data['area'] > 7.0) & (rectangles_data['area'] < 50.0) ]

---
## Data from a 3D csv-formatted file

- The CSV data file used below can be found at https://github.com/karlkirschner/2020_Scientific_Programming/blob/master/data_3d.csv

In [None]:
## For Colabs

## In order to upload data

#from google.colab import files
#uploaded = files.upload()

In [None]:
!head data_3d.csv --lines=10

For files without a header you can:
1. have pandas assign an index value as the header (e.g. 1 2 3)
2. assign the headers yourself
    - df = pd.read_csv('data_file.csv', sep=',', names=['header 1', 'header 2', 'average'])

In [None]:
df = pd.read_csv('data_3d.csv', header=None, sep=',')
df

In [None]:
## Read in a csv file, using the first row (i.e. 0) as the header, with a comma separator
df = pd.read_csv('data_3d.csv', header=0, sep=',')
df

In [None]:
## Save data to a new csv file, printing out to the first decimal place
df.to_csv('pandas_out.csv',
          sep=',', float_format='%.1f',
          index=False, encoding='utf-8')

## Visualizing the data via Pandas plotting

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html

In [None]:
## kind = line, box, hist, kde

df.plot(x='Time', y=['Exp', 'Theory'], kind='line',
        title='Example Plot', fontsize=16)

## Revisting Pandas Statistics

In [None]:
df['Exp'].mean()

In [None]:
df['Theory'].mean()

In [None]:
df['Exp Rolling'] =  df['Exp'].rolling(window=4, win_type=None).mean()
df['Theory Rolling'] =  df['Theory'].rolling(window=4, win_type=None).mean()
df

In [None]:
df.plot(x='Time', y=['Theory', 'Theory Rolling'], kind='line',
        title='Example Plot', fontsize=16)

---
# Diverse Topics

## Pandas to Latex
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_latex.html

In [None]:
print(df.to_latex(index=False))

***
## Import Data from a European data csv file
(e.g. decimal usage: 10.135,11)

In [None]:
## CSV data file acan be found at
## https://github.com/karlkirschner/2020_Scientific_Programming/blob/master/data_eu.csv

## For Colabs

## In order to upload data

#from google.colab import files
#uploaded = files.upload()

In [None]:
!head data_eu.csv --lines=10

In [None]:
df = pd.read_csv('data_eu.csv', decimal=',', thousands='.', sep=';')
df.columns
df['Value']