<div class="alert block alert-info alert">

# <center> Scientific Programming in Python
## <center>Karl N. Kirschner<br>Bonn-Rhein-Sieg University of Applied Sciences<br>Sankt Augustin, Germany

# <center> Pandas
#### <center> (Reading in, manipulating, analyzing and visualizing datasets.)</center>

<br><br>

"...providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for practical, real-world data analysis in Python." -- http://pandas.pydata.org/pandas-docs/stable/

- Tabular data with heterogeneously-typed columns (e.g., <b>CSV</b>, SQL, LibreOffice Calc, MS <b>Excel</b>)
- Ordered and unordered time series data.
- Arbitrary matrix data with row and column labels


<b>Significant things to note</b>:
- Allows you to <b>operate in any direction on your data</b> (i.e., by rows or by columns)
    - Database experts will find this interesting
        - SQL: manipulate data by rows (i.e., <b>row-focused</b>)
        - Columnar databases: manipulate data by columns (i.e., <b>column-focused</b>)
    - Operate data on data using 1-2 lines of code


- Data structures
    - <b>Series</b> - 1-dimensional data
    - <b>DataFrame</b> - 2 dimensional data


- <b>Index data</b>
    - can organize your data quickly and logically (e.g., based on calendar dates
    - can handle missing data


- <b>Missing data</b>
    - NaN
    - mean
    - fill forward and backward

#### Basic Functionalities to Know

1. Basics: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html
1. Head and tail: (https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#head-and-tail)
1. Attributes and underlying data (relevant for the numpy lecture): (https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#attributes-and-underlying-data)
1. Descriptive statistics: (https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics)
1. Reindexing and altering labels: (https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#reindexing-and-altering-labels)
1. Iteration: (https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#iteration)
1. Sorting: (https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#sorting)
1. Copying: (https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#copying)
1. dtypes: (https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes)

#### Underlying libraries (dependencies used but not clearly seen)
1. Numpy
2. Matplotlib

<br>
    
#### Note about citations (i.e. referencing):

<b>For citing Pandas</b>: (via https://pandas.pydata.org/about/citing.html - modify for your Pandas version)

<b>Bibtex</b>

@software{reback2020pandas,  
    author       = {The pandas development team},  
    title        = {pandas-dev/pandas: Pandas},  
    month        = feb,  
    year         = 2020,  
    publisher    = {Zenodo},  
    version      = {latest},  
    doi          = {10.5281/zenodo.3509134},  
    url          = {https://doi.org/10.5281/zenodo.3509134}  
}

@InProceedings{mckinney-proc-scipy-2010,  
  author    = {{W}es {M}c{K}inney},  
  title     = {{D}ata {S}tructures for {S}tatistical {C}omputing in {P}ython},  
  booktitle = {{P}roceedings of the 9th {P}ython in {S}cience {C}onference},  
  pages     = {56 - 61},  
  year      = {2010},  
  editor    = {{S}t\'efan van der {W}alt and {J}arrod {M}illman},  
  doi       = {10.25080/Majora-92bf1922-00a}  
}

<br>
    
#### Sources
1. The pandas development team, pandas-dev/pandas: Pandas, Zenodo, 2020, https://doi.org/10.5281/zenodo.3509134, visited on May 15, 2023

2. McKinney, W., 2010, June. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, van der Walt, S. & Millman, J. (Eds.), vol. 445 pp. 51-56).

3. Pandas contributors, https://pandas.pydata.org. Online; accessed on May 15, 2023.

4. Wes McKinney, Python for Data Analysis; Data Wrangling with Pandas, Numpy and Ipython, O'Reilly, Second Edition, 2018.

## <center><font color='dodgerblue'>Top Python Libraries in the Field of Computational Chemistry (as of 2023)</font><br><br>(# of projects from 176 surveyed)</center>

<div> <img src="00_images/top_compchem_libraries.png" width="1000"/> </div>

In [None]:
## For extra information given within the lectures

from IPython.display import HTML


def set_code_background(color: str):
    ''' Set the background color for code cells.

        Source: psychemedia via https://stackoverflow.com/questions/49429585/
                how-to-change-the-background-color-of-a-single-cell-in-a-jupyter-notebook-jupy

        To match Jupyter's dev class colors:
            "alert alert-block alert-warning" = #fcf8e3

        Args:
            color: HTML color, rgba, hex
    '''

    script = ("var cell = this.closest('.code_cell');"
              "var editor = cell.querySelector('.input_area');"
              f"editor.style.background='{color}';"
              "this.parentNode.removeChild(this)")
    display(HTML(f'<img src onerror="{script}">'))


#set_code_background(color='#fcf8e3')

In [None]:
set_code_background(color='#fcf8e3')
print('Test for background color.')

<hr style="border:2px solid gray"></hr>

In [None]:
import pandas as pd

## Pandas Series

A series contains two components:
1. a <b>one-dimensional array-like object</b> that contains a sequence of data values, and
2. an associated array of <b>data labels</b> (i.e., an <b>'index'</b> that start at zero )

#### Creating
Create a series that contains 5 integers, with index values from 0-4, using `Series`:

In [None]:
series_data = pd.Series([5, 10, 15, 20, 25], index=None)
series_data

Intead, let manually assign the index values:

In [None]:
series_data = pd.Series([5, 10, 15, 20, 25], index=['d', 'e', 'a', 'simulation 1', 'simulation 2'])
series_data

We can alter these indexes at any time using `index`.

In [None]:
series_data.index = ['Norway', 'Italy', 'Germany', 'simulation 1', 'simulation 2']
series_data

#### Accessing the series

Access only the values, using `values`:

In [None]:
series_data.values

Access only the index, using `index`:

In [None]:
series_data.index

Access the data via an index label (i.e., <b>human readable</b>):

In [None]:
series_data['simulation 1']

Or by a position (more on this below in <b>Accessing and selecting data</b>):

<b>Note</b>: The <b>deprecated</b> way that you will often see is

`series_data[3]`

In [None]:
series_data.iloc[3]

#### Using operators

In [None]:
series_data**2

What happens when one of the series has <b>missing data</b>?

Let's create an <b>alternate series</b> that has the <b>Italian data missing</b>, and then <b>add them</b> to the original series:

In [None]:
series_data_missing = pd.Series([5, 10, 20, 25], index=['Germany', 'Norway', 'simulation 1', 'simulation 2'])
series_data_missing

In [None]:
series_data_missing + series_data

<font color='dodgerblue'>Notice</font>:
1. The values are <b>correctly summed together</b> even though the two <b>lists sequences are different</b> (see the <b>Germany</b> values)
    - ['Norway', 'Italy', <b>'Germany'</b>, 'simulation 1', 'simulation 2']  versus
    - [<b>'Germany'</b>, 'Norway', 'simulation 1', 'simulation 2']


2. The <b>missing index</b> results in a <b>`NaN`</b>

Converting a Pandas series to a regular list:
- use `tolist()`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tolist.html

In [None]:
series_data.tolist()

<div class="alert alert-block alert-warning">
<hr style="border:1.5px dashed gray"></hr>

## Extra Information
### dtype
- https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes

Pandas will default to `int64` and `float64` dtypes for numbers.

However, notice that `series_data` and `series_data_missing` were `int64`, while `series_data + series_data_missing` resulted in `float64`. This is due to Pandas' built-in <b>upcasting</b>:
    
    "Types can potentially be upcasted when combined with other types, meaning they are promoted from
    the current type (e.g., int to float)."
    
`NaN` can not included in an `int64`, but only in `float64` (which is a Numpy limitation, I believe).
<hr style="border:1.5px dashed gray"></hr>

#### Filtering and Sorting

Filter the data, using a comparison operator (a boolean expresssion: `series_data >= 15`):

In [None]:
series_data >= 15

To return a filtered series:

In [None]:
series_data[series_data >= 15]

Sorting a series by its index, using `sort_index()`:

In [None]:
series_data.sort_index()

<font color='dodgerblue'>Notice</font> the sorting goes by:
1. <b>Capital</b> letters (i.e., German, Italy, Norway), and then by
1. <b>Lowercase</b> letters (i.e., simulation 1, simulation 2)

Sorting a series by data values, using `sort_values()`:

In [None]:
series_data.sort_values()

<hr style="border:2px solid gray"></hr>

## DataFrames
- DataFrames represent a <b>rectangular, ordered</b> table of data (numbers, strings, etc.)

- Conceptually like a spreadsheet

Let's create a simple user function that allows us to <font color='dodgerblue'>reset our example dataframe</font> as needed
1. First create a dictionary
2. Convert the <b>dictionary</b> to a <b>dataframe</b>

In [None]:
def dict2dataframe():
    '''Create a dataframe 'by hand' using a dictionary that has equal lengths.'''

    data_dict = {'group': ['Deichkind', 'Die Fantastischen Vier', 'Seeed', 'Paul van Dyk'],
                 'year': [2015, 2106, 2017, 2018],
                 'attendence (x1000)': [50, 60, 70, 90]}

    dataframe = pd.DataFrame(data_dict, index=['band 1', 'band 2', 'band 3', 'band 4'])

    return dataframe

In [None]:
dict2dataframe()

In [None]:
example_df = dict2dataframe()
example_df

<b>Alter the indexes</b> as done for series, using `index`. <font color='dodgerblue'>Notice</font> that index values <b>do not need to be unique</b> for each row, but this <b>can cause problems</b> (e.g., deleting rows using the index label).

Assign `band 1` to the first two index positions

In [None]:
example_df.index = ['band 1', 'band 1', 'band 3', 'band 4']
example_df

<b>Alter the header names</b> using Pandas' `rename`.
- Done using a dictionary (key: value)
    - key = old name
    - value = new name

In [None]:
example_df.rename({'attendence (x1000)': 'attendence'}, axis='columns', inplace=True)
example_df

#### Inserting columns

Insert a column with specified values:

In [None]:
example_df['quality'] = ['good', 'excellent', 'good', 'average']
example_df

Insert a column and fill it with `NaN`:

In [None]:
example_df['number of total concerts'] = pd.Series(data='NaN')
example_df

List the column lables using `columns`:

In [None]:
example_df.columns

List the column lables using `index`:

In [None]:
example_df.index

<b>Inserting rows</b>:

1. Create a Pandas Series

2. use `to_frame()` to convert a series to dataframe: https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.to_frame.html?highlight=to_frame#pandas.DatetimeIndex.to_frame

3. Use `transpose()` to transpose (think about like a matrix operation) the dataframe: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html?highlight=transpose#pandas.DataFrame.transpose

4. Use `concat()` (concatenate) to combine them: https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=concat#pandas.concat
    - axis='rows' ; axis=0

In [None]:
new_band = pd.Series({'group':'Scorpions',
                      'year':1965,
                      'attendence':100})

example_df = pd.concat([example_df, new_band.to_frame().transpose()], axis='rows', ignore_index=True)

example_df

<font color='dodgerblue'>Notice</font>:
1. how the index change to integers.
1. how `NaN` is added to the columns not specified (i.e., to `quality` and `number of total concerts`)

### Dropping data entries
- pandas.drop will <b>drop columns</b> and <b>rows</b> using the <b>axis</b> keyword
    - `axis='rows'` ;`axis=0` ; `axis='index'`
    - `axis='columns'` ; `axis=1`

#### Removing columns
- Use `drop()`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
- axis='columns' ; axis=1

In [None]:
example_df = example_df.drop(['year', 'attendence'], axis='columns')
example_df

#### Removing rows
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
- `axis='row'` ; `axis='rows'` ;`axis=0` ; `axis='index'`

In [None]:
example_df = example_df.drop([0, 1], axis='index')
example_df

<font color='dodgerblue'><b>Note</b></font>: if the index were strings (e.g., `band1`, `band2`), then you would do something like `example_df.drop(['band 1', 'band 2'], axis='index')`.

What happens if you have rows with the same index?

Let's reset, and set <b>two indexes</b> to `band 3`:

In [None]:
example_df = dict2dataframe()
example_df.index = ['band 1', 'band 3', 'band 3', 'band 4']
example_df

In [None]:
example_df = example_df.drop(['band 3'])
example_df

<hr style="border:2px solid gray"></hr>

## Accessing and selecting data
There are many ways to do this (df: dataframe)
- slicing using `df[:]` (rows)


- `df.loc[val]`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
- `df.loc[row_val, col_val]`


- `df.iloc[row_index, col_index]`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc


- `df[val]` and `df[[]]`
- and more

- Reset the example, and
- Reindex the dataframe:

In [None]:
example_df = dict2dataframe()
example_df

#### Accessing/Selecting rows (by the index)

- Using slicing `:`

via index names:

(Note if you only wanted a single row you would do: `example_df['band 1':'band 1']`)

In [None]:
example_df['band 1':'band 3']

via index numbers:

In [None]:
example_df[0:3]

via a specified list:
- `loc` with double `[[ ]]`

<font color='dodgerblue'>Notice</font> how we skip `band 2` in the following, and thus it is not a range.

In [None]:
example_df.loc[['band 1', 'band 3']]

#### Access a specific cell (row index, column labels)

By label
- `loc`

In [None]:
example_df.loc['band 3', 'group']

Or by index number
- `iloc`

In [None]:
example_df.iloc[2, 0]

#### Substitute a value at a specific cell

In [None]:
example_df.loc['band 3', 'year'] = 2024
example_df

<font color='dodgerblue'><b>Notice</b></font>: We can also are adding a new column!

In [None]:
example_df.loc['band 3', 'number of total concerts'] = 10000
example_df

### Accessing/Selecting columns

#### Accessing columns (by label)

<font color='dodgerblue'>Single column:</font>

- the single `[ ]` (i.e., returns a Pandas series)

In [None]:
example_df['group']

<font color='dodgerblue'>Multiple column:</font>

- the double `[[ ]]` (i.e., passsing a list to columns and returns a Pandas dataframe)

In [None]:
example_df[['group', 'year']]

- `loc[row , column]`

<font color='dodgerblue'><b>Notice</b></font>: the <b>rows</b> designation is left as `:`, followed by a `,` and then the <b>columns</b>

In [None]:
example_df.loc[:, 'group':'attendence (x1000)']

Now, let's putting everything together
- slice the rows (e.g. `'band 1':'band 3'`), and
- slice the columns (e.g. `'group':'year'`)

In [None]:
example_df.loc['band 1':'band 3', 'group':'year']

#### Filtering Dataframes

- one condition

In [None]:
example_df[(example_df['year'] > 2015)]

Note: that you do not need the () in the above statement, but it helps to make sense with the two condition example below.
<br>

- two conditions

In [None]:
example_df[ (example_df['year'] > 2015) & (example_df['attendence (x1000)'] <= 70) ]

<hr style="border:2px solid gray"></hr>

## Essential Functions

### Reminder about reordering the rows by their indexes

- demonstrates what happens to a dataframe with multiple columns

- `reindex`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html?highlight=reindex#pandas.DataFrame.reindex

In [None]:
example_df = dict2dataframe()
example_df

In [None]:
example_df = example_df.reindex(['band 3', 'band 4', 'band 1', 'band 2'])
example_df

### Factorize categorical data
- This is something that is sometimes done when performing data analysis
    - e.g., <b>machine learning</b>
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html

In [None]:
example_df = dict2dataframe()

example_df['quality'] = ['good', 'excellent', 'good', 'average']

example_df

In [None]:
codes, uniques = example_df['quality'].factorize()

In [None]:
codes

In [None]:
uniques

In [None]:
example_df

In [None]:
example_df['quality_numeric'] = codes
example_df

### Iterate over rows
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html#pandas-dataframe-iterrows

In [None]:
for index, row in example_df.iterrows():
    print(f"Index: {index} ; Group: {row['group']}\n")

<hr style="border:2px solid gray"></hr>

## Combining dataframes
Take the columns from different dataframes and put them together into a single column

### Using `concat`
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat

<b>Example</b> - student grades on homework 1 and 2

First create the dataframes:

In [None]:
homework_1_grades = pd.DataFrame({'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
                                  'homework 1': [63.0, 76.0, 76.0,
                                                 76.0, 0.0, 0.0, 
                                                 88.0, 86.0, 76.0,
                                                 86.0, 70.0, 0.0, 80.0]})
homework_1_grades

- <b>Extra</b>: `sample()` is used to randomize the rows
    - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html?highlight=sample#pandas.DataFrame.sample

In [None]:
homework_2_grades = pd.DataFrame({'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
                                  'homework 2': [70.0, 73.0, 91.0,
                                                 89.0, 58.0, 0.0,
                                                 77.0, 91.0, 86.0,
                                                 78.0, 100.0, 61.5, 71.0]})

homework_2_grades = homework_2_grades.sample(frac=1)

homework_2_grades

#### Combine to create one column

Now bring the two dataframes together using `concat`.

In [None]:
new_df_concat = pd.concat([ homework_1_grades['homework 1'], homework_2_grades['homework 2'] ], axis='rows')
new_df_concat

<b>Result</b>: Creates a single column
- A Panda <b>Series</b> is made
- The <b>index ordering is preserved</b> from one dataset to the next

In [None]:
type(new_df_concat)

#### Combine to create two columns

In [None]:
new_df_concat = pd.concat([ homework_2_grades['homework 2'], homework_1_grades['homework 1'] ], axis='columns')
new_df_concat

- Results in a Pandas <b>DataFrame</b>
- Indexes are <b>ordered</b> using the <b>first dataframe</b>
- 
<font color='dodgerblue'>Notice</font> how the data is <b>aligned</b> by the <b>indexes</b>.

You can do this for any <b>number of dataframes</b>:

In [None]:
## New dataframe, but with one extra strudent homework added

homework_3_grades = pd.DataFrame({'student': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
                                  'homework 3': [89.0, 58.0, 0.0,
                                                 70.0, 73.0, 91.0,
                                                 78.0, 100.0, 61.5,
                                                 77.0, 91.0, 86.0, 71.0, 99.0]})

homework_3_grades = homework_3_grades.sample(frac=1)

homework_3_grades

- Also do a index sorting now

In [None]:
new_df_concat = pd.concat([homework_2_grades['homework 2'],
                           homework_1_grades['homework 1'],
                           homework_3_grades['homework 3']],
                          axis='columns')

new_df_concat.sort_index(ascending=True, inplace=True)

new_df_concat

### Using `merge`
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html#pandas.merge

- Combines only <b>2 dataframes</b>
- Combine based on <b>any specified column</b> or the index
- <b>Drops</b> the <b>rows</b> that are <b>not in common</b> (i.e., student 14 from homework 3)

In [None]:
pd.merge(homework_1_grades, homework_3_grades, on='student')

<hr style="border:2px solid gray"></hr>

## Math operators

Let's perform some math on a dataframe.

Dataframe:
- 5 rectangles that are defined by
    - length (m)
    - height (m)

In [None]:
rectangles_dict = {'length': [0.1, 9.4, 6.2, 3.8, 9.4],
                   'height': [8.7, 6.2, 9.4, 5.6, 3.3]}

rectangles_data = pd.DataFrame(rectangles_dict,
                               index=['Rect. 1', 'Rect. 2', 'Rect. 3', 'Rect. 4', 'Rect. 5'])

rectangles_data

#### Operate on all columns
- convert them to centimeters
- returns a dataframe

In [None]:
rectangles_data*100

#### Operate on a single column
- returns a series

In [None]:
rectangles_data['length']*100

#### Operation using two columns (e.g., for the area of a rectangle)

In [None]:
rectangles_data['length'] * rectangles_data['height']

#### Create a new dataframe column based on math using other columns

In [None]:
rectangles_data['area'] = rectangles_data['length'] * rectangles_data['height']
rectangles_data

### Descriptive statistics

Using <b>python built-in functions</b> (e.g., max, min) on a Pandas dataframe:

In [None]:
max(rectangles_data['area'])

In [None]:
min(rectangles_data['area'])

Notice above: how the dataframe is given within the parentheses.

Using <b>pandas functions</b>

- count (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html)
- sum (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html)
- median (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)
- std (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)
- var (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)
- max (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html)
- min (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html)
- correlation analysis (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)
- and many more



<b><font color='dodgerblue'>Notice</font> below</b> how <b>the dataframe is given first</b>, followed by the function (e.g., `df.max()`)

On all dataframe columns:

In [None]:
rectangles_data.max()

One a specific column:

In [None]:
rectangles_data['area'].max()

`idxmin` and `idxmax`

"Return <b>index</b> of the first occurrence of maximum over requested axis."[1]

1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html

In [None]:
rectangles_data

In [None]:
maximum_index = rectangles_data['area'].idxmax()
maximum_index

Using this index value, let's see the entire row as a dataframe:
- slice approach
- `loc` approach

In [None]:
# rectangles_data[maximum_index:maximum_index]

rectangles_data.loc[maximum_index:maximum_index]

<font color='dodgerblue'>Notice</font> it is the <b>FIRST OCCURRENCE</b>
- Returns the row with a length=9.4, width=6.2 and an area=58.28 (i.e., index = <b>'Rect. 2'</b>)
- It does <b>NOT</b> return values for the rows that contain
    - length=6.2, width=9.4 and an area=58.28 (i.e., index - <b>'Rect. 3'</b>)

In [None]:
rectangles_data['area'].count()

In [None]:
rectangles_data['area'].mean()

In [None]:
rectangles_data['area'].std()

In [None]:
print(f"Mean area (with proper significant figures): "\
      f"{rectangles_data['area'].mean():0.1e} ± {rectangles_data['area'].std():0.1e}")

#### Sidenote: How to using other libraries (e.g., statistics)
- Make sure you have a good reason to do this (i.e., be consistent and concise) since it adds overhead to your code.

In [None]:
import statistics
statistics.mean(rectangles_data['area'])

### Unique values
- using Pandas' `unique` function

In [None]:
rectangles_data['area'].unique()

- <b>Unique values</b> and <b>count</b> their occurance

In [None]:
rectangles_data['area'].value_counts()

### Sorting dataframes
- similar to how the series was done above, but with a twist
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
- `df.sort_values()`

Our original, unsorted dataframe:

In [None]:
rectangles_data

- sort by a single column's values

In [None]:
rectangles_data.sort_values(by='area')

- sort by multiple columns
    - consecutively done

In [None]:
rectangles_data.sort_values(by=['area', 'length'])

<hr style="border:2px solid gray"></hr>

## Data from a csv-formatted file

- The example CSV data file used below can be found at https://github.com/karlkirschner/2020_Scientific_Programming/blob/master/data_3d.csv

In [None]:
## For Colabs

## In order to upload data

#from google.colab import files
#uploaded = files.upload()

In [None]:
!head data_3d.csv --lines=10

For files without a header you can:
1. have Pandas assign an <b>index value</b> as the <b>header</b> (e.g., 1 2 3)
 - this will cause 'Time', 'Exp' and 'Theory' to be placed in a row

In [None]:
df = pd.read_csv('data_3d.csv', header=None, sep=',')
df

2. Read in a csv file, using the <b>first row</b> (i.e., 0) as the <b>header</b>

In [None]:
df = pd.read_csv('data_3d.csv', header=0, sep=',')
df

3. Assign the headers yourself
     - use `skiprows` if the first row labels are present, as in this example

In [None]:
df = pd.read_csv('data_3d.csv', skiprows=1, names=['header 1', 'header 2', 'average'], sep=',')
df

####  Save data to a new csv file, printing out to the first decimal place

In [None]:
df.to_csv('pandas_out.csv',
          sep=',', float_format='%.1f',
          index=False, encoding='utf-8')

<hr style="border:2px solid gray"></hr>

## Visualizing the data via Pandas plotting

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html


#### Types of plots

The type of plot is specified through the pandas.DataFrame.plot's `kind` keyword.

1. ‘line’ : line plot (default)
1. ‘bar’ : vertical bar plot
1. ‘barh’ : horizontal bar plot
1. ‘hist’ : histogram
1. ‘box’ : boxplot
1. ‘kde’ : Kernel Density Estimation plot
1. ‘density’ : same as ‘kde’
1. ‘area’ : area plot
1. ‘pie’ : pie plot
1. ‘scatter’ : scatter plot
1. ‘hexbin’ : hexbin plot

In [None]:
df = pd.read_csv('data_3d.csv', header=0, sep=',')
df

In Pandas v. 1.1.0, xlabel and ylabel was introduced:

In [None]:
df.plot(x='Time', y='Exp', kind='scatter',
        xlabel='Time (unit)', ylabel='Exp. (unit)',
        title='Example Plot', fontsize=16)

In [None]:
## The following is usable when `kind` = line, box, hist, kde, but not for scatter

df.plot(x='Time', y=['Exp', 'Theory'], kind='line',
        xlabel='X-Label', ylabel='Y-Label',
        title=['Example Plot: Exp', 'Example Plot: Theory'], fontsize=16, subplots=True)

An <b>alternative way</b> (also usable with older Pandas version) that gives you a bit <b>more control</b> over, for example
1. the fontsize of different elements, for example
    - axis label
    - title
1. legend location

This is similar to how matplotlib works.

In [None]:
graphs = df.plot(x='Time', y=['Exp', 'Theory'], kind='line', fontsize=16, subplots=True)

graphs[0].set_title("Example Title 1", fontsize=16)
graphs[0].set_ylabel("Exp. (unit)", fontsize=16)
graphs[0].legend(loc='upper left')

graphs[1].set_title("Example Title 2", fontsize=16)
graphs[1].set_xlabel("Time (unit)", fontsize=16)
graphs[1].set_ylabel("Theory (unit)", fontsize=16)
graphs[1].legend(loc='upper center')

<div class="alert alert-block alert-warning">
<hr style="border:1.5px dashed gray"></hr>

## Moving averages (data smoothing)
- https://en.wikipedia.org/wiki/Moving_average

- rolling mean of data via pandas

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html?highlight=rolling#pandas.DataFrame.rolling

In [None]:
rectangles_data['area moving avg'] = rectangles_data['area'].rolling(window=2, win_type=None).mean()
rectangles_data

<div class="alert alert-block alert-warning">
<hr style="border:1.5px dashed gray"></hr>

## Pandas to Latex
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_latex.html

In [None]:
print(df.to_latex(index=False))

<div class="alert alert-block alert-warning">
<hr style="border:1.5px dashed gray"></hr>

## Import Data from a European data csv file
(e.g. decimal usage: 10.135,11)

In [None]:
## CSV data file acan be found at
## https://github.com/karlkirschner/2020_Scientific_Programming/blob/master/data_eu.csv

## For Colabs

## In order to upload data

#from google.colab import files
#uploaded = files.upload()

In [None]:
!head data_eu.csv --lines=10

In [None]:
df = pd.read_csv('data_eu.csv', decimal=',', thousands='.', sep=';')
df.columns
df['Value']