# Data Analysis in Python with pandas

[pandas](https://pandas.pydata.org/docs/getting_started/overview.html) is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

Therefore, it is the tool of choice in what follows.

Let's start by importing this package (as well as the universally useful NumPy).

In [None]:
import pandas as pd
import numpy as np

## Content

1. [Accessing data files](#Accessing-data-files)
1. [Reading data files](#Reading-data-files)
1. [How to do calculations](#How-to-do-calculations)
1. [Data analysis](#Data-analysis)
    1. [Basic statistics](#Basic-statistics)
    1. [Smoothing](#Smoothing)
    1. [Tangent to a curve](#Tangent-to-a-curve)


## Accessing data files

How to access data files depends on whether they are stored locally, i.e. within the file system on the computer executing the Jupyter notebook, or remotely, i.e. hosted on another computer.

### Remotely hosted files

Any file that has a uniform resource locator (URL) can be accessed through that.
Examples are
* `https://sample-videos.com/csv/Sample-Spreadsheet-100-rows.csv`
* `https://drive.google.com/uc?id=1zO8ekHWx9U7mrbx_0Hoxxu6od7uxJqWw&export=download`


### Locally stored files

Any local file is accessible based on its path in the file system.
Examples would be:
* `test.csv` resides in the same folder as the executed notebook
* `sub/dir/data.csv` is in the `sub/dir` relative to the notebook location
* `../../prices.csv` is stored two directory levels above the notebook location
* `/Users/joe/Documents/sample-data/chart.csv` is a specific file located at this absolute path

## Reading data files

For demonstration purposes, consider an exemplary output file of an Instron tensile testing frame.
The file is called "force-displacement.csv", is stored in the "Data" folder next to the folder containing this notebook, and contains comma-separated data like this
```
Results Table 1
,Specimen label,Width,Thickness,Length,Tensile strain (Displacement) at Break (Standard),Maximum Tensile stress
,,(mm),(mm),(mm),(%),(MPa)
"1","BrassS4_01","12.73","1.52","78.49","34.28","427.03"

1,Time,Displacement,Force
,(s),(mm),(N)
"","0.0000","0.0000","39.3799"
"","0.1000","0.0042","43.7967"
"","0.2000","0.0175","45.0747"
...
```



pandas offers a convenient [`read_csv` function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) that is used to parse comma-separated values (csv files) into a data frame that we will call `df`.

* Since the first seven lines of the "force-displacement.csv" file contain information that we do not want to keep, we use the `skiprows=7` option to skip them. (They would actually confuse the parser, which relies on the first line to estimate how the rest of the file looks like!)
* Similarly, because the first column is always empty, we only `usecolumns=[1,2,3]`
* Lastly, to specify our own (more meaningful) names for the columns of interest, we provide those with the `names=['Time/s','Displacement/mm','Force/N']` argument.


In [None]:
df = pd.read_csv('../Data/force-displacement.csv',
                 skiprows=7,
                 usecols=[1,2,3],
                 names=['Time/s','Displacement/mm','Force/N'],
                )


In [None]:
df.head(n=3)          # Display the first 5 rows of the data frame

One can access values in the data frame by first selecting the column and (optionally) a range of rows.
In the below example, we extract the time column for rows 10 to 20 (eleven rows in total).
The pandas documentation provides a much more involved [explanation of how to access parts of data frames](https://pandas.pydata.org/docs/user_guide/indexing.html) in case you want to learn more.

In [None]:
df['Time/s'][10:21]

## Manipulating data

### Definition of constants

Any constants should be declared with a descriptive name. 
Using these names (in contrast to an explicit value) in any subsequent calculations is good practice because it makes the formulas general and easy to understand.
Let's start by defining some named constants:

In [None]:
width = 12.73
thickness = 1.52
length = 78.49

inch_to_mm = 25.4

### Definition of functions

Similarly, explicitly defining (more complex) functions that one wants to apply to data is good practice compared to writing an inline evaluation.

In [None]:
def LinearMap(x,a=0.0,b=1.0):
    return a+b*x

### Adding and deleting columns

Columns are added to a data frame by specifying the data that goes into it.
This can be a constant scalar, a given list of numbers (of the correct length), or the result of a calculation involving other columns.

In [None]:
df['zero'] = 0.0
df['range'] = np.arange(len(df))
df['prod'] = df['Time/s'] * df['Force/N']

In [None]:
df.head()

Columns can be deleted by dropping them from the data frame either in-place, or by returning a copy without the dropped columns.

In [None]:
df.drop(columns=['zero','range','prod'],
        inplace=True)

In [None]:
df.head()

### Example: unit conversions

Suppose you need to perform a unit conversion of `Displacement/mm` from mm to inch and `Force/N` from N to kN.
The results are stored as two new columns in the existing data frame `df`.

In [None]:
df['Displacement/in'] = df['Displacement/mm'] / inch_to_mm     # Converting displacements from mm to inch
df['Force/kN'] = df['Force/N'] * 1e-3                          # Converting forces from N to kN

In [None]:
df.head(n=3)

The same transformation can be achieved by applying the `LinearMap` function with suitable parameters `a` and `b`.

In [None]:
df['apply:Displacement/in'] = df['Displacement/mm'].apply(LinearMap,args=(0,1/inch_to_mm))     # Converting displacements from mm to inch
df['apply:Force/kN'] = df['Force/N'].apply(LinearMap,args=(0,1e-3))                            # Converting forces from N to kN

In [None]:
df.head(n=3)

## Data analysis

### Basic statistics

A `pandas` "Series", i.e. a column of a data frame such as `df['Force/kN']`, offers multiple methods to extract basic statistical information.
A few useful ones are: 

* `Series.max()` to find the maximum
* `Series.min()` to find the minimum
* `Series.mean()` to calculate the mean (average)
* `Series.median()` to calculate the median
* `Series.mode()` to calculate the most frequent value(s)
* `Series.std()` to calculate the standard deviation 

In [None]:
print(f"""
Maximum displacement: {df['Displacement/mm'].max()} mm
Maximum force: {df['Force/kN'].max()} kN
Average force: {df['Force/kN'].mean()} kN
""")

### Smoothing

In cases where the data is too noisy to be useful, it can be helpful to smooth it.

Please note that smoothing is different from curve fitting.
Curve fitting adjusts the parameters of a given function until it best fits the observed values as closely as possible based on statistical criteria and can be used to extrapolate outside of the data interval.
Smoothing, on the other hand, only reduces the weight of outlying points and makes the trends in the data more obvious, with very limmited possibilities for extrapolation.

One possibility for smoothing a data series demonstrated here is to use the [Savitzky–Golay filter](https://docs.scipy.org/doc/scipy-1.14.0/reference/generated/scipy.signal.savgol_filter.html) from the `scipy.signal` module.

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt
from scipy.signal import savgol_filter


df['Smooth Force/kN'] = savgol_filter(x=df['Force/kN'],
                                      window_length=101, # larger window results in greater smoothing
                                      polyorder=2,
                                     )

fig,ax = plt.subplots()
sns.lineplot(data=df,
             x='Displacement/mm',
             y='Force/kN',
             color='blue',
             ax=ax,
               )
sns.lineplot(data=df,
             x='Displacement/mm',
             y='Smooth Force/kN',
             color='orange',
             ax=ax,
            )
_ = plt.show()