In [None]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)

<div style="text-align:center;">
  <img src="custom/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Linear Fitting

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* How can I use Python to fit a linear model?

Objectives:

* Fit a linear model using statsmodels

</div>


A common exercise in introductory chemistry is fitting a linear model to absorbance measurements at a particular wavelength vs. concentration of an analyte using spectrophotometry. 
In these cases, the absorbance of a sample is described using [Beer-Lambert Law](https://en.wikipedia.org/wiki/Beer%E2%80%93Lambert_law)

$$
A = \varepsilon \cdot l \cdot c
$$

where $A$ is the absorbance, $\varepsilon$ is the absorptivity constant, $l$ is path length, and $c$ is the concentration of the sample.


In this notebook, we will read in absorbance data, fit a linear model using a Python library called [statsmodels](https://www.statsmodels.org/stable/index.html), and calculate the concentrations of unknowns using our model.
We will utilize skills learned in our previous lessons, including reading, accessing, and analyzing data using `pandas` and visualizing data using `plotly`.
We will add a new skill - fitting a linear model with a Python library called `statsmodels`. 

## Importing Libraries and Reading Data

Note that when fitting a model in Python, there are a number of options for the library you might pick including NumPy, SciPy, SciKit-Learn, and statsmodels.
The library that you pick might be based on personal preference or features that a particular library offers.
In this notebook, we show fitting with `statsmodels` because of the ease of seeing fit statics and defining a formula for fitting.

For `statsmodels`, we will use the formula API. 
This will allow us to define the relationship we'd like to fit using a formula as a string.

To start this notebook, we will import the libraries we need. 
We will import the following:

Library Name | Purpose
-------------|--------
[pandas](https://pandas.pydata.org/docs/)       | reading and processing data
[plotly](https://plotly.com/python/getting-started/)       | interactive plots and visualization
[statsmodels](https://www.statsmodels.org/stable/index.html)  | data fitting


The cell below imports the libraries we will use for our analysis. 

In [None]:
# imports 
import pandas as pd # for reading data from a file and putting in a table
import plotly.express as px # for plotting

import statsmodels.formula.api as smf # for fitting

For the example, we will use data stored in the file `data/protein_assay.csv`.
This data represents absorbance data recorded in a Bradford Assay for determining protein concentration.

After we have imported our libraries, we will next use `pandas` to read in our data.
Our data is stored in a comma separated value (CSV) file, though pandas can also read from Excel files.

In [None]:
# Read in file
df = pd.read_csv("data/protein_assay.csv")

# View the first five rows.
df.head()

### Visualizing Data

After reading in our data, we might wish to inspect it visually. 
We can do this using plotly express (imported as `px`).

In [None]:
# Create a figure using px.scatter.
fig = px.scatter(df, x='concentration', y="absorbance")

# Show the figure.
fig.show()

Note that you can change the axis labels using the following syntax.

In [None]:
# Create a figure using px.scatter.
fig = px.scatter(df, x='concentration', y="absorbance", 
                 labels={"concentration":"Concentration (mg/mL)", "absorbance":"Absorbance at 595 nm"})

# Show the figure.
fig.show()

## Fitting a Linear Model

To fit a linear equation to our data, we can use a library called [statsmodels](https://www.statsmodels.org/stable/index.html). 
`statsmodels` is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

While there are numerous options for fitting data in Python, `statsmodels` is particularly beneficial when dealing with linear models. 
Note that this encompasses more than just "linear equations". 
This library is especially handy when you need straightforward access to fit parameters and in-depth statistical metrics related to the fit.

In particular, we are using a part of `statsmodels` called the formula API.
The formula API lets you define a formula as a string for fitting and is specifically designed to work with dataframes.
When defining a formula, you use the column names in a string to define the relationship.
Note that the column names should not have spaces, or entering the relationship is a bit more complicated.
For example, if we had data representing pressure and temperature at constant volume and expected them to follow a linear relationship such as for an ideal gas

$$
P = \frac{n R T}{V}$$

we would write `"P ~ T"` when using the `statsmodels` formula API. 
This is assuming that our dataframe has columns named `P` and `T`


As a slightly more complicatd eexample lone could also fit something like a polynomial using `"y ~ np.power(x, 2) + x"` if you had imported NumPy (`import numpy as n`). 

To use the formula API, we will use `smf` (imported in first cell). 
We will use ordinary least squares (`ols`) for our fit, though [a number of other options are offered](https://www.statsmodels.org/dev/api.html#statsmodels-formula-api).


In [None]:
regression = smf.ols("absorbance ~ concentration", data=df).fit()

We can see a summary of the fit including the `R-squared` by using the `.summary()` method.

In [None]:
regression.summary()

If we would like to force the model to not have an intercept, we use a special `statsmodel` syntax. 
Adding `-1` to our formula forces the intercept to be 0.

In [None]:
# force the intercept to be 0
regression = smf.ols("absorbance ~ concentration - 1", data=df).fit()
regression.summary()

The model parameters are in `.params` of the fit variable.

In [None]:
regression.params

To get the slope, we will get the coefficient in front of the concentration variable.

In [None]:
slope = regression.params["concentration"]
print(slope)

We can see what our model predicts for our input concentration values by using the `predict` method.
In the cell below, we save the results in a new column in our dataframe.

In [None]:
df["predicted"] = regression.predict()

In [None]:
df.head()

In [None]:
fig = px.scatter(x= df["concentration"], y=df["absorbance"])
fig.add_scatter(x=df["concentration"], y=df["predicted"], mode="lines", name="model")
fig.show()

## Fitting unknowns

Now that we have our model and slope, we can use it to calculate the protein concentrations for a set of unknowns.

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>

Use pandas to read in the file `data/protein_samples.csv`.

Next, use our calculated slope to predict concentration based on measured absorbance.

</div>


## Exercise - Perform a Linear Fit

<div class="exercise admonition">
<p class="admonition-title">Exercise</p>

Now that you have completed a simple linear regression exercise with protein assay data, here is a problem with a slightly larger dataset, taken from a ground water survey of wells in Texas kindly provided by Houghton-Mifflin. 
The data for this exercise is in the file `data/ground_water.csv`.
Using the skills you have learned with pandas and `statsmodels`, get the linear regression statistics for the relationship between pH (dependent variable) and bicarbonate levels (ppm in well water in Texas; independent variable).

</div>



