<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Case-study-description" data-toc-modified-id="Case-study-description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Case study description</a></span></li><li><span><a href="#Read-in-the-data-set" data-toc-modified-id="Read-in-the-data-set-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read in the data set</a></span></li><li><span><a href="#Explore-data:-show-some-tables-and-plots" data-toc-modified-id="Explore-data:-show-some-tables-and-plots-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Explore data: show some tables and plots</a></span></li><li><span><a href="#Load-another-dataset;-merge-the-two-data-tables" data-toc-modified-id="Load-another-dataset;-merge-the-two-data-tables-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load another dataset; merge the two data tables</a></span></li><li><span><a href="#Choose-a-column-and-predict-another-column,-based-on-a-least-squares-model" data-toc-modified-id="Choose-a-column-and-predict-another-column,-based-on-a-least-squares-model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Choose a column and predict another column, based on a least-squares model</a></span></li><li><span><a href="#Quantify-how-good-the-predictions-are" data-toc-modified-id="Quantify-how-good-the-predictions-are-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Quantify how good the predictions are</a></span></li><li><span><a href="#User-interface-to-test-our-predictions" data-toc-modified-id="User-interface-to-test-our-predictions-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>User-interface to test our predictions</a></span></li><li><span><a href="#Try-a-different-prediction-model" data-toc-modified-id="Try-a-different-prediction-model-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Try a different prediction model</a></span></li><li><span><a href="#Compare-the-two-models" data-toc-modified-id="Compare-the-two-models-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Compare the two models</a></span></li><li><span><a href="#Discussion-here-about-the-two-models" data-toc-modified-id="Discussion-here-about-the-two-models-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Discussion here about the two models</a></span></li><li><span><a href="#Export-the-results" data-toc-modified-id="Export-the-results-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Export the results</a></span></li><li><span><a href="#Try-some-Python-commands-yourself" data-toc-modified-id="Try-some-Python-commands-yourself-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Try some Python commands yourself</a></span></li><li><span><a href="#Next-steps-in-learning-Python" data-toc-modified-id="Next-steps-in-learning-Python-13"><span class="toc-item-num">13&nbsp;&nbsp;</span>Next steps in learning Python</a></span></li></ul></div>

# Details about this notebook

In this notebook we demonstrate some Python usage, specifically around data analysis. We consider topics that you would do on a regular basis:

* Reading in the data
* Get a bit of an understanding of the data format
* Merging it with another data set
* Building a prediction model from your data
* Exporting the results and sharing them with your colleagues.

## Case study description

Goes here

In [None]:
import numpy as np
import pandas as pd
import plotly
#from ipysheet import sheet, column, to_dataframe, row
import ipywidgets as widgets
pd.options.plotting.backend = "plotly"
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "notebook" # jupyterlab
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

## Read in the data set

In [None]:
spectra = pd.read_excel("https://github.com/kgdunn/process-improve/raw/main/notebooks_examples/Tablets.xlsx", sheet_name="Spectra").set_index("Sample")
print(f"Data shape = {spectra.shape}")

## Explore data: show some tables and plots

In [None]:
# Show the top of the data set
spectra.head()

In [None]:
# Show the end of the data set
spectra.tail()

In [None]:
# Randomly select and show 10 rows
spectra.sample(10)

In [None]:
# Show a randomly selected row; and plot its spectrum
spectra.sample(1).iloc[0].plot(title="Plot of a randomly selected spectrum")

In [None]:
# Improve the figure; drop the first column away
fig=spectra.sample(1).iloc[0, 1:].plot(title="Plot of a randomly selected spectrum")
fig.update_layout(xaxis_title_text="Wavelength [nm]")
fig.update_layout(yaxis_title_text="Absorbance")

## Load another dataset; merge the two data tables

In [None]:
output = pd.read_excel("Tablets.xlsx", sheet_name="Hardness").set_index("Sample")
print(f"The 'outputs' data frame has shape = {output.shape}")

In [None]:
output.head()

In [None]:
# Explore the outputs
display(output['Hardness'].plot.line())
display(output['Hardness'].plot.hist(nbins=30))
display(output['Hardness'].plot.box())

In [None]:
# Summary statistics for each column
display(output.mean())
display(output.median())
display(output.std())
display(output.min())

In [None]:
# Get a complete summary
output.describe()

In [None]:
spectra.index

In [None]:
output.index

In [None]:
# Join the two data sets
joined =spectra.merge(output, left_index=True, right_index=True)
joined.columns


In [None]:
# Plot the output, sorted:
joined["Hardness"].sort_values()
joined["Hardness"].sort_values().plot()

In [None]:
# See the 4 groups, based on 'Category', for the output variable
joined.groupby(["Category"])["Hardness"].mean()

In [None]:
# Now, repeat, but for the spectra
joined.groupby(["Category"]).mean()

In [None]:
spectra.groupby(["Category"]).mean().T.plot()

In [None]:
# Select a column for a particular wavelength
wavelength  = '1884nm'
fig=joined.loc[:, wavelength].plot(title=f"Plot of absorbances for all tablets at wavelength {wavelength}")
fig.update_layout(xaxis_title_text="Sample number")
fig.update_layout(yaxis_title_text=f"Absorbance at {wavelength}")

## Choose a column and predict another column, based on a least-squares model

In [None]:
# Correlation plot at a particular wavelength against Hardness
# Choose a column as x, to predict another column as y, based on a least-squares model

wavelength  = "1666nm"
two_columns = joined.loc[:, [wavelength, 'Hardness']]
# display(two_columns)
fig=two_columns.plot.scatter(x=wavelength, y='Hardness')
fig.update_layout(xaxis_title_text=f"Absorbance at {wavelength}")
fig.update_layout(yaxis_title_text="Hardness")

In [None]:
all_correlations = joined.corr()['Hardness']

In [None]:
all_correlations.apply(lambda x: x**2).plot()


In [None]:
display(  all_correlations.max()          )
display(  all_correlations[0:-1].max()    )
display(  all_correlations[0:-1].argmax() )

In [None]:
# So wavelength in position 533 is the largest: corresponds to wavelength ...
maximum_column = 533
joined.columns[maximum_column]

In [None]:
best_wavelength = "1664nm"
joined.loc[:, best_wavelength]

In [None]:
## Choose a column and predict another column, based on a least-squares model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

mymodel = LinearRegression()

In [None]:
X = joined.loc[:, [best_wavelength]]
mymodel.fit(X, y=joined["Hardness"]);

In [None]:
# The coefficients
print(f'Intercept = {mymodel.intercept_} and slope = {mymodel.coef_}')

# The mean squared error:
actual_y_values = joined["Hardness"]
predicted_y_values = mymodel.predict(X)
prediction_error = actual_y_values - predicted_y_values    
prediction_error.hist(nbins=40)

In [None]:
joined["Predicted hardness"] = mymodel.intercept_ + mymodel.coef_ * joined[best_wavelength]

In [None]:
# Plot the regression model, and fit
three_columns = joined.loc[:, [best_wavelength, 'Hardness', "Predicted hardness"]]
fig=three_columns.plot.scatter(x=best_wavelength, y='Hardness')
fig.update_layout(xaxis_title_text=f"Absorbance at {best_wavelength}")
fig.update_layout(yaxis_title_text="Hardness")
fig.add_scatter(x=three_columns[best_wavelength], y=three_columns['Predicted hardness'], name="Prediction")

## Quantify how good the predictions are

In [None]:
print(f'Mean squared error: {mean_squared_error(actual_y_values, predicted_y_values, squared=False):.4g}')
      
# The coefficient of determination: (R^2)
print(f'Coefficient of determination = R^2 = {r2_score(actual_y_values, predicted_y_values):.3f}')

## User-interface to test our predictions

Build an interactive tool to find an ideal wavelength to make the predictions from.

In [None]:
def correlation_plot(wavelength):
    fig=joined.plot.scatter(x=wavelength, y='Hardness')
    fig.update_layout(xaxis_title_text=f"Absorbance at {wavelength}")
    fig.update_layout(yaxis_title_text="Hardness")    
    fig.add_scatter(x=joined[wavelength], y=joined['Predicted hardness'], name="Prediction", line_color="darkgreen")
    fig.update_layout(
        width = 800,
        height = 500,
        title = "Training data, with prediction line"
    )
    fig.show()
    
def update_correlation_plot(wavelength_selected):
    wavelength = f"{int(wavelength_selected)}nm"    
    lsmodel = LinearRegression()
    X = joined.loc[:, [wavelength]]
    lsmodel.fit(X, y=joined["Hardness"]);
    predicted_y_values = lsmodel.predict(X)
    joined['Predicted hardness'] = predicted_y_values
    actual_y_values = joined["Hardness"]
    prediction_error = actual_y_values - predicted_y_values    
    spm = correlation_plot(wavelength=wavelength);
    print(f'Intercept = {lsmodel.intercept_:.3g} and slope = {lsmodel.coef_[0]:.3g}')
    print(f'Mean squared error: {mean_squared_error(actual_y_values, predicted_y_values, squared=False):.4g}')
    print(f'Coefficient of determination = R^2 = {r2_score(actual_y_values, predicted_y_values):.3f}')

    
    # TODO: show 4 plots: scatter plot, error histogram, spectra with vertical line
    # TODO: do a plot update, to avoid 'flashing' figures
        
wavelength_selected = widgets.FloatSlider(min=1600, 
                                          max=1898, 
                                          step=2, 
                                          value=1600, 
                                          readout_format="d",
                                          continuous_update=False,
                                          description='Wavelength')
ui = widgets.VBox([wavelength_selected])
out = widgets.interactive_output(update_correlation_plot, {'wavelength_selected': wavelength_selected});
display(ui,out);
wavelength_selected.value += 2

## Try a different prediction model

Use multiple columns to predict the tablet hardness.

In [None]:
## Calculation: use the average of some columns around the best column
spectra.loc[:, "1654nm":"1674nm"]    

In [None]:
avg_model = LinearRegression()
X = spectra.loc[:, "1654nm":"1674nm"] #.mean(axis=1).values
avg_model.fit(X, y=joined["Hardness"]);

In [None]:
# The coefficients
print(f'Intercept = {avg_model.intercept_} and slope = {avg_model.coef_}')

# The mean squared error:
predicted_y_values_avgmodel = avg_model.predict(X)
prediction_error_avgmodel = actual_y_values - predicted_y_values_avgmodel    
prediction_error_avgmodel.hist(nbins=40)

## Compare the two models

In [None]:
print(f'Mean squared error (single): {mean_squared_error(actual_y_values, predicted_y_values, squared=False):.4g}')
print(f'Mean squared error (multiple): {mean_squared_error(actual_y_values, predicted_y_values_avgmodel, squared=False):.4g}')

# The coefficient of determination: (R^2)
print(f'Coefficient of determination (single) = R^2 = {r2_score(actual_y_values, predicted_y_values):.3f}')
print(f'Coefficient of determination (multiple) = R^2 = {r2_score(actual_y_values, predicted_y_values_avgmodel):.3f}')

In [None]:
#joined["Predicted hardness (multiple)"] = 
#avg_model.intercept_ + 
pd.Series(avg_model.coef_).plot.bar()# * X

## Discussion here about the two models

Undesirable aspects of the multiple linear regression.

## Export the results

In [None]:
## Export image to PNG or PDF

## Try some Python commands yourself

In [None]:
print('Hi, my name is ____.')

In [None]:
# Creating variables:

temperature_in_F = 212.0
temperature_in_C = ...


## Next steps in learning Python