## Machine Learning and prediction
Elements of Data Science   
In this laboratory we will use training data to predict outcomes. We will first test these ideas using our Old Faithful data again. Next we will look at data on the iris flower to classify iris' based on sepal width and length. In our culminating activity we will predict molecular acidity using data computed by [Prof. Vince Voelz](http://www.voelzlab.org) in the Temple Chemistry department and a graduate student, Robert Raddi. See their paper: [Stacking Gaussian processes to improve pKa predictions in the SAMPL7 challenge](https://link.springer.com/epdf/10.1007/s10822-021-00411-8?sharing_token=yLV8dMXdxg40M_Ds_2Rhsfe4RwlQNchNByi7wbcMAY6fCl3bMLQiAhJzS2zZw-SwUkz490heLLZu1bPJ8T5LHXo1WvZkp0AJmWzXo71rszl8UaPxjqtqR-oARfxWGrTiCV0rNXy0C7IVzX6yoMYTPv2ZJfnQS-zF1pYvL8ESsUI%3D).

In [None]:
Your_name = ...

### Learning from training data
A key concept in machine learning is using a subset of a dataset to train an algorithm to make estimates on a separate set of test data. The quality of the machine learning and algorithm can be assesed based on the accuracy of the predictions made on test data. Many times there are also parameters sometimes termed hyper-parameters which can be optimized through an iterative approach on test or validation data. In practice a dataset is randomly split into training and test sets using sampling. 

<div class="alert alert-info">
  <strong>Nearest neighbor</strong>
</div>

### k nearest neighbor
We will examine one machine learning algorithm in the laboratory, k nearest neighbor. Many of the concepts are applicable to the broad range of machine learning algorithms available.

In [None]:
from gofer.ok import check

import numpy as np
from datascience import *
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import warnings
warnings.simplefilter('ignore', UserWarning)
#from IPython.display import Image
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
from jupyterquiz import display_quiz
import json
from IPython.core.display import HTML
from ipywidgets import interact, interactive, fixed,IntSlider
import ipywidgets as widgets
import EDS
import os
user = os.getenv('JUPYTERHUB_USER')

### Nearest neighbor concept
The training examines the characteristics of *k* nearest neighbors to the data point for which a prediction will be made. Nearness is measured using several different [metrics](https://www.nhm.uio.no/english/research/infrastructure/past/help/similarity.html) with Euclidean distance being a common one for numerical attributes.  
Euclidean distance:   
1-D $$ d(p,q) = \sqrt{(p-q)^{2}} $$   
 2-D $$ d(p,q) = \sqrt{(p_1-q_1)^{2}+(p_2-q_2)^{2}} $$
 
 For multiple points (rows):
 2-D $$ d(p,q) = \sum{{\sqrt{((p_1-q_1)^{2}+(p_2-q_2)^{2}}}} $$

#### An example in 2-D Cartesian coordinates

In [None]:
x = np.array([3, 8, 5, 6, 1, 9, 8])
y = np.array([4, 5, 7, 4, 6, 9, 4])
testx = np.array([4])
testy = np.array([8])
n = list(np.arange(1,8))
color = "red"
plt.scatter(x, y, c = color, s=30,label = 'training points')
plt.scatter(testx, testy, c = 'blue', s=100, label = 'test point')
for i, txt in enumerate(n):
    plt.annotate(txt, (x[i]+.1, y[i]+.1))
plt.legend()
plt.show()

### Compute Euclidean distances
$ d(x,y) = \sqrt{(x_{test}-x_{train})^{2}+(y_{test}-y_{train})^{2}} $

In [None]:
distance = np.sqrt((testx - x)**2 + (testy - y)**2)  # Compute numpy array of distances from training point

Execute the following code...

In [None]:
print("training point\t distance")
for i, txt in enumerate(n):
    print(f'{txt:d}  \t\t {distance[i]:.2f}')

Now sort to see the nearest in order...

In [None]:
print("training point\t distance")
for i, dist in zip(np.argsort(distance)+1, np.sort(distance)) :
    print(f'{i:d}   \t\t {dist:.2f}')

**Training point 3 is the nearest neighbor**

##### Try different attribute values in the following 2D Euclidean distance example code below to get a feel for the computation

In [None]:
# Example code to compute an Euclidean distance between two 2-D points
d_p_q = np.sqrt(sum((make_array(2,3)-make_array(4,3))**2))
d_p_q

#### A couple quick review questions about nearest neighbor below, select the best answer (multiple tries ok). Execute the below cell to reveal the self-check quiz.

In [None]:
with open("questions.json", "r") as file:
    questions=json.load(file)    
display_quiz(questions)

### k nearest  neighbor regression
We will use the k nearest neighbor algorithm to make predictions of wait time in minutes following an eruption duration of a given number of minutes (independent variable).


In [None]:
faithful = Table.read_table("data/faithful.csv")
faithful.scatter(0, 1, fit_line=True)

The cell below shows how to get values from a row in a Table as an array as is done in row_distance. Note in the faithful data case we will only consider the duration column in nearest neighbor computation but in examples below we will use a 2-D array of attributes with the iris data and a 10-D array in the chemistry and molecular acidity case.

#### <font color=blue> **Question 1.** </font>
Use the datascience .split(n) Table method to split the dataset into 80% training and 20% test. The argument for .split(n) method,n, needs to be an integer. [See datascience documentation](https://datascience.readthedocs.io/en/master/_autosummary/datascience.tables.Table.split.html#datascience.tables.Table.split)

In [None]:
trainf, testf = ...
print(trainf.num_rows, 'training and', testf.num_rows, 'test instances.')

In [None]:
check('tests/q1.py')

#### <font color=blue> **Question 2.** </font>
Define a function which is the Euclidean distance between two values. Use the last two example code cells above as inspiration. This is where we will compute the distance between two *duration* values.

In [None]:
def distance(pt1, pt2):
    """The distance between two points, represented as arrays."""
    return ...

In [None]:
check('tests/q2.py')

#### Rest of the nearest neighbor algorithm
Execute these cells to create the complete algorithm

In [None]:
def row_distance(row1, row2):
    """The distance between two rows of a table."""
    return distance(np.array(row1), np.array(row2)) # Need to convert rows into arrays

def distances(training, example, output):
    """Compute the distance from example for each row in training."""
    dists = []
    attributes = training.drop(output)
    for row in attributes.rows:
        dists.append(row_distance(row, example))
    return training.with_column('Distance', dists)

def closest(training, test, k, output):
    """Return a table of the k closest neighbors to example."""
    return distances(training, test, output).sort('Distance').take(np.arange(k))

#### <font color=blue> **Question 3.** </font>
We will look at the Table containing the test data set and then we can take a specific row using `.take(5)` which would take the 6th row of the test data (zero referenced). This specific row will be used to test the prediction of a wait time given the duration specified in the test row.

In [None]:
testf

In [None]:
testf.take(5)

Example below will create a test row without the `wait` data. The value of `wait` will be predicted by the nearest neighbor machine learning algorithm.

In [None]:
testf.take(5).drop('wait').row(0) 

Take a test row from the test data (testf), drop the prediction column and use the closest function to see the top 10 closest points to the target in the training data. 

In [None]:
test_row = testf...row(...)
test_row   # This should display data contained in selected row in testf table.

In [None]:
k = ... # Number of nearest neighbors
closest(...,test_row,...,'wait')

In [None]:
check('tests/q3.py')

#### <font color=blue> **Question 4.** </font>
Predict the value for this row using the defined predict_nn function below and compare to the value reported for wait in the test data. How do they compare?

In [None]:
def predict_nn(test):
    """Return the majority class among the k nearest neighbors."""
    k = 5
    return np.average(closest(trainf, test, k , 'wait').column('wait'))

In [None]:
predictionf = ...  # This is the value predicted for wait using the average of the k nearest neighbors in the test set
actual = ...
print(predictionf,actual)

<font color='blue'> Answer here  </font>
***  

In [None]:
check('tests/q4.py')

#### <font color=blue> **Question 5. Predictions** </font>
Now we will make predictions for the whole data set using the apply Table method. We will then look at the root mean squared error (RMSE) for the nearest neighbor fit and a scatter plot. Try adjusting the value of k in the predict_nn function to see it's effect on the quality of fit by rerunning these cells. Are the predicted points in a **perfect** straight line, why or why not?

In [None]:
testf = testf.with_columns("predict",testf.apply(predict_nn,"duration"))
nn_test_predictions = testf.column("predict")
test_wait = testf.column("wait")
rmse_nn = np.mean((test_wait - nn_test_predictions) ** 2) ** 0.5

print('Test set RMSE for nearest neighbor regression:', round(rmse_nn,2))

In [None]:
testf.scatter("duration")

<font color='blue'> Answer here  </font>
***  

### Classify iris flower data with machine learning
Next we will take on the problem of classifying iris data into three categories, setosa, versicolor, and virginica. Here we will also learn the basics of the k nearest neighbor algorithm.

The first data set we will look at consists of 50 samples from three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured including the length and the width of the sepals and petals, in centimeters for each observation.
<br><center><img src='iris.png' width=150 height=150><br>Iris stainglass, J.R. Smith</center>

In [None]:
n_neighbors = 15
# Load iris data
iris = datasets.load_iris()
# We only take the first two features. 
iris_table = Table().with_columns("Name",iris.target,iris.feature_names[0],iris.data[:,0],iris.feature_names[1],iris.data[:,1])
iris_table

In [None]:
iris.target_names

#### <font color=blue> **Question 6.** </font>
Train and test split the iris_table @ 80% as above.

In [None]:
train_i, test_i = ...
print(train_i.num_rows, 'training and', test_i.num_rows, 'test instances.')

In [None]:
check('tests/q6.py')

#### <font color=blue> **Question 7.** </font>
With classification we need to use training data to decide how to classify data given a set of attributes, sepal length and sepal width in this case. Create a function which returns the majority classification among three possibilities in "Name" coded as 0, 1, 2 (setosa, versicolor, and virginica respectively). The `and` below combines two conditionals. For example,  (twos > ones) and ...

In [None]:
def majority(topkclasses):
    virginica = topkclasses.where('Name', are.equal_to(2)).num_rows
    versicolor = ...
    setosa = ...
    # Now test to see what the majority name for each k class
    if ... and ...:
        return 2
    elif ... and ...:
        return 1
    else:
        return 0

In [None]:
check('tests/q7.py')

In [None]:
def classify(training, new_point, k):
    closestk = closest(training, new_point, k,"Name")
    topkclasses = closestk.select('Name')
    return majority(topkclasses)

In [None]:
test_row_num = ...
k = ...
print("Prediction: ",classify(train_i,test_i.drop(...).row(test_row_num),k)," Actual: ",test_i.select("Name").row(test_row_num))

In [None]:
def predict(train, test_attributes, k):
    pred = []
    for i in np.arange(test_attributes.num_rows):
        pred.append(classify(train,test_attributes.row(i),k))
    return pred

#### <font color=blue> **Question 8.** </font>
Make a new table called prediction which includes original columns of test Table but also includes a "predict" column.

In [None]:
k = ...
prediction = test_i.with_columns("predict",...)
prediction.show(30)

In [None]:
check('tests/q8.py')

### Plot decision outcomes for test set
#### <font color=blue> **Question 9.** </font>
Use above prediction Table to make a scatter plot of the color coded predictions based on the tweo attributes(use group="predict" in scatter plot after specifying x and y axis based on attributes).

In [None]:
prediction.drop("Name").scatter(...,...,group='predict')

### Fancy plot showing color coded decision boundaries
We can make a more informative plot by predicting on a grid of attribute values as shown below. Seaborn is an add-on to the Matplotlib plotting we have been using which provides more control of plotting. Execute (this may take a minute+) and study the below input and resulting output for your information.

In [None]:
def make_colors(iris, y, cmap):
    colors = []
    cdict = {'setosa':0, 'virginica':2, 'versicolor':1}
    for x in iris.target_names[y]:
        colors.append(cmap[cdict[x]])
    return colors

In [None]:
import seaborn as sns

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
h = 0.1  # step size in the mesh
k = 10
x_min, x_max = iris.data[:, 0].min() - 1, iris.data[:, 0].max() + 1
y_min, y_max = iris.data[:, 1].min() - 1, iris.data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
## Create a grid of predictions in a Table
attribute_grid = Table().with_columns(
    iris.feature_names[0],
    np.c_[xx.ravel(), yy.ravel()][:, 0],
    iris.feature_names[1],
    np.c_[xx.ravel(), yy.ravel()][:, 1],
)

Z = np.array(predict(train_i, attribute_grid, k))

# Create color maps
cmap_light = ListedColormap(["orange", "cyan", "cornflowerblue"])
cmap_bold = ["darkorange", "c", "darkblue"]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap=cmap_light, alpha=.3)

# Plot the test points but convert to numpy arrays
predictions = prediction.column("predict")
attribute1 = prediction.column(1)
attribute2 = prediction.column(2)
plt.scatter(
    x=attribute1,
    y=attribute2,
    c=make_colors(iris, predictions, cmap_bold),
    alpha=.7,
    edgecolor="black",
)

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i')" % (k))
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.show()

#### <font color=blue> **Question 10.** </font>
Comment on the quality of the predictions by <font color='blue'>
1. <font color='green'>Your nearest neighjbor algorithm 

#### <font color='blue'> Answers here </font>
***

## Molecules and predicting acidity measured by pKa
Within the Jupyter notebook we can also analyze molecules and their molecular data using the library  RDKit. RDKit adds the ability to visualize 2D and 3D molecular structures. We can apply many of the data science tools we have learned to molecular data as well. <br>First we will briefly look at acid-base chemistry and how acidity is defined. pH is a measure of the acidity of a water-based (aqueous) solution. A pH of 1 is acidic, a pH of 7 is neutral and a pH of 14 is basic.  Next we will use some computed atributes of a large set of molecules to train a k nearest neighbor model to predict acidity. We will use a range of attributes including the partial charges on atoms adjacent to the acidic proton, molecular weight, solvent accessible surface area (SASA), carbon-oxygen bond order, and some thermochemistry measures all of which may help predict acidity with a lower pKa indicating a stronger (weak) acid.

#### Acid-base and pKa background
A very brief background on acid - base equilibria demonstrated for glycine. See [OpenStax Chemistry](https://openstax.org/books/chemistry-2e/pages/14-introduction) for details based on interest.
<br><center><img src='acid_base_pKa.png' width=900></center>

### RDKit
[RDKit](https://www.rdkit.org/docs/Cookbook.html) is a specialized library to handle the complexities of molecules within Python. 

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules
from rdkit.Chem.Draw.MolDrawing import MolDrawing, DrawingOptions #Only needed if modifying defaults
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit import DataStructs
# Options
DrawingOptions.bondLineWidth=1.8

#### <font color='magenta'> Load detailed molecular data for 600 molecules

In [None]:
molecules = Table().read_table('pKa_med.csv')
molecules

#### <font color=blue> **Question 11.** </font>Select an amino acid 
 Use the Table above to view data for an amino acid of your selection from the 21 amino acids which are building blocks of proteins. Note: not all amino acids are in the data set, try another if missing. See [web page](https://i.pinimg.com/originals/a2/fd/dd/a2fddd4ad8b9067bfeb0d6f51cf28e71.jpg) for possible choices. Hint: use are.containing within the .where() Table method. For example below we can find compounds which contain a methyl group (CH$_3$). We get 55 rows (records).

In [None]:
methyl = molecules.where("Name",are.containing("methyl"))
methyl

In [None]:
amino = molecules.where("Name",are.containing("..."))
amino

In [None]:
check('tests/q11.py')

### Display molecular structure
[SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) is a shorthand language to describe molecular structure. Execute each structure below.

In [None]:
Chem.MolFromSmiles("[NH2+]CC(O)=O") # Glycine an amino acid

In [None]:
EDS.smiles3D("[NH2+]CC(O)=O")

In [None]:
Chem.MolFromSmiles("[H]-O-[H]") #Water

In [None]:
Chem.MolFromSmiles("[CH3]") #Methyl radical

In [None]:
Chem.MolFromSmiles("C-C-O") #Ethanol

In [None]:
EDS.smiles3D("C-C-O")

### pKa data examination
Now we will look at a data set derived from the above data but with computed molecular attributes for our machine learning. This data set is computed and described by  [Prof. Vince Voelz](http://www.voelzlab.org) in the Temple Chemistry department and a graduate student, Robert Raddi. See their paper: [Stacking Gaussian processes to improve pKa predictions in the SAMPL7 challenge](https://link.springer.com/epdf/10.1007/s10822-021-00411-8?sharing_token=yLV8dMXdxg40M_Ds_2Rhsfe4RwlQNchNByi7wbcMAY6fCl3bMLQiAhJzS2zZw-SwUkz490heLLZu1bPJ8T5LHXo1WvZkp0AJmWzXo71rszl8UaPxjqtqR-oARfxWGrTiCV0rNXy0C7IVzX6yoMYTPv2ZJfnQS-zF1pYvL8ESsUI%3D).

### Load molecular data set with additional descriptors

In [None]:
molecules = Table().read_table('pKa_small.csv')
molecules

### Histogram of pKa, molecular weight (g/mol), and ∆H.

In [None]:
molecules.hist('pKa',bins=25, edgecolor="black", linewidth=1.2)
molecules.hist('Weight',bins=25, edgecolor="black", linewidth=1.2)
molecules.hist('∆H',bins=25, edgecolor="black", linewidth=1.2)

### <font color='purple'>Sample molecules in our data set</font>
Take them for a spin by clicking and dragging on them.

In [None]:
for rowid in np.random.choice(np.arange(molecules.num_rows),4):
    print(molecules['Name'][rowid],' pKa: ',molecules['pKa'][rowid], ' Molecular Weight (g/mol): ',molecules['Weight'][rowid])
    EDS.smiles3D(molecules['smiles'][rowid])

### <font color='purple'>Pharmaceuticals in our data set</font>
Below is a sampling of the many pharmaceuticals in our data set with 2D views.

In [None]:
celecoxib = Chem.MolFromSmiles(molecules.where('Name','celecoxib')['smiles'][0])
celecoxib.SetProp('Name',molecules.where('Name','celecoxib')['Name'][0]+' (NSAID) pKa: '+str(molecules.where('Name','celecoxib')['pKa'][0]))
sertraline = Chem.MolFromSmiles(molecules.where('Name','Sertraline')['smiles'][0])
sertraline.SetProp('Name',molecules.where('Name','Sertraline')['Name'][0]+' (anti-depressant, SSRI) pKa: '+str(molecules.where('Name','Sertraline')['pKa'][0]))
losartin = Chem.MolFromSmiles(molecules.where('Name','Losartan')['smiles'][0])
losartin.SetProp('Name',molecules.where('Name','Losartan')['Name'][0]+' (hypertension,  angiotensin II receptor blocker) pKa: '+str(molecules.where('Name','Losartan')['pKa'][0]))
'Losartan'
Draw.MolsToGridImage(
    [celecoxib, sertraline, losartin],
    legends=[celecoxib.GetProp("Name"),sertraline.GetProp("Name"), losartin.GetProp("Name")],
    molsPerRow=3,
    subImgSize=(400, 250),
    useSVG=True,)

##### We can look at the structures and data on several derivatives of acetic acid by executing the code below

### Display molecular structure from dataset
[SMILES](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) is a shorthand language to describe molecular structure.

##### Example from smiles string from molecules Table

In [None]:
print(molecules['Name'][22],'smiles string: ', molecules['smiles'][22])

Try your own... 2D and 3D structures using `Chem.MolFromSmiles()` and `EDS.smiles3D`

#### Selected amino acid 2D molecular structure
Try it out for fun! Use the same above syntax and the SMILES string from your above Table to display a 2D amino acid structure from your selection. Even if there is no rdkit, try your hand at the SMILES molecular description. The 21 basic amino acid protein building blocks are given in a [helpful table](https://www.aatbio.com/data-sets/amino-acid-reference-chart-table) along with their smiles string to use below.

In [None]:
print("My chosen amino acid is:",...)
smile_struct = '...'

Chem.MolFromSmiles(smile_struct)

In [None]:
EDS.smiles3D(...)

###  Nearest Neighbor
For machine learning we will use the 5 features available in the molecules Table after dropping the smiles molecular structures string.

#### Selection of attributes/features for training and prediction
We need to select the features that we will use in the training. These will include the molecular weight (g/mol), Change in Enthalpy (kJ/mol) (prot-deprot), ∆G_solv (kJ/mol) (prot-deprot), the bond order to the atom for which the acidic proton is bound, and the solvent accessible surface area (SASA). We also keep the labels and smiles string as well as the pKa we will train on. In a second round we will add features corresponding to partial charges on nearby atoms.

Let's look at caprilic acid which is a fatty acid found in palm oil and coconut oil to understand the various features we are using in our machine learning

In [None]:
molecules.where('Name','Caprylic acid')

In [None]:
molecules.where('Name','Caprylic acid')['smiles'][0]

In [None]:
Chem.MolFromSmiles(molecules.where('Name','Caprylic acid')['smiles'][0])

### Standardize, train, test split
#### <font color=blue> **Question 12.** </font>
First we need to convert all numerical values to standard units. This is because different features have differing magnitudes leading to deviations in computed Euclidean distances. You will split the molecular Table into train and test data using 80% for training and remembering that the split must be an integer using int() function. Again we will select certain rowsas attributes.

In [None]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

In [None]:
molecules.labels[2:-1]

In [None]:
pKa_s = molecules.select('Name','pKa','smiles')
for label in molecules.labels[2:-1]:
    print('Standardizing: ',label)
    pKa_s = pKa_s.with_columns(label,standard_units(molecules[label]))
pKa_s   

Now time to split the data, your turn... rember to use the standardized data in `pKa_s`

In [None]:
train, test = pKa_s.split(...)
print(train.num_rows, 'training and', test.num_rows, 'test instances.')

train.show(3)

In [None]:
check('tests/q12.py')

### Our k nearest neighbors code

Remember our the k nearest neighbor code from above which wewill again use here.
<code>
    def row_distance(row1, row2):
    """The distance between two rows of a table."""
    return distance(np.array(row1), np.array(row2))

def distances(training, example, output):
    """Compute the distance from example for each row in training."""
    dists = []
    attributes = training.drop(output)
    for row in attributes.rows:
        dists.append(row_distance(row, example))
    return training.with_column('Distance', dists)

def closest(training, example, k, output):
    """Return a table of the k closest neighbors to example."""
    return distances(training, example, output).sort('Distance').take(np.arange(k))
</code>

### Test algorithm
Execute these cells to define the predict_nn function for pKa, pick a  row, predict and compare.

In [None]:
def predict_nn(test):
    """Return the majority class among the k nearest neighbors."""
    k = 10
    return np.average(closest(train.drop('Name','smiles'), test, k, 'pKa').column('pKa'))

Examine 1 row in test set to try to predict

In [None]:
test.drop().take(90)

#### Look at closest in training set to test row, need to drop pKa and smiles string from test

In [None]:
test.take(90)

In [None]:
print('Target pKa: ',test.take(90)['pKa'][0])
k = 25
closest(train.drop('Name','smiles'), test.drop('Name','smiles','pKa').row(90), k, 'pKa').select('pKa','Distance')

In [None]:
pKa_pre = closest(train.drop('Name','smiles'), test.drop('Name','smiles','pKa').row(90), k, 'pKa')['pKa'].mean()
print(f'Prediction: {pKa_pre:.2f}')

#### If we use test data in both cases we get exact match (Distance = 0) and no training, not machine learning but matching!

In [None]:
closest(test.drop('Name','smiles'), test.drop('Name','smiles','pKa').row(100), k, 'pKa').select('pKa','Distance')

### Histogram of experimental acidity to be predicted
#### *Question*  
Make two histograms of acidity measured by pKa in the training data and test data. Compare the distribution.

In [None]:
...

### <font color=blue> **Question 13.** </font>Prediction time
Now predict the pKa of the 10th molecule in the test dataset using predict. We need to drop the experimental pKa, the smiles string, and the name to create a test_nn_row with the attributes for the k nearest neighbor. Discuss the quality of the fit and the name of the name of the molecule from column 1. Repeat for two more rows and discuss the prediction quality. Keep in mind that the prediction of pKa is a very challenging task for machine learning.

In [None]:
test_nn_row = ....row(9)
test_nn_row

In [None]:
predict_nn(test_nn_row)

In [None]:
print('Experimental pKa:', test.column('pKa').item(9))
print('Predicted pKa using nearest neighbors:', round(predict_nn(test_nn_row),2))

In [None]:
check('tests/q13.py')

### Now let's plot knn prediction success
Execute the next three cells

In [None]:
def predict_knn(test, k=5):
    """Return the majority class among the k nearest neighbors."""
    return np.mean(closest(train.drop('Name','smiles'), test, k, 'pKa').column('pKa'))

In [None]:
exp_pKa = np.array([])
predict_pKa = np.array([])
for i in np.arange(test.num_rows):
    exp_pKa = np.append(exp_pKa,test.column('pKa').item(i))
    test_nn_row = test.drop('Name','pKa','smiles').row(i)
    predict_pKa = np.append(predict_pKa,predict_knn(test_nn_row) )

In [None]:
len(exp_pKa), len(predict_pKa)

In [None]:
plt.scatter(exp_pKa, predict_pKa)
# calculate equation for regression line
z = np.polyfit(exp_pKa, predict_pKa, 1)
p = np.poly1d(z)
# add trendline to plot
plt.plot(
    exp_pKa, p(exp_pKa), "blue", label="{}".format(p)+' k:'+str(k)
)  # Equation of line placed in legend from label
plt.xlabel("Experimental pKa")
plt.ylabel("Predicted pKa")
plt.legend(fontsize="small")
plt.show()

### Conclusions on our k nearest neighbor model
#### <font color=blue> **Question 14.** </font>
What is the interpretation of the slope and the intercept in the above plot? What would the slope and intercept be in the case of a perfect match between `Predicted pKa` and `Experimental pKa`?

<font color='blue'>Your discussion </font>
***   

Evaluate the overall quality of our machine learning prediction based on the above plot.

<font color='blue'>Your discussion </font>
***   

### Examining dependence on k parameter.
#### <font color=blue> **Question 15.** </font>
Now we will iterate through different values of k to decide which is best.

Now we will try a few values for k to try to optimize the value of k which is known as a hyperparameter. We need a new version of `predict_nn` that also has an argument of k with a default value of 5.

In [None]:
def predict_knn(test, k=5):
    """Return the majority class among the k nearest neighbors."""
    return np.mean(closest(train.drop('Name','smiles'), test, k, 'pKa').column('pKa'))

Make a list of 4 to 7 values of k to test with the same plot as is in Question 13 above.

In [None]:
for k in [...]:
    exp_pKa = make_array()
    predict_pKA = make_array()
    for i in np.arange(test.num_rows):
        exp_pKa = np.append(exp_pKa, test.column("pKa").item(i))
        example_nn_row = test.drop('Name','pKa','smiles').row(i)
        predict_pKA = np.append(predict_pKA, predict_knn(example_nn_row, k))
    plt.scatter(exp_pKa, predict_pKA)
    z = np.polyfit(exp_pKa, predict_pKA, 1)
    p = np.poly1d(z)
    plt.plot(
        exp_pKa, p(exp_pKa), "blue", label="{}".format(p), color='teal',alpha=0.7)  # Equation of line placed in legend from label
    plt.xlabel("Experimental pKa")
    plt.ylabel("Predicted pKa")
    plt.title("k = " + str(k))
    plt.legend(fontsize="small")
    plt.savefig('k-plots.png')
    plt.show()


<font color=blue> *Question:* Which value of `k` makes the best estimation?</font>

In [None]:
k = ...

In [None]:
check('tests/q14.py')

### Different knn weighting schemes
All of the nearest n neighbors receive the same consideration in determining the prediction. It makes sense that points that are 'nearer' may be more important or weightier. Below are figures showing the unweighted approach we have been using, a weighting scheme based on inverse distance (1/distance), and an exponetial weighting scheme for 4-ethylphenol.<br>
<img src='ethylphenol.png'>

---
### <center>**knn weighting plots**
---
<center>k = 5 with neighboring 5 molecular structures<br><img src='knn_equal_weighting_molecule.png'></center><br>

---
<center>k = 10<br>
<img src='knn_inverse_weighting.png'><img src='knn_exponential_weighting.png'></center>

<font color='green'>Repeat the above replacing the `predict_knn` with the first weighting scheme which is 1/distance weighting in the `predict_knn_weighted` function and then the `predict_knn_weighted_exp` function.

In [127]:
def predict_knn_weighted(example,k):
    """Return the majority class among the k nearest neighbors."""
    dist_table = closest(train.drop('Name','smiles'), example, k, 'pKa')    
    total_inverse = np.sum(1/dist_table['Distance'])
    dist_table=dist_table.with_columns('knn_weighting',(1/dist_table['Distance'])*total_inverse)
    sum_weight = np.sum(dist_table['knn_weighting'])
    weighted_mean_pKa = np.sum(dist_table['pKa']*dist_table['knn_weighting']/sum_weight)
    return weighted_mean_pKa

In [128]:
def predict_knn_weighted_exp(example,k):
    """Return the majority class among the k nearest neighbors."""
    dist_table = closest(train.drop('Name','smiles'), example, k, 'pKa')    
    total_exp = np.sum(np.exp(-dist_table['Distance']))
    dist_table=dist_table.with_columns('knn_weighting',(np.exp(-dist_table['Distance']))*total_exp)
    sum_weight = np.sum(dist_table['knn_weighting'])
    weighted_mean_pKa = np.sum(dist_table['pKa']*dist_table['knn_weighting']/sum_weight)
    return weighted_mean_pKa

In [None]:
for k in [...]:
    exp_pKa = make_array()
    predict_pKA = make_array()
    for i in np.arange(test.num_rows):
        exp_pKa = np.append(exp_pKa, test.column("pKa").item(i))
        example_nn_row = test.drop('Name','pKa','smiles').row(i)
        predict_pKA = np.append(predict_pKA, ...(example_nn_row, k))  # PLACE TO PUT NEW FUNCTION
    plt.scatter(exp_pKa, predict_pKA)
    z = np.polyfit(exp_pKa, predict_pKA, 1)
    p = np.poly1d(z)
    plt.plot(
        exp_pKa, p(exp_pKa), "blue", label="{}".format(p), color='teal',alpha=0.7)  # Equation of line placed in legend from label
    plt.xlabel("Experimental pKa")
    plt.ylabel("Predicted pKa")
    plt.title("k = " + str(k))
    plt.legend(fontsize="small")
    plt.savefig('k-plots.png')
    plt.show()


### <font blue>Which weighting scheme works best?</font>

...

### <font color=blue> **Question 16.** </font>
At the end of each lab, please include a reflection. 
* How did this lab go? 
* Can you think of other applications of k-means clustering?
* Were there questions you found especially challenging you would like your instructor to review in class? 
* How long did the lab take you to complete?

Share your feedback so we can continue to improve this class!

**Insert a markdown cell below this one and write your reflection on this lab.**

### <font color='green'>Draw a 3D structure of your favorite molecule encountered in this lab using a smiles string and the `EDS.smiles3D()` function. Why is it your favorite?

In [None]:
EDS.smiles3D(...)

...

## All finished...
Run checks and submit .html and .ipynb files after downloading.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import check
correct = 0
checks = [1,2,3,4,6,7,8,11,12,13,14,15]
total = len(checks)
for x in checks:
    print('Testing question {}: '.format(str(x)))
    g = check('tests/q{}.py'.format(str(x)))
    if g.grade == 1.0:
        print("Passed")
        correct += 1
    else:
        print('Failed')
        display(g)

print('Grade:  {}'.format(str(correct/total)))
print("Nice work ",Your_name, user)
import time;
localtime = time.asctime( time.localtime(time.time()) )
print("Submitted @ ", localtime)

---
---
### <font color='brown'>REFERENCE: k nearest neghbors toolkit
---

In [None]:
trainf, testf = table.split(int(0.80*table.num_rows))
print(trainf.num_rows, 'training and', testf.num_rows, 'test instances.')

In [None]:
def distance(pt1, pt2):
    """The distance between two points, represented as arrays."""
    return np.sqrt(sum((pt1 - pt2) ** 2))
def row_distance(row1, row2):
    """The distance between two rows of a table."""
    return distance(np.array(row1), np.array(row2)) # Need to convert rows into arrays

def distances(training, example, output):
    """Compute the distance from example for each row in training."""
    dists = []
    attributes = training.drop(output)
    for row in attributes.rows:
        dists.append(row_distance(row, example))
    return training.with_column('Distance', dists)

def closest(training, test, k, output):
    """Return a table of the k closest neighbors to example."""
    return distances(training, test, output).sort('Distance').take(np.arange(k))

#### Classification

In [None]:
def majority(topkclasses):
    two = topkclasses.where('Name', are.equal_to(2)).num_rows
    one = topkclasses.where('Name', are.equal_to(1)).num_rows
    zero = topkclasses.where('Name', are.equal_to(0)).num_rows
    if (two> one) and (two > zero):
        return 2
    elif one>zero:
        return 1
    else:
        return 0
def classify(training, new_point, k):
    closestk = closest(training, new_point, k,"Name")
    topkclasses = closestk.select('Name')
    return majority(topkclasses)
def predict(train, test_attributes, k):
    pred = []
    for i in np.arange(test_attributes.num_rows):
        pred.append(classify(train,test_attributes.row(i),k))
    return pred

#### Regression

Standard Units for features/attributes

In [None]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  
pKa_s = molecules.select('Name','pKa','smiles')
for label in molecules.labels[2:-1]:
    print('Standardizing: ',label)
    pKa_s = pKa_s.with_columns(label,standard_units(molecules[label]))
pKa_s  

In [None]:
def predict_knn(test, k=5):
    """Return the majority class among the k nearest neighbors."""
    return np.mean(closest(train.drop('Name','smiles'), test, k, 'pKa').column('pKa'))

Alternate weighting of neighbors

In [None]:
def predict_knn_weighted(example,k):
    """Return the majority class among the k nearest neighbors."""
    dist_table = closest(train.drop(0), example, k, 'pKa')    
    total_inverse = np.sum(1/dist_table['Distance'])
    dist_table=dist_table.with_columns('knn_weighting',(1/dist_table['Distance'])*total_inverse)
    sum_weight = np.sum(dist_table['knn_weighting'])
    weighted_mean_pKa = np.sum(dist_table['pKa']*dist_table['knn_weighting']/sum_weight)
    return weighted_mean_pKa

In [None]:
def predict_knn_weighted_exp(example,k):
    """Return the majority class among the k nearest neighbors."""
    dist_table = closest(train.drop(0), example, k, 'pKa')    
    total_exp = np.sum(np.exp(-dist_table['Distance']))
    dist_table=dist_table.with_columns('knn_weighting',(np.exp(-dist_table['Distance']))*total_exp)
    sum_weight = np.sum(dist_table['knn_weighting'])
    weighted_mean_pKa = np.sum(dist_table['pKa']*dist_table['knn_weighting']/sum_weight)
    return weighted_mean_pKa