# Optimization for Data Science

In this notebook, we will look at the use of optimization for various machine learning applications

Basic elements of optimization:
- Variables
- Constraints
- Function for optimization
- Optimization method

Because the sklearn methods have built in functions for optimization and optimizers, we will focus on applications for regression and classification while noticing how optimization happens behind the scenes.

## Wine Data:
<img src="Resource/wine.jpg" width="250">
Making wine is pretty interesting, where many different factors play a role in determining the properties of the wine. This dataset presents a chemical and physical analysis of wine from 3 different sources in Italy.<sup>1</sup>

<sup>1. Forina, M. et al, PARVUS - An Extendible Package for Data
       Exploration, Classification and Correlation. Institute of Pharmaceutical
       and Food Analysis and Technologies, Via Brigata Salerno, 
       16147 Genoa, Italy.</sup>

#### Brief description of features:
1. Cultivar: source of wine
2. Alcohol: alcohol content
3. Malic acid (C4H6O5): Found in fruits, contributes sour taste
4. Ash: inorganic matter
5. Alkalinity of ash: how basic the ash is
6. Magnesium: magnesium content, a cofactor in many enzyme systems that regulate biochemical reactions in the body
7. Total phenols: natural compounds containing phenol group that contribute to the color and texture in wine
8. Flavanoids: a type of phenol, most of the phenols in wine are flavanoids
9. Nonflavanoid phenols: all the other phenols
10. Proanthocyanidins: polyphenols, composed of flavanoid oligomers
11. Color intensity: measurement made with spectrophotometer/colorometer to determine transmission properties of the wine
12. Hue: a property of color of the wine
13. OD280/OD315 of diluted wines: optical density at 280nm/315nm ratio, like absorbance except it considers the scattering of light as well. Used to determine protein concentration
14. Proline(C5H9NO2): The most abundant amino acid in wine

In [None]:
# Imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
df = pd.read_csv('wine.data',names = ['Cultivar','Alcohol','Malic_acid','Ash','Alkalinity_of_ash','Magnesium',
                                 'Total_phenols','Flavanoids','Nonflavanoid_phenols','Proanthocyanidins','Color_intensity',
                                 'Hue','OD280/OD315_of_diluted_wines','Proline'])
df

What properties of wine are you interested in? Let's use a heatmap to find relationships between the features!

In [None]:
# Create a heatmap using seaborn:
# Helpful settings: square = True, annot = True
correlation = df.corr()
fig = plt.subplots(figsize=(11,11))
sb.heatmap(correlation, square = True, annot = True)

## Linear Regression:

Optimization for linear regression involves finding the minimum of the mean square error function. This can easily be accomplished with the normal equation.

In [None]:
# Visualize our data:
x = df[['Flavanoids']]
y = df[['OD280/OD315_of_diluted_wines']]
fig = plt.figure(figsize = (8,8))
ax = fig.add_axes([.1,.1,.8,.8])
ax.scatter(x,y)
ax.set_xlabel('Flavanoids')
ax.set_ylabel('OD280/OD315_of_diluted_wines')

In [None]:
# Split into training and testing sets
x_train,x_test,y_train,y_test = train_test_split(x,y)

# Perform linear regression
linreg = LinearRegression()
linreg.fit(x_train,y_train)
y_pred = linreg.predict(x_test)

# Check how we did:
accuracy = linreg.score(x_test,y_test)
print("accuracy:           ", accuracy)

mse = mean_squared_error(y_test,y_pred)
print("mean squared error: ", mse)

# plot our results:
fig = plt.figure(figsize = (8,8))
ax = fig.add_axes([.1,.1,.8,.8])
ax.scatter(x,y)
ax.plot(x_test,y_pred,'r')
ax.set_xlabel('Flavanoids')
ax.set_ylabel('OD280/OD315_of_diluted_wines')

## Logistic Regression:

In [None]:
# Show which of the three sources the wine is from
fig = plt.figure(figsize = (8,8))
ax = fig.add_axes([.1,.1,.8,.8])
ax.scatter(df.Flavanoids, df['OD280/OD315_of_diluted_wines'], c=df.Cultivar, edgecolors='k', cmap=plt.cm.Paired)
ax.set_xlabel('Flavanoids')
ax.set_ylabel('OD280/OD315_of_diluted_wines')

In [None]:
# Split into training sets:
X = df[['Flavanoids','OD280/OD315_of_diluted_wines']]
Y = df['Cultivar']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y)

# Fit the data:
logreg = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')   #Notice the solver settings
logreg.fit(X_train, Y_train)

# Results:
training_score = logreg.score(X_train, Y_train)
print("training score: ",training_score)
test_score = logreg.score(X_test, Y_test)
print("test score:     ", test_score)

In [None]:
# Visualize the data:
# Plot the decision boundary in a mesh:
x_min, x_max = X['Flavanoids'].min() - .5, X['Flavanoids'].max() + .5
y_min, y_max = X['OD280/OD315_of_diluted_wines'].min() - .5, X['OD280/OD315_of_diluted_wines'].max() + .5
h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(8, 8))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot the points:
plt.scatter(df.Flavanoids, df['OD280/OD315_of_diluted_wines'], c=df.Cultivar, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Flavanoids')
plt.ylabel('OD280/OD315_of_diluted_wines')

Lets observe the different types of solvers:
1. Newton-Conjugate Gradient: 'newton-cg’
2. Limited-memory Broyden–Fletcher–Goldfarb–Shanno: ‘lbfgs’
3. A Library for Large Linear Classification: ‘liblinear’
4. Stochastic gradient Average: ‘sag’, 
5. SAGA: ‘saga’



## SVM:

In [None]:
# Fit the data:
clf = svm.SVC(C = 1, gamma = 'auto')   #for SVM, 'liblinear' is used for all computations
clf.fit(X_train, Y_train)

# Results:
training_score = clf.score(X_train, Y_train)
print("training score: ",training_score)
test_score = clf.score(X_test, Y_test)
print("test score:     ", test_score)

In [None]:
# Plot the decision boundary in a mesh:
x_min, x_max = X['Flavanoids'].min() - .5, X['Flavanoids'].max() + .5
y_min, y_max = X['OD280/OD315_of_diluted_wines'].min() - .5, X['OD280/OD315_of_diluted_wines'].max() + .5
h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(8, 8))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot the points:
plt.scatter(df.Flavanoids, df['OD280/OD315_of_diluted_wines'], c=df.Cultivar, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Flavanoids')
plt.ylabel('OD280/OD315_of_diluted_wines')

## Decision Tree:

In [None]:
# Fit the data:
decision_tree_classifier = DecisionTreeClassifier(random_state=1)    #optimization is done by greedy algorithm (usually)
decision_tree_classifier.fit(X_train, Y_train)  

# Results:
training_score = decision_tree_classifier.score(X_train, Y_train)
print("training score: ",training_score)
test_score = clf.score(X_test, Y_test)
print("test score:     ", test_score)

In [None]:
# Plot the decision boundary in a mesh:
x_min, x_max = X['Flavanoids'].min() - .5, X['Flavanoids'].max() + .5
y_min, y_max = X['OD280/OD315_of_diluted_wines'].min() - .5, X['OD280/OD315_of_diluted_wines'].max() + .5
h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = decision_tree_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(8, 8))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot the points:
plt.scatter(df.Flavanoids, df['OD280/OD315_of_diluted_wines'], c=df.Cultivar, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Flavanoids')
plt.ylabel('OD280/OD315_of_diluted_wines')

## Neural Network:

In [None]:
# Fit the data:
mlp = MLPClassifier()   # use ‘lbfgs’, ‘sgd’, ‘adam’
mlp.fit(X_train,Y_train)

# Results:
training_score = mlp.score(X_train, Y_train)
print("training score: ",training_score)
test_score = clf.score(X_test, Y_test)
print("test score:     ", test_score)

In [None]:
# Plot the decision boundary in a mesh:
x_min, x_max = X['Flavanoids'].min() - .5, X['Flavanoids'].max() + .5
y_min, y_max = X['OD280/OD315_of_diluted_wines'].min() - .5, X['OD280/OD315_of_diluted_wines'].max() + .5
h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = mlp.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(8, 8))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot the points:
plt.scatter(df.Flavanoids, df['OD280/OD315_of_diluted_wines'], c=df.Cultivar, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Flavanoids')
plt.ylabel('OD280/OD315_of_diluted_wines')

How does changing the solver effect the end results from using mlp classifier? How does this compare with changing solvers for the logisitic regression?

Adjust the learning rate and momentum to see what happens!