# Imports

In [1]:
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets, linear_model
from sklearn import tree
import sklearn.cluster as cluster
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
import plotly.express as px
import plotly.graph_objects as go
from scipy.optimize import minimize

np.random.seed(1234)

In [None]:
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
plt.rcParams.update({"axes.grid": True, "figure.figsize": (8, 4)})
import seaborn as sns

# sns.set()

# Various Data Loads

In [6]:
tips = sns.load_dataset("tips")
X = tips.drop(columns=["tip"])
y = tips["tip"]
display(X)
display(y)

Unnamed: 0,total_bill,sex,smoker,day,time,size
0,16.99,Female,No,Sun,Dinner,2
1,10.34,Male,No,Sun,Dinner,3
2,21.01,Male,No,Sun,Dinner,3
3,23.68,Male,No,Sun,Dinner,2
4,24.59,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...
239,29.03,Male,No,Sat,Dinner,3
240,27.18,Female,Yes,Sat,Dinner,2
241,22.67,Male,Yes,Sat,Dinner,2
242,17.82,Male,No,Sat,Dinner,2


0      1.01
1      1.66
2      3.50
3      3.31
4      3.61
       ... 
239    5.92
240    2.00
241    2.00
242    1.75
243    3.00
Name: tip, Length: 244, dtype: float64

In [2]:
from sklearn import datasets

iris = datasets.load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names).join(
    pd.Series(iris["target"], name="species")
)

df["species"] = df["species"].map({0: "setosa", 1: "versicolor", 2: "virginica"})

df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [None]:
df = pd.read_csv("./data/housing.csv")
df

# Overview - History of Neural Networks

Generalized, functional learning algorithms were first developed for multilayer perceptrons and used a "feedforward" design (compared to backpropagation). Alexey Ivakhnenko and V. G. Lapa first published their work in 1967 in Cybernetics and Forecasting Techniques. Then, in 1971, Ivakhnenko published a paper called "Polynomial Theory of Complex Systems," in which he described a deep network with eight layers that was trained by the group method.

After researching backpropagation for his 1974 dissertation, Paul Werbos was the first person in the United States to propose using it for artificial neural networks. Backpropagation is a method for processing errors at the output (not at the input) and distributing them backward through the system's layers for training and learning. It has become a popular method for training deep neural networks.

Lastly, deep learning became a reality in 1989 when Yann LeCun and his colleagues experimented with the standard backpropagation algorithm (created in 1970) and applied it to a neural network. They aimed to teach the computer how to recognize handwritten ZIP codes on mail. This new system worked, and it was the beginning of deep learning.

# 22.1 Introduction to Neural Networks

First used successfully for image classification. Consider an MxN image with 3 color channels
- Has MxNx3 total features, or MxNx3 dimensionality
- While you could develop decision trees on this, would have to be giant data set to avoid risk of overfitting
- A good first step is to pre-process to capture most important features in image to reduce its dimensionality
- Example histogram of gradient (hog)
- For some time focus was on developing best such pre-processing steps

Standard data sets of importance for comparing classifier performance, for example imagenet
- Big database of labeled images
- Contest was held for years for people to enter their best classifier performance
- Step change in performance seen with alexnet, which was a neural network (vs the computer vision approach)

![image.png](attachment:image.png)

# 22.2 Foundations of NN

Terminology recap

![image.png](attachment:image.png)

Recapping linear regression
- Find parameters a, b (a is a vector) such that the loss function is minimized
- Remember phi(xi) is a transformation on the input feature space xi

![image.png](attachment:image.png)

Depicting the above in network form

![image.png](attachment:image.png)

When switching to an nn, several things are generalized. First the phi functions are activation functions, most common are
- sigmoid: acts like a switch, either opening or closing a channel
- tanh: puts saturation limits on outputs of neuron; small values basically unchanged, but truncates large positive or negative values
- relu: prevents any negative numbers

![image.png](attachment:image.png)

Another generalization
- NN add coefficients to all lines on the network graph, not just ones on output side

![image.png](attachment:image.png)

Generalization
- Allow for many layers to be used, not just a single layer
- L total layers, each of which have their own phi(prior layer) transformation
- For L layers, L-1 hidden layers, plus the Lth output (not hidden) layer

![image.png](attachment:image.png)

Final generalization
- By adding an activation function on output of outer layer, you can create classifiers
- For example with sigmoid, you can have a binary classifier

Training problem is then to find all A, b such that loss function is minimized

# 22.3 NN Playground

Full walkthrough of calculation, given a 20-year old passenger who paid $8
- consider layers before final layer as generating new features for the next layer
- the generated features are anonymous

![image.png](attachment:image.png)

Multiple layers with a programmable number of neurons per layer are fine too

![image.png](attachment:image.png)

Example separated in quadrants of an x/y plane - come up with a feature that predicts blue

![image.png](attachment:image.png)

Additional examples shown using [tensor flow playground](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=xor&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=&seed=0.05622&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=true&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)

![image.png](attachment:image.png)