## Fundamentals of Models and Data   

Welcome to Data Science at Georgia Tech! Our goal is to teach you about Data Science and Machine Learning in a way that is useful. We mix theory and hands-on coding -- because it's cool when you can do stuff with your own hands. 

This notebook accompanies the Fundamental Understandings of Models and Data. Use this notebook to follow along!  

Goals: To cover how to consider **models** and **data**

Instructions for People New to Notebooks:
- To run a cell, click on a cell and press shift enter.  



#### Importing Important Toolkits  

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import pandas as pd

In [None]:
#silencing warnings. because they don't really matter and are just ugly to look at
try: 
    import warnings
    warnings.filterwarnings('ignore')
except:
    pass

### There are 3 main ways to consider ML models. 
1) Pictorially  
2) Using decision boundaries  
3) Mathematically    
Let's walk through each of them 

### 1) Pictorially
#### Can you guess which picture corresponds to which algorithm?
a) KNN b) Neural Network c) Decision Tree d) SVM  
 
<img src="./materials/pics.png" style="height:330px">

In [None]:
with open('./materials/picture_answers.txt') as f: 
    for line in f:print(line)

### 2) Decision Boundaries
####  Now we can plot decision boundaries  
Let's quicly use a decision tree, a KNN, an SVM, and a simple neural network   
These are all very popular algorithms in industry and in academia!

In [None]:
from itertools import product

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier


In [None]:
# Load data
iris = datasets.load_iris()
X, y = iris.data[:, [0, 1]], iris.target
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
data1.head()

In [None]:
# Create the classifiers
dt = DecisionTreeClassifier(max_depth=4)
knn = KNeighborsClassifier(n_neighbors=7)
svc = SVC(gamma=.1, kernel='rbf', probability=True)
mlp = MLPClassifier()

In [None]:
# Train the classifiers
dt.fit(X, y)  
knn.fit(X, y)
svc.fit(X, y) 
mlp.fit(X,y)

In [None]:
# Plotting decision regions
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        [dt, knn, svc, mlp],
                        ['Decision Tree (depth=4)', 'KNN (k=7)',
                         'Kernel SVM', 'MLP']):

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y,
                                  s=20, edgecolor='k')
    axarr[idx[0], idx[1]].set_title(tt)

plt.show()
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_decision_regions.html

### 3) Mathematic Formulations  
Lets focus on a regression model. Now we can look at the mathematical formula

#### Intuition : linear regression looks something like this: y=mx+b
<img src="./materials/linear_regression_result-1.png" style="height:250px">  
  


Now lets bump up to higher dimensions. We will use thirteen features to predict Boston housing prices   

In [None]:
boston = datasets.load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
print(df.shape)
df.head()

Now we can fit a linear regression model to our data.   
There is more than one feature. So y= mx+b is not enough. We need a more complex formulation

In [None]:
from sklearn.model_selection import train_test_split
X, Y = df, boston.target

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)

In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression().fit(X_train, Y_train)

Y_pred = lm.predict(X_test)
print("Score is:",lm.score(X_test,Y_test))

my_formatted_list = [ '%.2f' % i for i in lm.coef_ ]
str1 = 'x + '.join(str(e) for e in my_formatted_list)
print("\n \nFormula is:\n y = ", str1, ' + ', str(lm.intercept_) )

### Understanding Data  
There are many potential data forms. Three main categories are sound, image, and tabular data.   
Data can also be discrete, catagorical, continuous, and numerical

In [None]:
# Example: Tabular

from sklearn import datasets
data = datasets.load_breast_cancer()
X = data.data
df = pd.DataFrame(data.data, columns=data.feature_names)
print(df.shape)
print(data.target_names)
df.head()

Question: Are these features continuous or discrete? What about the output? 

In [None]:
print(df.shape)
df.describe()

Understanding data. Lets see how an image is represented
<img src="./materials/buzz.jpg" style="height:200px">


In [None]:
from scipy import misc
pixel_vals = misc.imread('./materials/pics.png') 

print("size is", pixel_vals.shape)
print(pixel_vals)


Fun Question: Why are there four dimensions? 

In [None]:
# answer: because pngs

### Slide deck continues with discussion of overfitting and underfitting if time permits 
  
   .

<img src="./materials/fitting.png" style="height:120px">

# Questions?  
Slack a content member or email me at hwhittaker6@gatech.edu