<a href="https://colab.research.google.com/github/kbehrman/foundational-python-for-data-science/blob/main/Chapter-12%3AMachine-Learning-Libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notes
* Tensorflow
    - Developed for internal use at Google
    - Based on Nueral networks
* Keras
    - opened source
    - can be run on top of Tensorflow
    - specialized for neural networks?
    - designed by google engineer?
    - [Getting Started with Keras](https://youtu.be/J6Ok8p463C4)
* Pytorch
    - ease of use
    - Developed at Facebook
    - computer vision, natural language processing
    - based on torch C library
* Scikit-learn
    - Built on numpy and scipy
    - popular for classical algorithms
    - [Getting started](https://youtu.be/rvVkVsG49uU)



## Introduction
Machine learning consists of letting a computer find a way to solve a problem using data. This contrasts with traditional programming where the programmer defines the means of solution in code. In this chapter, we will take a overview of some of the more popular libraries used for machine learning. These libraries implement the algorithms used to create and train machine learning model. These models have variuos uses, dependeding on the type of problem. Some models are useful for predicting future values, others for classifying data into groups or categories. 



## Popular Libraries

Four of the most popular libraries are TensorFlow, Keras, Pytorch, and Scikit-learn. TensorFlow was developed by Google for internal use. It is a powerful library used to solve problems using deep learning. This involves defining layers that transform the data and which are tuned as the solution is fit to data. Keras is an opened source library designed to work with TensorFlow, and it is now included in the TensorFlow library. 

PyTorch is Facebook's contribution to production worthy machone learning libraries. It is based on the Torch library, which makes use of GPUs in solving parralell problems.

Scikit-learn is a popular library for starting machine learning. It is built on top of NumPy and SciPy. It has classes for most of the traditional algorithms. We will take a closer look at Scikit-learn, but first lets talk about a general approach to solving a problem using machine learning.

## High Level Process

Machine learning algorithmns can be divided into two types, unsupervised and supervised learning. Unsupervised learning involes discovering insights about data without pre-existing results to test against. In supervised learning, you use known data to train and test a model. Generally the steps to trainig a supervised model are:

1. Transform data
2. Separate out test data
3. Train the model
4. Test accuracy

Scikit-learn has tools to simplify each of these steps.



# Transformations

For some algorithms it is advantagous to transform the data before training a model. For example you might want to take a continuous variable, such as age, and turn it into discreat catagories, such as age ranges. Scikit-learn includes many types of transformers, including ones for cleaning, feature extraction, reduction and expansion. These are represented as classes which generally use a .fit() method determine the transformation and a .transform() method to modify data. In figure ... we use a MinMaxScaler. This transformer scales values to fit in a defined range, between 0 and 1 by default.

In [None]:
import numpy as np

data = np.array([[100, 34, 4],
          [90,  2,  0],
          [78,  -12, 16],
          [23,   45,   4]])

data

array([[100,  34,   4],
       [ 90,   2,   0],
       [ 78, -12,  16],
       [ 23,  45,   4]])

In [None]:
from sklearn.preprocessing import MinMaxScaler

minMax = MinMaxScaler()
scaler = minMax.fit(data)

scaler.transform(data)

array([[1.        , 0.80701754, 0.25      ],
       [0.87012987, 0.24561404, 0.        ],
       [0.71428571, 0.        , 1.        ],
       [0.        , 1.        , 0.25      ]])

There may be times you wish to seperate your data before fitting the tranformer. In this way the tranformer settings will not be effected by the test data. Since the fitting and transforming are seperate methods, it is easy to fit to the train data and use that to transform the test data.

## Splitting test and training data

One important pitfall to avoid when training a model is over-fitting. This is when a model perfectly predicts the data used to train it, but has little predictive power with new data. In the simplest sense, we avoid overfitting by not testing the model with the data that it was trained upon. Scikit-learn offers helper methods to make splitting data easy. In figure 12.2 we use the Scikit-learn function train_test_split() to split the iris data set provided with the library, into train and test data sets. 

Loading a sample dataset

In [None]:
from sklearn import datasets

In [None]:
data, target = datasets.load_iris(return_X_y=True)

In [None]:
print(type(data))
print(data.shape)

<class 'numpy.ndarray'>
(150, 4)


In [None]:
print(type(target))
print(target.shape)

<class 'numpy.ndarray'>
(150,)


Splitting data into training and test

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_target, test_target = train_test_split(data, target)
print(train_data.shape)
print(train_target.shape)
print(test_data.shape)
print(test_target.shape)

(112, 4)
(112,)
(38, 4)
(38,)


## Training a model

Scikit-learn offers many classes representing various machine learning algorithms. These are refered to as estimators. Many estimators can be tuned using parameters during instantiation. Each estimator has a .fit() method which trains the model. Most of the .fit() methods take two arguments. The first is some sort of training data, refered to as samples.The second is the results, or targets, for those samples. Both arguments should be an array like objects, such as a NumPy arrays. Once the training is done, the model can predict results using it's .predict() method. The accuracy of this prediction can be checked using functions from the methods module. Figure 12. shows a simple example using the KNeighborsClassifier estimator. 

In [None]:
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(train_data, train_target)
test_prediction = knn.predict(test_data)
metrics.accuracy_score(test_target, test_prediction)

1.0

## Next Steps

We have only touched the surface of Scikit-learns capability. Other important features are tools for cross-validation, where a data set is split multiple times to avoid over-fitting on test data, and Pipelines which wrap up transformers, estimators and cross-validation together. If you want to learn more about Scikit-learn, there are great tutorials on the official webiste, [scikit-learn tutorials](https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu).

## Summary

Many of the algorthims used to create Machine Learning models are represented in the major Python Machine Learning libraries. TensorFlow is a deep learning library created by Google. PyTorch is a library built on Torch by Facebook. Scikit-learn is popular library for getting started with Machine Learning. It has modules and functions to perform the steps of creating and analysing a model. 

## Questions

1. In which step of training a supervised estimator would a Scikit-learn Transformer be useful?

2. Why is it important to separate training and test data?

3. Once you have transformed your data and trained your model, what should you do next?