
## Four branches of machine learning
  

* We have seen three specific types of machine learning problems: binary classification, multiclass classification, and scalar regression.
* All three are instances of *supervised learning*.
* Machine learning algorithms generally fall into four broad categories, described in the below.

> ### Supervised learning

* The most common case

* It consists of learning to map input data to known targets (also called *annotations*), given a set of examples (often annotated by humans).

* Generally, almost all applications of deep learning that are in the spotlight these days belong in this category.
  * E.g., optical character recognition, speech recognition, image classification, and language translation

* Although supervised learning mostly consists of classification and regression, there are more variants as well.
  * Sequence generation: Given a picture, predict a caption describing it.
      
    <img src="https://drive.google.com/uc?id=13W2tZ27kqVmNEOa8ygs17jjVMAGebWKO" width="800">
    
    *https://cs.stanford.edu/people/karpathy/sfmltalk.pdf*
  
  * Syntax tree prediction: Given a sentence, predict its decomposition into a syntax tree.
  
  * Object detection: Given a picture, draw a bounding box around certain objects inside the picture.
  
    <img src="https://drive.google.com/uc?id=13WuTOydqvbLAuQ15C5P0Gj68srmAZ9NF" width="800">
    
    *https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088*
    
  * Object segmentation: Given a picture, draw a pixe-level mask on a specific object.
  
    <img src="https://drive.google.com/uc?id=13_-4dEBvtg9xRD-sqM1eyaLMNUz4Gvee" width="800">
    
    *https://medium.com/@jonathan_hui/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359* 
  
  

> ### Unsupervised learning

* Finding interesting transformations of the input data without the help of any targets for the purpose of:
  * data visualization
  * data compression
  * data denoising
  * to better understand the correlations present in the data at hand
  
* *Dimensionality reduction* and *clustering* are well-known categories of unsupervised learning.


> ### Self-supervised Learning

전통적인 딥러닝은 라벨링을 다 해야하는데 2세대의 딥러닝은 라벨링된거를 하나만 보고 스스로 특징들을 비슷한거를 생각해서 다른것들도 알아서 코끼리라고 라벨링한다고 생각하면 된다는것인거같은데.. 좀 더 나아가서 직접 알아서 라벨링을 해서 사람이 라벨링 즉 annotation 할 필요가 없다.

* A specific instance of supervised learning
* Self-supervised learning is supervised learning without human-annotated labels.
* There are still labels involved, but they are generated from the input data.
* Examples
  * autoencoders where the generated targets are the input
  * predicting the next frame in a video, given past frames
  * predicting the next word in a text, given previous words


> ### Reinforcement learning

수업시간에 다루지는 않는데 machine learning에서 매우 중요한 부분이다<br>
장기적인 것을 다루는 느낌?

* Reinforcement learning started to get a lot of attention after Google DeepMind successfully applied it to learning to play Atari games.
  * https://www.youtube.com/watch?v=V1eYniJ0Rnk&vl=en
  
* In RL, an *agent* receives information about its environment and learns to choose actions that will maximize some reward.
  * For example, a neural network that "looks" at a video-game screen and outputs game actions in order to maximize its score can be trained via RL.
  
* It can be applied to large range of real-world applications:
  * self-driving cars, robotics, resource management, education, and so on.

## Evaluating machine-learning models

* In the previous examples, we split the data into a training set, a validation set, and a test set.

* In machine learning, the goal is to achieve models that *generalize* - that perform well on never-before-seen data - and overfitting is the central obstacle.

* Here, we will focus on how to measure generalization: how to evaluate machine-learning models

> ### Training, validation, and test sets

딥러닝은 보통 hold-out데이터를 쓴다

valid을 하는 이유 <br>
그냥 최적의 hyperparameter를 찾기 위해서이다<br>
test는 loss function후 그냥 자동으로 최적화되는것을 알아보기 위한 단계이다

* Splitting the available data into three sets: training, validation, and test.
  * We train on the training data and evaluate our model on the validation data.
  * Once the model is ready, we test it one final time on the test data.
  
* Why not have just two sets: a training set and a test set?

* The reason is that developing a model always involves tuning its configuration.
  * For example, choosing the number of layers or the size of the layers
    * They are called the *hyperparameters* of the model, to distinguish them from the *parameters*, which are the network's weights.
  * We do this tuning by using the performance of the model on the validation data.
  * This tuning is a form of *learning*: a search for a good configuration in some parameter space.
  * As a result, tuning the configuration of the model can quickly result in *overfitting to the validation set*, even though your model is never directly trained on it.
  
* Central to this phenomenon is the notion of *information leak*.
  * Every time you tune a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data leaks into the model.
  
* A model that performs artificially well on the validation set does not guarantee similar performance on the test set.

* **Simple hold-out validation**
 
    <img src="https://drive.google.com/uc?id=13fgXDda-SVywl2lhsyIhpdCs7wbMFqdK" width="800">
  
  * The simplest evaluation protocol
    * If little data is available, then the validation and test sets may contain too few samples to be statistically representative of the data at hand.
      * It is easy to observe: try different random shuffling rounds of the data
  
  * Code example
  
  ```python
    # hold-out validation
    num_validation_samples = 10000

    np.random.shuffle(data)

    validation_data = data[:num_validation_samples]
    data = data[num_validation_samples:]

    training_data = data[:]

    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data)

    # At this point you can tune your model!

    model = get_model()
    model.train(np.concatenate([training_data, validation_data]))

    test_score = model.evaluate(test_data)
  ```

* **k-fold validation**

  * Split the data into *k* partitions of equal size.
  * For each partition *i*, train a model on the remaining *k-1* partitions, and evaluate it on partition *i*.
  * Then, the average of the *k* scores is obtained as the final score.
  
    <img src="https://drive.google.com/uc?id=13glqM37H-fU8KZhZudfo35XTi-floJWi" width="800">
    
    
  * Code example
  
  ```python
    k = 4
    num_validation_samples = len(data) // k

    np.random.shuffle(data)

    validation_scores = []
    for fold in range(k):
      validation_data = data[num_validation_samples*fold : num_validation_samples*(fold+1)]
      training_data = data[:num_validation_samples*fold] + data[num_validation_samples*(fold+1):]

      model = get_model()
      model.train(training_data)
      validation_score = model.evaluate(validation_data)
      validation_scores.append(validation_score)

    validation_score = np.average(validation_scores)

    model = get_model()
    model.train(data)
    test_score = model.evaluate(test_data)
  ```

* **Iterated k-fold validation with shuffling** 
  
  * Applying k-fold validation multiple times, shuffling the data every time before splitting it *k* ways
  * The final score is the average of the scores obtained at each run of k-fold validation.

데이터가 극단적으로 적을때 쓰는 방법<br>
k-fold validation을 여러번하는것이다

> ### Things to keep in mind

데이터가 잘 대표하고 있는가 --> 셔플해서 뽑는거 <br>
시계열데이터는 섞으면 큰일남  알아서 잘 대표하게 해라<br>
중복도 잘 제거해야한다

* Data representativeness
  * What if you sort the data according to their classes?
  * *random shuffling* is usually used before splitting it.
  
* The arrow of time
  * If you are trying to predict the future given the past, you should *not* randomly shuffle the data before splitting it.
  
* Redundancy in your data
  * If some data points in your data appear twice, then the performance might be over-estimated.
  * Make sure your training set and validation set are disjoint.

## Data preprocessing, feature engineering, and feature learning

* How do we prepare the input data and targets before feeding them into a neural network?
* Many data-preprocessing and feature-engineering techniques are domain specific.

> ### Data preprocessing for neural networks

* Vectorization
  * (input, target) --> tensors of floating-point data
  
* Value normalization
  * Normalize each feature independently so that it had a standard deviation of 1 and a mean of 0.
 
* Handling missing values
  * With neural networks, it is safe to input missing values as 0.

> ### Feature engineering

task하기 위해 정보를 추출하는 방법<br>
아마 딥러닝은 별로 필요없다

* The process of using our own knowledge about the data to make the algorithm work better by applying hardcoded (nonlearned) transformations to the data.

* Reading the time on a clock

     <img src="https://drive.google.com/uc?id=15Lw3owKBH2UgnwEUz-o-8AX2J9L2-V6H" width="500">

* Before deep learning, feature engineering used to be critical.
  * Because classical shallow algorithms did not have hypothesis spaces rich enough to learn useful features by themselves.
  * E.g., MNIST --> the number of loops, the height of each digit, a histogram of pixel values, etc.
  
* Modern deep learning removes the need for most feature engineering.
  * Because neural networks are capable of automatically extracting useful features from raw data.
  
* However, this is still important for two reasons:
  * Good features allow us to solve problems more elegantly while using fewer resources.
  * Good features let us solve a problem with far less data.

하지만 그래도 feature는 중요하다 feature특징을 그냥 알아내면 딥러닝까지 갈 필요가 없다

## Overfitting and underfitting

* The fundamental issue in machine learning is the tension between optimization and generalization.
  * *Optimization* refers to the process of adjusting a model to get the best performance on the training data.
  * *Generalization* refers to how well the trained model performs on data it has never seen before.
  * The goal is to get good generalization, but we can only adjust the model based on the training data.
  
* At the beginning of training, optimization and generalization are correlated.
  * The lower the loss on training data, the lower on test data.
  * While this is happening, the model is said to be *underfit*.
  
* After a certain number of iterations, generalization stops improving.
  * The model is starting to *overfit*.
  
* To prevent overfitting, the best solution is to get *more training data*.

* When that isn't possible, the next-best solution is to modulate 
  * the quantity of information that the model is allowed to store,
  * to add constraints on what information it's allowed to store.
  * The process of fighting overfitting this way is called *regularization*.

> ### Reducing the network's size

overfitting을 막기 위한 방법은 모델의 complexity를 줄이는것이다

* The simplest way to prevent overfitting is to reduce the size of the model: the number of learnable parameters in the model.

* In deep learning, the number of learnable parameters in a model is often referred to as the model's *capacity*.

* There is a compromise to be found between *too much capacity* and *not enough capacity*.

* Unfortunately, there is no magical formula to determine the right number of layers or the right size for each layer.

* Let's revisit the movie-review classification network.
  * The original model
  
    ```python
      from tensorflow.keras import models 
      from tensorflow.keras import layers

      model = models.Sequential() 
      model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
      model.add(layers.Dense(16, activation='relu')) 
      model.add(layers.Dense(1, activation='sigmoid'))
    ```
  
  * Smaller network (low capacity)
  
    ```python
      model = models.Sequential() 
      model.add(layers.Dense(4, activation='relu', input_shape=(10000,))) 
      model.add(layers.Dense(4, activation='relu')) 
      model.add(layers.Dense(1, activation='sigmoid'))
    ```
  
  * A comparison of the validation losses of the original network and the smaller network
  
    <img src="https://drive.google.com/uc?id=15O-efTcA7xWc77HuvniLf9zNGgdLAXKL" width="500">
  
  * Bigger model (high capacity)
  
    ```python
      model = models.Sequential() 
      model.add(layers.Dense(512, activation='relu', input_shape=(10000,))) 
      model.add(layers.Dense(512, activation='relu')) 
      model.add(layers.Dense(1, activation='sigmoid'))
    ```
  
  * A comparison between the original network and the bigger network
  
    <img src="https://drive.google.com/uc?id=15OWpylDoU1zY4dAbrsWUUwXHP_NHFTuM" width="500">

    <img src="https://drive.google.com/uc?id=15eF8nKy2OLIKPTukUSeevTRbeOG_BEUB" width="500">
  

> ### Adding weight regularization

모델의 hyperparamet를 줄여서 과적합을 막을수도 있지만 regularization으로도 가능하다<br>
일단 기본적으로 앞에 있는 parameter 즉 계수들이 아주 크면 그래프가 overfitting되는 특성이 있다.<br>
그래서 그 앞의 계수들에 제한을 두면 overfitting을 막을수가잇다.<br>
l1 regulrazation은 그 계수들의 절대값 값을 최소화시키는 방법<br>
l2는 제곱의 합을 작게하는 방법이다

* The principle of *Occam's razor*
  * Given two explanations for something, the explanation most likely to be correct is the simplest one - the one that makes fewer assumptions.
  
* A *simple model* in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters).

* A common way to mitigate overfitting is to put contraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more *regular*.

* This is called *weight regularization*.
  * It is done by adding to the loss function of the network a *cost* associated with having large weights.
  * *L1 regularization*
    * The cost added is proportional to the absolute value of the weight coefficients.
    * The *L1 norm* of the weights
  * *L2 regularization*
    * The cost added is proportional to the square of the value of the weight coefficients.
    * The *L2 norm* of the weights
    * It is also called *weight decay* in the context of neural networks.
    
* L2 weight regularization in Keras

  ```python
    from tensorflow.keras import regularizers

    model = models.Sequential() 
    model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape=(10000,))) 
    #0.001은 l1regularazation에 대한 합에 앞에 붙는 계수를 의미한다
    model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu')) 
    model.add(layers.Dense(1, activation='sigmoid'))
  ```

  * The impact of the L2 regularization
  
    <img src="https://drive.google.com/uc?id=15hDvTnLTCtd-jl6p4k4VQwYLWLEw-E8Y" width="500">
  
  

> ### Adding dropout

뉴럴네트워크에 특화된 regulrazation이다<br>당연히 과적합을 막기위해서 사용<br>
학습과정중에 hidden노드를 껐다 켰다 그렇게 한다<br>
그렇게 각각의 노드가 꺼질 확률을 hyperparamet로 조정하는 느낌이다<br>
중요한건 매번 iteration마다 랜덤한 확률로 살아남는것이다<br>
또 중요한건 test에서는 dropout이 사라진다 그걸 켜놓으면 모델이 fit할때마다 다른 결과가 나오기 떄문이다 <br>
dropout을 걷어내고 그냥 아무것도 안하면 dropout을 한 의미가 없다 그래서 확률로써 나올수 있는 값을계산하고 test값에 그만큼의 계수를 곱해준다

* *Dropout* is one of the most effective and most commonly used regularization techniques for neural networks.

* It consists of randomly dropping out (setting to zero) a number of output features of the layer during training.
  * E.g., [0.2, 0.5, 1.3, 0.8, 1.1] --> (dropout) --> [0, 0.5, 1.3, 0, 1.1]
  
* The *dropout rate* is the fraction of the features that are zeroed out.

* At test time, no units are dropped out.
  * Instead, the layer's output values are scaled down by a factor equal to *(1-the dropout rate)* to balance for the fact that more units are active than at training time.
  ><img src="https://drive.google.com/uc?id=1nfP0HxqbBcW-isMbCDuDkD-Bu_x7w65_" width="800">
  
* Implementation using Numpy

  ```python
    # At training time, we zero out 50% of activations.
    layer_output *= np.random.randint(0, high=2, size=layer_output.shape)

    # At test time, we scale down the output.
    layer_output *= 0.5
  ```

* Another implementation (in practice)

  ```python
    # At training time
    layer_output *= np.random.randint(0, high=2, size=layer_output.shape) 
    layer_output /= 0.5
  ```

* In Tensorflow,

  ```python
    model.add(layers.Dropout(0.5))
  ```

* Adding dropout to the IMDB network

  ```python
    model = models.Sequential() 
    model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
    model.add(layers.Dropout(0.5)) 
    model.add(layers.Dense(16, activation='relu')) 
    model.add(layers.Dropout(0.5)) 
    model.add(layers.Dense(1, activation='sigmoid'))
  ```

><img src="https://drive.google.com/uc?id=15i6-m5KQvETAyj2BOjMoAZvO9jFqKYuT" width="500">



## The universal workflow of machine learning

> ### Defining the problem and assembling a dataset

* First, we must define the problem at hand:
  * What will your input data be?
  * What are you trying to predict?
  * What type of problem are you facing? 

* The hypotheses you make at this stage:
  * The outputs can be predicted given the inputs.
  * The available data is sufficiently informative to learn the relationship between inputs and outputs.
  
* Not all problems can be solved: a dataset (X, Y) doesn't mean X contains enough information to predict Y.

* Keep in mind that machine learning can only be used to learn patterns that are present in the training data.
  * For instance, using machine learning trained on past data to predict the future is making the assumption that the future will behave like the past.

> ### Deciding on an evaluation protocol

* How you will measure the current progress
  * Maintaining a hold-out validation set
  * Doing k-fold cross validation
  * Doing iterated k-fold validation
 

> ### Preparing the data

* Once you know what you’re training on, what you’re optimizing for, and how to evaluate your approach, you’re almost ready to begin training models.

* Formatting the data
  * The data should be formatted as tensors.
  * The values taken by these tensors should be scaled to small values.
  * If different features take values in different ranges, then the data should be normalized.
  * Some feature engineering may be needed, especially for small-data problems.

0근처로 normalized를 해줘야한다

> ### Developing a model that does better than a baseline

* Developing a small model that is capable of beating a dumb baseline

* Three key choices to build the network:
  * Last-layer activation
  * Loss function
  * Optimization configuration
  
><img src="https://drive.google.com/uc?id=15lpFAc2T95-g4GiugmRdYnuUVK_24c9H" width="700">

뭘할지에 따라 loss function이나 activation function 무엇을 쓸지 거의 정해져 있따

> ### Scaling up: developing a model that  overfits

underfit이 훨씬 critical하다 차라리 overfit을 시켜서 줄이는과정이 훨씬 좋다

* Once you’ve obtained a model that has statistical power, the question becomes, is your model sufficiently powerful?

* Developing a model that overfits:
  * Add more layers
  * Make the layers bigger
  * Train for more epochs
  
* Always monitor the training loss and validation loss, as well as the training and validation values for any metrics you care about.

> ### Reguralizing the model and tuning the hyperparameters

* Repeatedly modify the model, train it, evaluate on the validation data, again and again.

* We can try:
  * Add dropout
  * Try different architectures
  * Add regularization terms
  * Try different hyperparameters
  * Optionally, iterate on feature engineering
  
* Keep in mind that every time you use feedback from your validation process to tune your model, you leak information about the validation process into the model.

* Once you’ve developed a satisfactory model configuration, you can train your final production model on all the available data (training and validation) and evaluate it one last time on the test set.