#### Types of Machine Learning

- Based on type of training (with or without human supervision)
   - supervised,unsupervised,semisupervised and Reinforcement learning
- Whether or not they can learn incrementally on the fly
   - online vs batch learning
- If  they work by comparing new data points to known data points, or instead detect patterns in training data and build predictive model
   - instance-based vs model-based learning

These criteria are not exclusive; you can combine them in any way you like. For example, a state-of-the art
spam filter may learn on the fly using a deep neural network model trained using examples of spam and
ham; this makes it an online, model-based, supervised learning system.

#### Supervised Learning

In supervised learning, training data includes desired solutions(labels)

Types of task:

- Classification:
  - 
- Regression:
  - predicting a *target* numeric value,such as a price of house, given set of *features*(size,accesory,garden size,etc...) called predictors
  - some regression algorithms can be used for classification
    - *Logistic Regression* is often used for classification as it can output a value that corresponds to the probability of belonging to a given class(e.g. 20% chance of being spam)

*Most important algorithms*

- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machine
- Decision Trees and Random Forest
- Neural Networks ( can also be semisupervised or unsupervised)


#### Unsupervised Learning

System tries to learn without teacher.

*Most important algorithms*

- Clustering
  - k-Means
  - Hiearchical Cluster Analysis (HCA)
  - Expectation Maximization
- Visualization and dimensionality reduction
  - Principal Component Analysis (PCA)
  - Kernel PCA
  - Locally-Linear Embedding(LLE)
  - t-distributed Stochastic Neighbor Embedding(t-SNE)
- Association rule learning
  - Apriori
  - Eclat
  
in ML *attribute* is a data type ('Car Mileage'), *feature* usually means attribute+ it`s value ('Car Mileague=15000km)

**Dimensionality** **reduction** - goal is to simplify data without losing too much information.
  - for example, merge several correlated features into one ( car mileage may be very correlated with its age, so dimensionality reduction algorithm will merge them into one feature that represents car`s wear and tear). This is called *feature* *extraction*
  
**Anomaly** **Detection**
  - detecting unusual credit card transactions,catching manufacturing defects, or automatically removing outliers from dataset before feeding it to another learning algorithm
**Association** **rule** **learning**
  - sif through large amounts of data and discover interesting relations between attributes
    - for example, running association rule on your sales logs may reveal that people who purchase barbecue sauce  and potate chips also tend to buy steak, thus you may want to place these items close together
    
#### Semisupervised learning

#### Reinforcement Learning





### Batch and Online Learning

#### Batch learning
- System is trained using all available data
  - takes a lot of computing resources
    - typically done offline
  - after training, system is put to production and runs without learning, just applies what it has learned. 
  - This is called **Offline** **Learning**
    - after some time, it needs to be trained again on new and old data
    - then stop old system, and replace it with new one
      - this process can be automated 


#### Online(Incremental) Learning

<img src= 'notes_img\online_learning.png'>

- training is done incrementally by feeding its data instances sequentially
  - either individually, or by *mini-batches**
- great for systems that receives data as continuous flow and need to adapt to change rapidly and/or autonomously
- Also works well if you have limited resources
  - once an online learning system has learned about new datainstances, it can discard them
    - still good idea to save them, to be able to roll back to previous state and 'replay' data
- *out-of-core* *learning*
  - used if data can not fit in one machine main memory
    - algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data
- *Learning* *rate*
  - tells alg how fast it should adapt to changing data
    - high learning rate > system will rapidly adapt to new data
      - but quickly forgets old data
    - slow learning rate > system will have more inertia(setrvacnost)
      - learns more slowly, but less sensitive to noise in data, or nonrepresentative data points
  - its important to closely monitor online learning system because bad data will gradually worsen system performance
    - in extreme case switch learning off
    - if possible monitor input data and react to abnormal data (anomaly detection algorithm)


<img src= 'notes_img\out_of_core.png'>  


 
    





### Instance-based vs Model-based Learning

#### Instance-based
- based on measure of similarity
  - systems learn examples by heart, and then generalizes to new cases using a similarity measure

<img src= 'notes_img\instance_based.png'>    


#### Model-based learning
- study data
- select model
    - train on training data ( learning algorithm searched for the model paramater values that minimizes the cost function )
- apply model on new data to make predictions on new cases (*inference*)

<img src= 'notes_img\model_based.png'>   






### Main Challenges of Machine Learning

#### Insufficient Quantity of Training Data
- for very simple problems , we need thousands of examples
- for complex problems (image or speech recognition) we may need millions of examples (unless you can reuse existing model)

#### Nonrepresentative Training Data
- in order to generalize well, it is crucial that training data are representative of new cases you want to generalize to
  - true for both instance and model based learning
- **Sampling** **noise**
  - non representative data as result of chance (if sample is too small)
- **Sampling** **bias**
  - even large samples can be nonrepresentative if the sampling method is flawed.
#### Poor Quality Data
- data full of outliers, errors and noise makes it harder for system to detect patterns
  - often is worth to spend time tidying up your data set
- if some instances are clearly outliers , it may help to simply discard them, or to try fix errors manually
- if some instances are missing a few features(e.g. 5% of your customers did not specify their age)
  - ignore this attribute altogether
  - ignore these instances
  - fill in missing values (e.g. with median age)
  - or train model with the feature and one model without feature
#### Irrelevant features
**Garbage** **In** **Garbage** **Out**

- Feature engineering
  - process of selecting good features for model
  - Feature selection
    - selecting the most useful features to train on amongst existing features
  - Feature extraction
    - combining existing features to produce a more useful one ( dimensionality reduction algorithm can help)
  - Creating new features by gathering data

#### Overfitting the Training Data 
- i.e. overgeneralizing
- it means that model performs well on training data, but does not generalize well

- Complex model can detect subtle patterns in data, but if training set is too noisy, or if it is too small (i.e. sampling noise)
the model is likely to detect patterns in noise

- **Solutions**
  - simplify model by selecting one with fewer parameters (e.g. linear model rather that a high-degree polynomial model), by reducing the number of attributes in the training data or by constraining model
    - this is called **Regularization**
      - amount of regularization can be controlled by *hyperparameter*
        - hyperparameter is a parameter of learning algorithm (not of the model)
  - gather more training data
  - reduce noise in the training data (e.g. fix data errors and remove outliers)

#### Underfitting the Training Data
- occurs when model is too simple learn underlying structure of the data 
- Solutions:
  - Selecting a more powerful model, with more paramaters
  - Feeding better features to the learning algorithm ( feature engineering)
  - Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)

#### Testing and Validating

- split data into *training* set and *test* set
  - error rate on new cases is called *generalization* *error* and by evaluating your model on test set , you get an estimation error
  - common ratio is to use 80% for training  and *hold* *out* 20% for testing (or 70/30)
- common pitfall is to measure generalization error multiple times on test set, until you adapt your model and hyperparameters to your test set ( that makes it unlikely to perform well on new data)
  - common solution is to have second hold-out set (*validation* *set*)
    - train multiple models with various hyperparameters on training set
    - select model that perform best on validation set
    - run single final test on test set to get an estimate of generalization error
- to avoid wasting to much training data in validation sets, technique called **cross-validation** is often used
  - training set is split in complementary subsets and each model is trained against different combination of these subsets and validated against remaining parts.
  - once a model type and hyperparameters have been selected, a final model is trained using these hyperparameters on the full training set, and the generalized error is measured on the set
  
#### No Free Lunch Theorem

- if you make absolutely no assumptions about data, then there is no reason to prefer one model over any other
  - not possible in practice due time constraints
    - neccesary to make assumptions and make educated guess with which model to start