Most machine learning problems deal with _predicting a target value_.

These problems have the following commonalities:
* A well-defined target of interest: binary classification (either yes or no), multiple classification values, or numerical values (predicted real estate value)
* They have historical data where the target is known which can be used for training (_supervised learning_).

The role of the ML algorithm is to use the training set to determine how the set of input features can predict the target variable. The result of this 'learning' is encoded in a machine learning model.

#### Terms
* target = response = dependent variables
* input features = explanatory = independent variables

### Data Collection

#### Feature Selection
As a rule of thumb, if a feature is suspected to have an effect on the target, include it. Likewise, features that do not have anything to do with the target should be excluded to minimize noise (e.g., unique identification number of user). Be careful when excluding because ML algorithms are sensitive and can detect even minute relationships between features and target. Throwing all features can lessen accuracy of the ML model because the model won't be able to distinguish noise over important features.

Steps for feature selection:
1. Include all the features that you **suspect** to be predictive of the target variable. Fit an ML model. If the accuracy of the model is sufficient, stop.
2. Otherwise, expand the feature set by including other features that are less obviously related to the target. Fit another model and assess the accuracy. If performance is sufficient, stop.
3. Otherwise, starting from the expanded feature set, run an _ML feature selection algorithm_ to choose the best, most predictive subset of your expanded feature set.

#### Amount of Training data
These factors determine the amount of training data needed:

* The complexity of the problem. Does the relationship between the input features and target variable follow a simple pattern, or is it complex and nonlinear?
* The requirements for accuracy. If you require only a 60% success rate for your problem, less training data is required than if you need to achieve a 95% success rate.
* The dimensionality of the feature space. If only two input features are available, less training data will be required than if there were 2,000 features.

#### Training Data Representation
Training data must be representative of actual data so that model can make accurate predictions.

A training sample could be nonrepresentative for several reasons:
* It was possible to obtain ground truth for the target variable for only a certain, biased subsample of data.
* The properties of the instances have changed over time. For fraudulent cases for example, policies may change over time which changes user behavior.
* The input feature set has changed over time. For example, previous collection of data includes ZIP code and state but has since been changed to IP address.

### Data Preprocessing

Some features are in a different format and some may be missing. Preprocessing of data might be needed to fix these occurrences in data.

#### Categorical Features
A feature is categorical if values can be placed in buckets and the order of values isn’t important (e.g., marital status).

Some machine-learning algorithms use categorical features natively, but generally they need data in numerical form. Categorical feature can be converted to numerical form by creating a corresponding column for each category value. The column can have a value of either 1 or 0.

| Gender        | Female           | Male  |
| ------------- |-------------| -----|
| Female      | 1 | 0 |
| Male      | 0      |   1 |



In [13]:
import numpy as np

def cat_to_num(data):
    categories = np.unique(data)
    features = []
    print(categories)
    for cat in categories:
        features.append([int(c == cat) for c in data])
        
    return features

cat_to_num(['male', 'female', 'female'])

['female' 'male']


[[0, 1, 1], [1, 0, 0]]

#### Missing Data

A missing data can either be important for the algorithm or the measurement was impossible and the information is not meaningful. For numerical informative data, the missing values can be filled up by giving it a value that is at the end of the spectrum (e.g., -1 or -9999). For categorical data, a new category can be created to indicate the missing value (e.g., Missing, None).

![image.png](attachment:image.png)

The rules are not cutout but the following can serve as a guide for filling in missing data
* If missing data only involves a few record, consider dropping these records
* If missing data is temporal in nature, consider sorting the records and using the data of the previous record
* If missing data is numerical, the mean can be used except if there are outliers, in this case, the median can be used.
* If missing data is complex, another ML model can be used

#### Feature Engineering
feature engineering: using the existing features to create new features that increase the value of the original data by applying our knowledge of the data or domain in question. 

In the titanic example, the cabin numbers do not really add value to the ML model as there are a lot of distinct values for this feature. The cabin no. can be useful if it's used to determine the section of the ship the passenger is staying.

#### Data Normalization

Normalization involves converting numeric data to reside inside a certain scale. This is becuase data with range of 1 to 10 can have a bigger influence than data with range from 1 to 2. Data are usually normalized to be in the range from 0 to 1, or from -1 to 1.

In [4]:
def normalize_feature(data, f_min=-1.0, f_max=1.0):
    d_min, d_max = min(data), max(data)
    factor = (f_max - f_min) / (d_max - d_min)
    normalized = [(f_min + (d - d_min) * factor) for d in data]
    return normalized, factor

normalize_feature([1, 0, 10])

([-0.8, -1.0, 1.0], 0.2)

### Data Visualization

Data Visuallization can be used to discover relationships between features and result.

![image.png](attachment:image.png)

#### Mosaic Plots

Mosaic plots allow you to visualize the relationship between two or more categorical variables.

![image.png](attachment:image.png)

If a strong relationship exists, the horizontal splits will be far apart.

#### Box Plots

Box plots are used to visualize the distribution of a numerical variable. For a single variable, a box plot depicts the quartiles of its distribution: the minimum, 25th percentile, median, 75th percentile, and maximum of the values. 

![image.png](attachment:image.png)

From the image, it's not clear if the age of the passenger had an effect on the survival rate. But it is important to note that just because it's not clear, it means that the feature should be excluded. ML algorithms should be able to detect if this feature, along with other features, would have an effect in the outcome.

Features can be compared by comparing distributions in parallel. 

![image.png](attachment:image.png)

If distribution is data is large (highly skewed), data can be transformed by getting the square root of the numbers. As can be seen from the parallel box plots, those who paid higher fares had higher survival rates.

#### Density Plot

Density plots display the distribution of a single variable in more detail than a box plot. First, a smoothed estimate of the probability distribution of the variable is estimated (typically using a technique called kernel smoothing). Next, that distribution is plotted as a curve depicting the values that the variable is likely to have. By creating a single density plot of the response variable for each category that the input feature takes, you can easily visualize any discrepancies in the values of the response variable for differences in the categorical input feature. 

![image.png](attachment:image.png)

#### Scatter Plot

In a scatter plot, the value of the feature is plotted versus the value of the response variable, with each instance represented as a dot. Though simple, scatter plots can reveal both linear and nonlinear relationships between the input and response variables.

![image.png](attachment:image.png)