# K-Nearest Neighbor

Used to classify data points based on "distance" to known data. Find the K nearest neighbors, based on a distance metric and let them sort the classification.

![k.png](attachment:551aa41d-0cec-436c-9e06-4b57f8ed4c0f.png)

We want it to be small so we don't have to go too far, but big enough that there's enough data points to get a meaningful sample.

It's still supervised learning, because labelling is still needed. 

In [None]:
import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = computeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors
K = 10
avgRating = 0
neighbors = getNeighbors(1,K)
for neighbor in neighbors:
    avgRating += movieDict[neighbor][3]
    print(str(movieDict[neighbor]) + " " + str(movieDict[neighbor][3]))
avgRating /= float(K)

### Dealing with real world data

#### Bias / Variance tradeoff
Bias is how far off we are from the correct value.
Variance is how spread out our predictions are.

#### Low bias, high variance
![image.png](attachment:1a0fbeeb-0f8a-459a-8d04-79229808c929.png)
#### High bias, low variance
![image.png](attachment:b94c2e55-2de6-4b59-9774-596ca1c268d7.png)
#### High bias, high variance
![image.png](attachment:455d6d6d-5cd7-44f2-81ba-97244cd591e9.png)
#### Ideal distribution
![image.png](attachment:5b9c1572-2f15-436d-b36b-6101da28065d.png)


In reality we often have to choose between having low variance or low bias.

We can express error by **Bias² + Variance = error**


## Data Cleaning and Normalization
Raw data is polluted and will skew the results.
### Things to look out in Data
1. Watching out for outliers
2. Missing Data, thing about what the right thing to do when dealing with it.
3. Identify and filter out malicious data.
4. Wrong Data, faulty data can skew results.
5. Irrelevant Data, data that have no effect on the result or anything to do with what we actually care about.
6. Inconsistent Data, figure out the variation of information in the dataset and normalize the data.
7. Formatting, things can be spelt differently across different places.


## Garbage in, garbage out

Make sure to have enough data, and high quality data.
The quality of the data often dictates how well the results you're gonna get.

## Normalizing Numerical Data

It's important to normalize the data so that there's more consistency on the weight of the attributes before running it through the model as it might introduce variacne or bias to the results. 

## Dealing with outliers

Sometimes it's important to remove outliers from the training data, filtering the results can tell if there are issues with the data we're trying to model. However, depending on the situation, outliers are important in the data analysis.

## Detecting outliers

Standard deviation provides a princpled way to classify outlier.

When dealing with outliers we can choose to filter them out depending on the standard deviation.

## Feature Engineering

Selecting which features are important to what I'm trying to predict.
Things to keep in mind:
    Which features should I use?
    Do I need to transform the features in some way?
    How do I handle missing data?
    Should I create new features from existing code?

Too many features can lead to sparse data, most of feature engineering is to identify and select the most relevant ones to a given problem.


##### PCA - Princinple Component Analysis
##### K-Means Clustering




## Imputing missing data

Solution for missing data: replace the missing data by the mean of that column.
Median can be better when dealing with many outliers.
Can't be used with categorical data.

Probably not the best solution

When deciding whether to drop data, make sure that it doesn't skew the results.

### The best way might be using Machine Learning

KNN: Avg together the value of the more similar rows and use it to replace the missing data. However, mostly applicable for numerical data instead of categorical data.
Deep learning: Works really well for categorical data.
Regression: Find linear or non-linear relationships between missing feature and other features.
**Mice (Multiple Imputation by Chained Equations)**

Best of the bestest ways: gather more data.

## Handling unbalanced data

Positive/Negative cases, when the model is finding the intended result or not.

## Oversampling

Take samples from the minority and copy them.

## Undersampling
Removing the majority to balance out the results, only when avoiding scaling issues. Better solution is getting better computational power.

## SMOTE - Synthetic Minority Oversampling Technique

Artifically generates new samples of the minority class using the nearest neioghbors.
Both generates new samples and undersamples majority class.

## Adjusting the thresholds
Usually a good way to iron out false positives. Comes at the cost of more false negatives.


# Binning
Transoform numerical data into categorical data by "putting them into bins" (basically buckets of similar data).
Useful when information is not precise.
## Quantile binning
Categorizes the data by the data distribution. Has even sizes of data in each bin
# Transforming Data
Useful on non-linear data.
# Enconding
Transforming data into some new representation required by the model.
## One-hot enconding
Creates a category for every piece of data we have.
Very common on deep learning models.
# Shuffling
Eliminates residual signals in the training data from the order it was collected.