# Programmers Guide to Data Mining

In [21]:
# Dependencies
import math

## Collaborative Filtering
Type of recommendation system based on the preferences of other people who are similar to you.

![](../resources/euclidian_distance.jpeg)

$c = \sqrt{a^2 + b^2}$

For more than one book, create a matrix of their ratings (users are columns, rows are books). You can then calculate the difference between two people and also the squared difference. By finding the square root of the summed squared differences you can find the euclidian distance. 

$\sqrt{\sum (x_i-y_i)^2}$

If someone has only reviewed a couple of the same books as another person, then you can get strange results. You need a decent sample size in common.

## Distance Metrics

In [None]:
def manhattan_distance(x1, y1, x2, y2):
    distance = abs(x1-x2) + abs(y1-y2)
    return distance

def euclidian_distance(x1, y1, x2, y2):
    x_diff = (x1-x2)
    y_diff = (y1-y2)
    sum_squared_diffs = math.pow(x_diff, 2) + math.pow(y_diff, 2)
    distance = math.sqrt(sum_squared_diffs)
    return distance

## Pearson Correlation

People can rate things differently - for instance someone may like everything and rate it all 4-5 while someone may avoid extremes and rate 2-4. To get around this, you can use Pearson correlation coefficient. If you plot the scores of two people, then if you get a straight line then you can show that they are in agreement. Alternatively, if they're all over the place it means they don't agree. The coefficient ranges between -1 to 1: 1 means perfect agreement, -1 means perfect disagreement.

$$r = \frac{\sum x_iy_i - \frac{\sum_{i=1}^{n} x_i \sum y_i}{n}}
{\sqrt{x_i^2 - \frac{(\sum x_i)^2}{n}} \sqrt{y_i^2 - \frac{(\sum y_i)^2}{n}}}$$

In [None]:
def pearson_cor(person1_ratings, person2_ratings):
    def numerator



    summed_ratings_multiplied = 0
    sum_person1_ratings = 0
    sum_person2_ratings = 0
    for i, score in enumerate(person1_ratings):
        summed_ratings_multiplied += person1_ratings[i] * person2_ratings[i]

        sum_person1_ratings += person1_ratings[i]
        sum_person2_ratings += person2_ratings[i]

    summed_totals_multiplied_divided_n = sum_person1_ratings * sum_person2_ratings / len(person1_ratings)

    numerator = summed_ratings_multiplied - summed_totals_multiplied_divided_n