- run command 'python main.py' to run the program
- Changing algorithm in the commented section inside main.py
- change K (top k nearest neighbors) in main.py
In this project, I developed different algorithms to make recommendations for movies. There is a given training set with 200 users and 1000 movies. Movie ratings are on a scale for 1-5, and 0 means there was no rating given. Based on this training set, we should be able to provide ratings for new users given some of their movie preferences.
This whole project is centered around K Nearest Neighbor. This is used in both Collaborative Filtering Algorithms. It gets a ranked list of the nearest users based on a metric. The two metrics used in this project was Cosine Similarity and Pearson Correlation. The final prediction is then calculated by performing a weighted average from the ranked list. User based filtering is based on the assumption that similar users will rate similarly.
- Used Based Cosine Similarity: Cosine Similarity finds the angle between two vectors. It is also independent of vector length. The angle is used to compare how similar two vectors are. The range for Cosine Similarity is [0,1]. Note that we only care about the similarity between users that have rated the movie we are interested in making the prediction on. The pro of cosine similarity is the metric it provides to be able to compare users. Its weakness is its inability to differentiate between different dimensions. By this, I mean that it cannot recognize that [1,2,3] [1,2,3] is more similar that [1,2] [1,2]. Both sets of users produce a cosine similarity of 1. In addition, Cosine Similarity cannot handle one dimensional cases all comparisons between one dimensional vectors will have the same angle.
- Pearson: The problem with Cosine Similarity is that users may be different type of raters. One user may always give high ratings, and another low ratings. In Cosine Similarity, a rating of [7,8,9] is very different than [1,2,3] . However the correlation between these two vectors are very strong. In fact, these two ratings are the exact same if we look at the magnitude of the user’s ratings (subtracting the average). By doing this process, both vectors becomes[-1,0,1] The Pearson Coefficient finds how well two user’s ratings are correlated. The coefficient ranges from [-1,1]. We want to look at the absolute value of this coefficient and use that as our basis for similarity. The pro of the pearson correlation is its focus on correlation and trends. It can find similar user preferences even if their values are different. Another advantage of Pearson over Cosine Similarity is its ability to find inversely related correlations. These cases would have produced a low cosine similarity. The con of Pearson is still the inability to differentiate different dimensions and inability to handle one dimensional cases.
- IUF: IUF measures how important a movie is. This is useful in determining the similarity between users. If two users rate a unique movie (a movie that has very few ratings) similarly, the two users will have a higher user’s similarity. IUF allows us to weigh the value of the movies differently.
- Case Amplification: Case amplification is used to increase the weights of similar users and decrease the weights of less-similar users. This is so similar users will be valued more than those not-as-similar.
- *Note: Fixing 1 Dimensional Cases: When comparing angles, the one dimensional case will always have the same angle. Therefore using cosine similarity on any one dimensional vector is not useful. For these cases I determine similarity by using the following equation: Sim = 1| v1 - v2 | + 1.05 where 1.05 is used for smoothing. 1.05 is used because it will set a maximum threshold for the one dimensional similarity (best case is 1/1.05 = .95) This gives vectors with greater dimensions an advantage.
- Adjusted Cosine Similarity: This process follows the same process as listed above in the Cosine Similarity for User Based Collaborative Filtering except for the main fact that we are comparing movies rather than users and we care about the magnitude of the rating based on the user who rated it. For every movie rating, before we calculate the cosine similarity, we subtract out the user average-movie-rating. This filtering algorithm depends on the following assumption: Movies that are similar will have similar rating
- Dimensions - As mentioned above, the biggest problem with Cosine Similarity and Pearson is inability to compare its metric across different dimensions. If we had these two sets of users [1,2,3], [1,2,3] and [1,2,3,4], [1,2,3,4], both sets will receive a Pearson Correlation and Cosine Similarity of 1. However it is intuitive that the user [1,2,3,4] should be deemed “more closely related.” In addition, Pearson Correlation and Cosine Similarity will favor lower dimensional similarities. This is the major issue, and my custom algorithm will attempt to solve this problem.
- Scale By Dimension: To ensure that all the top k users are not mostly one dimensional vectors, I multiply the final cosine value by the the log of the dimensional size. For example if the vector [2,2,2] and [2,3,4] were being compared, the cosine similarity of these two vectors will be multiplied by log(3) to get its final similarity. Of course, this method does not work with the one dimensional space as log(1) is 0, so I hardcode a fixed value for this case. However, after trials of changing the log base all the top k users were still vectors with large dimensions with relatively low similarity rates (before scaling). As a result, the error rate got worse as shown below.
- Manhattan Distance: Distance from Vector A to B going across one dimensional plane at a time. This metric is useful when dividing this distance by the number of dimensions in the two vectors. This results in the average distance across one dimensional plane. As such I created this equation to calculate similarity: m = 1/(ManhattanDistance(V1,V2) + 1)where the +1 was used for smoothing. In addition, I created the condition where if two similarities were the same, the vector with more dimensions would be considered more similar. Although I had high hopes for this method, It did not produce favorable results.
- Top K Per Dimension: After the failure of the past two methods, I believed it was too difficult to compare vectors of different dimensions. As such, this method is under the assumption that we cannot compare similarities with vectors of different dimensions. I created a Top K for every dimension. The final result is a weighted average of each dimension’s result. The weight for each dimension is the sum of of the weight within the dimension. I implemented this concept for both User Based Pearson and Cosine Similarity. Surprisingly The Cosine Similarity Top K per Dimension did significantly better than the pearson version even though User Based Pearson had a lower error rate by itself. This Top K per Dimension with Cosine Similarity produced the best result that I got throughout this project.