# 1. Introduction

>Your task is to build a mixture model for collaborative filtering. You are given a data matrix containing movie ratings made by users where the matrix is extracted from a much larger Netflix database. Any particular user has rated only a small fraction of the movies so the data matrix is only partially filled. The goal is to predict all the remaining entries of the matrix.

>You will use mixtures of Gaussians to solve this problem. The model assumes that each user's rating profile is a sample from a mixture model. In other words, we have $K$ possible types of users and, in the context of each user, we must sample a user type and then the rating profile from the Gaussian distribution associated with the type. We will use the Expectation Maximization (EM) algorithm to estimate such a mixture from a partially observed rating matrix. The EM algorithm proceeds by iteratively assigning (softly) users to types (E-step) and subsequently re-estimating the Gaussians associated with each type (M-step). Once we have the mixture, we can use it to predict values for all the missing entries in the data matrix.

## やること

- collaborative filteringを行うためのmixture modelを構築する
- data matrixが与えられる。これは、各userのmovie ratingによって構成されたmatrix。このmatrixは巨大なNetflixのdatabaseから抽出されている。
- userは映画のごく一部しか評価していないので、そのmatrixは部分的にしか埋まっていない。
- goalはmatrixの残りの部分を予測すること
- mixture(混ざり具合？)をExpection Maximaization(EM) algorithmを使って見積もる

## アプローチ

- この問題を解くために、mixtures of Gaussianを使う
- このモデルは各ユーザーのrating profileはこのmixture modelの確率分布のサンプルと仮定する
- 言い換えると、K possible types of usersを仮定し、各userのcontextからuser typeとそのtypeに関連するGaussian distributionからrating profileを抽出しなければならない

## setup

>- kmeans where we have implemented a baseline using the K-means algorithm
>- naive_em.py where you will implement a first version of the EM algorithm (tabs 3-4)
>- em.py where you will build a mixture model for collaborative filtering (tabs 7-8)
>- common.py where you will implement the common functions for all models (tab 5)
>- main.py where you will write code to answer the questions for this project
>- test.py where you will write code to test your implementation of EM for a given test case

> Additionnally, you are provided with the following data files:

>- toy_data.txt a 2D dataset that you will work with in tabs 2-5
>- netflix_incomplete.txt the netflix dataset with missing entries to be completed
>- netflix_complete.txt the netflix dataset with missing entries completed
>- test_incomplete.txt a test dataset to test for you to test your code against our implementation
>- test_complete.txt a test dataset to test for you to test your code against our implementation
>- test_solutions.txt a test dataset to test for you to test your code against our implementation

## 2. K-means




- このパートでは、K-meansによるclusterringとEMによるclusteringを比較する
    - ここでのK-meansは今まで学習してきたものとは少し異なる
- common.py内で`GaussianMixture`クラスが定義されている。直接このクラスを使うことはなさそう
- common.py内で定義された関数内で、このクラスをインスタンス化しているみたい
- common.init(X, K, seed)でK-meansの初期化を行う。これがランダム初期化になる？
    - K: number of clusters
    - seed: random seed used to randomly initialize the parameters
- K=[1, 2, 3, 4]で計算し、結果をそれぞれ`common.plot`を使ってプロットする
- seedsは0, 1, 2, 3, 4を使ってcostが一番小さいものを使う