In [1]:
import pandas as pd
import numpy as np

import itertools

**Using feather file rather than txt files**

In [2]:
df = pd.read_feather('../ratings.fth')
df.shape

(100480507, 3)

In [3]:
df.describe()

Unnamed: 0,movie,user,rating
count,100480500.0,100480500.0,100480500.0
mean,9070.915,1322489.0,3.60429
std,5131.891,764536.8,1.085219
min,1.0,6.0,1.0
25%,4677.0,661198.0,3.0
50%,9051.0,1319012.0,4.0
75%,13635.0,1984455.0,4.0
max,17770.0,2649429.0,5.0


In [4]:
# df = df.pivot(index='user', columns='movie', values='rating')
# df

In [5]:
# df.fillna(0, inplace=True)
# df

In [6]:
# R = np.array(df)
# R.shape

In [7]:
df

Unnamed: 0,movie,user,rating
0,1,1488844,3
1,1,822109,5
2,1,885013,4
3,1,30878,4
4,1,823519,3
5,1,893988,3
6,1,124105,4
7,1,1248029,3
8,1,1842128,4
9,1,2238063,3


In [8]:
# Randomly sample elements from the dataframe

# df_sample = df.sample(n=100)
# df = df_sample

In [9]:
# df.fillna(0, inplace=True)
# df

**Creating a dummy small dataframe to implement formulas**

In [10]:
# df = pd.DataFrame({'user': np.random.choice(list(range(1,49)), size=150), 'movie': np.random.choice(list(range(1,18)), size=150), 'rating': np.random.choice([1, 2, 3, 4, 5, np.nan], size=150)})

# subset = df[['user', 'movie', 'rating']]
# review_tuples = [tuple(x) for x in subset.values]

# unique_reviews = set(review_tuples)
# set_100 = list(set(itertools.islice(unique_reviews, 100)))

# df = pd.DataFrame(set_100, columns=['user', 'movie', 'rating'])
# df

### Section 4.2 on Paper

**Global Effects**   
We start with a few global effects that are easy to mea- sure and publish accurately without incurring substantial privacy cost. We first measure and publish the number of ratings present for each movie, and the sum or ratings for each movie, with random noise added for privacy.   
We use these to derive a global average, G = GSum/GCnt

In [11]:
noise = 0
GSum = sum(df["rating"].fillna(0)) + noise
GCnt = df["rating"].agg(['count']) + noise
G = GSum/GCnt
G

count    3.60429
Name: rating, dtype: float64

**Movie Effects**

In [12]:
number_of_movies = len(pd.Series(df["movie"].values.ravel()).unique())
number_of_users = len(pd.Series(df["user"].values.ravel()).unique())
print("Number of Movies: ", number_of_movies)
print("Number of Users: ", number_of_users)

Number of Movies:  17770
Number of Users:  480189


Next, we sum and count the number of ratings for each movie, using d dimensional vector sums.

In [13]:
# Generate vectors of noise for each movie
sigma = .1
noisemovies = np.random.normal(0, sigma, number_of_movies)
noisemovies

array([-0.05089098,  0.03471242,  0.05586272, ...,  0.11665861,
        0.13388247,  0.02741122])

In [14]:
# number of fictitious ratings to introduce in the movie average calculation
betam = 15.0 
# number of fictitious ratings to introduce in the user average calculation
betap = 20.0
# bound of the interval that clam the resulted centered rating, to limit sensitivity 
B = 1.0 

We produce a stabilized per-movie average rating by intro-
ducing βm fictitious ratings at value G for each movie

In [15]:
MSum = df.groupby('movie').sum()["rating"] + noisemovies
MCnt = df.groupby('movie').agg(['count'])["rating"]
MCnt = MCnt.iloc[:, 0] + noisemovies
Mavg = (MSum + float(betam*G))/(MCnt + betam)
Mavg

movie
1        3.745915
2        3.562346
3        3.640808
4        2.821646
5        3.915160
6        3.091794
7        2.334434
8        3.190216
9        2.756785
10       3.203643
11       3.070972
12       3.422736
13       4.450840
14       3.089948
15       3.301692
16       3.101283
17       2.904670
18       3.784111
19       3.332156
20       3.196835
21       3.473029
22       2.338331
23       3.557712
24       3.001058
25       3.965972
26       2.795827
27       3.530695
28       3.823166
29       3.599010
30       3.761825
           ...   
17741    3.289195
17742    2.822067
17743    3.106113
17744    3.517708
17745    3.830095
17746    3.338709
17747    3.528086
17748    3.803953
17749    3.506663
17750    2.970259
17751    3.936013
17752    3.004666
17753    2.566486
17754    3.255211
17755    3.213554
17756    3.770158
17757    3.793734
17758    2.918661
17759    2.735091
17760    2.834166
17761    2.920802
17762    3.645717
17763    3.413268
17764    3.867054
1776

In [16]:
Mavg = pd.DataFrame(Mavg , columns = ['avg_rating'] )
Mavg.reset_index(inplace = True)
Mavg.set_index(['movie'])
del MCnt,MSum

With these averages now released, they can be incorpo- rated arbitrarily into subsequent computation with no ad- ditional privacy cost. In particular, we can subtract the corresponding averages from the every rating to remove the per-movie global effects.

In [17]:
Mavg

Unnamed: 0,movie,avg_rating
0,1,3.745915
1,2,3.562346
2,3,3.640808
3,4,2.821646
4,5,3.915160
5,6,3.091794
6,7,2.334434
7,8,3.190216
8,9,2.756785
9,10,3.203643


### Section 4.3 on Paper

**User Effects**   
Having published the average rating for each movie, we will subtract these averages from each rating before contin- uing. We then center the ratings for each user, taking an average again with a number βp of fictitious ratings at the recomputed global average