In [104]:
import pandas as pd
import numpy as np

import itertools

**Using feather file rather than txt files**

In [121]:
df = pd.read_feather('ratings.fth')
df.shape

(100480507, 3)

In [122]:
df.describe()

Unnamed: 0,movie,user,rating
count,100480500.0,100480500.0,100480500.0
mean,9070.915,1322489.0,3.60429
std,5131.891,764536.8,1.085219
min,1.0,6.0,1.0
25%,4677.0,661198.0,3.0
50%,9051.0,1319012.0,4.0
75%,13635.0,1984455.0,4.0
max,17770.0,2649429.0,5.0


In [107]:
# df = df.pivot(index='user', columns='movie', values='rating')
# df

In [108]:
# df.fillna(0, inplace=True)
# df

In [109]:
# R = np.array(df)
# R.shape

In [110]:
df

Unnamed: 0,movie,user,rating
0,1,1488844,3
1,1,822109,5
2,1,885013,4
3,1,30878,4
4,1,823519,3
5,1,893988,3
6,1,124105,4
7,1,1248029,3
8,1,1842128,4
9,1,2238063,3


In [111]:
# Randomly sample elements from the dataframe

# df_sample = df.sample(n=100)
# df = df_sample

In [112]:
# df.fillna(0, inplace=True)
# df

**Creating a dummy small dataframe to implement formulas**

In [113]:
# df = pd.DataFrame({'user': np.random.choice(list(range(1,49)), size=150), 'movie': np.random.choice(list(range(1,18)), size=150), 'rating': np.random.choice([1, 2, 3, 4, 5, np.nan], size=150)})

# subset = df[['user', 'movie', 'rating']]
# review_tuples = [tuple(x) for x in subset.values]

# unique_reviews = set(review_tuples)
# set_100 = list(set(itertools.islice(unique_reviews, 100)))

# df = pd.DataFrame(set_100, columns=['user', 'movie', 'rating'])
# df

### Section 4.2 on Paper

**Global Effects**   
We start with a few global effects that are easy to mea- sure and publish accurately without incurring substantial privacy cost. We first measure and publish the number of ratings present for each movie, and the sum or ratings for each movie, with random noise added for privacy.   
We use these to derive a global average, G = GSum/GCnt

In [114]:
noise = 0
GSum = sum(df["rating"].fillna(0)) + noise
GCnt = df["rating"].agg(['count']) + noise
G = GSum/GCnt
G

count    3.60429
Name: rating, dtype: float64

**Movie Effects**

In [115]:
number_of_movies = len(pd.Series(df["movie"].values.ravel()).unique())
number_of_users = len(pd.Series(df["user"].values.ravel()).unique())
print("Number of Movies: ", number_of_movies)
print("Number of Users: ", number_of_users)

Number of Movies:  17770
Number of Users:  480189


Next, we sum and count the number of ratings for each movie, using d dimensional vector sums.

In [116]:
# Generate vectors of noise for each movie
sigma = .1
noisemovies = np.random.normal(0, sigma, number_of_movies)
noisemovies

array([ 0.07771005, -0.12405949,  0.03323336, ...,  0.07325741,
       -0.09719451,  0.05212507])

In [117]:
# number of fictitious ratings to introduce in the movie average calculation
betam = 15.0 
# number of fictitious ratings to introduce in the user average calculation
betap = 20.0
# bound of the interval that clam the resulted centered rating, to limit sensitivity 
B = 1.0 

We produce a stabilized per-movie average rating by intro-
ducing βm fictitious ratings at value G for each movie

In [118]:
MSum = df.groupby('movie').sum()["rating"] + noisemovies
MCnt = df.groupby('movie').agg(['count'])["rating"]
MCnt = MCnt.iloc[:, 0] + noisemovies
Mavg = (MSum + float(betam*G))/(MCnt + betam)
Mavg

movie
1        3.745287
2        3.564891
3        3.640837
4        2.822094
5        3.915616
6        3.091947
7        2.333709
8        3.190195
9        2.754264
10       3.204908
11       3.072145
12       3.422599
13       4.449738
14       3.091955
15       3.302463
16       3.101343
17       2.904721
18       3.784103
19       3.332056
20       3.196399
21       3.471885
22       2.339156
23       3.556699
24       3.000678
25       3.965993
26       2.795815
27       3.531008
28       3.823172
29       3.598466
30       3.761825
           ...   
17741    3.289083
17742    2.821749
17743    3.106078
17744    3.517925
17745    3.829231
17746    3.338484
17747    3.527844
17748    3.804292
17749    3.506298
17750    2.974474
17751    3.936168
17752    3.004472
17753    2.566492
17754    3.256164
17755    3.216535
17756    3.770213
17757    3.794483
17758    2.918544
17759    2.735203
17760    2.834322
17761    2.920704
17762    3.645722
17763    3.413341
17764    3.867041
1776

In [119]:
Mavg = pd.DataFrame(Mavg , columns = ['avg_rating'] )
Mavg.reset_index(inplace = True)
Mavg.set_index(['movie'])
del MCnt,MSum

With these averages now released, they can be incorpo- rated arbitrarily into subsequent computation with no ad- ditional privacy cost. In particular, we can subtract the corresponding averages from the every rating to remove the per-movie global effects.

In [120]:
Mavg

Unnamed: 0,movie,avg_rating
0,1,3.745287
1,2,3.564891
2,3,3.640837
3,4,2.822094
4,5,3.915616
5,6,3.091947
6,7,2.333709
7,8,3.190195
8,9,2.754264
9,10,3.204908


### Section 4.3 on Paper

**User Effects**   
Having published the average rating for each movie, we will subtract these averages from each rating before contin- uing. We then center the ratings for each user, taking an average again with a number βp of fictitious ratings at the recomputed global average