# Monoid

M = (T, f, Zero) is a monoid, where

T is a data type

f() is a binary operation: f: (T, T) -> T

Zero: T (an instance of T)

The Zero is an identity (neutral) element of type T and does not necessarily mean number zero. With the properties specified below, the triple (T, f, Zero) is called a monoid. Here are the monoidic properties:

Let a, b, c, Zero be type of T

Then the following properties must hold:

# Binary operation:

$f: (T, T) \rightarrow T$

# Neutral element:

for all a in T:

$f(Zero, a) = a$

$f(a, Zero) = a$

# Associativity:

for all a, b, c in T:

$ f(f(a, b), c) = f(a, f(b, c))$

# Examples

mean(10, mean(30, 50)) != mean(mean(10, 30), 50)

where

   mean(10, mean(30, 50))

      = mean (10, 40)

      = 25

   mean(mean(10, 30), 50)

      = mean (20, 50)

      = 35

   25 != 35

# Integers with addition:

1)  (Inetger, Integer) = Integer


2)  Zero element : 0
    
    $f(a,0)=f(0,a)=a$

3)  Associativity
    
    $f(a,b) = a + b$

    $f(f(a,b),c)= f(a+b,c)= (a+b)+c$

    $f(a,f(b,c))=f(a,b+c)= a+(b+c)$




# Integers with multiplication

1)  (Inetger, Integer) = Integer


2)  Zero element : 1
    
    $f(a,1)=f(1,a)=a$

3)  Associativity
    
    $f(a,b) = a * b$

    $f(f(a,b),c)= f(a*b,c)= (a*b)*c$

    $f(a,f(b,c))=f(a,b*c)= a*(b*c)$

# Strings with concatenation

(a + (b + c)) = ((a + b) + c)

"" + s = s

s + "" = s

The Zero element for concatenation is an empty string of size 0.

# Lists with concatenation
 List(a, b) + List(c, d) = List(a,b,c,d)

# Sets with their union:

Set(1,2,3) + Set(2,4,5)
    
    = Set(1,2,3,2,4,5)
    = Set(1,2,3,4,5)

Zero element:

S + {} = S

{} + S = S

The Zero element is an empty set {}.


# Non-Monoid Examples

Integers with mean function:

   mean(mean(a,b),c) != mean(a, mean(b,c))
   
Integers with subtraction:

   ((a - b) -c) != (a - (b - c))

Integers with division:

   ((a / b) / c) != (a / (b / c))

Integers with mode:

mode(mode(a, b), c) != mode(a, mode(b, c))

Integers with median:

median(median(a, b), c) != median(a, median(b, c))

# Coding

To compute mean of ratings per user, we can use a monoid data structure (which supports associativity and commutativity) such as a pair of (sum, count), where sum is the total sum of all numbers — ratings —  we have visited (per partition) and count is the number of ratings we have visited so far.

$$mean(pair(sum, count)) = sum / count$$


1)  Type

    $A = (sum, count)$

    $f:(A:A1,A:A2) = (A1.sum +A2.sum, A1.count+A2.count)$


2)  The Zero element is (0.0, 0)
    
    $f(A, Zero) = A$
    
    $f(Zero, A) = A$
    
    f(A, Zero)
    = (sum+0.0, count+0)
    = (sum, count)
    = A

    f(Zero, A)
    = (0.0+sum, 0+count)
    = (sum, count)
    = A

3)  Assoacitivity

    $f(f(A:X,A:Y),A:Z) = f((X.sum+Y.sum,X.count+Y.count),Z)=(X.sum+Y.sum)+Z.sum$

    $f(A:X,f(A:Y,A:Z) = f(X,(Y.sum+Z.sum, Y.count+Z.count)=X.sum+(Y.sum+Z.sum)$


4)  Commutative

    $f(f(A:X,A:Y) = (X.sum + Y.sum,X.count+Y.count)$

    $f(A:Y,A:X) = (Y.sum + X.sum,Y.count+X.count)$
    
    $f(f(A:X,A:Y)= f(f(A:Y,A:X)$






# First take a look at the signature of aggregateByKey() in simple form:

aggregateByKey(zero_value, seq_func, comb_func)


1) Create a A from zero_value (so called an initial value) per partition

2) Merge a V and a A into a single C (inside a partition)

3) Combine two A’s into a single A (combining two partitions)

A is a combined data structure, which in our case here, denotes a pair of (sum, count)

In [1]:
def create_pair(rating_record):
	tokens = rating_record.split(",")
	userID = tokens[0]
	rating = float(tokens[2])
	return (userID, rating)

In [5]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("monoid").master("local[*]").getOrCreate()

In [7]:
ratings_path = "C:/Users/malam/development/data/spark/ml-25m/ratings.csv.no.header"
rdd = spark.sparkContext.textFile(ratings_path)
# load user-defined python function
ratings = rdd.map(lambda rec : create_pair(rec))
ratings.count()

25000095

In [10]:
sum_count = ratings.aggregateByKey(
    (0.0, 0),
    (lambda C, V: (C[0]+V, C[1]+1)),
    (lambda C1, C2: (C1[0]+C2[0], C1[1]+C2[1]))
)

sum_count.take(10)

[('2', (668.0, 184)),
 ('13', (1466.5, 412)),
 ('24', (100.5, 25)),
 ('33', (93.0, 23)),
 ('70', (652.0, 196)),
 ('76', (731.5, 182)),
 ('77', (140.0, 45)),
 ('111', (101.5, 23)),
 ('113', (469.5, 133)),
 ('119', (398.0, 124))]

In [9]:
average_rating = sum_count.mapValues(lambda x: (x[0]/x[1]))