# Counting number of the mutual friends

For each user having ID in the column userId count the amount of his / her common friends with each other user having ID in the column userId.

Print 49 pairs of the users having the largest amount of common friends, ordered in descending order first by the common friends count , then by id of user1 and finally by id of user 2. The format is following: "count user1 user2"7

Example:

    234	54719	767867
    120	54719 767866
    97 50787 327676

To solve this task use the algorithm described in the last video of lesson 1. The overall plan could look like this:

1. Create a new column “friend” by exploding of column “friends” (like in the demo iPython notebook)
2. group the resulting dataframe by the column “friend” (like in the demo iPython notebook)
3. create a column “users” by collecting all users with the same id in the column “friend” together (like in the demo iPython notebook)
4. sort the elements in the column “users” by the function sort_array
5. filter only the rows which have more than 1 element in the column “users”
6. for each row emit all possible ordered pairs of users from the column “users” (tip: write a user defined function for this)
7. count the number of times each pair has appeared
8. with the help of the window function (like in the demo python notebook) select 49 pairs of users who have the biggest amount of common friends

The sample dataset is located at /data/graphDFSample.

The part of the result on the sample dataset:

    ...
    3044 21864412 51640390
    3021 17139850 51640390
    3010 14985079 51640390
    2970 17139850 21864412
    2913 20158643 27967558
    ...


In [1]:
import sys
if sys.version_info.major == 3:
    import os
    os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
    os.environ['PYTHONHASHSEED'] = '42'

In [2]:
%run /usr/local/spark/python/pyspark/shell.py

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Python version 3.6.2 (default, Jul 23 2017 22:59:30)
SparkSession available as 'spark'.


In [3]:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.enableHiveSupport().master("local [2]").getOrCreate()

In [4]:
graphPath = "/data/graphDFSample"
db = sparkSession.read.parquet(graphPath)

In [5]:
db.limit(5).toPandas()

Unnamed: 0,user,friends
0,22991438,"[20699, 175973, 533235, 584091, 610338, 652317..."
1,37586597,"[83616, 139192, 165978, 184552, 228332, 277633..."
2,56325000,"[504270, 645333, 933904, 1137277, 1209847, 172..."
3,12862761,"[234344, 5991561, 6039721, 6832532, 19429321, ..."
4,38989299,"[47992, 83113, 709903, 716694, 839792, 1276790..."


In [6]:
# Create a new column “friend” by exploding of column “friends

from pyspark.sql.functions import explode

db10 = db.withColumn("friend", explode("friends"))
db10.limit(5).toPandas()

Unnamed: 0,user,friends,friend
0,22991438,"[20699, 175973, 533235, 584091, 610338, 652317...",20699
1,22991438,"[20699, 175973, 533235, 584091, 610338, 652317...",175973
2,22991438,"[20699, 175973, 533235, 584091, 610338, 652317...",533235
3,22991438,"[20699, 175973, 533235, 584091, 610338, 652317...",584091
4,22991438,"[20699, 175973, 533235, 584091, 610338, 652317...",610338


In [7]:
# group the resulting dataframe by the column “friend”
# create a column “users” by collecting all users with the same id in the column “friend” together 

from pyspark.sql.functions import collect_list

db15 = db10.groupBy("friend").agg(collect_list("user").alias("users"))
db15.limit(5).toPandas()

Unnamed: 0,friend,users
0,148,"[65051219, 14631101, 3195315, 14957568]"
1,5518,[58573511]
2,9900,[36844066]
3,10362,[65278216]
4,11458,[39169321]


In [8]:
# sort the elements in the column “users” by the function sort_array
# filter only the rows which have more than 1 element in the column “users”

from pyspark.sql.functions import size, col

db20 = db15.withColumn("users_size", size("users")).filter(col("users_size") > 1)
db20.limit(5).toPandas()

Unnamed: 0,friend,users,users_size
0,148,"[65051219, 14631101, 3195315, 14957568]",4
1,36538,"[57354452, 20686207, 41660921, 63987222, 63305...",32
2,41751,"[60873111, 41811068]",2
3,49331,"[45058971, 58571716]",2
4,73470,"[49852791, 37445156]",2


In [9]:
# for each row emit all possible ordered pairs of users from the column “users”

from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField
from pyspark.sql.functions import udf

upairType = StructType([
    StructField("u1", IntegerType()), StructField("u2", IntegerType())
])

def user_pairs_func(users):
    pairs = []
    for u1 in users:
        for u2 in users:
            if u1 < u2:
                pairs.append([u1, u2])
    return pairs

user_pairs_udf = udf(user_pairs_func, ArrayType(upairType))

db25 = db20.withColumn('uparis', user_pairs_udf(col("users")))
db25.persist()
db25.limit(5).toPandas()

Unnamed: 0,friend,users,users_size,uparis
0,148,"[65051219, 14631101, 3195315, 14957568]",4,"[(14631101, 65051219), (14631101, 14957568), (..."
1,36538,"[57354452, 20686207, 41660921, 63987222, 63305...",32,"[(57354452, 63987222), (57354452, 63305254), (..."
2,41751,"[60873111, 41811068]",2,"[(41811068, 60873111)]"
3,49331,"[45058971, 58571716]",2,"[(45058971, 58571716)]"
4,73470,"[49852791, 37445156]",2,"[(37445156, 49852791)]"


In [10]:
db30 = db25.withColumn("upair", explode("uparis"))
db30.limit(5).toPandas()

Unnamed: 0,friend,users,users_size,uparis,upair
0,148,"[65051219, 14631101, 3195315, 14957568]",4,"[(14631101, 65051219), (14631101, 14957568), (...","(14631101, 65051219)"
1,148,"[65051219, 14631101, 3195315, 14957568]",4,"[(14631101, 65051219), (14631101, 14957568), (...","(14631101, 14957568)"
2,148,"[65051219, 14631101, 3195315, 14957568]",4,"[(14631101, 65051219), (14631101, 14957568), (...","(3195315, 65051219)"
3,148,"[65051219, 14631101, 3195315, 14957568]",4,"[(14631101, 65051219), (14631101, 14957568), (...","(3195315, 14631101)"
4,148,"[65051219, 14631101, 3195315, 14957568]",4,"[(14631101, 65051219), (14631101, 14957568), (...","(3195315, 14957568)"


In [11]:
# count the number of times each pair has appeared

from pyspark.sql.functions import count, lit

db35 = db30.groupBy("upair").agg(count(lit(1)).alias("pairCnt"))
db35.limit(5).toPandas()

Unnamed: 0,upair,pairCnt
0,"(1823, 359870)",270
1,"(1823, 675948)",70
2,"(3094, 31823602)",202
3,"(3094, 37723514)",747
4,"(3094, 39021271)",398


In [12]:
db35.persist()

DataFrame[upair: struct<u1:int,u2:int>, pairCnt: bigint]

In [13]:
# with the help of the window function (like in the demo python notebook)
# select 49 pairs of users who have the biggest amount of common friends

# This is in the demo notebook, I see no reason for window-function
"""
window = Window.orderBy(col("pairCnt").desc())
top49 = db35.withColumn("row_number", row_number().over(window)) \
            .filter(col("row_number") < 50) \
            .select(col("friend"), col("users_size")) \
            .orderBy(col("users_size").desc()) \
            .collect()
"""

top49 = db35.orderBy(col("pairCnt").desc()).take(49)

In [14]:
for val in top49:
    ((u1, u2), cnt) = val
    print('%s %s %s' % (cnt, u1, u2))

3206 27967558 42973992
3130 20158643 42973992
3066 22582764 42973992
3044 21864412 51640390
3021 17139850 51640390
3010 14985079 51640390
2970 17139850 21864412
2913 20158643 27967558
2903 22280814 51151280
2870 23848749 51640390
2855 20158643 22582764
2849 20158643 44996025
2846 22280814 42973992
2784 21864412 23848749
2779 31964081 51640390
2776 39205988 51640390
2754 17139850 23848749
2749 22582764 27967558
2728 50561859 51640390
2724 15485897 51640390
2700 28135661 42973992
2655 22280814 27967558
2653 42973992 43548989
2639 26755857 51640390
2621 14635589 51640390
2608 15485897 17139850
2606 17139850 26755857
2601 21864412 39205988
2600 8406745 51640390
2599 37735419 51640390
2597 20158643 28135661
2585 40003405 42973992
2585 21864412 31964081
2581 27967558 43548989
2579 23848749 31964081
2578 15485897 21864412
2578 27967558 28135661
2577 42973992 64755069
2574 51151280 57077210
2573 20158643 43548989
2566 21864412 26755857
2564 22280814 64755069
2561 42973992 44996025
2556 1713985