# Assignment 2: Mining Itemsets (Part V)


## Itemset Similarity
Recall from Week 1 that pattern and similarity are two basic outputs of data mining. So far, we have been playing with patterns - frequent itemsets and association rules can all be seen as "patterns". In the last part, let's work on itemset similarities.

### Exercise 6. Jaccard Similarity (20 pts)
Jaccard similarity is a simple but powerful measurement of itemset similarity, defined as follows:
$$\text{Jaccard_similarity(A, B)} = \frac{|A\cap B|}{|A\cup B|}$$

#### Exercise 6(a) (10 pts)
Complete the following function to calculate the Jaccard similarity between two sets, you may assume that at least one of the sets is not empty.

In [None]:
def jaccard_similarity(set_a, set_b):
    i = len(list(set_a.intersection(set_b)))
    u = len(list(set_a.union(set_b)))
    print(float(i/u))
    return float(i/u)
#     raise NotImplementedError()

In [11]:
# This code block tests whether the `jaccard_similarity` function work as expected.
# We hide some tests, so passing all the displayed assertions does not gurantee the bonus points.

assert abs(jaccard_similarity(set(['üçá', 'üçà', 'üçâ']), set(['üçä', 'üçã', 'üçå'])) - 0) < 1e-8
assert abs(jaccard_similarity(set(['üçá', 'üçà', 'üçé']), set(['üçä', 'üçã', 'üçé'])) - 0.2) < 1e-8
assert abs(jaccard_similarity(set(['üçá', 'üçí', 'üçé']), set(['üçä', 'üçí', 'üçé'])) - 0.5) < 1e-8
assert abs(jaccard_similarity(set(['üçì', 'üçí', 'üçé']), set(['üçí', 'üçé', 'üçì'])) - 1) < 1e-8


0.0
0.2
0.5
1.0


#### Exercise 6(b) (10 pts)
A few questions to help you better understand the properties of Jaccard similarity:
1. What is the range of Jaccard similarity?
2. When is the max/min Jaccard similarity achieved?

Complete the following function to indicate the minimum and maximum value a Jaccard similarity score can achieve. (5 pts each).

In [22]:
# Please change min_value and max_value to your answers

min_value = -100
max_value = -100

# YOUR CODE HERE
min_value = 0.0
max_value = 1.0


In [23]:
# This code block evaluate if min_value is assigned with the correct value

In [24]:
# This code block evaluate if max_value is assigned with the correct value

With the Jaccard similarity calculated, we can now calculate the Jaccard similarity between any given Tweet with all other Tweets and find the Tweets that are most similar in terms of the set of food/drink emojis used. (Not graded.)

In [25]:
import pandas as pd
import numpy as np

tweets_df = pd.read_csv("assets/food_drink_emoji_tweets.txt", sep="\t", header=None)
tweets_df.columns = ['text']

emoji_list = "üçáüçàüçâüçäüçãüçåüççü•≠üçéüçèüçêüçëüçíüçìü•ùüçÖü••ü•ëüçÜü•îü•ïüåΩüå∂ü•íü•¨ü•¶üçÑü•úüå∞üçûü•êü•ñü•®ü•Øü•ûüßÄüçñüçóü•©ü•ìüçîüçüüçïüå≠ü•™üåÆüåØü•ôü•öüç≥ü•òüç≤ü•£ü•óüçøüßÇü•´üç±üçòüçôüçöüçõüçúüçùüç†üç¢üç£üç§üç•ü•Æüç°ü•üü•†ü•°ü¶Äü¶ûü¶êü¶ëüç¶üçßüç®üç©üç™üéÇüç∞üßÅü•ßüç´üç¨üç≠üçÆüçØüçºü•õ‚òïüçµüç∂üçæüç∑üç∏üçπüç∫üçªü•Çü•É"
emoji_set = set(emoji_list)

def extract_uniq_emojis(text):
    return 

tweets_df['emojis'] = tweets_df.text.apply(lambda text:np.unique([chr for chr in text if chr in emoji_set]))

tweets_df['jaccard'] = tweets_df.emojis.apply(lambda x:jaccard_similarity(set(tweets_df.loc[0].emojis), set(x)))
tweets_df.sort_values('jaccard',ascending=False).head(n=10)

1.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.05555555555555555
0.16666666666666666
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.14285714285714285
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.2
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2


0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.125
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.16666666666666666
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.2
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.14285714285714285
0.2
0.0
0.0
0.0
0.125
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.2
0.0
0.0
0.0
0.125
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.125
0.0
0.0
0.2
0.0
0.2
0.0
0.2
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.2
0.0
0.0
0.0
0.16666666666666666
0.07692307692307693
0.2
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.1428571428571428

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.14285714285714285
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.2
0.0
0.2
0.0
0.0
0.3333333333333333
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.2
0.0
0.0
0.16666666666666666
0.0
0.0
0.5
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.16666666666666666
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.0
0.0
0.5
0.0
0.2
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.16666666666666666
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.1111111111111111
0.0
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.14285714285714285
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
0.0
0.

Unnamed: 0,text,emojis,jaccard
0,RT @CalorieFixess: üçóüåØüçîüçí 400 Calories https://t...,"[üåØ, üçí, üçî, üçó]",1.0
6800,RT @levelscafeabuja: Chow! ü§©üí¶üçóüåØüçî #LevelsCafeAb...,"[üåØ, üçî, üçó]",0.75
9158,RT @AStateRedWolves: ‚úÖ Countertops: Installed ...,"[üçî, üçó]",0.5
777,@SunnyAnderson @rosannascotto I don‚Äôt think KF...,"[üçî, üçó]",0.5
6226,RT @yooojax: 3. Free Food üçóüçî will ready by 2PM...,"[üçî, üçó]",0.5
7428,Kicking off the weekend with a cheeky BBQ? Her...,"[üçî, üçó]",0.5
7877,@tafarireid07 Did you say bbq? üî•üçîüçóüöô,"[üçî, üçó]",0.5
5328,RT @MAPSTTU: @EtaUpAlphas is starting the seme...,"[üçî, üçó]",0.5
5334,RT @thatssochioma: You don‚Äôt want to miss this...,"[üçî, üçó]",0.5
7788,I‚Äôm hungry for chicken üçó wings or burrito üåØ,"[üåØ, üçó]",0.5


**üçæüçæ Congratulations üéâüéâ, you have now finished up all the exercises in this assignment!**