# ITCS 3162 - Assignment 2

Frequent Itemset Mining - A Priori Algorithm

### Name: Harsh Patel

### Submission instructions

- Enter your name in the space above.  
- Save your completed json as *itcs3162_assignment_2_**\<uncc username>**.ipynb*.
- Upload **both** the **ipynb** file and the **html** file version of your completed notebook to Canvas.  

You can download the notebook in html format by going to *File --> Download as --> HTML*

***

## Mining Frequent Hashtags on Twitter

For this assignment, you will implement the *a priori* algorithm to mine pairs of hashtags that co-occur frequently in a set of tweets scraped using Twitter's API.

The "baskets" of hashtags are provided in the included data file **hashtag_baskets.csv**. Please download the file from Canvas and put the file in the **same directory** as this Jupyter notebook.

The function provided below will read the data file in as a list of sets of hashtags. Each set of hashtags corresponds to all of the hashtags used in a single tweet.

In [1]:
def load_hashtags():
    with open('hashtag_baskets.csv') as f:
        return [
            set(line.strip().split(','))
            for line in f
        ]

In [2]:
hashtag_baskets = load_hashtags()
hashtag_baskets[:10]  # Print the first few hashtags

[{'COVID19', 'StayHomeSaveLives'},
 {'recallgavinnewsom'},
 {'stayhome'},
 {'MondayMotivation'},
 {'Economy', 'POTUS'},
 {'BackToWorkNOW', 'NaziNationState', 'OperationGridlock', 'overreach'},
 {'STFH'},
 {'COVID19', 'GavinNewsom', 'KamalaHarris', 'coronavirus'},
 {'protectLA', 'protectcalifornua', 'thankyouforyourservice'},
 {'CARESAct', 'edd', 'eddpenaltyweeks'}]

## Part 1 - Frequent items (25 pts)

In the cells below:
1. Define a function that takes in a *list of item baskets* and a *support threshold*. The function should **return the frequent items**  
    (You may also want to return the count of each hashtag along with the tag itself)
2. Use your function to find the frequently occurring hashtags in our dataset
3. Print out your support threshold and the total number of hashtags that occur frequently
4. Print the **top 10** most frequent hashtags sorted by support

In [3]:
def freq_items(baskets, support_threshold = 2): # default = 2
    basket_dict = {}
    # recording all the items with the times it appears in the baskets
    for basket in hashtag_baskets:
        for item in basket:
            if (item in basket_dict.keys()):
                basket_dict[item] = basket_dict[item] + 1
            else:
                basket_dict[item] = 1
            # option 2: defualt dict
            # item_counts = defaultdict(int) and you can us assign +=1 without worrying about assigning =1 
    
    # filtering dictionary based on support_threshold
    freq_dict = {}
    for key, value in basket_dict.items():
        if (value >= support_threshold):
            freq_dict.update({key: value})

    return freq_dict

In [4]:
# frequent hastags stored in output
freq_hashtags = freq_items(hashtag_baskets, 24) #support threshold = 24

In [5]:
print('Frequently occuring (10 or more times) hashtags:\n')
print(freq_hashtags)

Frequently occuring (10 or more times) hashtags:

{'COVID19': 577, 'StayHomeSaveLives': 138, 'recallgavinnewsom': 24, 'stayhome': 24, 'GavinNewsom': 62, 'coronavirus': 169, 'RecallGavinNewsom': 409, 'Reopencalifornia': 46, 'ReopenCA': 26, 'opencalifornianow': 195, 'California': 171, 'COVID': 65, 'californialockdown': 25, 'OPENAMERICANOW': 57, 'covid19': 70, 'Covid19': 48, 'ReopenAmerica': 24, 'FlattenTheCurve': 42, 'COVID-19': 107, 'ReOpenCalifornia': 33, 'endthelockdown': 26, 's': 33, 'StayHome': 71, 'MedicareForAll': 39, 'OpenCaliforniaNow': 38, 'ConstitutionOverCoronavirus': 32, 'RecallNewsom': 49, 'SARSCoV2': 26, 'OpenCalifornia': 201, 'COVIDIOTS': 29, 'Covid_19': 53, 'AB5': 132, 'RepealAB5': 47, 'CancelRent': 28, 'RentFreezeNow': 25, 'Coronavirus': 49, 'StayAtHome': 34, 'opencalifornia': 29, 'MAGA': 30, 'CA': 25, 'RecallGavin2020': 38, 'ObamaGate': 24}


In [6]:
sorted_freq_tags = sorted(freq_hashtags.items(), key = lambda item: item[1], reverse = True)
print('Top 10 frequent hashtags: \n')
print(sorted_freq_tags[:10])

Top 10 frequent hashtags: 

[('COVID19', 577), ('RecallGavinNewsom', 409), ('OpenCalifornia', 201), ('opencalifornianow', 195), ('California', 171), ('coronavirus', 169), ('StayHomeSaveLives', 138), ('AB5', 132), ('COVID-19', 107), ('StayHome', 71)]


## Part 2 - Generate candidates (15 pts)

Using your results from Part 1, generate the candidate pairs of hashtags

In [7]:
from itertools import combinations

hashtags = list(freq_hashtags.keys()) # get all hashtags from part-1 (threshold = 10)

In [8]:
hashtag_pair = list(combinations(hashtags, 2))# list to hold pairs
print(hashtag_pair)

[('COVID19', 'StayHomeSaveLives'), ('COVID19', 'recallgavinnewsom'), ('COVID19', 'stayhome'), ('COVID19', 'GavinNewsom'), ('COVID19', 'coronavirus'), ('COVID19', 'RecallGavinNewsom'), ('COVID19', 'Reopencalifornia'), ('COVID19', 'ReopenCA'), ('COVID19', 'opencalifornianow'), ('COVID19', 'California'), ('COVID19', 'COVID'), ('COVID19', 'californialockdown'), ('COVID19', 'OPENAMERICANOW'), ('COVID19', 'covid19'), ('COVID19', 'Covid19'), ('COVID19', 'ReopenAmerica'), ('COVID19', 'FlattenTheCurve'), ('COVID19', 'COVID-19'), ('COVID19', 'ReOpenCalifornia'), ('COVID19', 'endthelockdown'), ('COVID19', 's'), ('COVID19', 'StayHome'), ('COVID19', 'MedicareForAll'), ('COVID19', 'OpenCaliforniaNow'), ('COVID19', 'ConstitutionOverCoronavirus'), ('COVID19', 'RecallNewsom'), ('COVID19', 'SARSCoV2'), ('COVID19', 'OpenCalifornia'), ('COVID19', 'COVIDIOTS'), ('COVID19', 'Covid_19'), ('COVID19', 'AB5'), ('COVID19', 'RepealAB5'), ('COVID19', 'CancelRent'), ('COVID19', 'RentFreezeNow'), ('COVID19', 'Corona

## Part 3 - Frequent pairs (30 pts)

    1. Using your candidate pairs from Part 2, count the occurrences of each candidate pair in our dataset
    2. Filter based on your support threshold to find the frequent pairs of hashtags
    3. Print the **top 10** frequent hashtag pairs sorted by support

In [9]:
freq_hashtag_pair = {}
hashtag_pair_dataset = []

# mining dataset to find all pairs
for basket in hashtag_baskets:
    temp_pair = list(combinations(basket, 2)) # generating pairs for each basket 

    for pairs in temp_pair:
            hashtag_pair_dataset.append(pairs) # adding pairs individually from baskets

# recording all the pairs with the times it appears in the candidate pairs
for pair in hashtag_pair_dataset:
 
    if (pair in hashtag_pair_dataset):
        if (pair in freq_hashtag_pair.keys()):
            freq_hashtag_pair[pair] += 1
        else:
            freq_hashtag_pair[pair] = 1
    else:
        freq_hashtag_pair[pair] = 0 
        
print(freq_hashtag_pair)

{('COVID19', 'StayHomeSaveLives'): 30, ('POTUS', 'Economy'): 1, ('NaziNationState', 'overreach'): 1, ('NaziNationState', 'BackToWorkNOW'): 1, ('NaziNationState', 'OperationGridlock'): 1, ('overreach', 'BackToWorkNOW'): 1, ('overreach', 'OperationGridlock'): 1, ('BackToWorkNOW', 'OperationGridlock'): 1, ('GavinNewsom', 'KamalaHarris'): 1, ('GavinNewsom', 'coronavirus'): 1, ('GavinNewsom', 'COVID19'): 4, ('KamalaHarris', 'coronavirus'): 1, ('KamalaHarris', 'COVID19'): 1, ('coronavirus', 'COVID19'): 58, ('protectLA', 'protectcalifornua'): 1, ('protectLA', 'thankyouforyourservice'): 1, ('protectcalifornua', 'thankyouforyourservice'): 1, ('edd', 'eddpenaltyweeks'): 5, ('edd', 'CARESAct'): 5, ('eddpenaltyweeks', 'CARESAct'): 6, ('OpenCA', 'RecallGavinNewsom'): 3, ('OpenCA', 'Reopencalifornia'): 2, ('RecallGavinNewsom', 'Reopencalifornia'): 6, ('RecallGavinNewsom', 'ReopenCA'): 1, ('Biden2020', 'VoteBlue2020'): 1, ('Covid19isReal', 'WereAllInThisTogether'): 1, ('reopen', 'Reopencalifornia'): 

In [10]:
# filtering frequent hastag pairs based on support threshold
filtered_hashtag_pair = {}

for key, value in freq_hashtag_pair.items():
    if (value >= 24): # support threshold = 24
        filtered_hashtag_pair.update({key: value})

print(filtered_hashtag_pair)

{('COVID19', 'StayHomeSaveLives'): 30, ('coronavirus', 'COVID19'): 58, ('opencalifornianow', 'RecallGavinNewsom'): 30, ('OpenCalifornia', 'opencalifornianow'): 29, ('OpenCalifornia', 'RecallGavinNewsom'): 26}


In [11]:
sorted_freq_pairs = sorted(filtered_hashtag_pair.items(), key = lambda item: item[1], reverse = True)
print('Top 10 frequent hashtags pairs: \n')
print(sorted_freq_pairs[:10])

Top 10 frequent hashtags pairs: 

[(('coronavirus', 'COVID19'), 58), (('COVID19', 'StayHomeSaveLives'), 30), (('opencalifornianow', 'RecallGavinNewsom'), 30), (('OpenCalifornia', 'opencalifornianow'), 29), (('OpenCalifornia', 'RecallGavinNewsom'), 26)]


## Part 4 - Association Rules (30 pts)

Using your results from Part 1 and Part 3, find the association rules with high confidence.

1. For each frequent pair, derive the **two** association rules from that pair and compute the **confidence** of each rule
2. Filter the association rules based on a confidence threshold of your choosing
3. Print each association rule and its confidence value

In [12]:
asso_rule = [] # empty dictionary to hold pairs and items that occur with it 

for basket in hashtag_baskets: # going through all the baskets 
    for pairs in filtered_hashtag_pair.keys(): # part-3
        for item in freq_hashtags.keys():
            if ((pairs[0] in basket) and (pairs[1] in basket) and (item in basket) and (item not in pairs)):
                asso_rule.append((pairs, item))

print(len(asso_rule))
print(asso_rule)

83
[(('coronavirus', 'COVID19'), 'GavinNewsom'), (('coronavirus', 'COVID19'), 'COVID'), (('opencalifornianow', 'RecallGavinNewsom'), 'OpenCalifornia'), (('OpenCalifornia', 'opencalifornianow'), 'RecallGavinNewsom'), (('OpenCalifornia', 'RecallGavinNewsom'), 'opencalifornianow'), (('opencalifornianow', 'RecallGavinNewsom'), 'OpenCalifornia'), (('OpenCalifornia', 'opencalifornianow'), 'RecallGavinNewsom'), (('OpenCalifornia', 'RecallGavinNewsom'), 'opencalifornianow'), (('coronavirus', 'COVID19'), 'ConstitutionOverCoronavirus'), (('coronavirus', 'COVID19'), 'COVIDIOTS'), (('opencalifornianow', 'RecallGavinNewsom'), 'OpenCalifornia'), (('OpenCalifornia', 'opencalifornianow'), 'RecallGavinNewsom'), (('OpenCalifornia', 'RecallGavinNewsom'), 'opencalifornianow'), (('opencalifornianow', 'RecallGavinNewsom'), 'OpenCalifornia'), (('OpenCalifornia', 'opencalifornianow'), 'RecallGavinNewsom'), (('OpenCalifornia', 'RecallGavinNewsom'), 'opencalifornianow'), (('coronavirus', 'COVID19'), 'Constituti

In [13]:
# confidence times item showes with pair over times pairs show
# calculate support ->

supo_dict = {}

for asso in asso_rule:
    supo_dict[asso[0], asso[1]] = asso_rule.count(asso)
    
print(len(supo_dict))
print(supo_dict)

28
{(('coronavirus', 'COVID19'), 'GavinNewsom'): 1, (('coronavirus', 'COVID19'), 'COVID'): 2, (('opencalifornianow', 'RecallGavinNewsom'), 'OpenCalifornia'): 10, (('OpenCalifornia', 'opencalifornianow'), 'RecallGavinNewsom'): 10, (('OpenCalifornia', 'RecallGavinNewsom'), 'opencalifornianow'): 10, (('coronavirus', 'COVID19'), 'ConstitutionOverCoronavirus'): 3, (('coronavirus', 'COVID19'), 'COVIDIOTS'): 4, (('opencalifornianow', 'RecallGavinNewsom'), 'OPENAMERICANOW'): 5, (('opencalifornianow', 'RecallGavinNewsom'), 'opencalifornia'): 1, (('opencalifornianow', 'RecallGavinNewsom'), 'ConstitutionOverCoronavirus'): 3, (('OpenCalifornia', 'opencalifornianow'), 'Reopencalifornia'): 2, (('OpenCalifornia', 'opencalifornianow'), 'OPENAMERICANOW'): 3, (('OpenCalifornia', 'RecallGavinNewsom'), 'OPENAMERICANOW'): 5, (('OpenCalifornia', 'RecallGavinNewsom'), 'OpenCaliforniaNow'): 1, (('OpenCalifornia', 'RecallGavinNewsom'), 'Reopencalifornia'): 2, (('OpenCalifornia', 'opencalifornianow'), 'GavinNew

In [14]:
# 3. each association rule and its confidence value
# calculating confidence 
# confidence = support / occurance
# conf_dict = vals from supo_dict / vals from filtered_hashtag_pair

conf_dict = {}

for assoKey, assoVal in supo_dict.items():
    conf_dict[assoKey] = (assoVal / filtered_hashtag_pair[assoKey[0]]) 

print(len(conf_dict))
print(conf_dict)

28
{(('coronavirus', 'COVID19'), 'GavinNewsom'): 0.017241379310344827, (('coronavirus', 'COVID19'), 'COVID'): 0.034482758620689655, (('opencalifornianow', 'RecallGavinNewsom'), 'OpenCalifornia'): 0.3333333333333333, (('OpenCalifornia', 'opencalifornianow'), 'RecallGavinNewsom'): 0.3448275862068966, (('OpenCalifornia', 'RecallGavinNewsom'), 'opencalifornianow'): 0.38461538461538464, (('coronavirus', 'COVID19'), 'ConstitutionOverCoronavirus'): 0.05172413793103448, (('coronavirus', 'COVID19'), 'COVIDIOTS'): 0.06896551724137931, (('opencalifornianow', 'RecallGavinNewsom'), 'OPENAMERICANOW'): 0.16666666666666666, (('opencalifornianow', 'RecallGavinNewsom'), 'opencalifornia'): 0.03333333333333333, (('opencalifornianow', 'RecallGavinNewsom'), 'ConstitutionOverCoronavirus'): 0.1, (('OpenCalifornia', 'opencalifornianow'), 'Reopencalifornia'): 0.06896551724137931, (('OpenCalifornia', 'opencalifornianow'), 'OPENAMERICANOW'): 0.10344827586206896, (('OpenCalifornia', 'RecallGavinNewsom'), 'OPENAMER

In [15]:
# 1. Association Rule:
sorted_conf_list = sorted(conf_dict.items(), key = lambda item: item[1], reverse = True)
print('Two Association Rules and their Confidence: \n')
print(sorted_conf_list[:2])

Two Association Rules and their Confidence: 

[((('OpenCalifornia', 'RecallGavinNewsom'), 'opencalifornianow'), 0.38461538461538464), ((('OpenCalifornia', 'opencalifornianow'), 'RecallGavinNewsom'), 0.3448275862068966)]


In [16]:
# 2. Filter the association rules based on a confidence threshold of your choosing
# confidence threshold = 0.25
filtered_conf_list = {}

for key, value in conf_dict.items():
    if (value >= 0.25): # confidence threshold = 0.25
        filtered_conf_list.update({key: value})

print(filtered_conf_list)

{(('opencalifornianow', 'RecallGavinNewsom'), 'OpenCalifornia'): 0.3333333333333333, (('OpenCalifornia', 'opencalifornianow'), 'RecallGavinNewsom'): 0.3448275862068966, (('OpenCalifornia', 'RecallGavinNewsom'), 'opencalifornianow'): 0.38461538461538464}


## Bonus (5 pts)

Repeat the above experiments after applying some text preprocessing to the data
1. Convert all text to lowercase
2. Remove all punctuation

In [17]:
# all the data in above program ^ is sourced from hashtag_baskets 
# therefore only hashtag_baskets needs to be updated for all the data to be changed
# run this part of code after loading hashtags from .csv to hashtag_baskets but before part-1 to see results 

from string import punctuation

def textPreprocessing(set_list):
    for basket in range(len(set_list)):
        temp_set = set_list[basket].copy() # creates a copy of the original set 
        for item in set_list[basket]:
            word = item
            temp_set.remove(word)
        
            for letter in item:
                if letter in punctuation:
                    word = word.replace(letter, "") # removing punctuation
            temp_set.add(word.lower()) # lowercasing
        set_list[basket] = temp_set
 
    
textPreprocessing(hashtag_baskets)

print(hashtag_baskets)

[{'stayhomesavelives', 'covid19'}, {'recallgavinnewsom'}, {'stayhome'}, {'mondaymotivation'}, {'economy', 'potus'}, {'backtoworknow', 'overreach', 'nazinationstate', 'operationgridlock'}, {'stfh'}, {'covid19', 'gavinnewsom', 'coronavirus', 'kamalaharris'}, {'protectcalifornua', 'thankyouforyourservice', 'protectla'}, {'caresact', 'edd', 'eddpenaltyweeks'}, {'reopencalifornia', 'openca', 'recallgavinnewsom'}, {'covid19'}, {'recallgavinnewsom', 'reopenca'}, {'liberatecalifornia'}, {'biden2020', 'voteblue2020'}, {'coronavirus'}, {'stayhome'}, {'recallgavinnewsom'}, {'stayhome'}, {'shortsighted'}, {'covid19isreal', 'wereallinthistogether'}, {'reopencalifornia', 'reopen'}, {'unpresidented', 'floridamorons', 'unamerican', 'michiganmorons'}, {'californiastrong'}, {'plandemic', 'sb276', 'fake', 'vaccineskill'}, {'reopenca'}, {'dotherightthing', 'stayhome', 'maskup'}, {'sciencematters'}, {'coronavirus'}, {'nocurve'}, {'nocurve', 'backtoaork'}, {'noumemployment', 'nomoneyforus', 'helpus'}, {'cov