# CITY A

The goal is to analyze POI co-occurrence patterns within grid cells for a city. The code uses the Apriori algorithm to identify frequent POI combinations that co-occur within the same grid, which aligns with the objective.

The code creates grid_id based on the x, y coordinates of each POI, which groups POIs into distinct grid cells.
The POI categories within each grid cell are collected into a list (grid_baskets), which serves as the "basket" for the Apriori algorithm. This is correct for organizing the data as baskets of POIs.

The one-hot encoding step properly transforms each grid cell's POI list into a binary vector. This allows the Apriori algorithm to identify frequent combinations of POI categories that appear together within the same grid cell.
The apriori function is applied to identify frequent itemsets based on a minimum support threshold (0.25 & 1), which is in line with the analysis of frequent itemsets. Next we displayed the top association rules sorted by confidence and lift in descending order with a minimum lift of 1.

We calculated support counts by checking item presence in each grid cell to analyze support values from the Apriori algorithm and updated frequent itemsets. Lift measures the strength of the association between antecedents and consequents, with values greater than 1 indicating a strong positive link. We present the leading association rules with the highest lift values for evaluating patterns of POI co-occurrence.

The code ignores the quantity of each POI and focuses solely on the presence of POI categories, which is in line with the recommendation to focus on the presence rather than quantity.


In [1]:
import numpy as np 
import pandas as pd 
from mlxtend.frequent_patterns import apriori, association_rules


In [2]:
# Load data
POI_data = pd.read_csv('POIdata_cityA.csv')
categories = pd.read_csv('POI_datacategories.csv', header=None)


In [3]:
# Create grid_id for each POI based on its x, y coordinates
POI_data['grid_id'] = POI_data['x'].astype(str) + "_" + POI_data['y'].astype(str)

# Group POI data by grid_id and collect the categories for each grid
grid_baskets = POI_data.groupby('grid_id')['category'].apply(list)

# Convert each basket to a set to remove duplicates
grid_baskets = grid_baskets.apply(lambda x: list(set(x)))
grid_baskets

grid_id
100_10                                  [65, 76, 79, 81, 62]
100_100    [4, 13, 18, 38, 41, 48, 52, 53, 54, 55, 56, 57...
100_101    [4, 5, 14, 17, 18, 21, 36, 38, 39, 45, 47, 48,...
100_102    [66, 4, 69, 36, 5, 9, 11, 13, 14, 48, 81, 50, ...
100_103             [71, 14, 79, 81, 82, 51, 84, 59, 62, 63]
                                 ...                        
9_6        [33, 66, 36, 4, 70, 46, 47, 79, 81, 17, 51, 20...
9_7        [66, 39, 73, 76, 79, 47, 48, 53, 54, 59, 60, 6...
9_8                                 [47, 53, 54, 55, 56, 59]
9_9                                          [73, 34, 67, 4]
9_97                                                    [56]
Name: category, Length: 20146, dtype: object

In [4]:
# Create a one-hot encoding matrix for the categories in each grid
one_hot_list = []

for basket in grid_baskets:
    row = [0] * len(categories)
    for category in basket:
        row[category - 1] = 1
    one_hot_list.append(row)

# Create a DataFrame for one-hot encoded categories
one_hot_df = pd.DataFrame(one_hot_list, columns=categories[0])
one_hot_df

Unnamed: 0,Food,Shopping,Entertainment,Japanese restaurant,Western restaurant,Eat all you can restaurant,Chinese restaurant,Indian restaurant,Ramen restaurant,Curry restaurant,...,Accountant Office,IT Office,Publisher Office,Building Material,Gardening,Heavy Industry,NPO,Utility Copany,Port,Research Facility
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,1,0,0,1,0,1,0,0,1,0
2,0,0,0,1,1,0,0,0,0,0,...,1,1,1,1,0,1,0,0,0,1
3,0,0,0,1,1,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20141,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
20142,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
20143,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20144,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Create a DataFrame for grid_baskets and merge it with the one-hot encoding
grid_baskets_df = pd.DataFrame({'grid_id': grid_baskets.index, 'categories': grid_baskets})
grid_baskets_df

Unnamed: 0_level_0,grid_id,categories
grid_id,Unnamed: 1_level_1,Unnamed: 2_level_1
100_10,100_10,"[65, 76, 79, 81, 62]"
100_100,100_100,"[4, 13, 18, 38, 41, 48, 52, 53, 54, 55, 56, 57..."
100_101,100_101,"[4, 5, 14, 17, 18, 21, 36, 38, 39, 45, 47, 48,..."
100_102,100_102,"[66, 4, 69, 36, 5, 9, 11, 13, 14, 48, 81, 50, ..."
100_103,100_103,"[71, 14, 79, 81, 82, 51, 84, 59, 62, 63]"
...,...,...
9_6,9_6,"[33, 66, 36, 4, 70, 46, 47, 79, 81, 17, 51, 20..."
9_7,9_7,"[66, 39, 73, 76, 79, 47, 48, 53, 54, 59, 60, 6..."
9_8,9_8,"[47, 53, 54, 55, 56, 59]"
9_9,9_9,"[73, 34, 67, 4]"


In [6]:
# Create a DataFrame for grid_baskets and merge it with the one-hot encoding
grid_baskets_df = pd.DataFrame({'grid_id': grid_baskets.index, 'categories': grid_baskets})
# Ensure the index of one_hot_df matches the grid_baskets_df index
one_hot_df.index = grid_baskets_df.index

# Join the one-hot encoding DataFrame with grid_baskets_df
grid_baskets_df = grid_baskets_df.drop(['categories'], axis=1).join(one_hot_df)

# Set grid_id as the index
grid_baskets_df.set_index('grid_id', inplace=True)
grid_baskets_df

Unnamed: 0_level_0,Food,Shopping,Entertainment,Japanese restaurant,Western restaurant,Eat all you can restaurant,Chinese restaurant,Indian restaurant,Ramen restaurant,Curry restaurant,...,Accountant Office,IT Office,Publisher Office,Building Material,Gardening,Heavy Industry,NPO,Utility Copany,Port,Research Facility
grid_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100_10,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,0,0,0
100_100,0,0,0,1,0,0,0,0,0,0,...,1,0,0,1,0,1,0,0,1,0
100_101,0,0,0,1,1,0,0,0,0,0,...,1,1,1,1,0,1,0,0,0,1
100_102,0,0,0,1,1,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
100_103,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9_6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
9_7,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
9_8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9_9,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Run apriori algorithm with a minimum support of 0.25
frequent_itemsets = apriori(grid_baskets_df, min_support=0.25, use_colnames=True)
# Display the frequent itemsets
print("Frequent Itemsets with Support:")
print(frequent_itemsets)

Frequent Itemsets with Support:
     support                              itemsets
0   0.255932                                (Café)
1   0.335153                                (Park)
2   0.406979                     (Transit Station)
3   0.267001                            (Hospital)
4   0.372679                         (Real Estate)
5   0.458255                     (Home Appliances)
6   0.318624                            (Laundry )
7   0.313511                      (Driving School)
8   0.251861                                (Bank)
9   0.374814                          (Hair Salon)
10  0.284424                    (Community Center)
11  0.428522                              (Church)
12  0.332324                   (Accountant Office)
13  0.453291                   (Building Material)
14  0.559069                      (Heavy Industry)
15  0.254740  (Transit Station, Building Material)
16  0.272163     (Heavy Industry, Transit Station)
17  0.268291        (Home Appliances, Real Estate)



In [8]:
# Run apriori algorithm with a minimum support of 0.25
frequent_itemsets = apriori(grid_baskets_df, min_support=0.25, use_colnames=True)

# Generate association rules with a minimum lift of 1
association_rules_df = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Sort the rules by confidence and lift, in descending order
sorted_rules = association_rules_df.sort_values(by=['confidence', 'lift'], ascending=[False, False])

# Display the top association rules
print("Top Association Rules:")
print(sorted_rules.head())


Top Association Rules:
            antecedents          consequents  antecedent support  \
18           (Laundry )     (Heavy Industry)            0.318624   
11        (Real Estate)     (Heavy Industry)            0.372679   
26  (Accountant Office)     (Heavy Industry)            0.332324   
23         (Hair Salon)     (Heavy Industry)            0.374814   
9         (Real Estate)  (Building Material)            0.372679   

    consequent support   support  confidence      lift  leverage  conviction  \
18            0.559069  0.251266    0.788596  1.410553  0.073133    2.085731   
11            0.559069  0.292316    0.784363  1.402982  0.083963    2.044788   
26            0.559069  0.258414    0.777595  1.390876  0.072622    1.982562   
23            0.559069  0.286062    0.763210  1.365145  0.076515    1.862121   
9             0.453291  0.277921    0.745738  1.645164  0.108989    2.150179   

    zhangs_metric  
18       0.427163  
11       0.457872  
26       0.420906  
23     



In [9]:
frequent_itemsets_copy = frequent_itemsets.copy()

# Calculate the support counts manually and add as a new column
support_counts = []
for index, row in frequent_itemsets_copy.iterrows():
    itemset = list(row["itemsets"])  # Convert frozenset of items to list
    items = one_hot_df[itemset]  # Get the columns in the itemset
    mask = items.all(axis=1)  # Check if all items in the itemset are present in each row (grid cell)
    support_count = mask.sum() / len(one_hot_df)  # Calculate support count as proportion of rows
    support_counts.append(support_count)

frequent_itemsets_copy["check_support"] = support_counts

# Sort the DataFrame by the support count and reset the index
freq_itemsets_copy = frequent_itemsets_copy.sort_values("support", ascending=False).reset_index(drop=True)

# Display the updated frequent itemsets with manually calculated support
print("Frequent Itemsets with Manually Verified Support:")
print(freq_itemsets_copy.head())

Frequent Itemsets with Manually Verified Support:
    support             itemsets  check_support
0  0.559069     (Heavy Industry)       0.559069
1  0.458255    (Home Appliances)       0.458255
2  0.453291  (Building Material)       0.453291
3  0.428522             (Church)       0.428522
4  0.406979    (Transit Station)       0.406979


In [10]:
frequent_itemsets_copy.to_csv("frequent_itemsets_cityA.csv")

In [11]:
# Create a new DataFrame to store selected columns from the sorted association rules
rules_summary = sorted_rules[["antecedents", "consequents", "lift"]].copy()

# Combine antecedents and consequents into a single column representing the rule
rules_summary["rule"] = (
    rules_summary["antecedents"].apply(lambda items: ", ".join(map(str, items)))
    + " -> "
    + rules_summary["consequents"].apply(lambda items: ", ".join(map(str, items)))
)

# Drop the original antecedents and consequents columns
rules_summary.drop(columns=["antecedents", "consequents"], inplace=True)

# Reorder columns for better readability (placing the rule column first)
rules_summary = rules_summary[["rule", "lift"]]

# Sort the DataFrame by the lift value in descending order and reset the index
rules_summary = rules_summary.sort_values(by="lift", ascending=False).reset_index(drop=True)

# Display the top rules with the highest lift
print("Top Rules with Lift Values:")
print(rules_summary.head())


Top Rules with Lift Values:
                               rule      lift
0         Real Estate -> Hair Salon  1.911089
1         Hair Salon -> Real Estate  1.911089
2  Real Estate -> Building Material  1.645164
3  Building Material -> Real Estate  1.645164
4   Building Material -> Hair Salon  1.601028


# CITY B

In [12]:
import numpy as np 
import pandas as pd 
from mlxtend.frequent_patterns import apriori, association_rules 


In [13]:
POI_data = pd.read_csv('POIdata_cityB.csv')
categories = pd.read_csv('POI_datacategories.csv', header=None)

In [14]:
# Create grid_id for each POI based on its x, y coordinates
POI_data['grid_id'] = POI_data['x'].astype(str) + "_" + POI_data['y'].astype(str)

# Group POI data by grid_id and collect the categories for each grid
grid_baskets = POI_data.groupby('grid_id')['category'].apply(list)

# Convert each basket to a set to remove duplicates
grid_baskets = grid_baskets.apply(lambda x: list(set(x)))
grid_baskets

grid_id
100_102                     [34, 38, 74, 76, 48, 83, 62, 63]
100_103    [35, 67, 41, 74, 43, 79, 48, 47, 83, 52, 51, 5...
100_104    [65, 37, 74, 76, 79, 48, 81, 54, 55, 58, 59, 6...
100_117                                             [81, 79]
100_118    [66, 35, 36, 4, 74, 79, 48, 51, 52, 54, 55, 62...
                                 ...                        
9_40                                            [73, 74, 54]
9_41                                            [73, 60, 69]
9_42                 [4, 73, 74, 41, 51, 56, 57, 58, 59, 63]
9_43       [4, 7, 11, 18, 21, 23, 34, 36, 38, 40, 41, 43,...
9_44             [69, 9, 74, 41, 79, 48, 82, 54, 58, 59, 60]
Name: category, Length: 9124, dtype: object

In [15]:
# Create a one-hot encoding matrix for the categories in each grid
one_hot_list = []

for basket in grid_baskets:
    row = [0] * len(categories)
    for category in basket:
        row[category - 1] = 1
    one_hot_list.append(row)

# Create a DataFrame for one-hot encoded categories
one_hot_df = pd.DataFrame(one_hot_list, columns=categories[0])
one_hot_df

Unnamed: 0,Food,Shopping,Entertainment,Japanese restaurant,Western restaurant,Eat all you can restaurant,Chinese restaurant,Indian restaurant,Ramen restaurant,Curry restaurant,...,Accountant Office,IT Office,Publisher Office,Building Material,Gardening,Heavy Industry,NPO,Utility Copany,Port,Research Facility
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9119,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9120,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9121,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9122,0,0,0,1,0,0,1,0,0,0,...,1,0,1,1,0,1,1,0,0,0


In [16]:
# Create a DataFrame for grid_baskets and merge it with the one-hot encoding
grid_baskets_df = pd.DataFrame({'grid_id': grid_baskets.index, 'categories': grid_baskets})
grid_baskets_df

Unnamed: 0_level_0,grid_id,categories
grid_id,Unnamed: 1_level_1,Unnamed: 2_level_1
100_102,100_102,"[34, 38, 74, 76, 48, 83, 62, 63]"
100_103,100_103,"[35, 67, 41, 74, 43, 79, 48, 47, 83, 52, 51, 5..."
100_104,100_104,"[65, 37, 74, 76, 79, 48, 81, 54, 55, 58, 59, 6..."
100_117,100_117,"[81, 79]"
100_118,100_118,"[66, 35, 36, 4, 74, 79, 48, 51, 52, 54, 55, 62..."
...,...,...
9_40,9_40,"[73, 74, 54]"
9_41,9_41,"[73, 60, 69]"
9_42,9_42,"[4, 73, 74, 41, 51, 56, 57, 58, 59, 63]"
9_43,9_43,"[4, 7, 11, 18, 21, 23, 34, 36, 38, 40, 41, 43,..."


In [17]:
# Create a DataFrame for grid_baskets and merge it with the one-hot encoding
grid_baskets_df = pd.DataFrame({'grid_id': grid_baskets.index, 'categories': grid_baskets})
# Ensure the index of one_hot_df matches the grid_baskets_df index
one_hot_df.index = grid_baskets_df.index

# Join the one-hot encoding DataFrame with grid_baskets_df
grid_baskets_df = grid_baskets_df.drop(['categories'], axis=1).join(one_hot_df)

# Set grid_id as the index
grid_baskets_df.set_index('grid_id', inplace=True)
grid_baskets_df

Unnamed: 0_level_0,Food,Shopping,Entertainment,Japanese restaurant,Western restaurant,Eat all you can restaurant,Chinese restaurant,Indian restaurant,Ramen restaurant,Curry restaurant,...,Accountant Office,IT Office,Publisher Office,Building Material,Gardening,Heavy Industry,NPO,Utility Copany,Port,Research Facility
grid_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100_102,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
100_103,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
100_104,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,0,0,0
100_117,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
100_118,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9_40,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9_41,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9_42,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9_43,0,0,0,1,0,0,1,0,0,0,...,1,0,1,1,0,1,1,0,0,0


In [18]:
# Run apriori algorithm with a minimum support of 0.1
frequent_itemsets = apriori(grid_baskets_df, min_support=0.1, use_colnames=True)
# Display the frequent itemsets
print("Frequent Itemsets with Support:")
print(frequent_itemsets)

Frequent Itemsets with Support:
     support                                           itemsets
0   0.130864                              (Japanese restaurant)
1   0.107080                                    (Interior Shop)
2   0.140728                                    (Grocery Store)
3   0.180294                                             (Park)
4   0.359711                                  (Transit Station)
..       ...                                                ...
66  0.110259  (Transit Station, Home Appliances, Building Ma...
67  0.103354  (Home Appliances, Heavy Industry, Transit Stat...
68  0.100395   (Transit Station, Hair Salon, Building Material)
69  0.108176  (Transit Station, Heavy Industry, Building Mat...
70  0.107628  (Home Appliances, Heavy Industry, Building Mat...

[71 rows x 2 columns]




In [19]:
# Run apriori algorithm with a minimum support of 0.1
frequent_itemsets = apriori(grid_baskets_df, min_support=0.1, use_colnames=True)

# Generate association rules with a minimum lift of 1
association_rules_df = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Sort the rules by confidence and lift, in descending order
sorted_rules = association_rules_df.sort_values(by=['confidence', 'lift'], ascending=[False, False])

# Display the top association rules
print("Top Association Rules:")
print(sorted_rules.head())

Top Association Rules:
                           antecedents          consequents  \
102    (Hair Salon, Building Material)    (Transit Station)   
16                              (Bank)    (Transit Station)   
12                          (Laundry )    (Transit Station)   
37                       (Real Estate)  (Building Material)   
112  (Home Appliances, Heavy Industry)  (Building Material)   

     antecedent support  consequent support   support  confidence      lift  \
102            0.123520            0.359711  0.100395    0.812777  2.259531   
16             0.133056            0.359711  0.105984    0.796540  2.214392   
12             0.162210            0.359711  0.121219    0.747297  2.077496   
37             0.163744            0.270934  0.121328    0.740964  2.734852   
112            0.145989            0.270934  0.107628    0.737237  2.721097   

     leverage  conviction  zhangs_metric  
102  0.055963    3.419934       0.635988  
16   0.058123    3.147009       0.632



In [20]:
frequent_itemsets_copy = frequent_itemsets.copy()

# Calculate the support counts manually and add as a new column
support_counts = []
for index, row in frequent_itemsets_copy.iterrows():
    itemset = list(row["itemsets"])  # Convert frozenset of items to list
    items = one_hot_df[itemset]  # Get the columns in the itemset
    mask = items.all(axis=1)  # Check if all items in the itemset are present in each row (grid cell)
    support_count = mask.sum() / len(one_hot_df)  # Calculate support count as proportion of rows
    support_counts.append(support_count)

frequent_itemsets_copy["check_support"] = support_counts

# Sort the DataFrame by the support count and reset the index
freq_itemsets_copy = frequent_itemsets_copy.sort_values("support", ascending=False).reset_index(drop=True)

# Display the updated frequent itemsets with manually calculated support
print("Frequent Itemsets with Manually Verified Support:")
print(freq_itemsets_copy.head())

Frequent Itemsets with Manually Verified Support:
    support             itemsets  check_support
0  0.375055             (Church)       0.375055
1  0.359711    (Transit Station)       0.359711
2  0.303924     (Heavy Industry)       0.303924
3  0.270934  (Building Material)       0.270934
4  0.269838    (Home Appliances)       0.269838


In [21]:
frequent_itemsets_copy.to_csv("frequent_itemsets_cityB.csv")

In [22]:
# Create a new DataFrame to store selected columns from the sorted association rules
rules_summary = sorted_rules[["antecedents", "consequents", "lift"]].copy()

# Combine antecedents and consequents into a single column representing the rule
rules_summary["rule"] = (
    rules_summary["antecedents"].apply(lambda items: ", ".join(map(str, items)))
    + " -> "
    + rules_summary["consequents"].apply(lambda items: ", ".join(map(str, items)))
)

# Drop the original antecedents and consequents columns
rules_summary.drop(columns=["antecedents", "consequents"], inplace=True)

# Reorder columns for better readability (placing the rule column first)
rules_summary = rules_summary[["rule", "lift"]]

# Sort the DataFrame by the lift value in descending order and reset the index
rules_summary = rules_summary.sort_values(by="lift", ascending=False).reset_index(drop=True)

# Display the top rules with the highest lift
print("Top Rules with Lift Values:")
print(rules_summary.head())


Top Rules with Lift Values:
                                               rule      lift
0                            Hair Salon -> Laundry   3.490690
1                            Laundry  -> Hair Salon  3.490690
2                         Hair Salon -> Real Estate  3.380272
3                         Real Estate -> Hair Salon  3.380272
4  Hair Salon -> Building Material, Transit Station  3.261651


# CITY C

In [23]:
import numpy as np 
import pandas as pd 
from mlxtend.frequent_patterns import apriori, association_rules 

In [24]:
POI_data = pd.read_csv('POIdata_cityC.csv')
categories = pd.read_csv('POI_datacategories.csv', header=None)

In [25]:
# Create grid_id for each POI based on its x, y coordinates
POI_data['grid_id'] = POI_data['x'].astype(str) + "_" + POI_data['y'].astype(str)

# Group POI data by grid_id and collect the categories for each grid
grid_baskets = POI_data.groupby('grid_id')['category'].apply(list)

# Convert each basket to a set to remove duplicates
grid_baskets = grid_baskets.apply(lambda x: list(set(x)))
grid_baskets

grid_id
101_200                                        [74]
102_188                                        [18]
108_161                                    [80, 81]
109_162                                        [48]
10_105                                     [80, 76]
                             ...                   
9_197      [69, 76, 47, 48, 79, 53, 54, 58, 59, 63]
9_4                                            [73]
9_5                    [66, 69, 72, 73, 41, 82, 61]
9_6                                            [62]
9_90                                           [74]
Name: category, Length: 3251, dtype: object

In [26]:
# Create a one-hot encoding matrix for the categories in each grid
one_hot_list = []

for basket in grid_baskets:
    row = [0] * len(categories)
    for category in basket:
        row[category - 1] = 1
    one_hot_list.append(row)

# Create a DataFrame for one-hot encoded categories
one_hot_df = pd.DataFrame(one_hot_list, columns=categories[0])
one_hot_df

Unnamed: 0,Food,Shopping,Entertainment,Japanese restaurant,Western restaurant,Eat all you can restaurant,Chinese restaurant,Indian restaurant,Ramen restaurant,Curry restaurant,...,Accountant Office,IT Office,Publisher Office,Building Material,Gardening,Heavy Industry,NPO,Utility Copany,Port,Research Facility
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3246,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
3247,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3248,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3249,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# Create a DataFrame for grid_baskets and merge it with the one-hot encoding
grid_baskets_df = pd.DataFrame({'grid_id': grid_baskets.index, 'categories': grid_baskets})
grid_baskets_df

Unnamed: 0_level_0,grid_id,categories
grid_id,Unnamed: 1_level_1,Unnamed: 2_level_1
101_200,101_200,[74]
102_188,102_188,[18]
108_161,108_161,"[80, 81]"
109_162,109_162,[48]
10_105,10_105,"[80, 76]"
...,...,...
9_197,9_197,"[69, 76, 47, 48, 79, 53, 54, 58, 59, 63]"
9_4,9_4,[73]
9_5,9_5,"[66, 69, 72, 73, 41, 82, 61]"
9_6,9_6,[62]


In [28]:
# Create a DataFrame for grid_baskets and merge it with the one-hot encoding
grid_baskets_df = pd.DataFrame({'grid_id': grid_baskets.index, 'categories': grid_baskets})
# Ensure the index of one_hot_df matches the grid_baskets_df index
one_hot_df.index = grid_baskets_df.index

# Join the one-hot encoding DataFrame with grid_baskets_df
grid_baskets_df = grid_baskets_df.drop(['categories'], axis=1).join(one_hot_df)

# Set grid_id as the index
grid_baskets_df.set_index('grid_id', inplace=True)
grid_baskets_df

Unnamed: 0_level_0,Food,Shopping,Entertainment,Japanese restaurant,Western restaurant,Eat all you can restaurant,Chinese restaurant,Indian restaurant,Ramen restaurant,Curry restaurant,...,Accountant Office,IT Office,Publisher Office,Building Material,Gardening,Heavy Industry,NPO,Utility Copany,Port,Research Facility
grid_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
101_200,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
102_188,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
108_161,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
109_162,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10_105,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9_197,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
9_4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9_5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9_6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
# Run apriori algorithm with a minimum support of 0.25
frequent_itemsets = apriori(grid_baskets_df, min_support=0.25, use_colnames=True)
# Display the frequent itemsets
print("Frequent Itemsets with Support:")
print(frequent_itemsets)

Frequent Itemsets with Support:
     support                                           itemsets
0   0.250384                                (Convenience Store)
1   0.253153                                    (Interior Shop)
2   0.482621                                             (Park)
3   0.571516                                  (Transit Station)
4   0.301446                                         (Hospital)
..       ...                                                ...
76  0.255306   (Transit Station, Hair Salon, Building Material)
77  0.264226  (Accountant Office, Transit Station, Building ...
78  0.262688  (Transit Station, Heavy Industry, Building Mat...
79  0.253768  (Home Appliances, Building Material, Real Estate)
80  0.254998  (Transit Station, Park, Building Material, Rea...

[81 rows x 2 columns]




In [30]:
# Run apriori algorithm with a minimum support of 0.25
frequent_itemsets = apriori(grid_baskets_df, min_support=0.25, use_colnames=True)

# Generate association rules with a minimum lift of 1
association_rules_df = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Sort the rules by confidence and lift, in descending order
sorted_rules = association_rules_df.sort_values(by=['confidence', 'lift'], ascending=[False, False])

# Display the top association rules
print("Top Association Rules:")
print(sorted_rules.head())

Top Association Rules:
                          antecedents        consequents  antecedent support  \
184   (Hair Salon, Building Material)  (Transit Station)            0.271301   
159         (Hair Salon, Real Estate)  (Transit Station)            0.285451   
165  (Accountant Office, Real Estate)  (Transit Station)            0.278376   
147  (Elderly Care Home, Real Estate)  (Transit Station)            0.272224   
21                         (Hospital)  (Transit Station)            0.301446   

     consequent support   support  confidence      lift  leverage  conviction  \
184            0.571516  0.255306    0.941043  1.646572  0.100253    7.267740   
159            0.571516  0.267302    0.936422  1.638487  0.104163    6.739538   
165            0.571516  0.258382    0.928177  1.624060  0.099286    5.965809   
147            0.571516  0.252230    0.926554  1.621220  0.096650    5.833968   
21             0.571516  0.278683    0.924490  1.617608  0.106402    5.674512   

     zhan



In [31]:
frequent_itemsets_copy = frequent_itemsets.copy()

# Calculate the support counts manually and add as a new column
support_counts = []
for index, row in frequent_itemsets_copy.iterrows():
    itemset = list(row["itemsets"])  # Convert frozenset of items to list
    items = one_hot_df[itemset]  # Get the columns in the itemset
    mask = items.all(axis=1)  # Check if all items in the itemset are present in each row (grid cell)
    support_count = mask.sum() / len(one_hot_df)  # Calculate support count as proportion of rows
    support_counts.append(support_count)

frequent_itemsets_copy["check_support"] = support_counts

# Sort the DataFrame by the support count and reset the index
freq_itemsets_copy = frequent_itemsets_copy.sort_values("support", ascending=False).reset_index(drop=True)

# Display the updated frequent itemsets with manually calculated support
print("Frequent Itemsets with Manually Verified Support:")
print(freq_itemsets_copy.head())

Frequent Itemsets with Manually Verified Support:
    support             itemsets  check_support
0  0.571516    (Transit Station)       0.571516
1  0.482621               (Park)       0.482621
2  0.472778  (Building Material)       0.472778
3  0.434943    (Home Appliances)       0.434943
4  0.426946     (Heavy Industry)       0.426946


In [32]:
frequent_itemsets_copy.to_csv("frequent_itemsets_cityC.csv")

In [33]:
# Create a new DataFrame to store selected columns from the sorted association rules
rules_summary = sorted_rules[["antecedents", "consequents", "lift"]].copy()

# Combine antecedents and consequents into a single column representing the rule
rules_summary["rule"] = (
    rules_summary["antecedents"].apply(lambda items: ", ".join(map(str, items)))
    + " -> "
    + rules_summary["consequents"].apply(lambda items: ", ".join(map(str, items)))
)

# Drop the original antecedents and consequents columns
rules_summary.drop(columns=["antecedents", "consequents"], inplace=True)

# Reorder columns for better readability (placing the rule column first)
rules_summary = rules_summary[["rule", "lift"]]

# Sort the DataFrame by the lift value in descending order and reset the index
rules_summary = rules_summary.sort_values(by="lift", ascending=False).reset_index(drop=True)

# Display the top rules with the highest lift
print("Top Rules with Lift Values:")
print(rules_summary.head())


Top Rules with Lift Values:
                                         rule      lift
0  Hair Salon -> Transit Station, Real Estate  2.242471
1  Transit Station, Real Estate -> Hair Salon  2.242471
2             Hair Salon -> Park, Real Estate  2.215365
3             Park, Real Estate -> Hair Salon  2.215365
4             Real Estate -> Park, Hair Salon  2.187702


# CITY D

In [34]:
import numpy as np 
import pandas as pd 
from mlxtend.frequent_patterns import apriori, association_rules

In [35]:
# Load data
POI_data = pd.read_csv('POIdata_cityD.csv')
categories = pd.read_csv('POI_datacategories.csv', header=None)

In [36]:
# Create grid_id for each POI based on its x, y coordinates
POI_data['grid_id'] = POI_data['x'].astype(str) + "_" + POI_data['y'].astype(str)

# Group POI data by grid_id and collect the categories for each grid
grid_baskets = POI_data.groupby('grid_id')['category'].apply(list)

# Convert each basket to a set to remove duplicates
grid_baskets = grid_baskets.apply(lambda x: list(set(x)))
grid_baskets

grid_id
100_1                                                   [81]
100_100    [1, 3, 4, 5, 7, 8, 9, 10, 12, 13, 14, 16, 17, ...
100_101    [4, 5, 6, 7, 8, 9, 10, 13, 14, 17, 18, 19, 20,...
100_102    [1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
100_103    [1, 4, 5, 9, 10, 11, 13, 14, 18, 19, 33, 34, 3...
                                 ...                        
9_29                                        [82, 66, 69, 14]
9_33                                            [80, 17, 69]
9_59                                                    [81]
9_60                                                [48, 63]
9_61                                            [48, 74, 47]
Name: category, Length: 10988, dtype: object

In [37]:
# Create a one-hot encoding matrix for the categories in each grid
one_hot_list = []

for basket in grid_baskets:
    row = [0] * len(categories)
    for category in basket:
        row[category - 1] = 1
    one_hot_list.append(row)

# Create a DataFrame for one-hot encoded categories
one_hot_df = pd.DataFrame(one_hot_list, columns=categories[0])
one_hot_df

Unnamed: 0,Food,Shopping,Entertainment,Japanese restaurant,Western restaurant,Eat all you can restaurant,Chinese restaurant,Indian restaurant,Ramen restaurant,Curry restaurant,...,Accountant Office,IT Office,Publisher Office,Building Material,Gardening,Heavy Industry,NPO,Utility Copany,Port,Research Facility
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,1,0,1,1,1,0,1,1,1,1,...,1,0,1,1,1,1,1,1,1,1
2,0,0,0,1,1,1,1,1,1,1,...,1,1,1,1,0,1,1,1,1,0
3,1,0,1,1,1,1,1,1,1,1,...,1,1,1,1,0,1,1,1,0,0
4,1,0,0,1,1,0,0,0,1,1,...,1,0,1,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
10984,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
10985,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
10986,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
# Create a DataFrame for grid_baskets and merge it with the one-hot encoding
grid_baskets_df = pd.DataFrame({'grid_id': grid_baskets.index, 'categories': grid_baskets})
grid_baskets_df

Unnamed: 0_level_0,grid_id,categories
grid_id,Unnamed: 1_level_1,Unnamed: 2_level_1
100_1,100_1,[81]
100_100,100_100,"[1, 3, 4, 5, 7, 8, 9, 10, 12, 13, 14, 16, 17, ..."
100_101,100_101,"[4, 5, 6, 7, 8, 9, 10, 13, 14, 17, 18, 19, 20,..."
100_102,100_102,"[1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1..."
100_103,100_103,"[1, 4, 5, 9, 10, 11, 13, 14, 18, 19, 33, 34, 3..."
...,...,...
9_29,9_29,"[82, 66, 69, 14]"
9_33,9_33,"[80, 17, 69]"
9_59,9_59,[81]
9_60,9_60,"[48, 63]"


In [39]:
# Create a DataFrame for grid_baskets and merge it with the one-hot encoding
grid_baskets_df = pd.DataFrame({'grid_id': grid_baskets.index, 'categories': grid_baskets})
# Ensure the index of one_hot_df matches the grid_baskets_df index
one_hot_df.index = grid_baskets_df.index

# Join the one-hot encoding DataFrame with grid_baskets_df
grid_baskets_df = grid_baskets_df.drop(['categories'], axis=1).join(one_hot_df)

# Set grid_id as the index
grid_baskets_df.set_index('grid_id', inplace=True)
grid_baskets_df

Unnamed: 0_level_0,Food,Shopping,Entertainment,Japanese restaurant,Western restaurant,Eat all you can restaurant,Chinese restaurant,Indian restaurant,Ramen restaurant,Curry restaurant,...,Accountant Office,IT Office,Publisher Office,Building Material,Gardening,Heavy Industry,NPO,Utility Copany,Port,Research Facility
grid_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100_1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
100_100,1,0,1,1,1,0,1,1,1,1,...,1,0,1,1,1,1,1,1,1,1
100_101,0,0,0,1,1,1,1,1,1,1,...,1,1,1,1,0,1,1,1,1,0
100_102,1,0,1,1,1,1,1,1,1,1,...,1,1,1,1,0,1,1,1,0,0
100_103,1,0,0,1,1,0,0,0,1,1,...,1,0,1,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9_29,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9_33,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
9_59,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9_60,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
# Run apriori algorithm with a minimum support of 0.1
frequent_itemsets = apriori(grid_baskets_df, min_support=0.1, use_colnames=True)
# Display the frequent itemsets
print("Frequent Itemsets with Support:")
print(frequent_itemsets)

Frequent Itemsets with Support:
     support                                           itemsets
0   0.138151                              (Japanese restaurant)
1   0.112668                                    (Interior Shop)
2   0.189843                                    (Grocery Store)
3   0.151074                                             (Park)
4   0.308063                                  (Transit Station)
..       ...                                                ...
82  0.172370                (Heavy Industry, Building Material)
83  0.101656   (Transit Station, Hair Salon, Building Material)
84  0.104933  (Transit Station, Heavy Industry, Building Mat...
85  0.111303  (Home Appliances, Heavy Industry, Building Mat...
86  0.102111    (Heavy Industry, Hair Salon, Building Material)

[87 rows x 2 columns]




In [41]:
# Run apriori algorithm with a minimum support of 0.1
frequent_itemsets = apriori(grid_baskets_df, min_support=0.1, use_colnames=True)

# Generate association rules with a minimum lift of 1
association_rules_df = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Sort the rules by confidence and lift, in descending order
sorted_rules = association_rules_df.sort_values(by=['confidence', 'lift'], ascending=[False, False])

# Display the top association rules
print("Top Association Rules:")
print(sorted_rules.head())

Top Association Rules:
                       antecedents          consequents  antecedent support  \
35                      (Hospital)         (Hair Salon)            0.143975   
11                      (Hospital)    (Transit Station)            0.143975   
136   (Heavy Industry, Hair Salon)  (Building Material)            0.139880   
118  (Hair Salon, Transit Station)  (Building Material)            0.142064   
53                   (Real Estate)  (Building Material)            0.174827   

     consequent support   support  confidence      lift  leverage  conviction  \
35             0.230251  0.105934    0.735777  3.195543  0.072783    2.913260   
11             0.308063  0.105752    0.734513  2.384293  0.061398    2.606295   
136            0.322625  0.102111    0.729993  2.262671  0.056983    2.508737   
118            0.322625  0.101656    0.715567  2.217955  0.055823    2.381493   
53             0.322625  0.125046    0.715252  2.216980  0.068642    2.378863   

     zhangs_met



In [42]:
frequent_itemsets_copy = frequent_itemsets.copy()

# Calculate the support counts manually and add as a new column
support_counts = []
for index, row in frequent_itemsets_copy.iterrows():
    itemset = list(row["itemsets"])  # Convert frozenset of items to list
    items = one_hot_df[itemset]  # Get the columns in the itemset
    mask = items.all(axis=1)  # Check if all items in the itemset are present in each row (grid cell)
    support_count = mask.sum() / len(one_hot_df)  # Calculate support count as proportion of rows
    support_counts.append(support_count)

frequent_itemsets_copy["check_support"] = support_counts

# Sort the DataFrame by the support count and reset the index
freq_itemsets_copy = frequent_itemsets_copy.sort_values("support", ascending=False).reset_index(drop=True)

# Display the updated frequent itemsets with manually calculated support
print("Frequent Itemsets with Manually Verified Support:")
print(freq_itemsets_copy.head())

Frequent Itemsets with Manually Verified Support:
    support             itemsets  check_support
0  0.360939             (Church)       0.360939
1  0.334274     (Heavy Industry)       0.334274
2  0.331361    (Home Appliances)       0.331361
3  0.322625  (Building Material)       0.322625
4  0.308063    (Transit Station)       0.308063


In [43]:
frequent_itemsets_copy.to_csv("frequent_itemsets_cityD.csv")

In [44]:
# Create a new DataFrame to store selected columns from the sorted association rules
rules_summary = sorted_rules[["antecedents", "consequents", "lift"]].copy()

# Combine antecedents and consequents into a single column representing the rule
rules_summary["rule"] = (
    rules_summary["antecedents"].apply(lambda items: ", ".join(map(str, items)))
    + " -> "
    + rules_summary["consequents"].apply(lambda items: ", ".join(map(str, items)))
)

# Drop the original antecedents and consequents columns
rules_summary.drop(columns=["antecedents", "consequents"], inplace=True)

# Reorder columns for better readability (placing the rule column first)
rules_summary = rules_summary[["rule", "lift"]]

# Sort the DataFrame by the lift value in descending order and reset the index
rules_summary = rules_summary.sort_values(by="lift", ascending=False).reset_index(drop=True)

# Display the top rules with the highest lift
print("Top Rules with Lift Values:")
print(rules_summary.head())


Top Rules with Lift Values:
                                               rule      lift
0                            Hospital -> Hair Salon  3.195543
1                            Hair Salon -> Hospital  3.195543
2                           Laundry  -> Real Estate  2.986421
3                           Real Estate -> Laundry   2.986421
4  Building Material, Transit Station -> Hair Salon  2.860391
