# Frequently bought together

## Instead of using raw sales (which can be skewed by products with few sales), damping pulls each sales mean closer to the global mean. This reduces the effect of products with very few sales.

content-based filtering isn’t just “automatic.”  It is very much about understanding the domain and user preferences in that domain.

Association rules (support, confidence, lift) - order level. Only keep associations with Lift > 1 (preferably much higher) to diminish popular items.
Max popularity: sometimes you drop items that appear in > e.g. 50% of baskets, because they’re “too common.”

### Lift formula

$$
\text{Lift}(A \rightarrow B) = \frac{P(A \land B)}{P(A) \cdot P(B)}
$$

* $P(A \land B)$ = fraction of baskets with both A and B
* $P(A)$ = fraction of baskets with A
* $P(B)$ = fraction of baskets with B

---

### Example with **trosor → bh**

* total baskets = 10
* baskets with **trosor** = 7 → $P(A) = 7/10 = 0.70$
* baskets with **bh** = 5 → $P(B) = 5/10 = 0.50$
* baskets with both = 4 → $P(A \land B) = 4/10 = 0.40$

$$
\text{Lift} = \frac{0.40}{0.70 \times 0.50} = \frac{0.40}{0.35} \approx 1.14
$$

---

### Interpretation

* **Lift = 1** → they are independent (no real relation).
* **Lift > 1** → they occur together **more often than chance** (positive association).
* **Lift < 1** → they occur together **less often than chance** (negative association).



## EDA on Status = Active. Most popular items?

In [10]:
import pandas as pd
tx = pd.read_parquet("/workspace/data/processed/transactions_clean.parquet")

In [13]:
articles = pd.read_parquet("/workspace/data/processed/articles_clean.parquet")
tx['groupId'] = tx['groupId'].astype(str)
articles['groupId'] = articles['groupId'].astype(str)

most_popular = (
    tx.groupby('groupId', as_index=False)
      .agg(
          count=('groupId', 'size'),
          avg_price_sek=('price_sek', 'mean'),
          avg_age=('Age', 'mean'),
          name=('name', lambda x: x.mode().iat[0] if not x.mode().empty else x.iloc[0])
      )
      .sort_values('count', ascending=False)
      .head(10)
)

most_popular['avg_price_sek'] = most_popular['avg_price_sek'].round().astype(int)
most_popular['avg_age'] = most_popular['avg_age'].round().astype(int)

# Bring in both name.1 and Category columns from articles_clean.csv
most_popular = most_popular.merge(
    articles[['groupId', 'brand', 'category']].drop_duplicates('groupId'),
    on='groupId',
    how='left'
)

most_popular

Unnamed: 0,groupId,count,avg_price_sek,avg_age,name,brand,category
0,261637,5396,70,78,Ankelsocka VID,Locköstrumpan,"Stödstrumpor,Strumpor,Underkläder"
1,260695,5314,162,75,Seamless bh-topp,Louise,"Bh-toppar,Bh,Bh utan kupstorlek,Underkläder"
2,240187,4239,324,76,Fritidsbyxa,Åshild,"Mjukisbyxor,Mysplagg,Byxor"
3,210338,4015,234,76,T-shirt 2-pack,Åshild,"Toppar,Överdelar,T-shirts"
4,210695,3221,204,77,Långärmad T-shirt,Åshild,"Toppar,Överdelar"
5,260646,3154,211,79,Trosa 3-pack,Åshild,"Underkläder,Trosor"
6,210186,3123,219,78,Polojumper,Åshild,"Toppar,Överdelar"
7,241562,3080,535,77,Velourbyxa,Åshild,"Byxor,Mjukisbyxor,Nederdelar,Mysplagg"
8,260596,2979,468,72,Bh utan bygel Stars,Swegmark,"Bh utan bygel,Bh,Underkläder"
9,260513,2488,513,74,Bh utan bygel med Magic Lift-funktion,Glamorise,"Bh utan bygel,Bh,Underkläder"


# Support: fraction of orders that contain both items (count / total_orders)

In [14]:
import pandas as pd
from itertools import combinations

# 1. make each order a basket of unique items
baskets = (
    tx.groupby('orderId')['groupId']
      .apply(set)  # one set of items per order
)

total_orders = len(baskets)

# 2. count co-occurrences
pair_counts = {}

for items in baskets:
    for a, b in combinations(sorted(items), 2):
        pair_counts[(a, b)] = pair_counts.get((a, b), 0) + 1

# 3. build dataframe
support_df = pd.DataFrame(
    [(a, b, count/total_orders, count) for (a, b), count in pair_counts.items()],
    columns=['itemA', 'itemB', 'support', 'count']  #count how many baskets each pair appears in.
)

support_df = support_df[support_df['count'] >= 20] # at least 20 orders

support_df.sort_values(by='count', ascending=False)

Unnamed: 0,itemA,itemB,support,count
16464,240187,261637,0.004363,462
20363,210695,240187,0.002824,299
61309,210727,240187,0.002597,275
1932,260646,261637,0.002512,266
9780,260646,260695,0.002238,237
...,...,...,...,...
32778,240187,264242,0.000189,20
32673,260572,262287,0.000189,20
32593,210186,291088,0.000189,20
67295,210759,240176,0.000189,20


# compute **confidence**:

For a rule $A \rightarrow B$:

$$
\text{confidence}(A \rightarrow B) = \frac{\#(A \land B)}{\#(A)}
$$

* $\#(A \land B)$ = number of orders containing both A and B
* $\#(A)$ = number of orders containing A



In [15]:
MIN_CONF = 0.05  # If a customer buys A, there’s at least a 5% chance they also buy B.

# count how many orders each item appears in
item_counts = (
    tx.groupby('groupId')['orderId']
      .nunique()
)

# map counts for each side of the pair
support_df['orders_with_item_A'] = support_df['itemA'].map(item_counts)
support_df['orders_with_item_B'] = support_df['itemB'].map(item_counts)

# compute confidence for both directions
support_df['conf_A_to_B'] = support_df['count'] / support_df['orders_with_item_A']
support_df['conf_B_to_A'] = support_df['count'] / support_df['orders_with_item_B']

# filter by minimum confidence in either direction
support_df = support_df[
    (support_df['conf_A_to_B'] >= MIN_CONF) | (support_df['conf_B_to_A'] >= MIN_CONF)
]

# keep only the clean columns
support_df = support_df[['itemA','itemB','count','support',
                         'orders_with_item_A','orders_with_item_B',
                         'conf_A_to_B','conf_B_to_A']]


In [16]:
support_df.sort_values(by='count', ascending=False)

Unnamed: 0,itemA,itemB,count,support,orders_with_item_A,orders_with_item_B,conf_A_to_B,conf_B_to_A
16464,240187,261637,462,0.004363,3684,3873,0.125407,0.119287
20363,210695,240187,299,0.002824,2285,3684,0.130853,0.081162
61309,210727,240187,275,0.002597,1300,3684,0.211538,0.074647
1932,260646,261637,266,0.002512,3102,3873,0.085751,0.068681
9780,260646,260695,237,0.002238,3102,4098,0.076402,0.057833
...,...,...,...,...,...,...,...,...
1743,210634,210668,20,0.000189,217,410,0.092166,0.048780
19490,200400,503376,20,0.000189,1855,305,0.010782,0.065574
19910,260174,262287,20,0.000189,342,453,0.058480,0.044150
17582,291088,293738,20,0.000189,716,380,0.027933,0.052632


Add **lift**:

$$
\text{Lift}(A \to B) = \frac{\text{conf}(A \to B)}{P(B)} 
= \frac{\text{count} \cdot \text{total orders}}{\text{orders with A} \cdot \text{orders with B}}
$$

In [17]:
MIN_LIFT = 2 #they co-occur 100% more than chance.

total_orders = tx['orderId'].nunique()

support_df['lift_A_to_B'] = (
    (support_df['count'] * total_orders) /
    (support_df['orders_with_item_A'] * support_df['orders_with_item_B'])
)

# since lift is symmetric, lift_A_to_B == lift_B_to_A
# keep one column
support_df = support_df[['itemA','itemB','count','support',
                         'orders_with_item_A','orders_with_item_B',
                         'conf_A_to_B','conf_B_to_A','lift_A_to_B']]
support_df = support_df.rename(columns={'lift_A_to_B': 'lift'})

In [18]:
support_df.sort_values(by='lift', ascending=False).head(10)

Unnamed: 0,itemA,itemB,count,support,orders_with_item_A,orders_with_item_B,conf_A_to_B,conf_B_to_A,lift
3908,270304,270305,21,0.000198,26,30,0.807692,0.7,2851.046154
100147,200965,200970,24,0.000227,36,30,0.666667,0.8,2353.244444
701,270307,270308,68,0.000642,108,81,0.62963,0.839506,823.151349
10049,290104,290207,30,0.000283,146,66,0.205479,0.454545,329.688667
6547,390504,390505,21,0.000198,129,55,0.162791,0.381818,313.434249
2760,260345,266882,25,0.000236,59,150,0.423729,0.166667,299.141243
27638,200312,200313,28,0.000264,169,79,0.16568,0.35443,222.087334
6332,291724,291732,25,0.000236,110,118,0.227273,0.211864,203.959938
21101,530335,530341,21,0.000198,167,69,0.125749,0.304348,192.989326
4014,210615,210622,26,0.000246,130,115,0.2,0.226087,184.166957


## Create a df with recs

In [28]:
import pandas as pd

# 1) Build directional rules (A→B and B→A) and KEEP 'count'
rules_AB = support_df[['itemA','itemB','count','lift','conf_A_to_B']].rename(
    columns={'itemA':'src_groupId','itemB':'rec_groupId','conf_A_to_B':'conf'}
)
rules_BA = support_df[['itemB','itemA','count','lift','conf_B_to_A']].rename(
    columns={'itemB':'src_groupId','itemA':'rec_groupId','conf_B_to_A':'conf'}
)
rules = pd.concat([rules_AB, rules_BA], ignore_index=True)

# 2) Clean & sort by strength (so dedup keeps strongest)
rules['src_groupId'] = rules['src_groupId'].astype(str)
rules['rec_groupId'] = rules['rec_groupId'].astype(str)
rules['count'] = rules['count'].astype(int)
rules = rules[rules['src_groupId'] != rules['rec_groupId']]
rules = rules.sort_values(['lift','conf','count'], ascending=False)
rules = rules.drop_duplicates(['src_groupId','rec_groupId'])  # keep best direction per pair

# 3) Cap to top-10 per source
rules_top10 = rules.groupby('src_groupId', group_keys=False).head(10)

# 4) Pack into requested structure, adding copurchase count
def pack_group(g: pd.DataFrame):
    # already sorted; preserve order
    return [
        {"groupId": rec_id, "source": "assoc", "count": int(cnt)}
        for rec_id, cnt in zip(g['rec_groupId'].tolist(), g['count'].tolist())
    ]

recs_df = (
    rules_top10.groupby('src_groupId', as_index=False)
               .apply(lambda g: pd.Series({'recs': pack_group(g)}))
               .reset_index(drop=True)
)

recs_df.head()


  .apply(lambda g: pd.Series({'recs': pack_group(g)}))


Unnamed: 0,src_groupId,recs
0,200304,"[{'groupId': '281410', 'source': 'assoc', 'count': 22}, {'groupId': '240276', 'source': 'assoc', 'count': 42}, {'groupId': '210338', 'source': 'assoc', 'count': 104}, {'groupId': '240187', 'source': 'assoc', 'count': 69}]"
1,200400,"[{'groupId': '530330', 'source': 'assoc', 'count': 20}, {'groupId': '242024', 'source': 'assoc', 'count': 55}]"
2,210186,"[{'groupId': '210695', 'source': 'assoc', 'count': 59}]"
3,210338,"[{'groupId': '210756', 'source': 'assoc', 'count': 103}, {'groupId': '210729', 'source': 'assoc', 'count': 57}, {'groupId': '200304', 'source': 'assoc', 'count': 104}, {'groupId': '240276', 'source': 'assoc', 'count': 74}, {'groupId': '241687', 'source': 'assoc', 'count': 85}, {'groupId': '220028', 'source': 'assoc', 'count': 25}, {'groupId': '241653', 'source': 'assoc', 'count': 43}, {'groupId': '281410', 'source': 'assoc', 'count': 29}, {'groupId': '240012', 'source': 'assoc', 'count': 26}, {'groupId': '210754', 'source': 'assoc', 'count': 25}]"
4,210695,"[{'groupId': '210186', 'source': 'assoc', 'count': 59}, {'groupId': '210338', 'source': 'assoc', 'count': 63}, {'groupId': '240187', 'source': 'assoc', 'count': 58}, {'groupId': '261637', 'source': 'assoc', 'count': 48}]"


## Add the rest of groupIds that were not associated with other products

In [31]:
# Load articles and filter to active only
articles = pd.read_csv("../data/processed/articles_clean.csv", dtype={"groupId": str}, low_memory=False)
active_articles = articles.loc[articles['status'] == 'active', 'groupId'].astype(str)

# 1) find which active groupIds are already in recs_df
already_have = set(recs_df['src_groupId'])

# 2) build rows for missing ones with empty recs
missing = [
    {"src_groupId": gid, "recs": []}
    for gid in active_articles if gid not in already_have
]

# 3) append and rebuild final DataFrame
if missing:
    recs_df = pd.concat([recs_df, pd.DataFrame(missing)], ignore_index=True)

recs_df_nonempty = recs_df[recs_df['recs'].apply(lambda x: isinstance(x, list) and len(x) > 0)]
recs_df_nonempty.head()



Unnamed: 0,src_groupId,recs
722,200304,"[{'groupId': '281410', 'source': 'assoc', 'count': 22}, {'groupId': '240276', 'source': 'assoc', 'count': 42}, {'groupId': '210338', 'source': 'assoc', 'count': 104}, {'groupId': '240187', 'source': 'assoc', 'count': 69}]"
724,200400,"[{'groupId': '530330', 'source': 'assoc', 'count': 20}, {'groupId': '242024', 'source': 'assoc', 'count': 55}]"
753,210186,"[{'groupId': '210695', 'source': 'assoc', 'count': 59}]"
754,210338,"[{'groupId': '210756', 'source': 'assoc', 'count': 103}, {'groupId': '210729', 'source': 'assoc', 'count': 57}, {'groupId': '200304', 'source': 'assoc', 'count': 104}, {'groupId': '240276', 'source': 'assoc', 'count': 74}, {'groupId': '241687', 'source': 'assoc', 'count': 85}, {'groupId': '220028', 'source': 'assoc', 'count': 25}, {'groupId': '241653', 'source': 'assoc', 'count': 43}, {'groupId': '281410', 'source': 'assoc', 'count': 29}, {'groupId': '240012', 'source': 'assoc', 'count': 26}, {'groupId': '210754', 'source': 'assoc', 'count': 25}]"
765,210695,"[{'groupId': '210186', 'source': 'assoc', 'count': 59}, {'groupId': '210338', 'source': 'assoc', 'count': 63}, {'groupId': '240187', 'source': 'assoc', 'count': 58}, {'groupId': '261637', 'source': 'assoc', 'count': 48}]"


In [34]:
num_empty = recs_df['recs'].apply(lambda x: isinstance(x, list) and len(x) == 0).sum()
num_with_recs = recs_df['recs'].apply(lambda x: isinstance(x, list) and len(x) > 0).sum()
print(f"Number of empty recs: {num_empty}")
print(f"Number of non-empty recs: {num_with_recs}")


Number of empty recs: 27192
Number of non-empty recs: 146


In [32]:
# Write out the predictions DataFrame to CSV
recs_df.to_csv("../data/predictions/assoc_recommendations.csv", index=False)


## Now we have a lot of empty and incomplete recs lists, we can fill those up with data from vector simillarity - simillar items 

In [21]:
# There are many src_groupIds in copurchased_10plus_df with completely empty recs,
# and some with incomplete lists (<10). We fill all to exactly 10 using vector similarity recs.
# If there are more than 10, we trim to 10.

import json
import ast

# Load vector similarity recommendations
vecsim_df = pd.read_csv("../data/predictions/vector_similarity_recommendations.csv")

# Parse the vector similarity recs (stringified lists) into lists
def parse_recs(x):
    try:
        if isinstance(x, str):
            return ast.literal_eval(x)
        return []
    except Exception:
        return []

# Build a mapping from src_groupId to its vector recs (as list of str)
vecsim_map = dict(
    zip(
        vecsim_df['src_groupId'].astype(str),
        vecsim_df['recs'].apply(parse_recs)
    )
)

recs_rows = []
for idx, row in copurchased_10plus_df.iterrows():
    src_group = str(row['src_groupId'])
    # CF recs: list of str, or empty list
    cf_recs = row['recs'] if isinstance(row['recs'], list) else []
    cf_recs = [str(x) for x in cf_recs if str(x) != ""]
    recs_annotated = []

    # Vector recs: list of str, or empty list
    vecsim_recs = vecsim_map.get(src_group, [])
    vecsim_recs = [str(x) for x in vecsim_recs if str(x) != ""]

    # Add all CF recs first, annotated
    for group in cf_recs:
        recs_annotated.append({"groupId": group, "source": "cf"})

    # Fill up to 10 with vector recs, skipping any already in CF
    for group in vecsim_recs:
        if group not in cf_recs and len(recs_annotated) < 10:
            recs_annotated.append({"groupId": group, "source": "vector"})
        if len(recs_annotated) >= 10:
            break

    # If still less than 10, pad with additional vector recs (even if duplicates, but try to avoid)
    if len(recs_annotated) < 10:
        # Try to fill with more vector recs, even if already in cf_recs, but not already in recs_annotated
        all_added = set([r["groupId"] for r in recs_annotated])
        for group in vecsim_recs:
            if group not in all_added:
                recs_annotated.append({"groupId": group, "source": "vector"})
                all_added.add(group)
            if len(recs_annotated) >= 10:
                break
        # If still not enough, just repeat the last one (or fill with None)
        while len(recs_annotated) < 10:
            recs_annotated.append({"groupId": None, "source": "vector"})

    # If more than 10, trim to 10
    recs_annotated = recs_annotated[:10]

    recs_rows.append({
        "src_groupId": src_group,
        "recs": json.dumps(recs_annotated, ensure_ascii=False)
    })

recs_df = pd.DataFrame(recs_rows)

# Show the result, wide columns
with pd.option_context('display.max_colwidth', None, 'display.width', 2000):
    display(recs_df)

Unnamed: 0,src_groupId,recs
0,261873,"[{""groupId"": ""261574"", ""source"": ""vector""}, {""groupId"": ""261585"", ""source"": ""vector""}, {""groupId"": ""175701"", ""source"": ""vector""}, {""groupId"": ""261463"", ""source"": ""vector""}, {""groupId"": ""261567"", ""source"": ""vector""}, {""groupId"": ""167301"", ""source"": ""vector""}, {""groupId"": ""261591"", ""source"": ""vector""}, {""groupId"": ""266536"", ""source"": ""vector""}, {""groupId"": ""266551"", ""source"": ""vector""}, {""groupId"": ""261626"", ""source"": ""vector""}]"
1,261745,"[{""groupId"": ""261294"", ""source"": ""vector""}, {""groupId"": ""261740"", ""source"": ""vector""}, {""groupId"": ""260912"", ""source"": ""vector""}, {""groupId"": ""261379"", ""source"": ""vector""}, {""groupId"": ""261280"", ""source"": ""vector""}, {""groupId"": ""262038"", ""source"": ""vector""}, {""groupId"": ""261656"", ""source"": ""vector""}, {""groupId"": ""261010"", ""source"": ""vector""}, {""groupId"": ""261938"", ""source"": ""vector""}, {""groupId"": ""267117"", ""source"": ""vector""}]"
2,265298,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""200400"", ""source"": ""cf""}, {""groupId"": ""210186"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""210695"", ""source"": ""cf""}, {""groupId"": ""210756"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240144"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}]"
3,260596,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""210186"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""210695"", ""source"": ""cf""}, {""groupId"": ""210756"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""221416"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""240276"", ""source"": ""cf""}]"
4,260951,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""200400"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""240276"", ""source"": ""cf""}, {""groupId"": ""241091"", ""source"": ""cf""}, {""groupId"": ""241687"", ""source"": ""cf""}, {""groupId"": ""260313"", ""source"": ""cf""}]"
...,...,...
1118,341155,"[{""groupId"": ""432060"", ""source"": ""vector""}, {""groupId"": ""432058"", ""source"": ""vector""}, {""groupId"": ""341151"", ""source"": ""vector""}, {""groupId"": ""432063"", ""source"": ""vector""}, {""groupId"": ""432057"", ""source"": ""vector""}, {""groupId"": ""341157"", ""source"": ""vector""}, {""groupId"": ""430222"", ""source"": ""vector""}, {""groupId"": ""432059"", ""source"": ""vector""}, {""groupId"": ""341140"", ""source"": ""vector""}, {""groupId"": ""341154"", ""source"": ""vector""}]"
1119,270518,"[{""groupId"": ""270204"", ""source"": ""vector""}, {""groupId"": ""270212"", ""source"": ""vector""}, {""groupId"": ""261851"", ""source"": ""vector""}, {""groupId"": ""270119"", ""source"": ""vector""}, {""groupId"": ""270213"", ""source"": ""vector""}, {""groupId"": ""270519"", ""source"": ""vector""}, {""groupId"": ""270210"", ""source"": ""vector""}, {""groupId"": ""261840"", ""source"": ""vector""}, {""groupId"": ""270211"", ""source"": ""vector""}, {""groupId"": ""270122"", ""source"": ""vector""}]"
1120,260290,"[{""groupId"": ""260279"", ""source"": ""vector""}, {""groupId"": ""260278"", ""source"": ""vector""}, {""groupId"": ""260271"", ""source"": ""vector""}, {""groupId"": ""261806"", ""source"": ""vector""}, {""groupId"": ""261803"", ""source"": ""vector""}, {""groupId"": ""260237"", ""source"": ""vector""}, {""groupId"": ""260239"", ""source"": ""vector""}, {""groupId"": ""260284"", ""source"": ""vector""}, {""groupId"": ""260285"", ""source"": ""vector""}, {""groupId"": ""260268"", ""source"": ""vector""}]"
1121,260993,"[{""groupId"": ""260926"", ""source"": ""vector""}, {""groupId"": ""260305"", ""source"": ""vector""}, {""groupId"": ""260994"", ""source"": ""vector""}, {""groupId"": ""261187"", ""source"": ""vector""}, {""groupId"": ""261597"", ""source"": ""vector""}, {""groupId"": ""261870"", ""source"": ""vector""}, {""groupId"": ""267591"", ""source"": ""vector""}, {""groupId"": ""260989"", ""source"": ""vector""}, {""groupId"": ""261470"", ""source"": ""vector""}, {""groupId"": ""263525"", ""source"": ""vector""}]"


## Explore the final combined recs

In [22]:
# Show rows where any groupId in recs is None
def has_null_groupid(rec_json):
    try:
        recs = json.loads(rec_json)
        if isinstance(recs, list):
            for r in recs:
                if isinstance(r, dict) and r.get("groupId") is None:
                    return True
        return False
    except Exception:
        return False

null_groupid_df = recs_df[recs_df['recs'].apply(has_null_groupid)]

with pd.option_context('display.max_colwidth', None, 'display.width', 2000):
    display(null_groupid_df)


Unnamed: 0,src_groupId,recs
156,12025SE,"[{""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}, {""groupId"": null, ""source"": ""vector""}]"


In [23]:
# Remove the row where src_groupId is '12025SE'
recs_df = recs_df[recs_df['src_groupId'] != '12025SE']


In [24]:
# Find rows where both CF and Vector recommendations are present
# Find rows where both CF and Vector recommendations are present in the recs list
def has_both_cf_and_vector(rec_json):
    try:
        recs = json.loads(rec_json)
        if isinstance(recs, list):
            sources = set()
            for r in recs:
                if isinstance(r, dict) and "source" in r:
                    sources.add(r["source"])
            return "cf" in sources and "vector" in sources
        return False
    except Exception:
        return False

both_cf_and_vector_df = recs_df[recs_df['recs'].apply(has_both_cf_and_vector)]

with pd.option_context('display.max_colwidth', None, 'display.width', 2000):
    display(both_cf_and_vector_df)


Unnamed: 0,src_groupId,recs
6,260141,"[{""groupId"": ""265298"", ""source"": ""cf""}, {""groupId"": ""260893"", ""source"": ""vector""}, {""groupId"": ""260018"", ""source"": ""vector""}, {""groupId"": ""290250"", ""source"": ""vector""}, {""groupId"": ""291877"", ""source"": ""vector""}, {""groupId"": ""292771"", ""source"": ""vector""}, {""groupId"": ""291815"", ""source"": ""vector""}, {""groupId"": ""290033"", ""source"": ""vector""}, {""groupId"": ""201675"", ""source"": ""vector""}, {""groupId"": ""293340"", ""source"": ""vector""}]"
17,290134,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""291229"", ""source"": ""vector""}, {""groupId"": ""522698"", ""source"": ""vector""}, {""groupId"": ""290017"", ""source"": ""vector""}, {""groupId"": ""294728"", ""source"": ""vector""}, {""groupId"": ""293100"", ""source"": ""vector""}, {""groupId"": ""294611"", ""source"": ""vector""}, {""groupId"": ""294918"", ""source"": ""vector""}, {""groupId"": ""291880"", ""source"": ""vector""}, {""groupId"": ""293746"", ""source"": ""vector""}]"
22,260484,"[{""groupId"": ""260223"", ""source"": ""cf""}, {""groupId"": ""260646"", ""source"": ""cf""}, {""groupId"": ""264275"", ""source"": ""cf""}, {""groupId"": ""261303"", ""source"": ""vector""}, {""groupId"": ""260485"", ""source"": ""vector""}, {""groupId"": ""260486"", ""source"": ""vector""}, {""groupId"": ""260551"", ""source"": ""vector""}, {""groupId"": ""260550"", ""source"": ""vector""}, {""groupId"": ""260703"", ""source"": ""vector""}, {""groupId"": ""260552"", ""source"": ""vector""}]"
27,290183,"[{""groupId"": ""261637"", ""source"": ""cf""}, {""groupId"": ""293365"", ""source"": ""vector""}, {""groupId"": ""291855"", ""source"": ""vector""}, {""groupId"": ""290271"", ""source"": ""vector""}, {""groupId"": ""292821"", ""source"": ""vector""}, {""groupId"": ""293225"", ""source"": ""vector""}, {""groupId"": ""294868"", ""source"": ""vector""}, {""groupId"": ""291757"", ""source"": ""vector""}, {""groupId"": ""295105"", ""source"": ""vector""}, {""groupId"": ""290273"", ""source"": ""vector""}]"
30,261916,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""241562"", ""source"": ""cf""}, {""groupId"": ""260646"", ""source"": ""cf""}, {""groupId"": ""260931"", ""source"": ""cf""}, {""groupId"": ""261436"", ""source"": ""cf""}, {""groupId"": ""261637"", ""source"": ""cf""}, {""groupId"": ""261699"", ""source"": ""cf""}, {""groupId"": ""261920"", ""source"": ""cf""}, {""groupId"": ""261924"", ""source"": ""cf""}, {""groupId"": ""261703"", ""source"": ""vector""}]"
...,...,...
814,503392,"[{""groupId"": ""503380"", ""source"": ""cf""}, {""groupId"": ""503373"", ""source"": ""vector""}, {""groupId"": ""503397"", ""source"": ""vector""}, {""groupId"": ""522383"", ""source"": ""vector""}, {""groupId"": ""528417"", ""source"": ""vector""}, {""groupId"": ""569744"", ""source"": ""vector""}, {""groupId"": ""569694"", ""source"": ""vector""}, {""groupId"": ""522391"", ""source"": ""vector""}, {""groupId"": ""562744"", ""source"": ""vector""}, {""groupId"": ""582111"", ""source"": ""vector""}]"
822,507707,"[{""groupId"": ""503380"", ""source"": ""cf""}, {""groupId"": ""507871"", ""source"": ""cf""}, {""groupId"": ""503386"", ""source"": ""vector""}, {""groupId"": ""599983"", ""source"": ""vector""}, {""groupId"": ""590195"", ""source"": ""vector""}, {""groupId"": ""582599"", ""source"": ""vector""}, {""groupId"": ""509665"", ""source"": ""vector""}, {""groupId"": ""503407"", ""source"": ""vector""}, {""groupId"": ""503402"", ""source"": ""vector""}, {""groupId"": ""537323"", ""source"": ""vector""}]"
867,546181,"[{""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""260646"", ""source"": ""cf""}, {""groupId"": ""431017"", ""source"": ""cf""}, {""groupId"": ""546180"", ""source"": ""vector""}, {""groupId"": ""541438"", ""source"": ""vector""}, {""groupId"": ""541420"", ""source"": ""vector""}, {""groupId"": ""541429"", ""source"": ""vector""}, {""groupId"": ""541430"", ""source"": ""vector""}, {""groupId"": ""541424"", ""source"": ""vector""}, {""groupId"": ""541423"", ""source"": ""vector""}]"
942,530341,"[{""groupId"": ""530330"", ""source"": ""cf""}, {""groupId"": ""530335"", ""source"": ""cf""}, {""groupId"": ""563495"", ""source"": ""vector""}, {""groupId"": ""563503"", ""source"": ""vector""}, {""groupId"": ""599963"", ""source"": ""vector""}, {""groupId"": ""599961"", ""source"": ""vector""}, {""groupId"": ""570953"", ""source"": ""vector""}, {""groupId"": ""552845"", ""source"": ""vector""}, {""groupId"": ""581240"", ""source"": ""vector""}, {""groupId"": ""585338"", ""source"": ""vector""}]"


In [25]:
# Find rows where all recommendations are from 'cf' source
def all_recs_are_cf(rec_json):
    try:
        recs = json.loads(rec_json)
        if isinstance(recs, list) and recs:
            return all(isinstance(r, dict) and r.get("source") == "cf" for r in recs)
        return False
    except Exception:
        return False

all_cf_df = recs_df[recs_df['recs'].apply(all_recs_are_cf)]

with pd.option_context('display.max_colwidth', None, 'display.width', 2000):
    display(all_cf_df)


Unnamed: 0,src_groupId,recs
2,265298,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""200400"", ""source"": ""cf""}, {""groupId"": ""210186"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""210695"", ""source"": ""cf""}, {""groupId"": ""210756"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240144"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}]"
3,260596,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""210186"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""210695"", ""source"": ""cf""}, {""groupId"": ""210756"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""221416"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""240276"", ""source"": ""cf""}]"
4,260951,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""200400"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""240276"", ""source"": ""cf""}, {""groupId"": ""241091"", ""source"": ""cf""}, {""groupId"": ""241687"", ""source"": ""cf""}, {""groupId"": ""260313"", ""source"": ""cf""}]"
7,260513,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""200400"", ""source"": ""cf""}, {""groupId"": ""210186"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""210695"", ""source"": ""cf""}, {""groupId"": ""210756"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""241091"", ""source"": ""cf""}]"
8,265249,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""200400"", ""source"": ""cf""}, {""groupId"": ""210186"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""210695"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""240276"", ""source"": ""cf""}, {""groupId"": ""241562"", ""source"": ""cf""}]"
...,...,...
321,241653,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""210756"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240012"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""241091"", ""source"": ""cf""}, {""groupId"": ""241562"", ""source"": ""cf""}, {""groupId"": ""241687"", ""source"": ""cf""}]"
353,261012,"[{""groupId"": ""200258"", ""source"": ""cf""}, {""groupId"": ""200400"", ""source"": ""cf""}, {""groupId"": ""210186"", ""source"": ""cf""}, {""groupId"": ""210695"", ""source"": ""cf""}, {""groupId"": ""210726"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""241562"", ""source"": ""cf""}, {""groupId"": ""242289"", ""source"": ""cf""}, {""groupId"": ""260557"", ""source"": ""cf""}]"
358,281410,"[{""groupId"": ""200304"", ""source"": ""cf""}, {""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""241091"", ""source"": ""cf""}, {""groupId"": ""241562"", ""source"": ""cf""}, {""groupId"": ""241687"", ""source"": ""cf""}, {""groupId"": ""260695"", ""source"": ""cf""}, {""groupId"": ""261595"", ""source"": ""cf""}, {""groupId"": ""261610"", ""source"": ""cf""}]"
362,240012,"[{""groupId"": ""210338"", ""source"": ""cf""}, {""groupId"": ""218982"", ""source"": ""cf""}, {""groupId"": ""240184"", ""source"": ""cf""}, {""groupId"": ""240187"", ""source"": ""cf""}, {""groupId"": ""241091"", ""source"": ""cf""}, {""groupId"": ""241653"", ""source"": ""cf""}, {""groupId"": ""241687"", ""source"": ""cf""}, {""groupId"": ""242024"", ""source"": ""cf""}, {""groupId"": ""260596"", ""source"": ""cf""}, {""groupId"": ""261637"", ""source"": ""cf""}]"


In [26]:
# Save the recs_df DataFrame to a CSV file in the predictions directory
recs_df.to_csv("../data/predictions/cf_and_vector_recs.csv", index=False)
