### Clustering Grocery Items
## Goal
Online shops often sell tons of different items and this can become very messy very quickly!
Data science can be extremely useful to automatically organize the products in categories so
that they can be easily found by the customers.
The goal of this challenge is to look at user purchase history and create categories of items that
are likely to be bought together and, therefore, should belong to the same section.
Challenge Description
Company XYZ is an online grocery store. In the current version of the website, they have
manually grouped the items into a few categories based on their experience.
However, they now have a lot of data about user purchase history. Therefore, they would like to
put the data into use!
### This is what they asked you to do:
## The company founder wants to meet with some of the best customers to go through a focus group with them. You are asked to send the ID of the following customers to the founder:
## the customer who bought the most items overall in her lifetime
## for each item, the customer who bought that product the most
## Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace theold/manually created categories with these new ones. Each item can belong to just one cluster.

In [61]:
import pandas as pd

In [62]:
purvhase = pd.read_csv('purchase_history.csv')
item = pd.read_csv('item_to_id.csv')

In [63]:
purvhase.head()

Unnamed: 0,user_id,id
0,222087,2726
1,1343649,64717
2,404134,1812232227433820351
3,1110200,923220264737
4,224107,"31,18,5,13,1,21,48,16,26,2,44,32,20,37,42,35,4..."


In [64]:
purvhase.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39474 entries, 0 to 39473
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  39474 non-null  int64 
 1   id       39474 non-null  object
dtypes: int64(1), object(1)
memory usage: 616.9+ KB


In [65]:
purvhase.describe()

Unnamed: 0,user_id
count,39474.0
mean,752014.9
std,433725.8
min,47.0
25%,373567.2
50%,753583.5
75%,1124939.0
max,1499974.0


In [66]:
item.head()

Unnamed: 0,Item_name,Item_id
0,coffee,43
1,tea,23
2,juice,38
3,soda,9
4,sandwich loaves,39


In [67]:
print(len(item['Item_name'].unique()))

48


In [68]:
item.describe()

Unnamed: 0,Item_id
count,48.0
mean,24.5
std,14.0
min,1.0
25%,12.75
50%,24.5
75%,36.25
max,48.0


In [69]:
item.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Item_name  48 non-null     object
 1   Item_id    48 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 900.0+ bytes


In [70]:
# Split the comma-separated item IDs and expand into separate rows
expanded_purchases = []
for _, row in purvhase.iterrows():
    user_id = row['user_id']
    item_list = str(row['id']).split(',')
    for item_id in item_list:
        expanded_purchases.append({
            'user_id': user_id,
            'item_id': item_id.strip()
        })
expanded_df = pd.DataFrame(expanded_purchases)
user_item_counts = expanded_df['user_id'].value_counts()
top_customer_id = user_item_counts.index[0]
total_items_bought = user_item_counts.iloc[0]
print(f"Top customer ID: {top_customer_id}")
print(f"Total items bought: {total_items_bought}")

Top customer ID: 269335
Total items bought: 72
