In [None]:
import numpy as np 
import pandas as pd 
orders = pd.read_csv(r'orders.csv')
orders.head()

In [None]:
prior = pd.read_csv(r'order_products__prior.csv')
prior.head()

In [None]:
train = pd.read_csv(r'order_products__train.csv')
train.head()

In [None]:
order_prior = pd.merge(prior,orders,on=['order_id','order_id'])
order_prior = order_prior.sort_values(by=['user_id','order_id'])
order_prior.head()

In [None]:
products = pd.read_csv(r'products.csv')
products.head()

In [None]:
aisles = pd.read_csv(r'aisles.csv')
aisles.head()

In [None]:
_mt = pd.merge(prior,products, on = ['product_id','product_id'])
_mt = pd.merge(_mt,orders,on=['order_id','order_id'])
mt = pd.merge(_mt,aisles,on=['aisle_id','aisle_id'])
mt.head(10)

In [None]:
mt['product_name'].value_counts()[0:10]
len(mt['product_name'].unique())

Fresh fruits and fresh vegetables are the best selling goods.

In [None]:
mt['aisle'].value_counts()[0:10]

Ths first thing to do is creating a dataframe with all the purchases made by each user.The resulting 'cust_prod' DataFrame will have 'user_id' as its index and 'aisle' as its columns, with the values being the frequency counts of each combination of 'user_id' and 'aisle'. In other words, it shows how many times each user has purchased products from each aisle.This type of analysis can be useful in understanding customer behavior and preferences, as well as identifying popular products and categories.


In [None]:
cust_prod = pd.crosstab(mt['user_id'], mt['aisle'])
cust_prod.head(10)

We can then execute a Principal Component Analysis to the obtained dataframe. This will reduce the number of features from the number of aisles to 6, the numbr of principal components I have chosen.The 'PCA' function is instantiated with the parameter 'n_components=6', which sets the number of principal components to 6.


In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=6)
pca.fit(cust_prod)
pca_samples = pca.transform(cust_prod)
ps = pd.DataFrame(pca_samples)
ps.head()

It's important to note that the choice of which principal components to use for clustering can have a significant impact on the results, and there is no one "correct" choice. It's often a good idea to explore multiple combinations of principal components and clustering parameters to find the best approach for your specific dataset and problem.we tried a range of different combinations of principal components, such as (PC1, PC2), (PC3, PC4), or (PC1, PC3), and compared the resulting clustering performance using evaluation metrics and (PC4,PC1) pair is chosen as a suitable pair for k-mean clustering.


In [None]:
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
tocluster = pd.DataFrame(ps[[4,1]])
print (tocluster.shape)
print (tocluster.head())

In [None]:
fig = plt.figure(figsize=(8,8))
plt.plot(tocluster[4], tocluster[1], 'o', markersize=2, color='blue', alpha=0.5, label='class1')

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.show()

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

clusterer = KMeans(n_clusters=4,random_state=42).fit(tocluster)
centers = clusterer.cluster_centers_
c_preds = clusterer.predict(tocluster)
print(centers)

print (c_preds[0:100])

In [None]:
import matplotlib
fig = plt.figure(figsize=(8,8))
colors = ['orange','blue','purple','green']
colored = [colors[k] for k in c_preds]
print (colored[0:10])
plt.scatter(tocluster[4],tocluster[1],  color = colored)
for ci,c in enumerate(centers):
    plt.plot(c[0], c[1], 'o', markersize=8, color='red', alpha=0.9, label=''+str(ci))

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.show()

The 'c_preds' array obtained from the KMeans algorithm in the previous code block is used to populate the 'cluster' column of the 'clust_prod' DataFrame.

In [None]:
clust_prod = cust_prod.copy()
clust_prod['cluster'] = c_preds

clust_prod.head(10)

In this line of code, we are first filtering the rows of the clust_prod DataFrame where the cluster column has a value of 0.This creates a new DataFrame that contains only the rows where the cluster column is equal to 0. Then, we drop the cluster column from this new DataFrame using the drop method.Finally, we calculate the mean of each column of this new DataFrame using the mean method.


In [None]:
print (clust_prod.shape)
f,arr = plt.subplots(2,2,sharex=True,figsize=(15,15))

c1_count = len(clust_prod[clust_prod['cluster']==0])

c0 = clust_prod[clust_prod['cluster']==0].drop('cluster',axis=1).mean()
arr[0,0].bar(range(len(clust_prod.drop('cluster',axis=1).columns)),c0)
c1 = clust_prod[clust_prod['cluster']==1].drop('cluster',axis=1).mean()
arr[0,1].bar(range(len(clust_prod.drop('cluster',axis=1).columns)),c1)
c2 = clust_prod[clust_prod['cluster']==2].drop('cluster',axis=1).mean()
arr[1,0].bar(range(len(clust_prod.drop('cluster',axis=1).columns)),c2)
c3 = clust_prod[clust_prod['cluster']==3].drop('cluster',axis=1).mean()
arr[1,1].bar(range(len(clust_prod.drop('cluster',axis=1).columns)),c3)
plt.show()

Let's check out what are the top 10 goods bought by people of each cluster. We are going to rely first on the absolute data and then on a percentage among the top 8 products for each cluster.


In [None]:
c0.sort_values(ascending=False)[0:10]

c1.sort_values(ascending=False)[0:10]

c2.sort_values(ascending=False)[0:10]

c3.sort_values(ascending=False)[0:10]

In [None]:
from IPython.display import display, HTML
cluster_means = [[c0['fresh fruits'],c0['fresh vegetables'],c0['packaged vegetables fruits'], c0['yogurt'], c0['packaged cheese'], c0['milk'],c0['water seltzer sparkling water'],c0['chips pretzels']],
                 [c1['fresh fruits'],c1['fresh vegetables'],c1['packaged vegetables fruits'], c1['yogurt'], c1['packaged cheese'], c1['milk'],c1['water seltzer sparkling water'],c1['chips pretzels']],
                 [c2['fresh fruits'],c2['fresh vegetables'],c2['packaged vegetables fruits'], c2['yogurt'], c2['packaged cheese'], c2['milk'],c2['water seltzer sparkling water'],c2['chips pretzels']],
                 [c3['fresh fruits'],c3['fresh vegetables'],c3['packaged vegetables fruits'], c3['yogurt'], c3['packaged cheese'], c3['milk'],c3['water seltzer sparkling water'],c3['chips pretzels']]]
cluster_means = pd.DataFrame(cluster_means, columns = ['fresh fruits','fresh vegetables','packaged vegetables fruits','yogurt','packaged cheese','milk','water seltzer sparkling water','chips pretzels'])
HTML(cluster_means.to_html())


The following table depicts the percentage these goods with respect to the other top 8 in each cluster. It is easy some interesting differences among the clusters.

It seems people of cluster 1 buy more fresh vegetables than the other clusters. As shown by absolute data, Cluster 1 is also the cluster including those customers buying far more goods than any others.

People of cluster 2 buy more yogurt than people of the other clusters.

Absolute Data shows us People of cluster 3 buy a Lot of 'Baby Food Formula' which not even listed in the top 8 products but mainly characterize this cluster. Coherently (I think) with this observation they buy more milk than the others.

In [None]:
cluster_perc = cluster_means.iloc[:, :].apply(lambda x: (x / x.sum())*100,axis=1)
HTML(cluster_perc.to_html())
