여러분 모두를 환영하고 Instacart Market Basket 대회에서 행운을 빕니다!
여기 경쟁 데이터 세트에 대한 첫 번째 탐색적 분석이 있다. 인스타카트는 홈페이지에서 사용자에게 다시 구매할 수 있는 아이템을 추천해 주는 추천 기능이 있다. 우리의 과제는 다음 주문 시 어떤 품목이 재주문될지 예측하는 것입니다.

데이터 세트는 6csv 파일에 분산된 340만 개의 식료품 주문에 대한 정보로 구성된다.

In [None]:
#order read
kable(head(orders,12))

In [None]:
glimpse(orders)

In [None]:
#We should do some recoding and convert character variables to factors.
orders 
mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products 
mutate(product_name = as.factor(product_name))
aisles 
mutate(aisle = as.factor(aisle))
departments 
mutate(department = as.factor(department))

In [None]:
#주문량에 대한 하루의 시간의 분명한 효과가 있다. 대부분의 주문은 8.00-18.00 사이입니다.
#Let’s have a look when people buy groceries online.
orders
ggplot(aes(x=order_hour_of_day)) + 
geom_histogram(stat="count",fill="red")

In [None]:
#요일
#요일의 효과가 뚜렷하다. 대부분의 주문은 0일과 1일에 있습니다. 
#안타깝게도 어떤 값이 어떤 요일을 나타내는지에 대한 정보는 없지만 오늘이 주말이라고 가정할 수 있습니다.
orders 
ggplot(aes(x=order_dow)) + 
geom_histogram(stat="count",fill="red")

In [None]:
#When do they order again? 재주문 1주일
#People seem to order more often after exactly 1 week.
orders 
ggplot(aes(x=days_since_prior_order)) + 
geom_histogram(stat="count",fill="red")

In [None]:
#How many prior orders are there?  사전주문 3
#We can see that there are always at least 3 prior orders.
orders 
filter(eval_set=="prior") 
count(order_number) 
ggplot(aes(order_number,n)) + geom_line(color="red", size=1)+geom_point(size=2, color="red")

In [None]:
#How many items do people buy?  5개 정도 산다
#Let’s have a look how many items are in the orders. 
#We can see that people most often order around 5 items. The distributions are comparable between the train and prior order set.
order_products
group_by(order_id) 
summarize(n_items = last(add_to_cart_order)) 
ggplot(aes(x=n_items))+
geom_histogram(stat="count",fill="red") + geom_rug()+ coord_cartesian(xlim=c(0,80))

In [None]:
#estsellers 가장 잘 팔리는 거
#Let’s have a look which products are sold most often (top10). And the clear winner is: Bananas
tmp <- order_products
group_by(product_id)  
summarize(count = n())  
top_n(10, wt = count) 
left_join(select(products,product_id,product_name),by="product_id") 
arrange(desc(count)) 
kable(tmp)

tmp
ggplot(aes(x=reorder(product_name,-count), y=count))+
geom_bar(stat="identity",fill="red")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())


In [None]:
#How often do people order the same items again? 재주문율
#59% of the ordered items are reorders.
tmp <- order_products
group_by(reordered)  
summarize(count = n())
mutate(reordered = as.factor(reordered)) 
mutate(proportion = count/sum(count))

kable(tmp)
ggplot(aes(x=reordered,y=count,fill=reordered))+
geom_bar(stat="identity")

In [None]:
#Most often reordered 10가지 재주문
#Now here it becomes really interesting. These 10 products have the highest probability of being reordered.
tmp <-order_products 
group_by(product_id) 
summarize(proportion_reordered = mean(reordered), n=n()) 
filter(n>40) 
top_n(10,wt=proportion_reordered) 
arrange(desc(proportion_reordered)) 
left_join(products,by="product_id")

kable(tmp)

tmp
ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+
geom_bar(stat="identity",fill="red")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95))

In [None]:
#Which item do people put into the cart first? 어느 물건 먼저 cart
#People seem to be quite certain about Multifold Towels and if they buy them, put them into their cart first in 66% of the time.
tmp <- order_products  
group_by(product_id, add_to_cart_order) 
summarize(count = n()) 
mutate(pct=count/sum(count)) 
filter(add_to_cart_order == 1, count>10) 
arrange(desc(pct)) 
left_join(products,by="product_id") 
select(product_name, pct, count) 
ungroup() 
top_n(10, wt=pct)

kable(tmp)

tmp 
ggplot(aes(x=reorder(product_name,-pct), y=pct))+
geom_bar(stat="identity",fill="red")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.4,0.7))

In [None]:
#ssociation between time of last order and probability of reorder 같은 날 주문시 같은 제품 주문
#This is interesting: We can see that if people order again on the same day, 
#    they order the same product more often. Whereas when 30 days have passed, they tend to try out new things in their order.
    
order_products
left_join(orders,by="order_id")  
group_by(days_since_prior_order) 
summarize(mean_reorder = mean(reordered)) 
ggplot(aes(x=days_since_prior_order,y=mean_reorder))+
geom_bar(stat="identity",fill="red")

In [None]:
#Association between number of orders and probability of reordering 주문률이 많은 젓은 재주문이 많다
#Products with a high number of orders are naturally more likely to be reordered. However, there seems to be a ceiling effect.
order_products  
group_by(product_id)  
summarize(proportion_reordered = mean(reordered), n=n()) 
ggplot(aes(x=n,y=proportion_reordered))+
geom_point()+
geom_smooth(color="red")+
coord_cartesian(xlim=c(0,2000))

tmp  
ggplot(aes(x=organic,y=count, fill=organic))+
geom_bar(stat="identity")

In [None]:
#Reordering Organic vs Non-Organic  유기농 제품 더 
#People more often reorder organic products vs non-organic products.
tmp 
order_products  left_join(products,by="product_id")  group_by(organic)  summarize(mean_reordered = mean(reordered))
kable(tmp)

tmp  
  ggplot(aes(x=organic,fill=organic,y=mean_reordered))+geom_bar(stat="identity")

In [None]:
#Visualizing the Product Portfolio  포트폴리오 시각화
#Here is use to treemap package to visualize the structure of
#instacarts product portfolio. In total there are 21 departments containing 134 aisles.
library(treemap)

tmp <- products  group_by(department_id, aisle_id)  summarize(n=n())
tmp <- tmp  left_join(departments,by="department_id")
tmp <- tmp %>% left_join(aisles,by="aisle_id")

tmp2<-order_products 
group_by(product_id) 
summarize(count=n()) 
left_join(products,by="product_id")  
ungroup()  
group_by(department_id,aisle_id)  
summarize(sumcount = sum(count))  
left_join(tmp, by = c("department_id", "aisle_id"))  
mutate(onesize = 1)

In [None]:
#How are aisles organized within departments?
treemap(tmp2,index=c("department","aisle"),vSize="onesize",vColor="department",palette="Set3",title="",
        sortID="-sumcount", border.col="#FFFFFF",type="categorical", fontsize.legend = 0,bg.labels = "#FFFFFF")

In [None]:
#How many unique products are offered in each department/aisle?
#The size of the boxes shows the number of products in each category.
treemap(tmp,index=c("department","aisle"),vSize="n",title="",palette="Set3",border.col="#FFFFFF")

In [None]:
#How often are products from the department/aisle sold?
#The size of the boxes shows the number of sales.
treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF")

In [None]:
#Exploring Customer Habits  항상 재주분 고객
#Here i look for customers who just reorder the same products again all the time. To search those I look at all orders
#(excluding the first order), where the percentage of reordered items is exactly 1 
#(This can easily be adapted to look at more lenient thresholds). 
#We can see there are in fact 3,487 customers, just always reordering products.\tmp <- order_products_prior %>% 
group_by(order_id)  
summarize(m = mean(reordered),n=n())  
right_join(filter(orders,order_number>2), by="order_id")

tmp2 <- tmp  
filter(eval_set =="prior")  
group_by(user_id)  
summarize(n_equal = sum(m==1,na.rm=T), percent_equal = n_equal/n())  
filter(percent_equal == 1)  
arrange(desc(n_equal))

datatable(tmp2, class="table-condensed", style="bootstrap", options = list(dom = 'tp'))

In [None]:
#The customer with the strongest habit
#The coolest customer is id #99753,
#having 97 orders with only reordered items. That’s what I call a strong habit. She/he seems to like Organic Milk :-)
uniqueorders <- filter(tmp, user_id == 99753)
order_id
tmp <- order_products_prior  
filter(order_id %in% uniqueorders)  
left_join(products, by="product_id")

datatable(select(tmp,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 'tp'))

In [None]:
#Let’s look at his order in the train set. One would assume that he would buy 
#“Organic Whole Milk” and “Organic Reduced Fat Milk”:
    
tmp <- orders  filter(user_id==99753, eval_set == "train")
tmp2 <- order_products   
filter(order_id == tmp$order_id)  
left_join(products, by="product_id")

datatable(select(tmp2,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 't'))