## Amazon Cosumer Behaviour Dataset

Link to Kaggle: https://www.kaggle.com/datasets/swathiunnikrishnan/amazon-consumer-behaviour-dataset

##### Data Dictionary:

- 1.age= age
- 2.gender= gender
- 3.Purchase_Frequency= How frequently do you make purchases on Amazon?
- 4.Purchase_Categories= What product categories do you typically purchase on Amazon?
- 5.Personalized_Recommendation_Frequency = Have you ever made a purchase based on personalized product recommendations from Amazon?
- 6.Browsing_Frequency =How often do you browse Amazon's website or app?
- 7.Product_Search_Method =How do you search for products on Amazon?
- 8.Search_Result_Exploration =Do you tend to explore multiple pages of search results or focus on the first page?
- 9.Customer_Reviews_Importance =How important are customer reviews in your decision-making process?
- 10.Add_to_Cart_Browsing =Do you add products to your cart while browsing on Amazon?
- 11.Cart_Completion_Frequency =How often do you complete the purchase after adding products to your cart?
- 12.Cart_Abandonment_Factors =What factors influence your decision to abandon a purchase in your cart?
- 13.Saveforlater_Frequency =Do you use Amazon's "Save for Later" feature, and if so, how often?
- 14.Review_Left =Have you ever left a product review on Amazon?
- 15.Review_Reliability =How much do you rely on product reviews when making a purchase?
- 16.Review_Helpfulness =Do you find helpful information from other customers' reviews?
- 17.Personalized_Recommendation_Frequency =How often do you receive personalized product recommendations from Amazon?
- 18.Recommendation_Helpfulness =Do you find the recommendations helpful?
- 19.Rating_Accuracy =How would you rate the relevance and accuracy of the recommendations you receive?
- 20.Shopping_Satisfaction =How satisfied are you with your overall shopping experience on Amazon?
- 23.Service_Appreciation =What aspects of Amazon's services do you appreciate the most?
- 24.Improvement_Areas =Are there any areas where you think Amazon can improve?

First, we are going to understand how the data is stored on this dataset to further comprehend what analysis and data transformation we might do.

In [1]:
# Importing libraries

import pandas as pd
import numpy as np

In [2]:
## Importing dataset

df = pd.read_csv('archive/Amazon Customer Behavior Survey.csv')

In [3]:
## Understading df

df.head(5)

Unnamed: 0,Timestamp,age,Gender,Purchase_Frequency,Purchase_Categories,Personalized_Recommendation_Frequency,Browsing_Frequency,Product_Search_Method,Search_Result_Exploration,Customer_Reviews_Importance,...,Saveforlater_Frequency,Review_Left,Review_Reliability,Review_Helpfulness,Personalized_Recommendation_Frequency.1,Recommendation_Helpfulness,Rating_Accuracy,Shopping_Satisfaction,Service_Appreciation,Improvement_Areas
0,2023/06/04 1:28:19 PM GMT+5:30,23,Female,Few times a month,Beauty and Personal Care,Yes,Few times a week,Keyword,Multiple pages,1,...,Sometimes,Yes,Occasionally,Yes,2,Yes,1,1,Competitive prices,Reducing packaging waste
1,2023/06/04 2:30:44 PM GMT+5:30,23,Female,Once a month,Clothing and Fashion,Yes,Few times a month,Keyword,Multiple pages,1,...,Rarely,No,Heavily,Yes,2,Sometimes,3,2,Wide product selection,Reducing packaging waste
2,2023/06/04 5:04:56 PM GMT+5:30,24,Prefer not to say,Few times a month,Groceries and Gourmet Food;Clothing and Fashion,No,Few times a month,Keyword,Multiple pages,2,...,Rarely,No,Occasionally,No,4,No,3,3,Competitive prices,Product quality and accuracy
3,2023/06/04 5:13:00 PM GMT+5:30,24,Female,Once a month,Beauty and Personal Care;Clothing and Fashion;...,Sometimes,Few times a month,Keyword,First page,5,...,Sometimes,Yes,Heavily,Yes,3,Sometimes,3,4,Competitive prices,Product quality and accuracy
4,2023/06/04 5:28:06 PM GMT+5:30,22,Female,Less than once a month,Beauty and Personal Care;Clothing and Fashion,Yes,Few times a month,Filter,Multiple pages,1,...,Rarely,No,Heavily,Yes,4,Yes,2,2,Competitive prices,Product quality and accuracy


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 602 entries, 0 to 601
Data columns (total 23 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   Timestamp                               602 non-null    object
 1   age                                     602 non-null    int64 
 2   Gender                                  602 non-null    object
 3   Purchase_Frequency                      602 non-null    object
 4   Purchase_Categories                     602 non-null    object
 5   Personalized_Recommendation_Frequency   602 non-null    object
 6   Browsing_Frequency                      602 non-null    object
 7   Product_Search_Method                   600 non-null    object
 8   Search_Result_Exploration               602 non-null    object
 9   Customer_Reviews_Importance             602 non-null    int64 
 10  Add_to_Cart_Browsing                    602 non-null    object
 11  Cart_C

### Notes so far:

- There are 602 elements on this dataset. We need to check if there are duplicated values;
- There are 2 missing values on dimension 'Product_Search_Method'. We might need to treat those;
- Age, Customer_Reviews_Importance, Personalized_Recommendation_Frequency, Rating_Accuracy and Shopping_Satisfaction are integers;
- Purchase_Frequency, Personalized_Recommendation_Frequency, Browsing_Frequency, Saveforlater_Frequency, Review_Reliability, Recommendation_Helpfulness are objects, but the values give sense of frequency, which means that we can turn them into ordinal integers;
- Gender, ReviewLeft and Review_Helpfulness are objects, but can be turned into dummy variables;
- Purchase_Categories, Product_Search_Method, Search_Result_Exploration, Service_Appreciation and Improvement_Areas are objects with categorical values. We can further explore the possibility of turning these values into dummy variables.

In [5]:
## Checking duplicated values

df[df.duplicated() == True]

Unnamed: 0,Timestamp,age,Gender,Purchase_Frequency,Purchase_Categories,Personalized_Recommendation_Frequency,Browsing_Frequency,Product_Search_Method,Search_Result_Exploration,Customer_Reviews_Importance,...,Saveforlater_Frequency,Review_Left,Review_Reliability,Review_Helpfulness,Personalized_Recommendation_Frequency.1,Recommendation_Helpfulness,Rating_Accuracy,Shopping_Satisfaction,Service_Appreciation,Improvement_Areas


In [6]:
## Checking nulls on Product_Search_Method

df[df['Product_Search_Method'].isnull()]

Unnamed: 0,Timestamp,age,Gender,Purchase_Frequency,Purchase_Categories,Personalized_Recommendation_Frequency,Browsing_Frequency,Product_Search_Method,Search_Result_Exploration,Customer_Reviews_Importance,...,Saveforlater_Frequency,Review_Left,Review_Reliability,Review_Helpfulness,Personalized_Recommendation_Frequency.1,Recommendation_Helpfulness,Rating_Accuracy,Shopping_Satisfaction,Service_Appreciation,Improvement_Areas
119,2023/06/06 2:07:12 PM GMT+5:30,21,Female,Once a month,Clothing and Fashion,Sometimes,Few times a week,,Multiple pages,3,...,Often,No,Moderately,Sometimes,3,Sometimes,3,3,User-friendly website/app interface,Customer service responsiveness
382,2023/06/08 5:49:59 PM GMT+5:30,47,Female,Few times a month,Beauty and Personal Care;Clothing and Fashion;...,No,Multiple times a day,,Multiple pages,1,...,Often,No,Moderately,No,2,No,3,2,Wide product selection,Shipping speed and reliability


In [7]:
df['Product_Search_Method'].value_counts()

categories    223
Keyword       214
Filter        127
others         36
Name: Product_Search_Method, dtype: int64

In this case, it can be hard to properly infer a cause for the null values on the dataset. It is possible that the consumers use more than one Search Method frequently and couldn't provide a direct awnser, or just haven't noticed the method engaged when searching. Either way, we are going to treat it imputing the values with the most frequent awnser, which is 'categories'.

See 'Handling Missing Data' by Analytics Vidhya: https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/

In [8]:
# Imputing missing data

df['Product_Search_Method'] = df['Product_Search_Method'].fillna('categories')
df[df['Product_Search_Method'].isnull()]

Unnamed: 0,Timestamp,age,Gender,Purchase_Frequency,Purchase_Categories,Personalized_Recommendation_Frequency,Browsing_Frequency,Product_Search_Method,Search_Result_Exploration,Customer_Reviews_Importance,...,Saveforlater_Frequency,Review_Left,Review_Reliability,Review_Helpfulness,Personalized_Recommendation_Frequency.1,Recommendation_Helpfulness,Rating_Accuracy,Shopping_Satisfaction,Service_Appreciation,Improvement_Areas


Now, we are going to transform the dimensions Purchase_Frequency, Personalized_Recommendation_Frequency, Browsing_Frequency, Saveforlater_Frequency, Review_Reliability, Recommendation_Helpfulness into integers. For that, I'm going to print the existent values on each dimenion to see the order of the values.

In [14]:
ordinal_dimensions = ['Purchase_Frequency', 
                      'Personalized_Recommendation_Frequency',
                      'Browsing_Frequency',
                      'Saveforlater_Frequency',
                      'Review_Reliability',
                      'Recommendation_Helpfulness']

for dimension in ordinal_dimensions:
    print(dimension, '\n', df[dimension].unique(), '\n')

Purchase_Frequency 
 ['Few times a month' 'Once a month' 'Less than once a month'
 'Multiple times a week' 'Once a week'] 

Personalized_Recommendation_Frequency 
 ['Yes' 'No' 'Sometimes'] 

Browsing_Frequency 
 ['Few times a week' 'Few times a month' 'Rarely' 'Multiple times a day'] 

Saveforlater_Frequency 
 ['Sometimes' 'Rarely' 'Never' 'Often' 'Always'] 

Review_Reliability 
 ['Occasionally' 'Heavily' 'Moderately' 'Never' 'Rarely'] 

Recommendation_Helpfulness 
 ['Yes' 'Sometimes' 'No'] 



- Personalized_Recommendation_Frequency and Recommendation_Helpfulness have a 3-value frequency scale;
- Browsing_Frequency have a 4-value frequency scale;
- Purchase_Frequency, Saveforlater_Frequency and Review_Reliability have a 5-value frequency scale.

So, the values will be converted into a numerical scale from 0 to the top (3, 4 or 5, respectively), where the higher value means 'more frequent' in each scale;
