# Project - Cart Abandonment Analysis

StyleHub is an e-commerce platform that sells a wide range of clothing and accessories.

The website allows users to browse products, add items to their cart, and proceed to checkout to complete their purchases.

However, the analytics team has noticed a significant number of users adding items to their cart but not completing the checkout process.

They need your help to understand the reasons behind cart abandonment to optimize their conversion rates and improve the user experience.


--- 

## Data Information

The dataset contains a comprehensive collection of customer information and shopping behavior data, obtained through a combination of behavior tracking and customer feedback in the last 5 weeks.


| Column Name            | Details                                                                                                                                          |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| customer_id            | Nominal. Unique identifier for each customer.                                                                                                    |
| first_item_added       | Continuous. The timestamp when the first item is added into the cart.                                                                            |
| last_item_added        | Continuous. The timestamp when the last item is added to the cart.                                                                               |
| discount_applied       | Nominal. Whether the customer applied a discounted code or took advantage of a promotional offer during the purchase.                            |
| item_category          | Nominal. The category of the items added into the cart.                                                                                          |
| number_of_items        | Discrete. Number of the items has been added into the cart.                                                                                      |
| shipping_region        | Nominal. The region or country to which the items will be shipped.                                                                               |
| total_price            | Nominal. The total price of all the items in the cart.                                                                                           |
| abandoned_checkout     | Nominal. Whether the customer abandoned the checkout (1) or completed the purchase (0) within a specific timeframe (e.g., 24 hours).             |
| reason_for_abandonment | Nominal. The reason cited by the customer for abandoning the checkout, such as "High Shipping Costs," "Payment Issues," "Price Comparison," etc. |



# Data Generation

Utilizing the `faker`, `pandas`, and `datetime` libraries, a dataset can be generated with the following specifications:

- The dataset consists of customer checkout information, where 75% of the customers have abandoned the checkout process.
- The likelihood of a customer abandoning the checkout increases when fewer items are added to the cart.
- The likelihood of a customer abandoning the checkout increases when no discount applied.
- The total price of the cart increases as the number of items added to the cart increases.
- The timestamp of the first item added is equivalent to the timestamp of the last item added when only one item is added to the cart.
- There might be multiple categories if there is more than one item in the cart.
- Some numerical and categorical columns contain missing values and inconsistencies.

In [66]:
import pandas as pd
from faker import Faker
import math
from datetime import datetime, timedelta

# Instantiate Faker object
fake = Faker()

# Set the seed value for reproducibility
Faker.seed(20230712)

# Set the number of rows in the dataset
num_rows = 7012

# Create an empty dictionary to store the data
data = {
    'customer_id': [],
    'first_item_added': [],
    'last_item_added': [],
    'discount_applied': [],
    'item_category': [],
    'number_of_items': [],
    'total_price': [],
    'shipping_region': [],
    'abandoned_checkout': [],
    'reason_for_abandonment': []
}


# Generate random data for each column
for _ in range(num_rows):
    # Generate a unique customer ID
    customer_id = fake.unique.random_number(digits=6)

    # Generate abandoned_checkout column, make sure 75% of customers abandoned
    abandoned_checkout = random.choices([True, False], weights=[0.75,0.25])[0]

    # Generate a random boolean for discount applied, with a higher chance of False when abandoned_checkout is True
    if abandoned_checkout:
        discount_applied = random.choices([True, False], weights=[0.3, 0.7])[0]
    else:
        discount_applied = random.choice([True, False])
    
   # Generate a random error factor for the number of items in the cart
    error_factor_higher= fake.random.uniform(1, 1.5)
    error_factor_lower = fake.random.uniform(0.5, 1)
    
   # Generate a random number of items in the cart
   # Smaller number_of_items when abandoned_checkout is True
    if abandoned_checkout:
        number_of_items = int(math.ceil(fake.random_int(min=1, max=5) * error_factor_higher))
    else:
        number_of_items = int(math.ceil(fake.random_int(min=1, max=10) * error_factor_lower))
        
    # Generate a random error factor for the total price
    error_factor = fake.random.uniform(0.8, 1.2)

    # Generate a random total price with error
    base_price = number_of_items * fake.random.uniform(10, 100)
    total_price = round(base_price * error_factor,2)
    
   # Generate timestamps for the first item added
    start_time = fake.date_time_between(start_date='-30d', end_date='-1d')
    first_item_added = start_time.strftime('%Y-%m-%d %H:%M:%S')
    
    # Generate timestamps for the last item added
    if number_of_items == 1:
        last_item_added = first_item_added
    else:
        last_item_added = (start_time + timedelta(minutes=random.randint(1, 60))).strftime('%Y-%m-%d %H:%M:%S')
        
    # Generate random item categories for number_of_items larger than 1 with a 50% chance of multiple categories
    if number_of_items > 1 and random.random() < 0.5:
        item_categories = fake.random_elements(elements=['Clothing', 'Shoes', 'Accessories'], unique=True,length=random.choice([2, 3]))
        item_category = ', '.join(item_categories)
    else:
        item_category = fake.random_element(elements=['Clothing', 'Shoes', 'Accessories'])

    # Generate a random shipping region
    shipping_region = random.choice(['North America', 'Europe', 'Asia', 'Latin America', 'Africa','Others'])

    # Generate a random reason for abandonment
    reasons = ['High Shipping Costs', 'Payment Issues', 'Price Comparison', 'Changed Mind']
    reason_for_abandonment = random.choices(reasons, weights=[0.1, 0.3, 0.5, 0.1])[0] if abandoned_checkout else ''

    # Convert to a different data type for total_price
    # Introduce missing_ in total_price
    total_price = "$" + str(total_price)
    if random.random() < 0.03:  # 5% chance of errors
        total_price = None
    
    # Introduce missing_values in shipping_region
    if random.random() < 0.05:  # 5% chance of errors
        shipping_region = None


# Append the generated data to the dictionary
    data['customer_id'].append(customer_id)
    data['first_item_added'].append(first_item_added)
    data['last_item_added'].append(last_item_added)
    data['discount_applied'].append(discount_applied)
    data['item_category'].append(item_category)
    data['number_of_items'].append(number_of_items)
    data['total_price'].append(total_price)
    data['shipping_region'].append(shipping_region)
    data['abandoned_checkout'].append(abandoned_checkout)
    data['reason_for_abandonment'].append(reason_for_abandonment)

    
# Create a DataFrame from the generated data
df = pd.DataFrame(data)

df.to_csv("data/customer.csv",index=False)