## Main Objective of the Analysis

The primary objective of this analysis is to predict customer purchasing behavior based on features such as product preferences, transaction amounts, and visit frequency. The goal is to develop a model that accurately predicts sales outcomes and provides insights into key factors driving sales in the coffee shop, ultimately assisting in optimizing marketing strategies and inventory management.

## Brief Description of the Dataset

In [3]:
import pandas as pd

# Load the dataset
file_path = '/content/Coffee Shop Sales.xlsx'
data = pd.read_excel(file_path)

# Display the first few rows and summary information
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149116 entries, 0 to 149115
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   transaction_id    149116 non-null  int64         
 1   transaction_date  149116 non-null  datetime64[ns]
 2   transaction_time  149116 non-null  object        
 3   transaction_qty   149116 non-null  int64         
 4   store_id          149116 non-null  int64         
 5   store_location    149116 non-null  object        
 6   product_id        149116 non-null  int64         
 7   unit_price        149116 non-null  float64       
 8   product_category  149116 non-null  object        
 9   product_type      149116 non-null  object        
 10  product_detail    149116 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(4), object(5)
memory usage: 12.5+ MB


he dataset contains 149,116 transactions from a coffee shop. The key attributes in the dataset are:

- transaction_id: Unique identifier for each transaction.
- transaction_date: Date when the transaction took place.
- transaction_time: Time when the transaction occurred.
- transaction_qty: Quantity of items purchased in the transaction.
- store_id: Identifier for the store where the transaction occurred.
- store_location: The location of the store.
- product_id: Identifier for the product sold.
- unit_price: The price per unit of the product.
- product_category: Category of the product (e.g., Coffee, Tea).
- product_type: Specific type within the product category (e.g., Gourmet brewed coffee).
- product_detail: Detailed description of the product.

The goal is to use these features to predict the sales outcome, which could be represented by the transaction_qty, or to identify patterns in purchasing behavior.

## Data Exploration and Cleaning

In [4]:
# Checking for missing values
missing_values = data.isnull().sum()

# Exploring the distribution of key features
transaction_qty_distribution = data['transaction_qty'].describe()
unit_price_distribution = data['unit_price'].describe()

missing_values, transaction_qty_distribution, unit_price_distribution


(transaction_id      0
 transaction_date    0
 transaction_time    0
 transaction_qty     0
 store_id            0
 store_location      0
 product_id          0
 unit_price          0
 product_category    0
 product_type        0
 product_detail      0
 dtype: int64,
 count    149116.000000
 mean          1.438276
 std           0.542509
 min           1.000000
 25%           1.000000
 50%           1.000000
 75%           2.000000
 max           8.000000
 Name: transaction_qty, dtype: float64,
 count    149116.000000
 mean          3.382219
 std           2.658723
 min           0.800000
 25%           2.500000
 50%           3.000000
 75%           3.750000
 max          45.000000
 Name: unit_price, dtype: float64)

- Missing Values: There are no missing values in the dataset, so no imputation is necessary.
- Transaction Quantity (transaction_qty): The quantity of items purchased in transactions ranges from 1 to 8, with a mean of approximately 1.44. Most transactions involve purchasing 1 or 2 items.
- Unit Price (unit_price): The price per unit varies significantly, ranging from $0.80 to $45.00, with a mean price of about $3.38.

Since the data is clean, the next step is feature engineering and model training.

## Model Training

We will train and compare three different classification models. For this example, let's assume we're predicting whether a transaction involves purchasing more than one item (transaction_qty > 1).

We'll create a binary target variable large_order (1 if transaction_qty > 1, 0 otherwise) and use the following models:

- Logistic Regression: As a baseline model.
- Random Forest Classifier: To capture non-linear relationships and interactions between features.
- Gradient Boosting Classifier: To optimize prediction performance by combining multiple weak learners.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

# Feature engineering: create a binary target variable
data['large_order'] = (data['transaction_qty'] > 1).astype(int)

# Selecting features for the model
features = ['store_id', 'unit_price', 'product_id']
X = data[features]
y = data['large_order']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Training the models
# Logistic Regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train_scaled, y_train)
y_pred_logreg = logreg.predict(X_test_scaled)

# Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Gradient Boosting Classifier
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)

# Generating classification reports
logreg_report = classification_report(y_test, y_pred_logreg)
rf_report = classification_report(y_test, y_pred_rf)
gb_report = classification_report(y_test, y_pred_gb)

logreg_report, rf_report, gb_report


('              precision    recall  f1-score   support\n\n           0       0.61      0.79      0.69     26247\n           1       0.50      0.30      0.37     18488\n\n    accuracy                           0.59     44735\n   macro avg       0.56      0.54      0.53     44735\nweighted avg       0.57      0.59      0.56     44735\n',
 '              precision    recall  f1-score   support\n\n           0       0.68      0.63      0.65     26247\n           1       0.52      0.57      0.54     18488\n\n    accuracy                           0.60     44735\n   macro avg       0.60      0.60      0.60     44735\nweighted avg       0.61      0.60      0.61     44735\n',
 '              precision    recall  f1-score   support\n\n           0       0.68      0.60      0.64     26247\n           1       0.52      0.61      0.56     18488\n\n    accuracy                           0.60     44735\n   macro avg       0.60      0.60      0.60     44735\nweighted avg       0.62      0.60      0.

Three different models were trained and evaluated on the binary classification task of predicting whether a transaction involves more than one item:

Logistic Regression:
- Precision: 0.50 to 0.61
- Recall: 0.30 to 0.79
- F1-Score: 0.37 to 0.69
- Accuracy: 0.59

Random Forest Classifier:
- Precision: 0.52 to 0.68
- Recall: 0.57 to 0.63
- F1-Score: 0.54 to 0.65
- Accuracy: 0.60

Gradient Boosting Classifier:
- Precision: 0.52 to 0.68
- Recall: 0.61 to 0.60
- F1-Score: 0.56 to 0.64
- Accuracy: 0.60

Recommendation: The Random Forest Classifier is recommended as the final model due to its balance between precision, recall, and F1-score across the classes. It provides better overall performance than Logistic Regression and similar performance to Gradient Boosting, but with the added advantage of being more interpretable.

## Key Findings and Insights

- Feature Importance: In the Random Forest model, the unit_price and product_id were significant predictors of whether a customer would purchase more than one item.
- Customer Behavior: The analysis suggests that higher unit prices and specific product categories may encourage bulk purchases.

## Model Flaws and Next Steps

- Class Imbalance: The dataset shows a significant imbalance between single-item and multi-item purchases. Techniques such as oversampling or undersampling could be explored to improve model performance.
- Feature Expansion: Additional features such as time of day, customer demographics, or loyalty program status could improve prediction accuracy.
- Model Refinement: Experimenting with hyperparameter tuning and different ensemble methods (like XGBoost) could further optimize the model.

## Conclusion

This analysis highlights the potential for predictive modeling to understand customer behavior in a coffee shop setting. By implementing the Random Forest model, the business can better anticipate customer needs and optimize inventory and marketing strategies.