A fictional e-commerce store has provided us with their sales data for the past year. The data includes information on sales revenue, products sold, customer demographics, and shipping information. Our task is to analyze the data and provide insights on the following questions:
- What is the overall revenue for the year?
- What were the top-selling products?
- What was the average order value?
- What is the customer demographics like?
- What was the shipping cost distribution?
- Can we identify any trends or patterns in the sales data?
For this project, I will be using the following libraries:
- Pandas: for data manipulation and analysis
- NumPy: for numerical computing
- Matplotlib: for data visualization
I will start by loading the data into a Pandas DataFrame using the read_csv()
function:
import pandas as pd
sales_data = pd.read_csv("sales_data.csv")
To calculate the overall revenue for the year, we can use the sum() function to sum up the revenue column:
overall_revenue = sales_data["revenue"].sum()
To identify the top-selling products, we can use the groupby() function to group the data by product, and then use the sum() function to calculate the total revenue for each product:
product_revenue = sales_data.groupby("product")["revenue"].sum()
top_products = product_revenue.sort_values(ascending=False).head(10)
To calculate the average order value, we can divide the overall revenue by the number of orders:
avg_order_value = overall_revenue / sales_data["order_id"].nunique()
To analyze the customer demographics, we can use the groupby() function to group the data by customer age and gender, and then calculate the average order value for each group:
customer_demographics = sales_data.groupby(["age", "gender"])["revenue"].mean()
To analyze the shipping cost distribution, we can use the hist() function from Matplotlib to create a histogram of the shipping costs:
import matplotlib.pyplot as plt
plt.hist(sales_data["shipping_cost"], bins=20) plt.xlabel("Shipping Cost") plt.ylabel("Frequency") plt.show()
To identify trends and patterns in the sales data, we can use various data visualization techniques such as line charts, scatter plots, and heat maps:
sales_data["date"] = pd.to_datetime(sales_data["date"])
sales_data.set_index("date", inplace=True)
sales_data.resample("M")["revenue"].sum().plot()
plt.xlabel("Date") plt.ylabel("Revenue") plt.show()
plt.scatter(sales_data["age"], sales_data["revenue"]) plt.xlabel("Age") plt.ylabel("Revenue")