# Supermarket Dataset Analysis

## 1. Introduction
This analysis focuses on the `supermarket_sales.csv` dataset, which contains transactional records from a supermarket. The dataset includes information about sales, products, customer demographics, and payment methods. The objective is to analyze sales performance, customer behavior, and product trends, and apply time series forecasting methods for predicting future sales.

---

## 2. Dataset Overview

The dataset consists of the following columns:

- **`Invoice ID`**: Unique identifier for each transaction.
- **`Branch`**: Branch of the supermarket where the transaction occurred.
- **`City`**: City in which the branch is located.
- **`Customer type`**: Type of customer (e.g., Member, Normal).
- **`Gender`**: Gender of the customer.
- **`Product line`**: Category of products purchased.
- **`Unit price`**: Price per unit of the product.
- **`Quantity`**: Number of units purchased.
- **`Tax 5%`**: Tax amount applied to the transaction.
- **`Total`**: Total amount spent in the transaction.
- **`Date`**: Date of the transaction.
- **`Time`**: Time of the transaction.
- **`Payment`**: Payment method used.
- **`cogs`**: Cost of goods sold.
- **`gross margin percentage`**: Percentage of gross margin.
- **`gross income`**: Gross income from the transaction.
- **`Rating`**: Customer rating for the transaction.

---

## 3. Data Quality Checks

### 3.1 Missing Values

Check for missing values and handle them appropriately.

```python
# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:")
print(missing_values)
```

### 3.2 Data Types

Ensure that the data types are appropriate for analysis.

```python
# Check data types
print("Data types:")
print(data.dtypes)
```

### 3.3 Outliers

Detect and handle outliers in key numerical columns.

```python
import numpy as np

# Calculate z-scores for detecting outliers
from scipy import stats
data['z_score_total'] = np.abs(stats.zscore(data['Total']))

# Filter out outliers (z-score > 3)
data_cleaned = data[data['z_score_total'] < 3]

print("Number of outliers removed:", len(data) - len(data_cleaned))
```

---

## 4. Exploratory Data Analysis (EDA)

### 4.1 Total Sales and Profit

Calculate the total sales and profit to understand overall performance.

```python
# Calculate total sales and total profit
total_sales = data['Total'].sum()
total_profit = data['gross income'].sum()

print(f"Total Sales: {total_sales}")
print(f"Total Profit: {total_profit}")
```

### 4.2 Sales by Branch

Analyze sales distribution across different branches.

```python
# Aggregate sales by branch
sales_by_branch = data.groupby('Branch')['Total'].sum()

print("Sales by Branch:")
print(sales_by_branch)
```

### 4.3 Sales by Product Line

Explore sales distribution across different product lines.

```python
# Aggregate sales by product line
sales_by_product_line = data.groupby('Product line')['Total'].sum()

print("Sales by Product Line:")
print(sales_by_product_line)
```

### 4.4 Customer Demographics

Analyze sales performance by customer type and gender.

```python
# Aggregate sales by customer type
sales_by_customer_type = data.groupby('Customer type')['Total'].sum()

print("Sales by Customer Type:")
print(sales_by_customer_type)

# Aggregate sales by gender
sales_by_gender = data.groupby('Gender')['Total'].sum()

print("Sales by Gender:")
print(sales_by_gender)
```

### 4.5 Time Series Analysis

Convert the `Date` column to datetime and aggregate sales by date for time series analysis.

```python
import pandas as pd

# Convert 'Date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'], format='%m/%d/%Y')

# Aggregate total sales by date
daily_sales = data.groupby('Date').agg({'Total': 'sum'}).reset_index()

print("Daily Sales Summary:")
print(daily_sales.head())
```

### 4.6 Visualizations

Visualize key metrics and trends for better insights.

```python
import matplotlib.pyplot as plt

# Plot total sales over time
plt.figure(figsize=(10, 6))
plt.plot(daily_sales['Date'], daily_sales['Total'], label='Total Sales')
plt.title('Total Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.legend()
plt.show()
```

---

## 5. Forecasting

### 5.1 ARIMA Model

Use ARIMA for forecasting future sales based on historical data.

```python
from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller

# Check for stationarity
result = adfuller(daily_sales['Total'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])

# Fit ARIMA model
arima_model = auto_arima(daily_sales['Total'], seasonal=False, stepwise=True)
print(arima_model.summary())
```

### 5.2 LSTM Model

Use LSTM for advanced time series forecasting.

```python
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout

# Normalize the data
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(daily_sales[['Total']])

# Create sequences
def create_sequences(data, time_step=1):
    X, y = [], []
    for i in range(len(data) - time_step):
        X.append(data[i:i+time_step, 0])
        y.append(data[i + time_step, 0])
    return np.array(X), np.array(y)

time_step = 10
X, y = create_sequences(scaled_data, time_step)
X = X.reshape(X.shape[0], X.shape[1], 1)

# Build LSTM model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(time_step, 1)))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
history = model.fit(X, y, epochs=50, batch_size=64, validation_split=0.2)
```

### 5.3 Facebook Prophet

Facebook Prophet is a forecasting tool designed for time series data that handles seasonality and holidays effectively.

#### Preprocessing for Prophet

```python
from fbprophet import Prophet

# Rename columns for Prophet
daily_sales_prophet = daily_sales.reset_index()
daily_sales_prophet.columns = ['ds', 'y']

# Initialize and fit the Prophet model
prophet_model = Prophet()
prophet_model.add_country_holidays(country_name='US')  # Add country-specific holidays if applicable

# Fit the model with the daily sales data
prophet_model.fit(daily_sales_prophet)

# Create future dates for forecasting (next 30 days)
future = prophet_model.make_future_dataframe(periods=30)

# Make the forecast
forecast = prophet_model.predict(future)

# Display the forecasted data
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())
```

#### Visualization

```python
import matplotlib.pyplot as plt

# Plot the forecast
fig = prophet_model.plot(forecast)
plt.title('Facebook Prophet Forecast')
plt.show()
```

### 5.4 Comparison of Forecasting Methods

After applying ARIMA, LSTM, and Facebook Prophet models, compare their performance using metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). This comparison will help determine which model provides the most accurate forecasts for the supermarket sales data.