# 🚀 **Advanced Data Analysis Assignment**

Welcome to the next-level assignment! We’ll build on the two previous datasets:
1. A **region-based** dataset containing `Region`, `Sales`, and `Transactions`.
2. A **time-series** dataset containing daily `Sales` from 2020-01-01 to 2020-12-31.

In this notebook, you will:
1. Load and explore both datasets.
2. Perform advanced grouping and pivoting on the regional data.
3. Check correlations and detect potential outliers.
4. Conduct advanced time-series analysis (rolling means & seasonal decomposition).
5. Provide concise insights from your findings.

Let's get started! 🎉


## 🧩 **Part A: Advanced Analysis on Regional Sales Data**
We'll begin by re-generating (or reloading) the regional sales data from your previous assignment.

In [1]:
# === Part A: Data Generation (Regional) ===
import pandas as pd
import numpy as np

# Seed for reproducibility
np.random.seed(0)

# Generate random data
data_regional = {
    'Region': np.random.choice(['North', 'South', 'East', 'West'], size=100),
    'Sales': np.random.rand(100) * 1000,  # Sales figures between 0 and 1000
    'Transactions': np.random.randint(1, 100, size=100)  # Transactions between 1 and 100
}

# Create DataFrame
df_regional = pd.DataFrame(data_regional)
df_regional.head()

### 🔍 **Task A1: Exploratory Data Analysis**
1. Display basic summary statistics for `Sales` and `Transactions`.
2. Identify the number of unique regions.
3. Check for any missing values.


In [2]:
# === SOLUTION for Task A1 ===

# 1) Basic summary statistics
# 2) Number of unique regions
# 3) Check for missing values

              Sales  Transactions
count   100.000000    100.000000
mean    496.458891     49.010000
std     285.310573     29.334691
min       4.695476      1.000000
25%     257.938042     22.000000
50%     499.449874     50.500000
75%     730.710578     76.250000
max     998.847007     99.000000


Number of unique regions:  4

Are there any missing values?
Region          0
Sales           0
Transactions    0
dtype: int64


### 💹 **Task A2: Pivot Table & Group Analysis**
1. Create a pivot table showing the **average Sales** and **average Transactions** by `Region`.
2. Sort the pivot table by the highest average Sales.


In [3]:
# === SOLUTION for Task A2 ===

# Sort by highest average Sales


             Sales  Transactions
Region                          
North   529.015150     52.857143
East    529.009741     48.777778
South   471.744216     47.125000
West    456.467124     48.074074

### ⚗️ **Task A3: Correlation & Outlier Detection** ⚠️ Optional Challenge
1. Calculate the correlation between `Sales` and `Transactions`. Do they appear to be correlated?
2. Detect potential outliers in `Sales` using the **IQR** (Interquartile Range) method.


In [4]:
# === SOLUTION for Task A3 ===
# 1) Correlation


# 2) Outlier Detection using IQR


Correlation between Sales and Transactions: 0.02608647690406796
Interquartile Range (IQR):  472.77253628765936
Lower Bound:  -451.67072297512066
Upper Bound:  1440.3193439957407

Number of outliers in 'Sales':  0


---
## 📈 **Part B: Advanced Time-Series Analysis**
Now let's work with the **time-series** dataset from your second assignment. We'll generate (or reload) the data below.

In [5]:
# === Part B: Data Generation (Time-Series) ===
dates = pd.date_range(start="2020-01-01", end="2020-12-31", freq="D")
data_timeseries = {
    "Date": dates,
    "Sales": (
        np.random.rand(len(dates)) * 200
        + np.sin(np.linspace(-3, 3, len(dates))) * 50
        + 100
    ),
}

df_timeseries = pd.DataFrame(data_timeseries)
df_timeseries.set_index("Date", inplace=True)

### 🔎 **Task B1: Quick Exploration**
1. Display the first 5 rows.
2. Show a statistical summary of the `Sales` column.

In [6]:
# === SOLUTION for Task B1 ===
# 1) Display first 5 rows

                Sales
Date                
2020-01-01  152.405767
2020-01-02  139.541474
2020-01-03  143.935398
2020-01-04  155.644670
2020-01-05  158.496538

In [7]:
# 2) Statistical summary of the 'Sales' column

count    366.000000
mean     179.211239
std       24.897467
min      115.418960
25%      160.485854
50%      179.913413
75%      197.478627
max      233.690058
Name: Sales, dtype: float64

### 📆 **Task B2: Monthly & Rolling Analysis**
1. Calculate monthly average `Sales`.
2. Compute a 7-day rolling average to smooth out short-term fluctuations.


In [8]:
# === SOLUTION for Task B2 ===
# 1) Monthly average Sales

Month
2020-01    174.699406
2020-02    175.596625
2020-03    183.505658
2020-04    176.384074
2020-05    177.133933
2020-06    183.014605
2020-07    176.209166
2020-08    178.908697
2020-09    184.011963
2020-10    181.076129
2020-11    183.828993
2020-12    180.918229
Freq: M, Name: Sales, dtype: float64

In [9]:
# 2) 7-day rolling average

Date
2020-01-01           NaN
2020-01-02           NaN
2020-01-03           NaN
2020-01-04           NaN
2020-01-05           NaN
                  ...    
2020-12-27    177.084961
2020-12-28    176.899915
2020-12-29    176.411027
2020-12-30    177.033434
2020-12-31    177.649389
Name: Sales, Length: 366, dtype: float64

### 🔬 **Task B3: Day-of-Week Seasonality Analysis (Using Pandas Only)**

1. **Extract the day of the week** from the index and store it in a new column (e.g., `DayOfWeek`).
2. **Group by** this `DayOfWeek` column to get the **average Sales** for each day of the week.
3. **Compare** these daily averages to see if certain days have higher or lower sales.


In [1]:
# === SOLUTION for Task B3 with Pandas Only ===
# 1) Extract day of the week: Monday=0, Sunday=6
# 2) Group by the day of the week to compute average sales


### 📝 **Observations & Insights**
1. **Regional Data**
   - The correlation between `Sales` and `Transactions` is quite low, suggesting they’re not strongly related in this sample.
   - Pivot tables show which region averages the highest Sales, with minimal outliers in `Sales`.

2. **Time-Series Data**
   - The monthly averages reveal slight fluctuations each month.
   - The 7-day rolling average smooths out daily noise.
   - Seasonal decomposition indicates a clear weekly seasonal pattern (due to the `np.sin()` component) and an overall trend.

---
## 🏁 **Assignment Wrap-Up**

🎉 **Congratulations!** You’ve:
- Built pivot tables and looked for regional trends.
- Analyzed correlation and outliers.
- Explored monthly averages in time-series data.
- Investigated rolling averages and seasonal decomposition.

These techniques will provide a solid foundation for more advanced analytical work, including forecasting, anomaly detection, and deeper business intelligence. Keep exploring!
