# 📘 Phase 1: Data Loading and Preparation

> This phase focuses on preparing the dataset for analysis by cleaning and enriching it with time-based features.


## 

1. 🔧 **Clean column names**  
   Standardize column headers for clarity and consistency.

2. 🗓️ **Convert "Month" to datetime format**  
   Ensure the "Month" column is in proper datetime format for time-based operations.

3. 🔍 **Check for missing or duplicate values**  
   Detect and handle any nulls or repeated records that could affect analysis.

4. 🏷️ **Add features: Month, Year, Season, etc.**  
   Extract meaningful time-based features like Year, Month, Quarter, and Season for deeper insights.


In [2]:

# Phase 1: Data Preparation
import pandas as pd

# Load dataset
df = pd.read_csv("C:/Users/kkang/Downloads/sales-of-shampoo-over-a-three-ye.csv")

# Rename columns
df.columns = ['Month', 'Sales']

# Convert 'Month' to datetime
df['Month'] = pd.date_range(start='2001-01-01', periods=len(df), freq='MS')

# Check missing and duplicates
print("Missing values:\n", df.isnull().sum())
print("\nDuplicate rows:\n", df.duplicated().sum())

# Add time features
df['Year'] = df['Month'].dt.year
df['Month_Num'] = df['Month'].dt.month
df['Month_Name'] = df['Month'].dt.strftime('%b')
df['Quarter'] = df['Month'].dt.quarter
df['Season'] = df['Month'].dt.month % 12 // 3 + 1

# Show data
df.head()


Missing values:
 Month    0
Sales    0
dtype: int64

Duplicate rows:
 0


Unnamed: 0,Month,Sales,Year,Month_Num,Month_Name,Quarter,Season
0,2001-01-01,266.0,2001,1,Jan,1,1
1,2001-02-01,145.9,2001,2,Feb,1,1
2,2001-03-01,183.1,2001,3,Mar,1,2
3,2001-04-01,119.3,2001,4,Apr,2,2
4,2001-05-01,180.3,2001,5,May,2,2
