#  Practical Data Preprocessing Task – Sales Dataset

##  Task Title:
**"Will This Sale Be Successful?"**

##  Objective:
You are provided with a real-world retail dataset. Your task is to **prepare the data for a machine learning classification model** that predicts whether a customer will purchase **more than 3 items in a single transaction**.

You will perform a complete preprocessing pipeline from data exploration to handling imbalances, with visualizations and clear justifications for each step.

---

##  Dataset Description:

Dataset link: [Customer Shopping Dataset – Retail Sales Data](https://www.kaggle.com/datasets/mehmettahiraslan/customer-shopping-dataset)

The dataset contains sales transactions from a retail environment with the following columns:

| Column Name       | Description                                                              |
|------------------|---------------------------------------------------------------------------|
| `invoice_no`      | Unique identifier for the invoice                                         |
| `customer_id`     | Unique identifier for each customer                                       |
| `gender`          | Gender of the customer (`Male`, `Female`)                                |
| `age`             | Age of the customer                                                      |
| `category`        | Category of the purchased product                                        |
| `quantity`        | Number of items purchased in the transaction                            |
| `price`           | Unit price of the product in Turkish Lira                                |
| `payment_method`  | Payment method used (e.g., `Cash`, `Credit Card`, `Debit Card`)         |
| `invoice_date`    | Date of the transaction                                                  |
| `shopping_mall`   | Name of the mall where the transaction occurred                         |

---

##  Task Instructions:

You must complete the following steps **in order**, and write clear explanations (in Markdown cells) alongside your code in a Jupyter Notebook.

### 1. Load & Explore the Data
- Load the dataset.
- Display first few rows and general information.
- Summary statistics of numerical and categorical features.
- Visualizations:
  - Histogram of `age`, `price`, and `quantity`.
  - Countplot for `gender`, `category`, and `payment_method`.

---

### 2. Clean the Data
- Remove duplicate records if any.
- Identify and remove/fix unrealistic values (e.g., negative or 0 prices, age outliers).
- Provide a rationale for cleaning decisions.

---

### 3. Handle Missing Data
- Check for missing values.
- Apply at least two different strategies to handle missing data:
  - Drop
  - Fill (e.g., with median, mode, etc.)
- Justify why each method was chosen.

---

### 4. Feature Engineering
- Create a new column: `total_spent = quantity * price`
- Create a new binary target column:
  - 1 if `quantity > 3`, otherwise 0
  - Name the new column `target`

---

### 5. Encode Categorical Variables
- Encode all applicable categorical features using suitable encoding techniques.
- Explain your choice (e.g., Label Encoding vs OneHot Encoding).

---

### 6. Detect & Handle Outliers
- Use visualizations (e.g., boxplots) to detect outliers in numerical columns like `age`, `price`, and `total_spent`.
- Apply IQR or another statistical method to handle outliers.

---

### 7. Feature Scaling
- Apply feature scaling to numerical columns like `age`, `price`, and `total_spent`.
- Use either `StandardScaler` or `MinMaxScaler`.
- Show distributions before and after scaling.

---

### 8. Train-Test Split
- Split the data into training and testing sets (80/20).
- Use `stratify=y` to preserve class distribution.

---

### 9. Deal with Imbalanced Classes
- Check class distribution of your `target` variable.
- If imbalanced, apply resampling techniques:
  - Random UnderSampling or SMOTE
- Show class distribution before and after resampling.

---
    
## Deliverables:
At the end of the session, you should submit:
- A complete Jupyter Notebook with:
  - Well-structured code
  - Clear explanations using Markdown
  - Visualizations embedded
- A short reflection answering:
  - What challenges did you face?
  - What did you learn from this task?
  - What would you do differently if the dataset was larger/messier?

---

## Hints:
- Clean code is important! Keep your notebook readable.
- Justify your preprocessing decisions.
- Label your visualizations properly.

---

> **Good luck! This is your chance to practice real-world data preparation and analysis.**
