# overview

### Description

Welcome to the **Regression** Challenge: 
- **Predicting Market Costs** This competition tasks you with building advanced **regression models** to accurately predict the **costs of products within a diverse market landscape**. 
- Leveraging a comprehensive dataset, including factors ranging from market dynamics to consumer attributes, your goal is to **develop models that uncover hidden patterns in the data to provide precise cost predictions**.

#### NOTE THAT :
Given the varied sources of data, participants will need to proficiently merge training dataframes to create comprehensive models capable of predicting costs.

---

### Evaluation

#### Evaluation Metric : 
The evaluation metric for this competition is the **Root Mean Squared Error (RMSE)**.

#### Submission Format :
Teams should submit a **CSV file with exactly 19942 rows with 2 columns**, submission will return an Invalid Score if you have extra rows or columns.

---

#### Notes :
- The public leaderboard is evaluated on 75% of Test data.
- Do not shuffle the sequence of the test series

## Data

### Files and Folders
- `IEEE_Victoris2_Filtration_train_data` - folder containing the training set
- `test.csv` - the test set
- `sample_submission.csv` - a sample submission file in the correct format

**Please be aware that this dataset is composed of numerous files, which could result in minor variations in certain column names or even the internal structure of the data.**

---

### Features in test data
- `Person Description` - Description of the person visiting the market
- `Place Code` - Code for each place which consists of 2 city codes parts separated by "_"
- `Customer Order` - Order of each customer in the market
- `Additional Features in market` - A list of features that are found in the market
- `Promotion Name` - Made by the market on media
- `Store Kind` - A genre for the store
- `Store Cost` - Cost of the Store
- `Gross Weight` - Bought item weight
- `Net Weight`- Bought item weight without the package
- `Package Weight` - Weight of the Package
- `Is Recyclable?` - If the item is Recyclable or no
- `Yearly Income` - min. Income for the consumer per year
- `Store Area` - Area of the store
- `Grocery Area` - Area of grocery department in the store
- `Frozen Area` - Area of frozen food department in the store
- `Meat Area` - Area of Meat department in the store
- `Cost` - The target variable

# Import libraries

In [525]:
import numpy as np
import pandas as pd

# Load Dataset

In [526]:
train_1 = pd.read_csv("./data/IEEE_Victoris2_Filtration_train_data/Train_Batch_1.csv")
train_2 = pd.read_csv("./data/IEEE_Victoris2_Filtration_train_data/Train_Batch_2.csv")
train_3 = pd.read_csv("./data/IEEE_Victoris2_Filtration_train_data/Train_Batch_3.csv")
test = pd.read_csv("./data/test.csv")

## Train-1

In [527]:
Gross_Weight = []
Net_Weight = []
Package_Weight = []
for row in train_1["Product Weights Data in (KG)"]:
    Gross_Weight.append(row.split(",")[0].split(":")[1])
    Net_Weight.append(row.split(",")[1].split(":")[1])

In [528]:
product_weight = pd.DataFrame({"Gross Weight": Gross_Weight, "Net Weight": Net_Weight})

In [529]:
train_1 = pd.concat([train_1, product_weight], axis=1)
train_1.drop("Product Weights Data in (KG)", axis=1, inplace=True)

In [530]:
train_1.rename(columns={"Min. Yearly Income": "Yearly Income"}, inplace=True)

In [531]:
train_1.head(2)

Unnamed: 0.1,Unnamed: 0,Person Description,Place Code,Customer Order,Additional Features in market,Promotion Name,Store Kind,Store Sales,Store Cost,Is Recyclable?,Yearly Income,Store Area,Grocery Area,Frozen Area,Meat Area,Cost,Gross Weight,Net Weight
0,mc_ID_0,"Single Female with four children, education: b...",H11go_ZA,"Cleaning Supplies from Household department, O...","['Video Store', 'Florist', 'Ready Food', 'Coff...",Dimes Off,Deluxe,8.76 Millions,4.2924 Millions,recyclable,10K+,2842.23,2037.64,481.98,323.0,602.7575,28.1997,26.6008
1,mc_ID_1,"Single Female with three children, education: ...",S04ne_WA,"Snack Foods from Snack Foods department, Order...",,Budget Bargains,Supermarket,6.36 Millions,1.9716 Millions,non recyclable,50K+,2814.95,2049.72,457.36,,708.665,16.571,14.972


## Train-2

In [532]:
train_2.rename(columns={"Min. Person Yearly Income": "Yearly Income"}, inplace=True)

## Train-3

In [533]:
new_cols = {col: col.replace("%20", " ") for col in train_3.columns}
train_3.rename(columns=new_cols, inplace=True)

In [534]:
train_3.drop("Weights Data", axis=1, inplace=True)

In [535]:
train_3.head(2)

Unnamed: 0.1,Unnamed: 0,Person Description,Place Code,Customer Order,Additional Features in market,Promotion Name,Store Kind,Store Sales,Store Cost,Gross Weight,Net Weight,Is Recyclable?,Yearly Income,Store Area,Grocery Area,Frozen Area,Meat Area,Cost
0,mc_ID_0,"Single Female with two children, education: pa...",T02ma_WA,"Meat from Deli department, Ordered Brand : Red...","['Coffee Bar', 'Florist', 'Ready Food', 'Bar F...",Sale : Double Down,Deluxe,7.12 Millions,2.5632 Millions,23.2575,20.3503,yes,90K+,3145.51,2056.79,654.13,436.09,500.7202
1,mc_ID_1,"Single Female with five children, education: p...",M10da_YU,"Specialty from Produce department, Ordered Bra...","['Coffee Bar', 'Florist', 'Bar For Salad', 'Vi...",GLD,Deluxe,14.72 Millions,7.0656 Millions,16.7163,12.3555,yes,30K+,2856.68,1871.16,595.93,395.51,484.1411


In [536]:
train = pd.concat([train_1, train_2, train_3])