# 1_Preprocess

In this notebook, I will make a preliminary analysis of the dataset, checking for variable types, missing values, basic descriptive statistics and possible inconsistencies 

# Step 1 | Import libraries
Import the necessary libraries that we are going to use further on.

In [1]:
import pandas as pd
import function_utils as futils
pd.set_option('display.float_format', '{:.2f}'.format) # disable scientific notation
pd.set_option('display.max_rows', None) # display all rows

# Step 2 | Import Dataset
Import the dataset

In [2]:
# Read the Excel file into a DataFrame
data = pd.read_excel("BCA_EU_CaseStudy_MarketIntelligence.xlsx")

# Display the first few rows of the DataFrame
data.head()

Unnamed: 0,vehicle_ID,selling_country,buyer_country,make,model,mileage,body_type,fuel_type,colour,registration_date,selling_date,selling_price,condition_grade,conversion_rate,b2b_price,b2c_price,stock_turnover
0,V000001,IT,IT,Toyota,4Runner,138366,SUV,Electric,Blue,2023-12-16,2024-12-15,14784.31,,100.0,12898.78,15797.47,26
1,V000002,PT,PT,Chevrolet,Equinox,111565,SUV,Diesel,Silver,2020-05-11,2024-05-10,9408.8,4.0,83.32,8379.68,10743.26,53
2,V000003,ES,ES,Chevrolet,Equinox,139601,SUV,Electric,Black,2023-03-21,2024-03-20,14209.81,5.0,95.13,12666.16,16305.35,46
3,V000004,FR,FR,Porsche,Boxster,144300,Convertible,CNG,White,2015-10-02,2024-09-29,15423.71,4.0,60.15,17111.19,18029.74,98
4,V000005,DE,DE,Subaru,Impreza,144016,Sedan,CNG,Grey,2014-03-15,2024-03-12,4668.59,4.0,75.64,4519.55,5539.8,50


# 1. Brief description

The dataset is composed of 1003  vehicles and 17 features (aka attributes or columns).

Columns description:<br>
*veichle_ID* : ID of each vehicle <br>
*selling_country*: The country that hosted the auction <br>
*buyer_country* : The country where the buyer is located <br>
*make* : : Brand of the vehicle’s manufacture <br>
*model* : Specific model within the make of the vehicle<br>
*mileage* : : Total distance the vehicle has traveled in kilometers <br>
*body_type* :  Category associated with the shape and structure of the vehicle (e.g., sedan, SUV, hatchback).<br>
*fuel_type* : The type of fuel the vehicle uses to operate (e.g., petrol, diesel).<br>
*colour* : : The exterior color of the vehicle. <br>
*registration_date* : Date when the vehicle was first registered<br>
*selling_date* : The date when the vehicle is sold.<br>
*selling_price* : : The price at which the vehicle is sold.<br>
*condition_grade* : A score reflecting the vehicles overall condition, based on physical and mechanical assessments.<br>
*conversion_rate* : Number of vehicles sold / Number vehicles that entered a sale since January 2024.<br>
*b2b_price* : B2B Benchmark value given by an external provider<br>
*b2c_price*: B2C Benchmark value given by an external provider. T<br>
*stock_turnover* : average number of days it takes for a vehicle with those characteristics (e.g., model, fuel
type, age, etc.) to be sold.<br>

In [3]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10003 entries, 0 to 10002
Data columns (total 17 columns):
vehicle_ID           10003 non-null object
selling_country      10003 non-null object
buyer_country        10003 non-null object
make                 10003 non-null object
model                10003 non-null object
mileage              10003 non-null int64
body_type            10003 non-null object
fuel_type            10003 non-null object
colour               10003 non-null object
registration_date    10003 non-null datetime64[ns]
selling_date         10003 non-null datetime64[ns]
selling_price        10003 non-null float64
condition_grade      9283 non-null float64
conversion_rate      10003 non-null float64
b2b_price            10003 non-null float64
b2c_price            10003 non-null float64
stock_turnover       10003 non-null int64
dtypes: datetime64[ns](2), float64(5), int64(2), object(8)
memory usage: 1.3+ MB
None


In [4]:
print("Unique values by feature:\n", data.nunique().sort_values())

Unique values by feature:
 condition_grade          5
fuel_type                5
colour                   6
selling_country          7
buyer_country            7
body_type                7
make                    15
model                   68
stock_turnover         128
selling_date           367
conversion_rate       2836
registration_date     3392
mileage               9763
b2b_price             9980
selling_price         9982
b2c_price             9986
vehicle_ID           10003
dtype: int64


# 2.  Preliminary adjustments

## 2.1. Missing values

In [5]:
rm_df_mis = futils.missing_reporter(data, True)
display(rm_df_mis)

Unnamed: 0,Nmissings,Pmissings
condition_grade,720,0.07


We have missing values in variable *condition_grade* that we can replace by "missing" as a category

In [6]:
# Replaces missing values 
data["condition_grade"].fillna(value='missing', inplace=True)

display("After imputing:", futils.missing_reporter(data))

'After imputing:'

Unnamed: 0,Nmissings,Pmissings


# 3. Explore the data

## 3.1 Categorical Data

In [7]:
# Selects only object (categorical) features 
categorical_features = data.select_dtypes(include=['object']).columns.to_list()

# Removes vehicleID since is an ID not interesting for exploration
categorical_features.remove("vehicle_ID")

In [8]:
# for every column in the categorical features print the name and value counts of each category
for col in categorical_features:
    print('\n Feature', col)
    print('\n', data[col].value_counts())


 Feature selling_country

 NL    1496
IT    1470
FR    1459
DE    1421
PT    1393
ES    1386
PL    1378
Name: selling_country, dtype: int64

 Feature buyer_country

 IT    1469
NL    1469
FR    1446
ES    1433
DE    1424
PL    1402
PT    1360
Name: buyer_country, dtype: int64

 Feature make

 Honda         699
Kia           696
Nissan        691
Lexus         685
Hyundai       685
Audi          674
Chevrolet     665
Mercedes      664
Toyota        662
Subaru        660
Ford          655
Porsche       655
Volkswagen    649
Mazda         646
BMW           617
Name: make, dtype: int64

 Feature model

 Accord        185
NX            183
Altima        179
CX-30         179
Outback       178
Maxima        177
Focus         176
Malibu        174
Sonata        174
ES            174
IS            173
Impreza       173
CR-V          173
Equinox       173
Tucson        173
Sentra        172
Pilot         171
Boxster       170
Civic         170
Elantra       170
Explorer      170
Golf          

It doesn't appear that data has any problem, it could be the case that we had typos or inconsistent case usage (e.g., "BMW" vs "Bmw") or even misclassified or undefined categories but it's not the case

## 3.2 Numerical Data

In [9]:
# Selects only numerical features 
numerical_features = data.select_dtypes(include=['int','float']).columns

In [10]:
# Get summary statistics for numerical variables
data[numerical_features].describe() 

Unnamed: 0,selling_price,conversion_rate,b2b_price,b2c_price
count,10003.0,10003.0,10003.0,10003.0
mean,13388.15,85.84,12635.46,15131.86
std,9458.86,8.18,9219.51,10793.56
min,1111.08,55.05,1089.82,1000.0
25%,6786.75,80.81,6359.7,7675.18
50%,10841.43,86.23,10152.46,12195.18
75%,17157.82,91.69,16039.89,19324.09
max,87741.24,100.0,92548.5,103033.82


It doesn't appear that data has any problem, it could be the case that we had negative values in these features which wouldn't make sense, and could also be the case that conversion_rate had values greater than 100% but it's not. 
I can already have a sense of the data with this summary statistics and for example understand that selling prices are higher than the B2B benchmark on average but lower then the B2C. 

## 3.3 Date time variables 

### 3.3.1 selling_date 

In [11]:
# Extract year, month, and day components from the selling_date column
data['selling_year'] = data['selling_date'].dt.year
data['selling_month'] = data['selling_date'].dt.month
data['selling_day'] = data['selling_date'].dt.day

#### Year

In [12]:
# Check years present in the dataset
data.groupby(['selling_year'])['vehicle_ID'].count()

selling_year
2024    10002
2099        1
Name: vehicle_ID, dtype: int64

It does not make sense to have 2099 in selling year.
In a real world situation i would ask to change this typo, but since here i cannot do it, i will just remove this row

In [13]:
# Removes rows where year is greater than 2024
data = data[data['selling_year'] <= 2024]

In [14]:
# Check years present in the dataset after removing 2099
data.groupby(['selling_year'])['vehicle_ID'].count()

selling_year
2024    10002
Name: vehicle_ID, dtype: int64

#### Month

In [15]:
# Check months present in the dataset
data.groupby(['selling_month'])['vehicle_ID'].count()

selling_month
1     838
2     777
3     832
4     821
5     804
6     819
7     853
8     840
9     878
10    856
11    846
12    838
Name: vehicle_ID, dtype: int64

#### Day

In [16]:
# Check months present in the dataset
data.groupby(['selling_month', 'selling_day'])['vehicle_ID'].count()

selling_month  selling_day
1              1              27
               2              26
               3              22
               4              33
               5              18
               6              30
               7              30
               8              26
               9              25
               10             25
               11             20
               12             21
               13             29
               14             23
               15             27
               16             31
               17             38
               18             25
               19             19
               20             25
               21             29
               22             25
               23             29
               24             25
               25             28
               26             26
               27             28
               28             21
               29             26
               3

### 3.3.2 registration_date     

In [17]:
# Extract year, month, and day components from the registration_date column
data['registration_year'] = data['registration_date'].dt.year
data['registration_month'] = data['registration_date'].dt.month
data['registration_day'] = data['registration_date'].dt.day

#### Year

In [18]:
# Check years present in the dataset
data.groupby(['registration_year'])['vehicle_ID'].count()

registration_year
2009       1
2014    1008
2015    1006
2016    1046
2017    1024
2018     966
2019    1068
2020     955
2021     940
2022    1019
2023     966
2024       3
Name: vehicle_ID, dtype: int64

#### Month

In [19]:
# Check years present in the dataset
data.groupby(['registration_month'])['vehicle_ID'].count()

registration_month
1     854
2     766
3     827
4     819
5     812
6     819
7     845
8     847
9     880
10    839
11    863
12    831
Name: vehicle_ID, dtype: int64

#### Day

In [20]:
# Check months present in the dataset
data.groupby(['registration_month', 'registration_day'])['vehicle_ID'].count()

registration_month  registration_day
1                   1                   32
                    2                   39
                    3                   34
                    4                   21
                    5                   25
                    6                   25
                    7                   28
                    8                   34
                    9                   30
                    10                  25
                    11                  20
                    12                  14
                    13                  27
                    14                  28
                    15                  24
                    16                  28
                    17                  31
                    18                  35
                    19                  25
                    20                  22
                    21                  26
                    22                  31
                 

## 3.4 Possible inconsistencies.
It is very frequent to find data inconsistencies: facts which do not make any sense in the context of a given problem and can bias your analysis. In this sub-section I will explore the data in depth in order to identify them.

- It wouldn't make sense to have vehicles with sales date before registration date

In [21]:
data[data['selling_date'] < data['registration_date']]

Unnamed: 0,vehicle_ID,selling_country,buyer_country,make,model,mileage,body_type,fuel_type,colour,registration_date,...,conversion_rate,b2b_price,b2c_price,stock_turnover,selling_year,selling_month,selling_day,registration_year,registration_month,registration_day


- Let's compare the vehicle's actual selling price (internal transaction data) to its benchmark B2B price (external reference data). This will helps measure how well a vehicle's selling price aligns with the market benchmark for similar vehicles. I'm not expecting that will arise high deviations

In [22]:
# Compute deviations
data['deviation_b2B'] = (data['selling_price'] - data['b2b_price']) / data['b2b_price']
data['deviation_b2C'] = (data['selling_price'] - data['b2c_price']) / data['b2c_price']

In [23]:
data[(data['deviation_b2B'] > 1) | (data['deviation_b2B'] < -1)].T

Unnamed: 0,10002
vehicle_ID,V010003
selling_country,PT
buyer_country,PT
make,BMW
model,X5
mileage,20000
body_type,SUV
fuel_type,Diesel
colour,Black
registration_date,2021-03-02 00:00:00


This vheicle has been sold for a lot more then the B2B benchmark price. Still I will leave it in the dataset

In [24]:
data[(data['deviation_b2C'] > 1) | (data['deviation_b2C'] < -1)].T

Unnamed: 0,10002
vehicle_ID,V010003
selling_country,PT
buyer_country,PT
make,BMW
model,X5
mileage,20000
body_type,SUV
fuel_type,Diesel
colour,Black
registration_date,2021-03-02 00:00:00


# 4. Export dataset

In [25]:
# Check columns and rows of dataset after all the changes
data.shape

(10002, 25)

In [26]:
# Export dataset as pkl to preserve the data types
data.to_pickle('data_preprocessed.pkl')