CURRENT DILEMMA

"Next inspection" needs to be in string format during null filling. This is because if it is treated as numeric, it will fill the average. But since we are measuring months, this won't be accurate. What is supposed to happen is: We take the most common "Next Inspection" for the corresponding make_model. Then it can be reformatted to YYYYMM and converted to an integer.

But long-term attempts to convert the column seem to be ignored.

In [1]:
import pandas as pd
import numpy as np

In [2]:
scout_car = pd.read_csv('Step 1 - Cleaned v1.csv')
scout_car

Unnamed: 0,URL,make_model,Short Description,Body Type,Price,VAT,KM,registration,Horsepower (kW),Type,...,Lane departure warning system,Night view assist,Passenger-side airbag,Power steering,Rear airbag,Side airbag,Tire pressure monitoring system,Traction control,Traffic sign recognition,Xenon headlights
0,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.4 TDI S-tronic Xenon Navi Klima,Sedans,15770,VAT deductible,56013.0,01/2016,66.0,Used,...,0,0,1,1,0,1,1,1,0,1
1,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.8 TFSI sport,Sedans,14500,Price negotiable,80000.0,03/2017,141.0,Used,...,0,0,1,1,0,1,1,1,0,1
2,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...,Sedans,14640,VAT deductible,83450.0,02/2016,85.0,Used,...,0,0,1,1,0,1,1,1,0,0
3,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.4 TDi Design S tronic,Sedans,14500,,73000.0,08/2016,66.0,Used,...,0,0,1,1,0,1,1,0,0,0
4,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.4 TDI S-Tronic S-Line Ext. admired...,Sedans,16790,,16200.0,05/2016,66.0,Used,...,0,0,1,1,0,1,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15914,https://www.autoscout24.com//offers/renault-es...,Renault Espace,Blue dCi 200CV EDC Executive 4Control,Van,39950,VAT deductible,,,147.0,New,...,1,0,1,1,1,1,1,1,1,0
15915,https://www.autoscout24.com//offers/renault-es...,Renault Espace,"TCe 225 EDC GPF LIM Deluxe Pano,RFK",Van,39885,VAT deductible,9900.0,01/2019,165.0,Used,...,1,0,1,1,0,1,1,1,1,0
15916,https://www.autoscout24.com//offers/renault-es...,Renault Espace,Blue dCi 200 EDC Initiale Paris Leder LED Navi...,Van,39875,VAT deductible,15.0,03/2019,146.0,Pre-registered,...,1,0,1,1,0,1,0,1,1,0
15917,https://www.autoscout24.com//offers/renault-es...,Renault Espace,"Blue dCi 200CV EDC Business , NUOVA DA IMMATRI...",Van,39700,VAT deductible,10.0,06/2019,147.0,Pre-registered,...,0,0,1,1,0,1,1,0,1,0


In [3]:
scout_car.isnull().sum()[scout_car.isnull().sum() > 0]

Short Description                     46
Body Type                             60
VAT                                 4513
KM                                  1024
registration                        1597
Horsepower (kW)                       88
Type                                   2
Previous Owners                     6640
Next Inspection (YYYYMM)           12384
Inspection new                     11987
Warranty (months)                  13123
Offer Number                        3175
First Registration                  1597
Body Color                           597
Paint Type                          5772
Body Color Original                 3774
# of Doors                           212
# of Seats                           977
Model Code                         10941
Displacement (cc)                    496
Cylinders                           5680
Weight (kg)                         6974
Drive chain                         6858
CO2 Emission (g CO2/km (comb))      2436
Emission Class  

In [4]:
scout_car.dtypes

URL                                object
make_model                         object
Short Description                  object
Body Type                          object
Price                               int64
                                    ...  
Side airbag                         int64
Tire pressure monitoring system     int64
Traction control                    int64
Traffic sign recognition            int64
Xenon headlights                    int64
Length: 135, dtype: object

# Columns not necessary or irrelevant in machine learning?

These columns are complicated strings.

'URL'

'Short description'

'Offer number'

'Description' - But maybe I could use sentiment analysis and assign a numeric score accordingly? Would have to translate though, how reliable would translations be? I'll drop it for now but may come back to reincorporate it later.

In [5]:
scout_car.drop('URL', axis=1, inplace=True)
scout_car.drop('Short Description', axis=1, inplace=True)
scout_car.drop('Offer Number', axis=1, inplace=True)
scout_car.drop('description', axis=1, inplace=True)

# Have to convert this column temporarily

In [6]:
scout_car['Next Inspection (YYYYMM)'] = scout_car['Next Inspection (YYYYMM)'].astype(str)

In [7]:
print(scout_car['Next Inspection (YYYYMM)'].dtype)

object


In [8]:
scout_car[['Next Inspection (YYYYMM)']]

Unnamed: 0,Next Inspection (YYYYMM)
0,202106.0
1,
2,
3,
4,
...,...
15914,
15915,202201.0
15916,
15917,


# Handling missing values

For non-numeric columns, find the most common value (before converting to dummies).

In [9]:
# Identify non-numeric columns
non_numeric_cols = scout_car.select_dtypes(exclude=[np.number]).columns

# Fill nulls in each non-numeric column using mode within each make_model group
for col in non_numeric_cols:
    scout_car[col] = scout_car.groupby('make_model')[col]\
                              .transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else x))


  .transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else x))
  .transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else x))
  .transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else x))
  .transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else x))
  .transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else x))


In [10]:
scout_car

Unnamed: 0,make_model,Body Type,Price,VAT,KM,registration,Horsepower (kW),Type,Previous Owners,Next Inspection (YYYYMM),...,Lane departure warning system,Night view assist,Passenger-side airbag,Power steering,Rear airbag,Side airbag,Tire pressure monitoring system,Traction control,Traffic sign recognition,Xenon headlights
0,Audi A1,Sedans,15770,VAT deductible,56013.0,01/2016,66.0,Used,\n2\n,202106.0,...,0,0,1,1,0,1,1,1,0,1
1,Audi A1,Sedans,14500,Price negotiable,80000.0,03/2017,141.0,Used,\n1\n,,...,0,0,1,1,0,1,1,1,0,1
2,Audi A1,Sedans,14640,VAT deductible,83450.0,02/2016,85.0,Used,\n1\n,,...,0,0,1,1,0,1,1,1,0,0
3,Audi A1,Sedans,14500,VAT deductible,73000.0,08/2016,66.0,Used,\n1\n,,...,0,0,1,1,0,1,1,0,0,0
4,Audi A1,Sedans,16790,VAT deductible,16200.0,05/2016,66.0,Used,\n1\n,,...,0,0,1,1,0,1,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15914,Renault Espace,Van,39950,VAT deductible,,01/2016,147.0,New,\n1\n,,...,1,0,1,1,1,1,1,1,1,0
15915,Renault Espace,Van,39885,VAT deductible,9900.0,01/2019,165.0,Used,"['\n1\n', '\n', '7.4 l/100 km (comb)', '\n', '...",202201.0,...,1,0,1,1,0,1,1,1,1,0
15916,Renault Espace,Van,39875,VAT deductible,15.0,03/2019,146.0,Pre-registered,"['\n1\n', '\n139 g CO2/km (comb)\n']",,...,1,0,1,1,0,1,0,1,1,0
15917,Renault Espace,Van,39700,VAT deductible,10.0,06/2019,147.0,Pre-registered,\n1\n,,...,0,0,1,1,0,1,1,0,1,0


In [11]:
scout_car[['Next Inspection (YYYYMM)']]

Unnamed: 0,Next Inspection (YYYYMM)
0,202106.0
1,
2,
3,
4,
...,...
15914,
15915,202201.0
15916,
15917,


For numeric columns, find the *median* value. Or maybe find the mean and force rounding in the case of integers.

In [12]:
# --- Fill numeric columns with mean per make_model ---
numeric_cols = scout_car.select_dtypes(include=[np.number]).columns

for col in numeric_cols:
    mean_filled = scout_car.groupby('make_model')[col]\
                           .transform(lambda x: x.fillna(x.mean()))
    
    # If integer dtype, round and cast back
    if pd.api.types.is_integer_dtype(scout_car[col]):
        scout_car[col] = mean_filled.round().astype('Int64')  # Keeps nullable int type
    else:
        scout_car[col] = mean_filled

In [13]:
#scout_car.to_csv('Validation csv 4.csv', index=False)

# Final cleanup

In [14]:
# Convert to datetime
scout_car['registration'] = pd.to_datetime(
    scout_car['registration'],
    format='%m/%Y',
    errors='coerce'
)

# Convert to YYYYMM format
scout_car['registration'] = scout_car['registration'].dt.strftime('%Y%m')

# Convert to INT
scout_car['registration'] = pd.to_numeric(scout_car['registration'], errors='coerce').astype('Int64')

# Rename column
scout_car.rename(columns={'registration': 'Registration (YYYYMM)'}, inplace=True)

# Fill numerics again?

In [19]:
# --- Fill numeric columns with mean per make_model ---
numeric_cols = scout_car.select_dtypes(include=[np.number]).columns

for col in numeric_cols:
    mean_filled = scout_car.groupby('make_model')[col]\
                           .transform(lambda x: x.fillna(x.mean()))
    
    # If integer dtype, round and cast back
    if pd.api.types.is_integer_dtype(scout_car[col]):
        scout_car[col] = mean_filled.round().astype('Int64')  # Keeps nullable int type
    else:
        scout_car[col] = mean_filled

In [20]:
scout_car.to_csv('Validation csv 5.csv', index=False)

In [13]:
scout_car['Registration (YYYYMM)'].value_counts(dropna=False)

Registration (YYYYMM)
NaN       1597
201803     695
201902     585
201805     572
201903     543
201901     541
201804     541
201802     539
201603     536
201604     532
201806     532
201801     511
201904     506
201602     472
201703     471
201605     459
201606     452
201905     440
201706     409
201705     404
201807     396
201704     380
201601     376
201702     368
201701     306
201808     285
201906     224
201707     215
201711     180
201607     176
201610     160
201710     154
201709     149
201611     142
201809     141
201609     141
201612     134
201712     123
201708     114
201811     110
201812     103
201810      97
201608      94
201907       6
201909       5
201908       1
201911       1
201912       1
Name: count, dtype: int64

# Problems

If I am sticking with this method of filling nulls in numeric columns:

#

Next Inspection (YYYYMM)

Is a special case due to how I have converted months in the previous notebook. Maybe that step should wait - i.e. that column would be left as a string, and filled as such here. Only then would I convert to the YYYYMM.

#

First Registration

Has some ugly decimals. But since this is only measuring years, probaly safe to round to iNT.

Or maybe make it a string for null filling, then convert after filling.


#

THESE:

/# of Doors

/# of Seats

Have decimals to be removed. Just some .0s. Maybe not crucial as this is not a data integrity issue.

#

THESE

Warranty (months)

Cylinders

Weight (kg)

CO2 Emission (g CO2/km (comb))

Are all ugly decimals and need to be round numbers.

#

These:

Consumption columns (L /100 km) comb

Consumption columns (L /100 km) city

Consumption columns (L /100 km) country

All need to be rounded to 1 decimal.