# AutoScout Car Price Prediction Exploratory Data Analysis (EDA) Project


<p>AutoScout</b> data used in this project scraped from the Website of an online car trading company in 2022, and contains many features of 13 different car makes including 594 models. In this project, you will have the opportunity to apply many commonly used algorithms for Data Cleaning and Exploratory Data Analysis by using a variety of Python libraries, such as Numpy, Pandas, Matplotlib, Seaborn, Scipy and then you will get a clean dataset for your analysis and predictive modelling in Machine Learning Path. So you will have the chance to use all the skills you have already learned in the Data Analysis and Visualization courses.</p>

<p dir="ltr">In this context, the project consists of 3 parts in general:</p>

<ol>
    <li> 'Data Cleaning'. It deals with Incorrect Headers, Incorrect Format, Anomalies, and Dropping useless columns.</li>
    <li> 'Filling Data', in other words 'Imputation'. It deals with Missing Values. Categorical to numeric transformation, Encoding, is done as well.</li>
    <li> 'Handling Outliers of Data' via Visualization libraries. So, some insights will be extracted.</li>
</ol>



# PART- 1 ( Data Cleaning )

In [5]:
pip install skimpy

Collecting skimpy
  Using cached skimpy-0.0.14-py3-none-any.whl.metadata (28 kB)
Collecting click<9.0.0,>=8.1.6 (from skimpy)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting polars<0.21,>=0.19 (from skimpy)
  Using cached polars-0.20.19-cp38-abi3-win_amd64.whl.metadata (14 kB)
Collecting pyarrow<16,>=13 (from skimpy)
  Using cached pyarrow-15.0.2-cp311-cp311-win_amd64.whl.metadata (3.1 kB)
Collecting quartodoc<0.8.0,>=0.7.2 (from skimpy)
  Using cached quartodoc-0.7.2-py3-none-any.whl.metadata (7.1 kB)
Collecting rich<14.0,>=10.9 (from skimpy)
  Using cached rich-13.7.1-py3-none-any.whl.metadata (18 kB)
Collecting typeguard==4.1.5 (from skimpy)
  Using cached typeguard-4.1.5-py3-none-any.whl.metadata (3.7 kB)
Collecting griffe>=0.33 (from quartodoc<0.8.0,>=0.7.2->skimpy)
  Using cached griffe-0.42.1-py3-none-any.whl.metadata (6.2 kB)
Collecting sphobjinv>=2.3.1 (from quartodoc<0.8.0,>=0.7.2->skimpy)
  Using cached sphobjinv-2.3.1-py3-none-any.whl.metadata (10 k



## Importing necessary Libraries

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
import re
from skimpy import clean_columns

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

%matplotlib inline
# %matplotlib notebook

plt.rcParams["figure.figsize"] = (10,6)
# plt.rcParams['figure.dpi'] = 100

sns.set_style("whitegrid")
pd.set_option('display.float_format', lambda x: '%.2f' % x)

#pd.options.display.max_rows = 100
#pd.options.display.max_columns = 100

## Reading the json file into a df

In [8]:
df0=pd.read_json('as24_cars.json')

## Making a working copy of the raw df (df0)

In [9]:
df=df0.copy()

## Getting the first impressions

In [10]:
df.shape

(29480, 58)

In [11]:
df.head().T

Unnamed: 0,0,1,2,3,4
make_model,Mercedes-Benz A 160,Mercedes-Benz EQE 350,Mercedes-Benz A 45 AMG,Mercedes-Benz A 35 AMG,Mercedes-Benz A 45 AMG
short_description,CDi,350+,S 4Matic+ 8G-DCT,4Matic+ 7G-DCT,200CDI BE Line 4M 7G-DCT
make,\nMercedes-Benz\n,\nMercedes-Benz\n,\nMercedes-Benz\n,\nMercedes-Benz\n,\nMercedes-Benz\n
model,"[\n, A 160 ,\n]","[\n, EQE 350 ,\n]","[\n, A 45 AMG ,\n]","[\n, A 35 AMG ,\n]","[\n, A 45 AMG ,\n]"
location,"P.I. EL PALMAR C/FORJA 6, 11500 PUERTO DE SAN...","APARTADO DE CORREOS 1032, 26140 LOGROÑO, ES","PORT. TARRACO, MOLL DE LLEVANT, Nº 5, LOC. 6-8...","Carrer de Provença, 31 Local, 8029 BARCELONA, ES","CARRIL ARAGONES 4, 30007 CASILLAS, ES"
price,"€ 16,950.-","€ 80,900.-","€ 69,900.-","€ 46,990.-","€ 16,800.-"
Body type,"[\n, Compact, \n]","[\n, Compact, \n]","[\n, Compact, \n]","[\n, Compact, \n]","[\n, Compact, \n]"
Type,"[\n, Used, \n]","[\n, Pre-registered, \n]","[\n, Used, \n]","[\n, Used, \n]","[\n, Used, \n]"
Doors,"[\n, 5, \n]","[\n, 4, \n]","[\n, 5, \n]","[\n, 5, \n]","[\n, 5, \n]"
Country version,"[\n, Spain, \n]","[\n, Spain, \n]","[\n, Spain, \n]","[\n, Spain, \n]","[\n, Spain, \n]"


In [12]:
df.sample(5).T

Unnamed: 0,13848,2091,8764,14011,16873
make_model,Skoda Citigo,Mercedes-Benz S 500,Peugeot 2008,Skoda Felicia,Dacia Sandero
short_description,1.0 Style Klima Sitzheizung Einparkhilfe,4M L AMG PANO BURM TV HUD 360° MASSAGE,1.2PureTech GT-Line S&S Carplay/PDC/Camera ***...,1.9D GLX,SL Aniversario TCe 1.0 74kW (100CV)
make,\nSkoda\n,\nMercedes-Benz\n,\nPeugeot\n,\nSkoda\n,\nDacia\n
model,"[\n, Citigo ,\n]","[\n, S 500 ,\n]","[\n, 2008 ,\n]","[\n, Felicia ,\n]","[\n, Sandero ,\n]"
location,"Ludwigsluster Chaussee 1A, 19061 Schwerin, DE","Bismarckstr. 26, 10625 Berlin, DE","Emiel Clauslaan 88, 9800 Deinze, BE","Carretera vieja de Santiago, N360, 27004 LUGO...","Gernika Kalea, 48, 48960 Galdakano, ES"
price,"€ 12,990.-","€ 130,900.-","€ 25,950.-","€ 2,500.-","€ 13,990.-"
Body type,"[\n, Compact, \n]","[\n, Sedan, \n]","[\n, Off-Road/Pick-up, \n]","[\n, Compact, \n]","[\n, Sedan, \n]"
Type,"[\n, Used, \n]","[\n, Used, \n]","[\n, Used, \n]","[\n, Used, \n]","[\n, Used, \n]"
Doors,"[\n, 5, \n]","[\n, 4, \n]","[\n, 5, \n]","[\n, 5, \n]","[\n, 5, \n]"
Country version,"[\n, Germany, \n]",,,"[\n, Spain, \n]","[\n, Spain, \n]"


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29480 entries, 0 to 29479
Data columns (total 58 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   make_model                28630 non-null  object 
 1   short_description         28630 non-null  object 
 2   make                      28630 non-null  object 
 3   model                     28630 non-null  object 
 4   location                  28630 non-null  object 
 5   price                     28630 non-null  object 
 6   Body type                 28630 non-null  object 
 7   Type                      28630 non-null  object 
 8   Doors                     28271 non-null  object 
 9   Country version           16889 non-null  object 
 10  Offer number              23100 non-null  object 
 11  Warranty                  15784 non-null  object 
 12  Mileage                   28629 non-null  object 
 13  First registration        28628 non-null  object 
 14  Gearbo

## Checking the null values

In [14]:
df.isnull().sum().sort_values(ascending=True)

make_model                     850
short_description              850
make                           850
model                          850
location                       850
price                          850
Body type                      850
Type                           850
seller                         850
Mileage                        851
First registration             852
Gearbox                       1098
Doors                         1209
Power                         1422
desc                          1433
Engine size                   2253
Colour                        2574
Fuel type                     2637
Seats                         3975
\nComfort & Convenience\n     4047
\nSafety & Security\n         4065
\nEntertainment & Media\n     5836
\nExtras\n                    6000
Fuel consumption              6095
Offer number                  6380
Manufacturer colour           7693
Gears                        10526
Cylinders                    10628
Upholstery          

## Checking & Cleaning the column names

In [15]:
df.columns

Index(['make_model', 'short_description', 'make', 'model', 'location', 'price',
       'Body type', 'Type', 'Doors', 'Country version', 'Offer number',
       'Warranty', 'Mileage', 'First registration', 'Gearbox', 'Fuel type',
       'Colour', 'Paint', 'desc', 'seller', 'Seats', 'Power', 'Engine size',
       'Gears', 'CO₂-emissions', 'Manufacturer colour', 'Drivetrain',
       'Cylinders', 'Fuel consumption', '\nComfort & Convenience\n',
       '\nEntertainment & Media\n', '\nSafety & Security\n', '\nExtras\n',
       'Empty weight', 'Model code', 'General inspection', 'Last service',
       'Full service history', 'Non-smoker vehicle', 'Emission class',
       'Emissions sticker', 'Upholstery colour', 'Upholstery',
       'Production date', 'Previous owner', 'Other fuel types',
       'Power consumption', 'Energy efficiency class', 'CO₂-efficiency',
       'Fuel consumption (WLTP)', 'CO₂-emissions (WLTP)', 'Available from',
       'Taxi or rental car', 'Availability', 'Last timing b

In [16]:
df = clean_columns(df)
df.columns

Index(['make_model', 'short_description', 'make', 'model', 'location', 'price',
       'body_type', 'type', 'doors', 'country_version', 'offer_number',
       'warranty', 'mileage', 'first_registration', 'gearbox', 'fuel_type',
       'colour', 'paint', 'desc', 'seller', 'seats', 'power', 'engine_size',
       'gears', 'co_emissions', 'manufacturer_colour', 'drivetrain',
       'cylinders', 'fuel_consumption', 'comfort_&_convenience',
       'entertainment_&_media', 'safety_&_security', 'extras', 'empty_weight',
       'model_code', 'general_inspection', 'last_service',
       'full_service_history', 'non_smoker_vehicle', 'emission_class',
       'emissions_sticker', 'upholstery_colour', 'upholstery',
       'production_date', 'previous_owner', 'other_fuel_types',
       'power_consumption', 'energy_efficiency_class', 'co_efficiency',
       'fuel_consumption_wltp', 'co_emissions_wltp', 'available_from',
       'taxi_or_rental_car', 'availability', 'last_timing_belt_change',
       'el

In [17]:
df.rename(columns={'comfort_&_convenience': 'comfort_convenience', 'entertainment_&_media':'entertainment_media','safety_&_security':'safety_security', }, inplace=True)

In [18]:
df.columns

Index(['make_model', 'short_description', 'make', 'model', 'location', 'price',
       'body_type', 'type', 'doors', 'country_version', 'offer_number',
       'warranty', 'mileage', 'first_registration', 'gearbox', 'fuel_type',
       'colour', 'paint', 'desc', 'seller', 'seats', 'power', 'engine_size',
       'gears', 'co_emissions', 'manufacturer_colour', 'drivetrain',
       'cylinders', 'fuel_consumption', 'comfort_convenience',
       'entertainment_media', 'safety_security', 'extras', 'empty_weight',
       'model_code', 'general_inspection', 'last_service',
       'full_service_history', 'non_smoker_vehicle', 'emission_class',
       'emissions_sticker', 'upholstery_colour', 'upholstery',
       'production_date', 'previous_owner', 'other_fuel_types',
       'power_consumption', 'energy_efficiency_class', 'co_efficiency',
       'fuel_consumption_wltp', 'co_emissions_wltp', 'available_from',
       'taxi_or_rental_car', 'availability', 'last_timing_belt_change',
       'electric

## Dropping the rows with all missing values

In [19]:
df.dropna(axis=0, how="all", inplace=True)
df.shape

(28630, 58)

# Univariate Analysis (=Checking the features one by one)

## Make Model

In [20]:
df['make_model'].value_counts(dropna=False)

make_model
Renault Megane     863
SEAT Leon          787
Volvo V40          740
Dacia Sandero      730
Hyundai i30        706
                  ... 
Toyota GR86          1
Toyota Tacoma        1
Toyota Tundra        1
Toyota 4-Runner      1
Volvo 244            1
Name: count, Length: 611, dtype: int64

## Short Description

In [21]:
df['short_description'].value_counts(dropna=False)

short_description
                                                      213
D2 Momentum 120                                        88
D2 Kinetic 120                                         87
Cabrio 1.4T S&S Excellence                             85
Extreme+ 7-Sitzer TCe 110                              57
                                                     ... 
PureTech  EAT8 Allure Pack...DISP. PER NOLEGGIO         1
BlueHDi 130 S&S EAT8 Business                           1
2.0 BlueHDi 180ch S\u0026S GT Line EAT8                 1
1.2 PureTech Première AUT. NAVI PANO                    1
2.9 Executive G. NETTE AUTO! LEER! NAVI! CRUISE! L      1
Name: count, Length: 20947, dtype: int64

## Make

In [22]:
df.make.value_counts()

make
\nVolvo\n            3659
\nMercedes-Benz\n    2398
\nOpel\n             2385
\nPeugeot\n          2360
\nRenault\n          2351
\nFiat\n             2338
\nFord\n             2324
\nNissan\n           2064
\nToyota\n           2038
\nHyundai\n          1867
\nSEAT\n             1743
\nSkoda\n            1566
\nDacia\n            1537
Name: count, dtype: int64

In [23]:
df["make"] = df["make"].str.strip("\n")
df.make.value_counts(dropna=False)

make
Volvo            3659
Mercedes-Benz    2398
Opel             2385
Peugeot          2360
Renault          2351
Fiat             2338
Ford             2324
Nissan           2064
Toyota           2038
Hyundai          1867
SEAT             1743
Skoda            1566
Dacia            1537
Name: count, dtype: int64

In [24]:
df['make'].isna().sum()

0

## Model

In [25]:
df.model.value_counts(dropna=False)

model
[\n, Megane ,\n]        863
[\n, Leon ,\n]          787
[\n, V40 ,\n]           740
[\n, Sandero ,\n]       730
[\n, i30 ,\n]           706
                       ... 
[\n, GLA 35 AMG ,\n]      1
[\n, G 55 AMG ,\n]        1
[\n, Ariya ,\n]           1
[\n, 105 ,\n]             1
[\n, 244 ,\n]             1
Name: count, Length: 594, dtype: int64

In [26]:
df['model']=df['model'].apply(lambda item : item[0] if type(item)==list else item)

In [27]:
df['model']=df['model'].str.strip('\n, ')

In [28]:
df.model.value_counts(dropna=False)

model
Megane        863
Leon          787
V40           740
Sandero       730
i30           706
             ... 
GLA 35 AMG      1
G 55 AMG        1
Ariya           1
105             1
244             1
Name: count, Length: 594, dtype: int64

In [29]:
df['model'].isna().sum()

0

## Bucketing the location column into country

In [30]:
df["location"].value_counts(dropna=False)

location
Av. Laboral, 10,  28021 MADRID, ES                          306
Luckenwalder Berg 5,  14913 Jüterbog, DE                    170
Ctra. del Mig, 96,,  08097 L'Hospitalet de Llobregat, ES    146
9 boulevard Jules Ferry,  75011 Paris, FR                   142
Neuenhofstr. 77,  52078 Aachen, DE                          135
                                                           ... 
2727CT ZOETERMEER, NL                                         1
Rosendaalsestraat 437-439,  6824 CK ARNHEM, NL                1
5751VH DEURNE, NL                                             1
00148 roma, IT                                                1
Sur rendez-vous,  5060 Sambreville, BE                        1
Name: count, Length: 8181, dtype: int64

In [31]:
df['country'] = df['location'].str.split(',').str[-1]
df['city'] = df['location'].str.split(',').str[-2]

In [32]:
df['country'].value_counts(dropna=False)

country
 DE    12643
 ES     6517
 NL     2929
 IT     2497
 BE     1873
 FR     1473
 AT      660
 LU       35
 DK        1
 EE        1
 BG        1
Name: count, dtype: int64

## City

In [33]:
df['city'].value_counts(dropna=False)

city
  28021 MADRID                       333
  14913 Jüterbog                     170
  08097 L'Hospitalet de Llobregat    147
  75011 Paris                        142
  52078 Aachen                       135
                                    ... 
4663 Laakirchen                        1
67250 aschbach                         1
  3530 Houthalen-Helchteren            1
47290 Cancon                           1
1000 Brussel                           1
Name: count, Length: 6807, dtype: int64

In [34]:
df['city']= df['city'].apply(lambda x: re.sub(r'\d+', '', x))
df['city'].value_counts(dropna=False)

city
   MADRID         736
   Berlin         424
   SEVILLA        210
   Dresden        209
   München        200
                 ... 
 Strasswalchen      1
   Spielfeld        1
   NORMANVILLE      1
   Wilthen          1
 Brussel            1
Name: count, Length: 5733, dtype: int64

## Price

In [35]:
df["price"].value_counts(dropna=False)

price
€ 14,990.-    222
€ 12,990.-    219
€ 16,990.-    186
€ 19,990.-    166
€ 9,990.-     160
             ... 
€ 19,112.-      1
€ 30,465.-      1
€ 18,461.-      1
€ 22,649.-      1
€ 4,440.-       1
Name: count, Length: 5021, dtype: int64

In [36]:
df["price"]= df['price'].str.replace('\D+', '', regex=True)
df["price"]=df['price'].str.strip(' ')

In [37]:
df["price"].value_counts(dropna=False)

price
14990    222
12990    219
16990    186
19990    166
9990     160
        ... 
19112      1
30465      1
18461      1
22649      1
4440       1
Name: count, Length: 5021, dtype: int64

In [38]:
df.price.isna().sum()

0

In [39]:
df["price"].apply(lambda x:type(x)).value_counts()

price
<class 'str'>    28630
Name: count, dtype: int64

In [40]:
df["price"].astype('int')

0        16950
1        80900
2        69900
3        46990
4        16800
         ...  
29474    37600
29475     5499
29476     7300
29477    29900
29478     4440
Name: price, Length: 28630, dtype: int32

In [41]:
df["price"].info()

<class 'pandas.core.series.Series'>
Index: 28630 entries, 0 to 29478
Series name: price
Non-Null Count  Dtype 
--------------  ----- 
28630 non-null  object
dtypes: object(1)
memory usage: 447.3+ KB


## Body Type

In [42]:
df.body_type.value_counts(dropna=False)

body_type
[\n, Station wagon, \n]       5448
[\n, Off-Road/Pick-up, \n]    5415
[\n, Compact, \n]             5387
[\n, Sedan, \n]               5043
[\n, Coupe, \n]               4009
[\n, Convertible, \n]         3328
Name: count, dtype: int64

In [43]:
df["body_type"] = df["body_type"].apply(lambda item : item[0] if type(item)==list else item)
df["body_type"].value_counts(dropna=False)

body_type
\n, Station wagon, \n       5448
\n, Off-Road/Pick-up, \n    5415
\n, Compact, \n             5387
\n, Sedan, \n               5043
\n, Coupe, \n               4009
\n, Convertible, \n         3328
Name: count, dtype: int64

In [44]:
df["body_type"] = df["body_type"].str.extract(", ([^,]+),")
df["body_type"].value_counts()

body_type
Station wagon       5448
Off-Road/Pick-up    5415
Compact             5387
Sedan               5043
Coupe               4009
Convertible         3328
Name: count, dtype: int64

## Type

In [45]:
df["type"].value_counts(dropna=False)

type
[\n, Used, \n]              25251
[\n, Demonstration, \n]      1433
[\n, Pre-registered, \n]     1377
[\n, Employee's car, \n]      569
Name: count, dtype: int64

In [46]:
df["type"] = df["type"].apply(lambda item : item[0] if type(item)==list else item)


In [47]:
 df["type"] = df["type"].str.extract(r',\s*([^,]+),')

In [48]:
df["type"].value_counts(dropna=False)

type
Used              25251
Demonstration      1433
Pre-registered     1377
Employee's car      569
Name: count, dtype: int64

## Doors

In [49]:
df.doors.value_counts(dropna=False)

doors
[\n, 5, \n]    17481
[\n, 2, \n]     5523
[\n, 4, \n]     3001
[\n, 3, \n]     2259
NaN              359
[\n, 6, \n]        5
[\n, 1, \n]        2
Name: count, dtype: int64

In [50]:
df["doors"] = df["doors"].apply(lambda item : item[0] if type(item)==list else item)

In [51]:
df["doors"] = df["doors"].str.extract("(\d)")

In [52]:
df["doors"].value_counts(dropna=False) 

doors
5      17481
2       5523
4       3001
3       2259
NaN      359
6          5
1          2
Name: count, dtype: int64

In [53]:
df["doors"].apply(lambda x: type(x) ).value_counts()

doors
<class 'str'>      28271
<class 'float'>      359
Name: count, dtype: int64

## Country Version

In [54]:
df["country_version"].value_counts(dropna=False)

country_version
NaN                         11741
[\n, Germany, \n]            7939
[\n, Spain, \n]              6376
[\n, Italy, \n]               679
[\n, Belgium, \n]             641
[\n, European Union, \n]      340
[\n, Netherlands, \n]         306
[\n, Austria, \n]             266
[\n, France, \n]              101
[\n, United States, \n]        57
[\n, Czechia, \n]              47
[\n, Poland, \n]               31
[\n, Hungary, \n]              21
[\n, Denmark, \n]              21
[\n, Romania, \n]              12
[\n, Japan, \n]                12
[\n, Switzerland, \n]           9
[\n, Luxembourg, \n]            9
[\n, Sweden, \n]                6
[\n, Slovenia, \n]              5
[\n, Slovakia, \n]              4
[\n, Croatia, \n]               3
[\n, Bulgaria, \n]              1
[\n, Malta, \n]                 1
[\n, Canada, \n]                1
[\n, Mexico, \n]                1
Name: count, dtype: int64

In [55]:
df["country_version"] = df["country_version"].apply(lambda item : item[0] if type(item)==list else item)

In [56]:
df["country_version"] = df["country_version"].str.extract(r',\s*([^,]+),')

In [57]:
df["country_version"].value_counts(dropna=False)

country_version
NaN               11741
Germany            7939
Spain              6376
Italy               679
Belgium             641
European Union      340
Netherlands         306
Austria             266
France              101
United States        57
Czechia              47
Poland               31
Hungary              21
Denmark              21
Romania              12
Japan                12
Switzerland           9
Luxembourg            9
Sweden                6
Slovenia              5
Slovakia              4
Croatia               3
Bulgaria              1
Malta                 1
Canada                1
Mexico                1
Name: count, dtype: int64

In [58]:
df["country_version"].info()

<class 'pandas.core.series.Series'>
Index: 28630 entries, 0 to 29478
Series name: country_version
Non-Null Count  Dtype 
--------------  ----- 
16889 non-null  object
dtypes: object(1)
memory usage: 447.3+ KB


## Offer Number

In [59]:
df["offer_number"].value_counts(dropna=False)

offer_number
NaN                                5530
[\n, 1, \n]                          28
[\n, L-Vorlauf 2023, \n]             10
[\n, 30, \n]                          9
[\n, 20, \n]                          9
                                   ... 
[\n, ggp-EP-880-ZS, \n]               1
[\n, 7475319, \n]                     1
[\n, abci-EZ-260-PS_130291, \n]       1
[\n, 7407611, \n]                     1
[\n, 43-JR-LR, \n]                    1
Name: count, Length: 20946, dtype: int64

In [60]:
df["offer_number"] = df["offer_number"].apply(lambda item : item[0] if type(item)==list else item)

In [61]:
df["offer_number"].value_counts(dropna=False)

offer_number
NaN                              5530
\n, 1, \n                          28
\n, L-Vorlauf 2023, \n             10
\n, 30, \n                          9
\n, 20, \n                          9
                                 ... 
\n, ggp-EP-880-ZS, \n               1
\n, 7475319, \n                     1
\n, abci-EZ-260-PS_130291, \n       1
\n, 7407611, \n                     1
\n, 43-JR-LR, \n                    1
Name: count, Length: 20946, dtype: int64

In [62]:
df["offer_number"] = df["offer_number"].str.strip("\n, ")

In [63]:
df["offer_number"].value_counts(dropna=False)

offer_number
NaN                      5530
1                          28
L-Vorlauf 2023             10
30                          9
20                          9
                         ... 
ggp-EP-880-ZS               1
7475319                     1
abci-EZ-260-PS_130291       1
7407611                     1
43-JR-LR                    1
Name: count, Length: 20946, dtype: int64

## Warranty

In [64]:
df.warranty.value_counts(dropna=False)

warranty
NaN                     12846
[\n, 12 months, \n]      9545
[\n, Yes, \n]            2319
[\n, 24 months, \n]      1515
[\n, 60 months, \n]       968
                        ...  
[\n, 55 months, \n]         1
[\n, 99 months, \n]         1
[\n, 122 months, \n]        1
[\n, 44 months, \n]         1
[\n, 4 months, \n]          1
Name: count, Length: 66, dtype: int64

In [65]:
df["warranty"] = df["warranty"].apply(lambda item : item[0] if type(item)==list else item)

In [66]:
df["warranty"]=df["warranty"].str.strip("\n, ")

In [67]:
df["warranty"].value_counts(dropna=False)    

warranty
NaN           12846
12 months      9545
Yes            2319
24 months      1515
60 months       968
              ...  
55 months         1
99 months         1
122 months        1
44 months         1
4 months          1
Name: count, Length: 66, dtype: int64

## Milage

In [68]:
df['mileage'].value_counts(dropna=False)

mileage
10 km         586
1 km          172
50 km         133
100 km        119
5,000 km      118
             ... 
141,589 km      1
59,821 km       1
123,500 km      1
29,781 km       1
230,047 km      1
Name: count, Length: 14184, dtype: int64

In [69]:
df['mileage'].isna().sum()

1

In [70]:
df['mileage']=df['mileage'].str.replace(',' , '').str.findall('\d+').str[0]

In [71]:
df['mileage'].value_counts(dropna=False)

mileage
10        586
1         172
50        133
100       119
5000      118
         ... 
141589      1
59821       1
123500      1
29781       1
230047      1
Name: count, Length: 14184, dtype: int64

## First Registration

In [72]:
df['first_registration'].value_counts(dropna=False)

first_registration
08/2022    454
06/2022    428
05/2019    420
06/2019    418
07/2019    416
          ... 
06/1967      1
06/1980      1
09/1970      1
06/1963      1
10/1979      1
Name: count, Length: 656, dtype: int64

In [73]:
df['first_registration'] = pd.to_datetime(df['first_registration'])

In [74]:
df['first_registration']

0       2016-06-01
1       2022-06-01
2       2020-07-01
3       2020-01-01
4       2015-09-01
           ...    
29474   2019-08-01
29475   2004-06-01
29476   2011-04-01
29477   2017-11-01
29478   2002-07-01
Name: first_registration, Length: 28630, dtype: datetime64[ns]

In [75]:
df['first_registration'].info()

<class 'pandas.core.series.Series'>
Index: 28630 entries, 0 to 29478
Series name: first_registration
Non-Null Count  Dtype         
--------------  -----         
28628 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 447.3 KB


In [76]:
df['first_registration'].isna().sum()

2

## Gearbox

In [77]:
df['gearbox'].value_counts(dropna=False)

gearbox
[\nManual\n]            17023
[\nAutomatic\n]         11287
NaN                       248
[\nSemi-automatic\n]       72
Name: count, dtype: int64

In [78]:
df['gearbox']=df['gearbox'].apply(lambda item : item[0] if type(item)==list else item)

In [79]:
df['gearbox']=df['gearbox'].str.strip('\n')
df['gearbox'].value_counts(dropna=False)

gearbox
Manual            17023
Automatic         11287
NaN                 248
Semi-automatic       72
Name: count, dtype: int64

In [80]:
df['gearbox'].info()

<class 'pandas.core.series.Series'>
Index: 28630 entries, 0 to 29478
Series name: gearbox
Non-Null Count  Dtype 
--------------  ----- 
28382 non-null  object
dtypes: object(1)
memory usage: 447.3+ KB


In [81]:
df['gearbox'].info()

<class 'pandas.core.series.Series'>
Index: 28630 entries, 0 to 29478
Series name: gearbox
Non-Null Count  Dtype 
--------------  ----- 
28382 non-null  object
dtypes: object(1)
memory usage: 447.3+ KB


## Bucketing the Fuel Type Feature

In [82]:
df['fuel_type'].value_counts(dropna=False)

fuel_type
Gasoline                                                                                              8532
Diesel                                                                                                5911
Super 95                                                                                              3557
Diesel (Particle filter)                                                                              2816
Regular/Benzine 91                                                                                    2065
NaN                                                                                                   1787
Super E10 95                                                                                          1016
Regular/Benzine 91 (Particle filter)                                                                   555
Super 95 (Particle filter)                                                                             537
Super E10 95 (Particle filt

In [83]:
benzine = ['Gasoline','Regular/Benzine 91','Regular/Benzine 91 (Particle filter)', 'Super 95', 'Super E10 95', 'Gasoline (Particle filter)','Super 95 (Particle filter)','Super Plus 98',\
           'Super E10 95 (Particle filter)','Regular/Benzine E10 91','Super Plus E10 98', 'Super Plus E10 98 (Particle filter)', 'Super Plus 98 (Particle filter)', 'Ethanol', \
           'Regular/Benzine E10 91 (Particle filter)', 'Super 95 (Particle filter) / Super E10 95 / Ethanol']
diesel  = ['Diesel','Diesel (Particle filter)']
LPG =  ['LPG', 'Liquid petroleum gas (LPG)','Liquid petroleum gas (LPG) / Super E10 95 / Regular/Benzine 91 / Super 95 / Super Plus 98 / Biogas', 'CNG',\
        'Liquid petroleum gas (LPG) (Particle filter) / Super 95 / Super E10 95','Liquid petroleum gas (LPG) / Super 95 / Super E10 95',\
        'LPG (Particle filter)',  'Domestic gas L','Liquid petroleum gas (LPG) / Super 95 / Super Plus 98 / Super Plus E10 98 / Super E10 95',\
        'CNG (Particle filter)', 'Domestic gas H', 'Domestic gas L (Particle filter)', 'Biogas',\
        'Domestic gas H / Super E10 95 / Super Plus E10 98 / Super 95 / Super Plus 98 / Domestic gas L',\
        'Domestic gas L / Super 95 / Domestic gas H','Super 95 / Super Plus 98 / Liquid petroleum gas (LPG)',\
        'Liquid petroleum gas (LPG) (Particle filter)', 'Liquid petroleum gas (LPG) / Super 95',\
        'Liquid petroleum gas (LPG) / Super 95 / Super E10 95 / Super Plus 98', 'Liquid petroleum gas (LPG) / Super 95 / Super Plus 98',\
        'Liquid petroleum gas (LPG) / Super E10 95', 'Liquid petroleum gas (LPG) / Super E10 95 / Super Plus E10 98 / Super Plus 98 / Super 95', 'Super 95 / Liquid petroleum gas (LPG)']
electric = ['Electric','Electric (Particle filter)']
others =  ['Others', 'Others (Particle filter)',"Hydrogen"]

def fuel_types(fuel_type):
    if fuel_type in benzine:
        return "benzine"
    elif fuel_type in diesel:
        return "diesel"
    elif fuel_type in LPG:
        return "LPG"
    elif fuel_type in electric:
        return "electric"
    elif fuel_type in others:
        return "others"
    else:
        return np.nan

df['fuel_type'] = df['fuel_type'].map(fuel_types) 
df['fuel_type'].value_counts(dropna=False)


fuel_type
benzine     17334
diesel       8727
NaN          1787
LPG           378
others        216
electric      188
Name: count, dtype: int64

In [84]:
df['fuel_type'].value_counts(dropna=False)

fuel_type
benzine     17334
diesel       8727
NaN          1787
LPG           378
others        216
electric      188
Name: count, dtype: int64

## Colour

In [85]:
df['colour'].value_counts(dropna=False)

colour
Black     6473
Grey      5998
White     5185
Blue      3478
Red       2242
NaN       1724
Silver    1622
Green      450
Brown      437
Orange     288
Beige      278
Yellow     230
Violet      98
Bronze      65
Gold        62
Name: count, dtype: int64

## Paint

In [86]:
df['paint'].value_counts(dropna=False)

paint
Metallic     14494
NaN          14135
Uni/basic        1
Name: count, dtype: int64

In [87]:
df["paint"] = df["paint"].apply(lambda x: 1 if x == "Metallic" else 0)

In [88]:
df["paint"].value_counts(dropna=False)

paint
1    14494
0    14136
Name: count, dtype: int64

## Desc

In [89]:
df['desc'].value_counts()

desc
[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

## Seller

In [90]:
df['seller'].value_counts(dropna=False)

seller
Dealer            26318
Private seller     2312
Name: count, dtype: int64

In [91]:
df['seller'].info()

<class 'pandas.core.series.Series'>
Index: 28630 entries, 0 to 29478
Series name: seller
Non-Null Count  Dtype 
--------------  ----- 
28630 non-null  object
dtypes: object(1)
memory usage: 447.3+ KB


In [92]:
df['seller']=df['seller'].astype('string')

In [93]:
df['seller'].info()

<class 'pandas.core.series.Series'>
Index: 28630 entries, 0 to 29478
Series name: seller
Non-Null Count  Dtype 
--------------  ----- 
28630 non-null  string
dtypes: string(1)
memory usage: 447.3 KB


## Seats

In [94]:
df['seats'].value_counts(dropna=False)

seats
[\n, 5, \n]     18308
[\n, 4, \n]      5390
NaN              3125
[\n, 2, \n]      1186
[\n, 7, \n]       488
[\n, 8, \n]        43
[\n, 9, \n]        35
[\n, 3, \n]        25
[\n, 6, \n]        12
[\n, 0, \n]         9
[\n, 1, \n]         7
[\n, 17, \n]        2
Name: count, dtype: int64

In [95]:
df['seats']=df['seats'].apply(lambda item : item[0] if type(item)==list else item)

In [96]:
df['seats']=df['seats'].str.strip('\n, ')

In [97]:
df['seats'].value_counts(dropna=False)

seats
5      18308
4       5390
NaN     3125
2       1186
7        488
8         43
9         35
3         25
6         12
0          9
1          7
17         2
Name: count, dtype: int64

## Power

In [98]:
df['power'].value_counts(dropna=False)

power
[\n110 kW (150 hp)\n]      1992
[\n96 kW (131 hp)\n]       1356
[\n88 kW (120 hp)\n]       1182
[\n81 kW (110 hp)\n]       1166
[\n66 kW (90 hp)\n]        1110
                           ... 
[\n746 kW (1,014 hp)\n]       1
[\n570 kW (775 hp)\n]         1
[\n471 kW (640 hp)\n]         1
[\n179 kW (243 hp)\n]         1
[\n26 kW (35 hp)\n]           1
Name: count, Length: 352, dtype: int64

In [99]:
df.power.isna().sum() 

572

In [100]:
df['power']=df['power'].apply(lambda item : item[0] if type(item)==list else item)

In [101]:
df["power"] = df["power"].str.extract(r"\n(\d+) ")

In [102]:
df.power.isna().sum() 

572

In [103]:
df['power']

0        NaN
1        215
2        310
3        225
4        100
        ... 
29474    288
29475    125
29476     84
29477    187
29478    147
Name: power, Length: 28630, dtype: object

## Engine Size

In [104]:
df['engine_size'].value_counts(dropna=False)

engine_size
[\n1,598 cc\n]    2099
[\n999 cc\n]      2068
[\n1,969 cc\n]    1929
NaN               1403
[\n1,461 cc\n]    1110
                  ... 
[\n2,753 cc\n]       1
[\n2,495 cc\n]       1
[\n200 cc\n]         1
[\n3,224 cc\n]       1
[\n2,473 cc\n]       1
Name: count, Length: 468, dtype: int64

In [105]:
df['engine_size']=df['engine_size'].apply(lambda item : item[0] if type(item)==list else item)

In [106]:
df['engine_size']=df['engine_size'].str.replace(',', '').str.extract('(\d+)')

In [107]:
df['engine_size'].value_counts(dropna=False) 

engine_size
1598    2099
999     2068
1969    1929
NaN     1403
1461    1110
        ... 
2753       1
2495       1
200        1
3224       1
2473       1
Name: count, Length: 468, dtype: int64

## Gears

In [108]:
df['gears'].value_counts(dropna=False)

gears
NaN         9676
[\n6\n]     8412
[\n5\n]     5335
[\n7\n]     1738
[\n8\n]     1690
[\n1\n]      712
[\n9\n]      642
[\n4\n]      256
[\n10\n]     112
[\n0\n]       31
[\n3\n]       24
[\n2\n]        2
Name: count, dtype: int64

In [109]:
df['gears']=df['gears'].apply(lambda item : item[0] if type(item)==list else item)

In [110]:
df['gears']=df['gears'].str.strip('\n')

In [111]:
df['gears'].value_counts(dropna=False)  

gears
NaN    9676
6      8412
5      5335
7      1738
8      1690
1       712
9       642
4       256
10      112
0        31
3        24
2         2
Name: count, dtype: int64

## Co Emissions

In [112]:
df['co_emissions'].value_counts(dropna=False)

co_emissions
NaN                  10036
0 g/km (comb.)        1038
119 g/km (comb.)       393
124 g/km (comb.)       340
129 g/km (comb.)       319
                     ...  
7 g/km (comb.)           1
80 g/km (comb.)          1
196  g/km (comb.)        1
100  g/km (comb.)        1
53 g/km (comb.)          1
Name: count, Length: 348, dtype: int64

In [113]:
df['co_emissions']=df['co_emissions'].str.extract('(\d+)')

In [114]:
df['co_emissions'].value_counts(dropna=False) 

co_emissions
NaN    10036
0       1040
119      393
124      340
129      319
       ...  
332        1
70         1
338        1
342        1
53         1
Name: count, Length: 327, dtype: int64

## Manufacturer Colour

In [115]:
df['manufacturer_colour'].value_counts(dropna=False)

manufacturer_colour
NaN                               6843
Blanco                            1235
Gris                               671
Azul                               552
Negro                              546
                                  ... 
Colore esterno (snowflake whit       1
Pompeigraumetallic                   1
Cararragrau                          1
ICE WHITE (wit metallic)             1
Denim Blue metallic (blauw met       1
Name: count, Length: 4964, dtype: int64

## Drive Train

In [116]:
df['drivetrain'].value_counts(dropna=False)

drivetrain
[\n, Front, \n]    12066
NaN                11737
[\n, 4WD, \n]       3252
[\n, Rear, \n]      1575
Name: count, dtype: int64

In [117]:
df['drivetrain']=df['drivetrain'].apply(lambda item : item[0] if type(item)==list else item)

In [118]:
df['drivetrain']=df['drivetrain'].str.strip('\n, ')

In [119]:
df['drivetrain'].value_counts(dropna=False)

drivetrain
Front    12066
NaN      11737
4WD       3252
Rear      1575
Name: count, dtype: int64

## Cylinders

In [120]:
df['cylinders'].value_counts(dropna=False)

cylinders
[\n4\n]     13068
NaN          9778
[\n3\n]      3258
[\n6\n]      1013
[\n5\n]       799
[\n8\n]       539
[\n2\n]       103
[\n0\n]        35
[\n1\n]        21
[\n12\n]        8
[\n7\n]         6
[\n26\n]        1
[\n16\n]        1
Name: count, dtype: int64

In [121]:
df['cylinders']=df['cylinders'].apply(lambda item : item[0] if type(item)==list else item)

In [122]:
df['cylinders']=df['cylinders'].str.strip('\n')

In [123]:
df['cylinders'].value_counts(dropna=False) 

cylinders
4      13068
NaN     9778
3       3258
6       1013
5        799
8        539
2        103
0         35
1         21
12         8
7          6
26         1
16         1
Name: count, dtype: int64

## Fuel Consumption

In [124]:
df['fuel_consumption'].value_counts(dropna=False)

fuel_consumption
NaN                                                                            5245
[[0 l/100 km (comb.)]]                                                          330
[[0 l/100 km (comb.)], [0 l/100 km (city)], [0 l/100 km (country)]]             306
[[3.4 l/100 km (comb.)], [3.7 l/100 km (city)], [3.2 l/100 km (country)]]       170
[[4 l/100 km (comb.)]]                                                          149
                                                                               ... 
[[14.2 l/100 km (comb.)], [21.3 l/100 km (city)], [10 l/100 km (country)]]        1
[[3.5 l/100 km (comb.)], [4.4 l/100 km (city)], [2.9 l/100 km (country)]]         1
[[4.5 l/100 km (comb.)], [6.2 l/100 km (city)], [3.5 l/100 km (country)]]         1
[[4.9 l/100 km (comb.)], [99.9 l/100 km (city)], [98 l/100 km (country)]]         1
[[10.4 l/100 km (comb.)], [15.3 l/100 km (city)], [7.6 l/100 km (country)]]       1
Name: count, Length: 3453, dtype: int64

In [125]:
df['fuel_consumption']= [item[0] if type(item)== list else item for item in df['fuel_consumption']]
df['fuel_consumption']

0                            NaN
1                            NaN
2         [8.4 l/100 km (comb.)]
3         [7.3 l/100 km (comb.)]
4         [4.9 l/100 km (comb.)]
                  ...           
29474       [2 l/100 km (comb.)]
29475     [9.1 l/100 km (comb.)]
29476     [3.8 l/100 km (comb.)]
29477     [6.5 l/100 km (comb.)]
29478    [10.4 l/100 km (comb.)]
Name: fuel_consumption, Length: 28630, dtype: object

In [126]:
df['fuel_consumption']= df['fuel_consumption'].apply(lambda item : item[0] if type(item)==list else item)
df['fuel_consumption']

0                          NaN
1                          NaN
2         8.4 l/100 km (comb.)
3         7.3 l/100 km (comb.)
4         4.9 l/100 km (comb.)
                 ...          
29474       2 l/100 km (comb.)
29475     9.1 l/100 km (comb.)
29476     3.8 l/100 km (comb.)
29477     6.5 l/100 km (comb.)
29478    10.4 l/100 km (comb.)
Name: fuel_consumption, Length: 28630, dtype: object

## Comfort Convenience

In [127]:
df['comfort_convenience'].value_counts(dropna=False)

comfort_convenience
NaN                                                                                                                                                                                                                                                                                                                                                                                                              3197
[Air conditioning, Automatic climate control, Electrical side mirrors, Multi-function steering wheel, Power windows]                                                                                                                                                                                                                                                                                              244
[Air conditioning, Automatic climate control, Cruise control]                                                                                                           

In [128]:
df["comfort_convenience"] = [",".join(item) if type (item) ==list else item for item in df["comfort_convenience"]]

In [129]:
df["comfort_convenience"] = df["comfort_convenience"].astype("str")

In [130]:
df["comfort_convenience"] = df["comfort_convenience"].str.lower()

In [131]:
df['comfort_convenience'].value_counts(dropna=False)

comfort_convenience
nan                                                                                                                                                                                                                                                                                                                                                                                                            3197
air conditioning, automatic climate control, electrical side mirrors, multi-function steering wheel, power windows                                                                                                                                                                                                                                                                                              244
air conditioning, automatic climate control, cruise control                                                                                                                 

## Entertainment Media

In [132]:
df["entertainment_media"] = [",".join(item) if type (item) ==list else item for item in df["entertainment_media"]]

In [133]:
df["entertainment_media"] = df["entertainment_media"].astype("str")
df["entertainment_media"] = df["entertainment_media"].str.lower()

In [134]:
df["entertainment_media"].value_counts(dropna=False)

entertainment_media
nan                                                                                                                                                                        4986
bluetooth, usb                                                                                                                                                              753
bluetooth                                                                                                                                                                   665
on-board computer                                                                                                                                                           662
cd player, on-board computer, radio                                                                                                                                         644
                                                                                                    

## Safety&Security

In [135]:
df["safety_security"].value_counts(dropna=False)

safety_security
NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3215
[Isofix]                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

In [136]:
df["safety_security"] = [",".join(item) if type (item) ==list else item for item in df["safety_security"]]

In [137]:
df["safety_security"] = df["safety_security"].astype("str")
df["safety_security"] = df["safety_security"].str.lower()

In [138]:
df["safety_security"].value_counts(dropna=False)

safety_security
nan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          3215
isofix                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

## Extras

In [139]:
df["extras"].value_counts(dropna=False)

extras
NaN                                                                                                                                                                                                                                                                   5150
[Alloy wheels]                                                                                                                                                                                                                                                        3408
[Alloy wheels, Roof rack]                                                                                                                                                                                                                                              380
[Alloy wheels, Sport seats]                                                                                                                                                                     

In [140]:
df["extras"] = [",".join(item) if type (item) ==list else item for item in df["extras"]]
df["extras"] = df["extras"].astype("str")
df["extras"] = df["extras"].str.lower()

In [141]:
df["extras"].value_counts(dropna=False)

extras
nan                                                                                                                                                                                                                                                                 5150
alloy wheels                                                                                                                                                                                                                                                        3408
alloy wheels, roof rack                                                                                                                                                                                                                                              380
alloy wheels, sport seats                                                                                                                                                                             

## Empty Weight

In [142]:
df['empty_weight'].value_counts(dropna=False)

empty_weight
NaN               10872
[\n1,395 kg\n]      233
[\n1,055 kg\n]      224
[\n1,423 kg\n]      216
[\n1,165 kg\n]      200
                  ...  
[\n1,877 kg\n]        1
[\n1,011 kg\n]        1
[\n1,069 kg\n]        1
[\n983 kg\n]          1
[\n1,391 kg\n]        1
Name: count, Length: 1219, dtype: int64

In [143]:
df['empty_weight']=df['empty_weight'].apply(lambda item : item[0] if type(item)==list else item)

In [144]:
df['empty_weight']=df['empty_weight'].str.replace(',', '').str.extract('(\d+)')

In [145]:
df['empty_weight'].value_counts(dropna=False)

empty_weight
NaN     10872
1395      233
1055      224
1423      216
1165      200
        ...  
1877        1
1011        1
1069        1
983         1
1391        1
Name: count, Length: 1219, dtype: int64

## Model Code

In [146]:
df["model_code"].value_counts(dropna= False)

model_code
NaN                   20263
[\n, 8212/AFJ, \n]       75
[\n, 1727/AAM, \n]       64
[\n, 1349/AGI, \n]       61
[\n, 1889/ABU, \n]       55
                      ...  
[\n, 7593/ANL, \n]        1
[\n, 1727/ABC, \n]        1
[\n, 4136/AEC, \n]        1
[\n, 4136/668, \n]        1
[\n, 9101/449, \n]        1
Name: count, Length: 2187, dtype: int64

In [147]:
df["model_code"]=df["model_code"].apply(lambda item : item[0] if type(item)==list else item)

In [148]:
df["model_code"]=df["model_code"].str.strip('\n, ')
df["model_code"].value_counts(dropna= False)

model_code
NaN         20263
8212/AFJ       75
1727/AAM       64
1349/AGI       61
1889/ABU       55
            ...  
7593/ANL        1
1727/ABC        1
4136/AEC        1
4136/668        1
9101/449        1
Name: count, Length: 2187, dtype: int64

## General Inspection

In [149]:
df['general_inspection'].value_counts(dropna =False)

general_inspection
NaN        16376
New         5883
05/2023      286
08/2023      280
03/2023      268
           ...  
09/2017        1
08/2013        1
08/2020        1
08/2018        1
03/2021        1
Name: count, Length: 92, dtype: int64

## Last Service

In [150]:
df['last_service'].value_counts(dropna =False)

last_service
NaN        26627
09/2022      220
08/2022      196
06/2022      164
07/2022      155
           ...  
02/2018        1
02/2011        1
10/2013        1
08/2017        1
08/2019        1
Name: count, Length: 62, dtype: int64

## Full Service History

In [151]:
df['full_service_history'].value_counts(dropna =False)

full_service_history
NaN    16065
Yes    12565
Name: count, dtype: int64

## Non Smoker Vehicle

In [152]:
df["non_smoker_vehicle"].value_counts(dropna=False)

non_smoker_vehicle
NaN    17036
Yes    11594
Name: count, dtype: int64

## Emission Class

In [153]:
df["emission_class"].value_counts(dropna = False)

emission_class
NaN             10771
Euro 6           6418
Euro 6d-TEMP     3399
Euro 6d          2858
Euro 5           2389
Euro 4           1743
Euro 3            523
Euro 2            217
Euro 1            172
Euro 6c           140
Name: count, dtype: int64

In [154]:
df["emission_class"] = ["Euro 6" if pd.notna(x) and "Euro 6" in x else x for x in df["emission_class"]]

In [155]:
df["emission_class"].value_counts(dropna = False)

emission_class
Euro 6    12815
NaN       10771
Euro 5     2389
Euro 4     1743
Euro 3      523
Euro 2      217
Euro 1      172
Name: count, dtype: int64

## Emission Sticker

In [156]:
df["emissions_sticker"].value_counts(dropna = False)

emissions_sticker
NaN               19216
4 (Green)          9230
1 (No sticker)      176
3 (Yellow)            6
2 (Red)               2
Name: count, dtype: int64

## Upholstery Colour

In [157]:
df["upholstery_colour"].value_counts(dropna = False)

upholstery_colour
NaN       14061
Black     10416
Grey       2038
Other      1003
Beige       466
Brown       275
Red         159
White        93
Blue         82
Orange       23
Green         8
Yellow        6
Name: count, dtype: int64

## Upholstery

In [158]:
df["upholstery"].value_counts(dropna = False) 

upholstery
NaN             10020
Cloth            8736
Full leather     5439
Part leather     2835
alcantara         764
Other             628
Velour            208
Name: count, dtype: int64

## Production Date

In [159]:
df["production_date"].value_counts(dropna=False)

production_date
NaN        22722
2019.00     1179
2021.00      980
2022.00      894
2020.00      763
2018.00      762
2017.00      391
2016.00      239
2015.00      118
2014.00       90
2013.00       76
2010.00       53
2012.00       43
2011.00       43
2008.00       34
2009.00       26
2007.00       20
2006.00       17
2002.00       15
2004.00       13
2003.00       13
1967.00       12
2001.00       12
1966.00       11
1999.00        9
2005.00        9
1965.00        9
1970.00        9
1997.00        8
1994.00        8
1968.00        7
2000.00        7
1998.00        5
1990.00        3
1973.00        3
1987.00        3
1993.00        3
1991.00        3
1996.00        3
1995.00        2
1969.00        2
1978.00        2
1982.00        2
1961.00        1
1954.00        1
1988.00        1
1962.00        1
1981.00        1
1985.00        1
1976.00        1
Name: count, dtype: int64

In [160]:
df["production_date"] = pd.to_datetime(df["production_date"], format = "%Y") # errors = "coerce")


In [161]:
df["production_date"].value_counts(dropna=False)

production_date
NaT           22722
2019-01-01     1179
2021-01-01      980
2022-01-01      894
2020-01-01      763
2018-01-01      762
2017-01-01      391
2016-01-01      239
2015-01-01      118
2014-01-01       90
2013-01-01       76
2010-01-01       53
2012-01-01       43
2011-01-01       43
2008-01-01       34
2009-01-01       26
2007-01-01       20
2006-01-01       17
2002-01-01       15
2004-01-01       13
2003-01-01       13
1967-01-01       12
2001-01-01       12
1966-01-01       11
1999-01-01        9
2005-01-01        9
1965-01-01        9
1970-01-01        9
1997-01-01        8
1994-01-01        8
1968-01-01        7
2000-01-01        7
1998-01-01        5
1990-01-01        3
1973-01-01        3
1987-01-01        3
1993-01-01        3
1991-01-01        3
1996-01-01        3
1995-01-01        2
1969-01-01        2
1978-01-01        2
1982-01-01        2
1961-01-01        1
1954-01-01        1
1988-01-01        1
1962-01-01        1
1981-01-01        1
1985-01-01        1
1976

## Previous Owner

In [162]:
df["previous_owner"].value_counts(dropna=False) 

previous_owner
NaN                           14615
[[50 km, 06/2022], 1]            64
[[10 km, 08/2022], 1]            59
[[10 km, 07/2022], 1]            45
[[10 km, 09/2022], 1]            38
                              ...  
[[358,000 km, 10/2010], 2]        1
[[165,400 km, 09/2010], 1]        1
[[65,000 km, 10/2006], 1]         1
[[71,000 km, 04/2013], 1]         1
[[230,047 km, 07/2002], 5]        1
Name: count, Length: 11734, dtype: int64

In [163]:
df["previous_owner"] = [item[-1] if isinstance(item, list) else item for item in df["previous_owner"]]
df["previous_owner"].value_counts(dropna=False)

previous_owner
NaN    14615
1       9746
2       3221
3        699
4        184
5         69
6         37
7         22
8         16
9         14
12         3
10         2
14         1
13         1
Name: count, dtype: int64

## Other Fuel Types

In [164]:
df["other_fuel_types"].value_counts(dropna=False)

other_fuel_types
NaN             26317
Electricity      2301
Hydogen            11
Super E10 95        1
Name: count, dtype: int64

## Power Consumption

In [165]:
df['power_consumption'].value_counts(dropna=False)

power_consumption
NaN                        28115
0 kWh/100 km (comb.)         101
15.2 kWh/100 km (comb.)       30
15.7 kWh/100 km (comb.)       19
17.7 kWh/100 km (comb.)       17
                           ...  
18 kWh/100 km (comb.)          1
25.3 kWh/100 km (comb.)        1
12.7 kWh/100 km (comb.)        1
22.7 kWh/100 km (comb.)        1
20.4 kWh/100 km (comb.)        1
Name: count, Length: 105, dtype: int64

In [166]:
df["power_consumption"] = df["power_consumption"].str.extract(r"(\d+\.?\d*)")

In [167]:
df["power_consumption"].value_counts(dropna=False)

power_consumption
NaN     28115
0         101
15.2       30
15.7       19
17.7       17
        ...  
18          1
25.3        1
12.7        1
22.7        1
20.4        1
Name: count, Length: 105, dtype: int64

## Energy Efficiency Class

In [168]:
df["energy_efficiency_class"].value_counts(dropna=False)

energy_efficiency_class
NaN     20826
B        2090
A        1687
C        1133
A+       1089
D         636
A+++      375
G         309
E         271
F         147
A++        67
Name: count, dtype: int64

## Co Efficiency

In [169]:
df["co_efficiency"].value_counts(dropna=False) #drop edilebilir.

co_efficiency
NaN                                                                                           20826
Calculated on basis of measured CO₂-emissions taking into account the mass of the vehicle.     7804
Name: count, dtype: int64

## Fuel Consumption wltp

In [170]:
df["fuel_consumption_wltp"].value_counts(dropna=False) #drop edilebilir.

fuel_consumption_wltp
NaN              28530
5.5 l/100 km         9
5 l/100 km           8
5.4 l/100 km         5
6.8 l/100 km         5
6.5 l/100 km         4
5.7 l/100 km         4
7.3 l/100 km         4
5.2 l/100 km         3
4.9 l/100 km         3
6.6 l/100 km         3
6.7 l/100 km         3
5.9 l/100 km         2
6.1 l/100 km         2
5.6 l/100 km         2
4.7 l/100 km         2
4.4 l/100 km         2
5.8 l/100 km         2
6.3 l/100 km         2
6.2 l/100 km         2
8.4 l/100 km         2
16 l/100 km          2
7.6 l/100 km         2
12.6 l/100 km        2
9.3 l/100 km         2
1.5 l/100 km         1
12 l/100 km          1
5.3 l/100 km         1
8 l/100 km           1
8.2 l/100 km         1
4.8 l/100 km         1
4.1 l/100 km         1
4.2 l/100 km         1
4.3 l/100 km         1
1.1 l/100 km         1
3.9 l/100 km         1
6.4 l/100 km         1
8.6 l/100 km         1
10.3 l/100 km        1
12.2 l/100 km        1
12.3 l/100 km        1
8.3 l/100 km         1
7.2 l/100 km

In [171]:
df["fuel_consumption_wltp"] = df["fuel_consumption_wltp"].str.extract(r"(\d+\.?\d*)")

In [172]:
df["fuel_consumption_wltp"].value_counts(dropna=False)

fuel_consumption_wltp
NaN     28530
5.5         9
5           8
5.4         5
6.8         5
6.5         4
5.7         4
7.3         4
5.2         3
4.9         3
6.6         3
6.7         3
5.9         2
6.1         2
5.6         2
4.7         2
4.4         2
5.8         2
6.3         2
6.2         2
8.4         2
16          2
7.6         2
12.6        2
9.3         2
1.5         1
12          1
5.3         1
8           1
8.2         1
4.8         1
4.1         1
4.2         1
4.3         1
1.1         1
3.9         1
6.4         1
8.6         1
10.3        1
12.2        1
12.3        1
8.3         1
7.2         1
9.6         1
9.1         1
12.9        1
10.4        1
7.8         1
Name: count, dtype: int64

## Co Emissions wltp

In [173]:
df["co_emissions_wltp"].value_counts(dropna=False) #drop edilebilir

co_emissions_wltp
NaN                 28514
0 g/km (comb.)         14
125 g/km (comb.)        8
130 g/km (comb.)        4
129 g/km (comb.)        4
                    ...  
211 g/km (comb.)        1
218 g/km (comb.)        1
159 g/km (comb.)        1
115 g/km (comb.)        1
97 g/km (comb.)         1
Name: count, Length: 69, dtype: int64

## Available From

In [174]:
df['available_from'].value_counts(dropna=False) #drop edilebilir. Tekrar bak, olmadi.

available_from
NaN                     28237
[\n, 01/03/2023, \n]       34
[\n, 01/10/2026, \n]       23
[\n, 31/03/2023, \n]       17
[\n, 08/10/2022, \n]       15
                        ...  
[\n, 18/11/2022, \n]        1
[\n, 24/10/2022, \n]        1
[\n, 03/02/2023, \n]        1
[\n, 27/10/2022, \n]        1
[\n, 12/10/2022, \n]        1
Name: count, Length: 126, dtype: int64

In [175]:
df['available_from']=df['available_from'].apply(lambda item : item[0] if type(item)==list else item)

In [176]:
df['available_from']=df['available_from'].str.strip('\n, ')

In [177]:
df['available_from'].value_counts(dropna=False)

available_from
NaN           28237
01/03/2023       34
01/10/2026       23
31/03/2023       17
08/10/2022       15
              ...  
18/11/2022        1
24/10/2022        1
03/02/2023        1
27/10/2022        1
12/10/2022        1
Name: count, Length: 126, dtype: int64

In [178]:
df['available_from']= pd.to_datetime(df['available_from'],format='%d/%M/%Y', errors='coerce')

In [179]:
df['available_from'].value_counts(dropna=False)

available_from
NaT                    28237
2023-01-01 00:03:00       34
2026-01-01 00:10:00       23
2023-01-31 00:03:00       17
2022-01-08 00:10:00       15
                       ...  
2022-01-18 00:11:00        1
2022-01-24 00:10:00        1
2023-01-03 00:02:00        1
2022-01-27 00:10:00        1
2022-01-12 00:10:00        1
Name: count, Length: 126, dtype: int64

## Taxi or Rental Car

In [180]:
df["taxi_or_rental_car"].value_counts(dropna=False)

taxi_or_rental_car
NaN    28208
Yes      422
Name: count, dtype: int64

## Availability 

In [181]:
df["availability"].value_counts(dropna=False)

availability
NaN                                  28329
[\n, in 1 day after order, \n]          64
[\n, in 5 days after order, \n]         56
[\n, in 7 days after order, \n]         31
[\n, in 3 days after order, \n]         25
[\n, in 14 days after order, \n]        22
[\n, in 60 days after order, \n]        21
[\n, in 42 days after order, \n]        15
[\n, in 180 days after order, \n]       12
[\n, in 90 days after order, \n]        10
[\n, in 120 days after order, \n]       10
[\n, in 6 days after order, \n]          9
[\n, in 270 days after order, \n]        8
[\n, in 28 days after order, \n]         5
[\n, in 2 days after order, \n]          5
[\n, in 4 days after order, \n]          4
[\n, in 21 days after order, \n]         2
[\n, in 360 days after order, \n]        1
[\n, in 150 days after order, \n]        1
Name: count, dtype: int64

In [182]:
df["availability"] = df["availability"].apply(lambda item : item[0] if type(item)==list else item)
df["availability"] = df["availability"].str.extract(r"(\d+)")

In [183]:
df["availability"].value_counts(dropna=False)

availability
NaN    28329
1         64
5         56
7         31
3         25
14        22
60        21
42        15
180       12
90        10
120       10
6          9
270        8
28         5
2          5
4          4
21         2
360        1
150        1
Name: count, dtype: int64

## Last Timing Belt Change

In [184]:
df["last_timing_belt_change"].value_counts(dropna=False)

last_timing_belt_change
NaN        28058
04/2022       27
05/2021       26
08/2022       25
07/2022       24
           ...  
07/2018        1
10/2015        1
12/2019        1
11/2018        1
07/2016        1
Name: count, Length: 87, dtype: int64

In [185]:
df["last_timing_belt_change"] = pd.to_datetime(df["last_timing_belt_change"], format = "%m/%Y", errors = "coerce")
df["last_timing_belt_change"].value_counts(dropna=False)

last_timing_belt_change
NaT           28058
2022-04-01       27
2021-05-01       26
2022-08-01       25
2022-07-01       24
              ...  
2018-07-01        1
2015-10-01        1
2019-12-01        1
2018-11-01        1
2016-07-01        1
Name: count, Length: 87, dtype: int64

## Electric Range wltp

In [186]:
df["electric_range_wltp"].value_counts(dropna = False) # drop edilebilir

electric_range_wltp
NaN                           28614
426 km492 km (within city)        2
389 km                            2
402 km484 km (within city)        1
50 km50 km (within city)          1
614 km681 km (within city)        1
573 km573 km (within city)        1
691 km691 km (within city)        1
351 km351 km (within city)        1
48 km48 km (within city)          1
450 km450 km (within city)        1
402 km402 km (within city)        1
540 km540 km (within city)        1
360 km                            1
384 km                            1
Name: count, dtype: int64

## Power Consumption wltp

In [187]:
df["power_consumption_wltp"].value_counts(dropna=False) # drop edilebilir.

power_consumption_wltp
NaN                28614
21.9 kWh/100 km        2
17.2 kWh/100 km        2
20.6 kWh/100 km        2
18.9 kWh/100 km        1
19.2 kWh/100 km        1
18.4 kWh/100 km        1
19.3 kWh/100 km        1
15.9 kWh/100 km        1
148 kWh/100 km         1
16.9 kWh/100 km        1
18.7 kWh/100 km        1
18.1 kWh/100 km        1
18.6 kWh/100 km        1
Name: count, dtype: int64

## Battery Ownership

In [188]:
df["battery_ownership"].value_counts(dropna=False)

battery_ownership
NaN         28623
Included        7
Name: count, dtype: int64

## Checking the null Values

In [189]:
null_counts = df.isnull().sum()
total_rows = len(df)

null_info = pd.DataFrame({
    'Null_count': null_counts,
    'Percentage': (null_counts / total_rows) * 100
}).sort_values(by='Null_count', ascending=False)

print(null_info)

                         Null_count  Percentage
battery_ownership             28623       99.98
power_consumption_wltp        28614       99.94
electric_range_wltp           28614       99.94
fuel_consumption_wltp         28530       99.65
co_emissions_wltp             28514       99.59
availability                  28329       98.95
available_from                28237       98.63
taxi_or_rental_car            28208       98.53
power_consumption             28115       98.20
last_timing_belt_change       28058       98.00
last_service                  26627       93.00
other_fuel_types              26317       91.92
production_date               22722       79.36
co_efficiency                 20826       72.74
energy_efficiency_class       20826       72.74
model_code                    20263       70.78
emissions_sticker             19216       67.12
non_smoker_vehicle            17036       59.50
general_inspection            16376       57.20
full_service_history          16065     

In [190]:
null_info.head(10)

Unnamed: 0,Null_count,Percentage
battery_ownership,28623,99.98
power_consumption_wltp,28614,99.94
electric_range_wltp,28614,99.94
fuel_consumption_wltp,28530,99.65
co_emissions_wltp,28514,99.59
availability,28329,98.95
available_from,28237,98.63
taxi_or_rental_car,28208,98.53
power_consumption,28115,98.2
last_timing_belt_change,28058,98.0


## Removing the columns with a high number of missing values

In [191]:
df.drop('battery_ownership', axis=1, inplace=True)
df.drop('power_consumption_wltp', axis=1, inplace=True)
df.drop('electric_range_wltp', axis=1, inplace=True)
df.drop('last_timing_belt_change', axis=1, inplace=True)
df.drop('availability', axis=1, inplace=True)
df.drop('available_from', axis=1, inplace=True)
df.drop('fuel_consumption_wltp', axis=1, inplace=True)
df.drop('co_emissions_wltp', axis=1, inplace=True)

## Checking the Columns again

In [192]:
df.columns

Index(['make_model', 'short_description', 'make', 'model', 'location', 'price',
       'body_type', 'type', 'doors', 'country_version', 'offer_number',
       'warranty', 'mileage', 'first_registration', 'gearbox', 'fuel_type',
       'colour', 'paint', 'desc', 'seller', 'seats', 'power', 'engine_size',
       'gears', 'co_emissions', 'manufacturer_colour', 'drivetrain',
       'cylinders', 'fuel_consumption', 'comfort_convenience',
       'entertainment_media', 'safety_security', 'extras', 'empty_weight',
       'model_code', 'general_inspection', 'last_service',
       'full_service_history', 'non_smoker_vehicle', 'emission_class',
       'emissions_sticker', 'upholstery_colour', 'upholstery',
       'production_date', 'previous_owner', 'other_fuel_types',
       'power_consumption', 'energy_efficiency_class', 'co_efficiency',
       'taxi_or_rental_car', 'country', 'city'],
      dtype='object')

In [193]:
df.shape

(28630, 52)

## Exporting the dataframe into a csv file

In [194]:
df.to_csv("df_cleaned.csv")