# Germany's Used Car Market - Data Cleaning & E.D.A.

In this project we will be taking a closer look at the state of Germany's used car market. Using the dataset that is found on [Kaggle](https://www.kaggle.com/datasets/wspirat/germany-used-cars-dataset-2023), we are going to be cleaning, analysing and exploring the said dataset's data in order to accumulate as much information as possible about this market and come up with answers to various questions.

## Part 0 - Importing the Libraries & Taking a Glance at the Data

In [18]:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns

raw_df = pd.read_csv('Data - German_Used_Cars.csv')
df = raw_df.copy()

In [19]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_kw,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km,offer_description
0,0,alfa-romeo,Alfa Romeo GTV,red,10/1995,1995,1300,148,201,Manual,Petrol,"10,9 l/100 km",260 g/km,160500.0,2.0 V6 TB
1,1,alfa-romeo,Alfa Romeo 164,black,02/1995,1995,24900,191,260,Manual,Petrol,,- (g/km),190000.0,"Q4 Allrad, 3.2L GTA"
2,2,alfa-romeo,Alfa Romeo Spider,black,02/1995,1995,5900,110,150,Unknown,Petrol,,- (g/km),129000.0,ALFA ROME 916
3,3,alfa-romeo,Alfa Romeo Spider,black,07/1995,1995,4900,110,150,Manual,Petrol,"9,5 l/100 km",225 g/km,189500.0,2.0 16V Twin Spark L
4,4,alfa-romeo,Alfa Romeo 164,red,11/1996,1996,17950,132,179,Manual,Petrol,"7,2 l/100 km",- (g/km),96127.0,"3.0i Super V6, absoluter Topzustand !"
5,5,alfa-romeo,Alfa Romeo Spider,red,04/1996,1996,7900,110,150,Manual,Petrol,"9,5 l/100 km",225 g/km,47307.0,2.0 16V Twin Spark
6,6,alfa-romeo,Alfa Romeo 145,red,12/1996,1996,3500,110,150,Manual,Petrol,"8,8 l/100 km",210 g/km,230000.0,Quadrifoglio
7,7,alfa-romeo,Alfa Romeo 164,black,07/1996,1996,5500,132,179,Manual,Petrol,"13,4 l/100 km",320 g/km,168000.0,(3.0) V6 Super
8,8,alfa-romeo,Alfa Romeo Spider,black,07/1996,1996,8990,141,192,Manual,Petrol,11 l/100 km,265 g/km,168600.0,|HU:neu|Klimaanlage|Youngtimer|
9,9,alfa-romeo,Alfa Romeo Spider,black,01/1996,1996,6976,110,150,Manual,Petrol,"9,2 l/100 km",220 g/km,99000.0,2.0 T.Spark L *Klima *2.Hand *Zahnriemen


In [20]:
df.shape

(251079, 15)

In [21]:
df.columns

Index(['Unnamed: 0', 'brand', 'model', 'color', 'registration_date', 'year',
       'price_in_euro', 'power_kw', 'power_ps', 'transmission_type',
       'fuel_type', 'fuel_consumption_l_100km', 'fuel_consumption_g_km',
       'mileage_in_km', 'offer_description'],
      dtype='object')

In [22]:
df.dtypes

Unnamed: 0                    int64
brand                        object
model                        object
color                        object
registration_date            object
year                         object
price_in_euro                object
power_kw                     object
power_ps                     object
transmission_type            object
fuel_type                    object
fuel_consumption_l_100km     object
fuel_consumption_g_km        object
mileage_in_km               float64
offer_description            object
dtype: object

In [23]:
df.isna().sum()

Unnamed: 0                      0
brand                           0
model                           0
color                         166
registration_date               4
year                            0
price_in_euro                   0
power_kw                      134
power_ps                      129
transmission_type               0
fuel_type                       0
fuel_consumption_l_100km    26873
fuel_consumption_g_km           0
mileage_in_km                 152
offer_description               1
dtype: int64

After our brief glance of the data there are some things that we need to keep in mind moving forward. More specifically:

- There is an extra column, namely 'Unnamed: 0', which does not serve any significant purpose and thus must be *dropped*.

- There is a considerable amount of missing values, compared to the amount of rows in the dataset, and we have to treat them carefully in order not to lose as much data during the 'cleaning' phase.

- The majority of the data types in the dataset are objects (in particular **strings**) and some should be turned to their appropriate data types.

## Part 1 - Data Cleaning

In order to improve the results of our analysis, we must first go through the stage of cleaning our data; making it more accurate and removing unwanted clutter. In the following steps we will be taking the necessary actions in doing that.

In [24]:
# dropping the columns we won't be using
df.drop(['Unnamed: 0', 'offer_description', 'power_kw'], axis=1, inplace=True)

In [25]:
df.fuel_type.unique()

array(['Petrol', 'Diesel', 'Hybrid', 'LPG', 'Other', '07/2004',
       '74.194 km', '110.250 km', '06/2014', 'CNG', 'Diesel Hybrid',
       '12/2016', 'Automatic', 'Electric', '12/2019', 'Unknown',
       '06/2023', 'Ethanol', 'Manual', '300.000 km', '264.000 km',
       'KETTE NEUE', '108.313 km', '05/2009', '180.000 km', '04/2013',
       '03/2014', '08/2014', '01/2016', '03/2017', '04/2008', '07/2007',
       '145.500 km', '12/2012', '25890', '10/2022', '06/2004', '09/2009',
       '12/2014', '02/2017', '12890', '11/2018', '08/2018', '03/2019',
       '19450', '11/2021', '20.600 km', 'Hydrogen', '07/2022', '05/2015',
       '03/2018', '04/2022', '160.629 km', '144.919 km', '02/1996',
       '04/2000', '200.000 km', '06/2009', '185.500 km', '13000',
       '05/2012', '11/2014', '10/2015', '350.000 km', '49.817 km',
       '34900', '35.487 km', '03/2021', '26890', '26990', '4.000 km',
       '11/2005', '07/2005', '08/2011', '02/2011', '03/2011', '10/2013',
       '09/2015', '02/2018',

In [26]:
#filtering the dataframe to include only the following fuels in the 'fuel_type' column.
fuels = ['Petrol', 'Diesel', 'Electric', 'Hybrid', 'LPG', 'CNG', 'Diesel Hybrid', 'Ethanol', 'Hydrogen']
df = df[df.fuel_type.isin(fuels)]

In [27]:
df.shape

(250606, 12)

In [28]:
# extracting the 'fuel_consumption_l_100km' numeric values
df['fuel_consumption_l_100km'] = df['fuel_consumption_l_100km'].str.extract('(\d+,\d+)', expand=False)
df['fuel_consumption_l_100km'] = df['fuel_consumption_l_100km'].str.replace(',', '.')

Converting the data type of the fuel consumption column to float now will give us more insight into our data.

In [29]:
df['fuel_consumption_l_100km'] = df['fuel_consumption_l_100km'].astype(float)
df.fuel_consumption_l_100km.describe()

count    200567.000000
mean          6.062741
std           1.971583
min           0.100000
25%           4.900000
50%           5.700000
75%           6.800000
max          99.900000
Name: fuel_consumption_l_100km, dtype: float64

From the information given to us above, we can see that there are some outliers in our fuel consumption data. More specifically, given the vast majority of vehicles can not physically operate with fuel consumption values as low as 0.1l/100km on fuel alone and thus rely on an additional form of fuel (Hybrids), we must assume that some entries contain incorrect info. Additionally, the max value in the column appears to be 99.9l/100km which is way above any normal vehicle should ever have.

Let's examine those outliers in detail.


In [30]:
df[df.fuel_consumption_l_100km < 2.5]['fuel_type'].value_counts()

Hybrid           2795
Diesel Hybrid     142
Petrol            123
Hydrogen           22
Diesel              7
Name: fuel_type, dtype: int64

As we predicted, the majority of vehicles with such low fuel consumption values do rely on an additional form of fuel and, for Hybrids, that is the battery. As for Diesels and Petrols, it is virtually impossible to have such low values. For now, let's simply turn all these values to 0.

In [31]:
cond = (df.fuel_consumption_l_100km < 2.5) & (df.fuel_type.isin(['Petrol', 'Diesel']))
df.loc[df[cond].index.values, 'fuel_consumption_l_100km'] = 0
df[cond]

Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km
14396,audi,Audi A3,white,08/2017,2017,19790,150,Automatic,Petrol,0.0,40 g/km,69999.0
15882,audi,Audi A3,silver,08/2018,2018,23870,150,Automatic,Petrol,0.0,38 g/km,53500.0
15901,audi,Audi A3,silver,05/2018,2018,24420,150,Automatic,Petrol,0.0,40 g/km,17400.0
18779,audi,Audi A4,grey,05/2020,2020,29440,190,Automatic,Diesel,0.0,120 g/km,115314.0
19559,audi,Audi A6,white,12/2020,2020,38890,299,Automatic,Petrol,0.0,34 g/km,44798.0
...,...,...,...,...,...,...,...,...,...,...,...,...
249698,volvo,Volvo XC90,grey,12/2020,2020,65900,303,Automatic,Petrol,0.0,55 g/km,20000.0
250097,volvo,Volvo XC90,white,10/2020,2020,51840,392,Automatic,Petrol,0.0,52 g/km,69987.0
250109,volvo,Volvo V90,white,07/2020,2020,44950,303,Automatic,Petrol,0.0,43 g/km,17360.0
250586,volvo,Volvo XC90,blue,05/2022,2022,81850,455,Automatic,Petrol,0.0,34 g/km,9500.0


As for the other end of the spectrum, we can see there is only a handful of entries with incorrect fuel consumption info so we will dropping these rows as well.

In [32]:
filt = (df.fuel_consumption_l_100km < 100) & (df.fuel_consumption_l_100km > 40)
df.loc[df[filt].index.values, 'fuel_consumption_l_100km'] = 0
df[df.fuel_consumption_l_100km == 0]

Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km
14396,audi,Audi A3,white,08/2017,2017,19790,150,Automatic,Petrol,0.0,40 g/km,69999.0
15882,audi,Audi A3,silver,08/2018,2018,23870,150,Automatic,Petrol,0.0,38 g/km,53500.0
15901,audi,Audi A3,silver,05/2018,2018,24420,150,Automatic,Petrol,0.0,40 g/km,17400.0
18779,audi,Audi A4,grey,05/2020,2020,29440,190,Automatic,Diesel,0.0,120 g/km,115314.0
19559,audi,Audi A6,white,12/2020,2020,38890,299,Automatic,Petrol,0.0,34 g/km,44798.0
...,...,...,...,...,...,...,...,...,...,...,...,...
249698,volvo,Volvo XC90,grey,12/2020,2020,65900,303,Automatic,Petrol,0.0,55 g/km,20000.0
250097,volvo,Volvo XC90,white,10/2020,2020,51840,392,Automatic,Petrol,0.0,52 g/km,69987.0
250109,volvo,Volvo V90,white,07/2020,2020,44950,303,Automatic,Petrol,0.0,43 g/km,17360.0
250586,volvo,Volvo XC90,blue,05/2022,2022,81850,455,Automatic,Petrol,0.0,34 g/km,9500.0


In [33]:
df.fuel_consumption_l_100km.describe()

count    200567.000000
mean          6.059277
std           1.925611
min           0.000000
25%           4.900000
50%           5.700000
75%           6.800000
max          31.400000
Name: fuel_consumption_l_100km, dtype: float64

Now, let's take the time to address the missing values in our dataframe. During our brief look at the dataset at the beginning of the project we noted that most of them are located in the Fuel Consumption column. Before we proceed with any data manipulation, let's take a moment to question why some of these values are missing.

- Electric vehicles do not consume fuel, but instead rely on a battery to move.

- The dataset is a collection of online car listings, thus many entries are subject to human error. The fuel consumption entry is no exception as many car owners may lack the knowledge of that information about their car.

In [34]:
missing_values_df = pd.DataFrame(df[df.fuel_consumption_l_100km.isna()]['fuel_type'].value_counts())
missing_values_df.columns = ['Missing Values']
missing_values_df['Total Rows'] = df['fuel_type'].value_counts()
missing_values_df['% of Missing Values'] = round(missing_values_df['Missing Values'] \
/missing_values_df['Total Rows']*100, 2)
missing_values_df.sort_values('% of Missing Values', ascending=False)

Unnamed: 0,Missing Values,Total Rows,% of Missing Values
Electric,5951,5967,99.73
Hydrogen,58,82,70.73
Ethanol,6,10,60.0
Diesel Hybrid,150,476,31.51
Hybrid,3884,12607,30.81
CNG,137,508,26.97
LPG,290,1255,23.11
Diesel,15289,86421,17.69
Petrol,24274,143280,16.94


What meaning does the above table convey to us exactly? We can see that almost all of the rows of electric vehicles have a missing value in the Fuel Consumption column. Again, this makes sense to us because we know that Electric vehicles lack a Fuel Consumption value.

Dropping these rows from our dataset now would be disastrous for our later on analysis. On the other hand, we can drop the row for Ethanol since it's significantly less in size, compared to the total row count, but also due to the fact that the ratio between the missing and total entries is still high.

So, for now, let's simply replace the missing values of all the fuel types with zero and drop the entries which we do not need.

In [35]:
#removing the 'Ethanol fuel type' rows
df = df[df['fuel_type'] != 'Ethanol']

# replacing the nan values with zero for the rest of the entries and printing the results
df['fuel_consumption_l_100km'] = df.loc[:,'fuel_consumption_l_100km'].fillna(0)
na_fuel_consumption_rows = df['fuel_consumption_l_100km'].isna().sum()
print('Number of rows with missing values in the Fuel Consumption column after cleaning: ', na_fuel_consumption_rows)


Number of rows with missing values in the Fuel Consumption column after cleaning:  0


In [36]:
df

Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km
0,alfa-romeo,Alfa Romeo GTV,red,10/1995,1995,1300,201,Manual,Petrol,10.9,260 g/km,160500.0
1,alfa-romeo,Alfa Romeo 164,black,02/1995,1995,24900,260,Manual,Petrol,0.0,- (g/km),190000.0
2,alfa-romeo,Alfa Romeo Spider,black,02/1995,1995,5900,150,Unknown,Petrol,0.0,- (g/km),129000.0
3,alfa-romeo,Alfa Romeo Spider,black,07/1995,1995,4900,150,Manual,Petrol,9.5,225 g/km,189500.0
4,alfa-romeo,Alfa Romeo 164,red,11/1996,1996,17950,179,Manual,Petrol,7.2,- (g/km),96127.0
...,...,...,...,...,...,...,...,...,...,...,...,...
251074,volvo,Volvo XC40,white,04/2023,2023,57990,261,Automatic,Hybrid,0.0,43 km Reichweite,1229.0
251075,volvo,Volvo XC90,white,03/2023,2023,89690,235,Automatic,Diesel,7.6,202 g/km,4900.0
251076,volvo,Volvo V60,white,05/2023,2023,61521,197,Automatic,Diesel,4.7,125 g/km,1531.0
251077,volvo,Volvo XC40,white,05/2023,2023,57890,179,Automatic,Hybrid,0.0,45 km Reichweite,1500.0


We still need to address the zero-valued rows for the electric vehicles which we replaced in the previous step, so let's do that now.

In [37]:
df[df.fuel_type == 'Electric']

Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km
16552,audi,Audi e-tron,beige,09/2019,2019,51888,408,Automatic,Electric,0.0,359 km Reichweite,84800.0
16559,audi,Audi e-tron,beige,07/2019,2019,53990,408,Automatic,Electric,0.0,359 km Reichweite,51000.0
16561,audi,Audi e-tron,beige,11/2019,2019,54870,408,Automatic,Electric,0.0,0 g/km,82814.0
16571,audi,Audi e-tron,beige,12/2019,2019,61989,408,Automatic,Electric,0.0,0 g/km,55990.0
16579,audi,Audi e-tron,blue,02/2019,2019,32930,408,Automatic,Electric,0.0,359 km Reichweite,84300.0
...,...,...,...,...,...,...,...,...,...,...,...,...
251033,volvo,Volvo C40,black,05/2023,2023,52890,231,Automatic,Electric,0.0,400 km Reichweite,8.0
251037,volvo,Volvo XC40,black,04/2023,2023,49900,231,Automatic,Electric,0.0,0 g/km,14900.0
251048,volvo,Volvo C40,black,01/2023,2023,51990,231,Automatic,Electric,0.0,0 g/km,2106.0
251056,volvo,Volvo C40,black,05/2023,2023,60520,231,Automatic,Electric,0.0,400 km Reichweite,3000.0


The column named 'fuel_consumption_g_km' contains information about the cars' battery range (*Reichweite = Range* in German) so we can extract that information and make a new column with the battery range values.

In [38]:
el_vehicles = df[df.fuel_type == 'Electric'].index.values #index values of all electric vehicles
df = df.assign(range_in_km = df.loc[el_vehicles, 'fuel_consumption_g_km'].str.extract(pat='(\d+)', expand=False))

In [39]:
#dropping unnecessary columns and rearranging
df = df[['brand', 'model', 'color', 'registration_date', 'year', 'power_ps', \
'transmission_type', 'fuel_type', 'fuel_consumption_l_100km', 'range_in_km', 'mileage_in_km', 'price_in_euro']]
df.sample(10)

Unnamed: 0,brand,model,color,registration_date,year,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,range_in_km,mileage_in_km,price_in_euro
110426,mercedes-benz,Mercedes-Benz C 350,grey,01/2011,2011,231,Automatic,Diesel,0.0,,187520.0,9950
191279,skoda,Skoda Octavia,blue,02/2015,2015,179,Automatic,Petrol,5.7,,213000.0,11995
229883,volkswagen,Volkswagen Golf Sportsvan,grey,09/2015,2015,150,Manual,Petrol,5.4,,104102.0,13870
212979,toyota,Toyota Proace,silver,05/2023,2023,177,Automatic,Diesel,5.4,,10.0,46960
104682,mercedes-benz,Mercedes-Benz E 200,silver,08/2004,2004,163,Manual,Petrol,0.0,,218395.0,5850
204099,smart,smart forFour,red,06/2016,2016,71,Automatic,Petrol,0.0,,56000.0,13490
163226,peugeot,Peugeot 508,blue,08/2022,2022,131,Automatic,Petrol,6.4,,9088.0,37890
84967,jaguar,Jaguar E-Pace,red,05/2019,2019,249,Automatic,Petrol,8.3,,39200.0,34990
192823,skoda,Skoda Fabia,grey,11/2017,2017,75,Manual,Petrol,4.8,,110000.0,9900
58118,ford,Ford Fiesta,grey,10/2005,2005,80,Manual,Petrol,6.5,,154030.0,2000


Since most of our columns still contain string based values, it would be beneficial to convert them into their appropriate data type.

In [40]:
df.dtypes

brand                        object
model                        object
color                        object
registration_date            object
year                         object
power_ps                     object
transmission_type            object
fuel_type                    object
fuel_consumption_l_100km    float64
range_in_km                  object
mileage_in_km               float64
price_in_euro                object
dtype: object

In [41]:
columns_to_numeric = ['power_ps', 'fuel_consumption_l_100km', 'range_in_km', 'price_in_euro']
for column in columns_to_numeric:
    df[column] = df[column].astype(float)

df['registration_date'] = pd.to_datetime(df['registration_date'], format='%m/%Y')
df['year'] = df['registration_date'].dt.year

In [42]:
df.dtypes

brand                               object
model                               object
color                               object
registration_date           datetime64[ns]
year                                 int64
power_ps                           float64
transmission_type                   object
fuel_type                           object
fuel_consumption_l_100km           float64
range_in_km                        float64
mileage_in_km                      float64
price_in_euro                      float64
dtype: object

In [43]:
df.isna().sum()

brand                            0
model                            0
color                          166
registration_date                0
year                             0
power_ps                       115
transmission_type                0
fuel_type                        0
fuel_consumption_l_100km         0
range_in_km                 244645
mileage_in_km                   62
price_in_euro                    0
dtype: int64

In [44]:
cols = ['color', 'power_ps', 'mileage_in_km']
df[cols] = df.loc[~df[cols].isna().any(axis=1), cols]
df.sample(10)

Unnamed: 0,brand,model,color,registration_date,year,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,range_in_km,mileage_in_km,price_in_euro
66641,ford,Ford Kuga,white,2017-02-01,2017,150.0,Manual,Petrol,6.2,,79200.0,16490.0
9939,audi,Audi A3,grey,2015-11-01,2015,150.0,Manual,Petrol,4.7,,85079.0,15820.0
36410,bmw,BMW 420,grey,2017-11-01,2017,190.0,Automatic,Diesel,4.5,,104492.0,27970.0
88792,kia,Kia Sportage,white,2016-06-01,2016,177.0,Manual,Petrol,7.3,,84232.0,17979.0
70170,ford,Ford Fiesta,silver,2019-04-01,2019,86.0,Manual,Petrol,5.1,,55225.0,12700.0
186162,seat,SEAT Arona,blue,2022-03-01,2022,110.0,Manual,Petrol,0.0,,250.0,23770.0
85237,jaguar,Jaguar F-Type,grey,2020-06-01,2020,381.0,Automatic,Petrol,0.0,,19000.0,60750.0
144143,opel,Opel Zafira,grey,2014-02-01,2014,110.0,Manual,Diesel,0.0,,143512.0,8500.0
179635,seat,SEAT,black,2017-07-01,2017,150.0,Automatic,Diesel,0.0,,69900.0,18890.0
31159,bmw,BMW 135,grey,2013-09-01,2013,320.0,Automatic,Petrol,0.0,,82000.0,25999.0


We have successfuly cleaned our dataset from NaN and "garbage" values. At this point, we are able to export our dataset back to a csv file so we can perform visualizations using either Tableu or PowerBI.

In [45]:
df.to_csv('New - German_Used_Cars.csv', index=False)