# German Used Cars - Data Cleaning & E.D.A.

In this project we will be taking a closer look at the state of Germany's used car market. Using the dataset that is found on [Kaggle](https://www.kaggle.com/datasets/wspirat/germany-used-cars-dataset-2023), we are going to be cleaning, analysing and exploring the said dataset's data in order to accumulate as much information as possible about this market and come up with answers to various questions.

## Part 0 - Importing the Libraries & Taking a Glance at the Data

In [336]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

raw_df = pd.read_csv('Data - German_Used_Cars.csv')
df = raw_df.copy()

In [337]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_kw,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km,offer_description
100986,100986,mazda,Mazda CX-5,red,09/2022,2022,40090,135,184,Manual,Diesel,"4,9 l/100 km",147 g/km,1853.0,SKYACTIV-D 184
122661,122661,mercedes-benz,Mercedes-Benz C 220,silver,06/2018,2018,30990,143,194,Automatic,Diesel,"4,4 l/100 km",117 g/km,56226.0,d Exclusive PTS COMAND LED STANDHZ PANO
95216,95216,land-rover,Land Rover Range Rover Evoque,grey,06/2019,2019,64850,147,200,Automatic,Hybrid,"7,8 l/100 km",176 g/km,24500.0,R-Dynamic SE*Navi *Led
141303,141303,opel,Opel Corsa,black,08/2009,2009,5990,59,80,Manual,Petrol,"6,1 l/100 km",145 g/km,56000.0,D Edition wenig KM!/Klima/8-Fach/TÜV Neu
60176,60176,ford,Ford Mondeo,black,03/2011,2011,7900,147,200,Manual,Diesel,6 l/100 km,159 g/km,191000.0,2.2 TDCI Turnier Titanium 200 PS 2.HAND
166412,166412,porsche,Porsche 991,black,04/2017,2017,59950,272,370,Automatic,Petrol,"7,4 l/100 km",169 g/km,57040.0,Carrera Sportabgas Schiebe-Hubdach Chrono
102289,102289,mercedes-benz,Mercedes-Benz E 320,silver,12/1997,1997,9500,162,220,Automatic,Petrol,"11,9 l/100 km",283 g/km,50000.0,Elegance Top Zustand Tüv Neu 1 Hand
85154,85154,jaguar,Jaguar E-Pace,silver,03/2019,2019,33579,177,241,Automatic,Diesel,"6,8 l/100 km",175 g/km,94620.0,D240 S R-Dynamic *Leder*Navi*Pano
99783,99783,mazda,Mazda CX-5,grey,03/2019,2019,17900,110,150,Automatic,Diesel,"6,5 l/100 km",171 g/km,190486.0,"2,2d/BI-Xenon/Sitzheizung/Automatik/Navi/"
195945,195945,skoda,Skoda Superb,grey,05/2019,2019,22950,110,150,Manual,Petrol,,143 g/km,63400.0,"Sportline 1,5 TSI Limousine * Garantie *"


In [338]:
df.shape

(251079, 15)

In [339]:
df.columns

Index(['Unnamed: 0', 'brand', 'model', 'color', 'registration_date', 'year',
       'price_in_euro', 'power_kw', 'power_ps', 'transmission_type',
       'fuel_type', 'fuel_consumption_l_100km', 'fuel_consumption_g_km',
       'mileage_in_km', 'offer_description'],
      dtype='object')

In [340]:
df.dtypes

Unnamed: 0                    int64
brand                        object
model                        object
color                        object
registration_date            object
year                         object
price_in_euro                object
power_kw                     object
power_ps                     object
transmission_type            object
fuel_type                    object
fuel_consumption_l_100km     object
fuel_consumption_g_km        object
mileage_in_km               float64
offer_description            object
dtype: object

In [341]:
df.isna().sum()

Unnamed: 0                      0
brand                           0
model                           0
color                         166
registration_date               4
year                            0
price_in_euro                   0
power_kw                      134
power_ps                      129
transmission_type               0
fuel_type                       0
fuel_consumption_l_100km    26873
fuel_consumption_g_km           0
mileage_in_km                 152
offer_description               1
dtype: int64

After our brief glance of the data there are some things that we need to keep in mind moving forward. More specifically:

- There is an extra column, namely 'Unnamed: 0', which does not serve any significant purpose and thus must be *dropped*.

- There is a considerable amount of missing values, compared to the amount of rows in the dataset, and we have to treat them carefully in order not to lose as much data during the 'cleaning' phase.

- The majority of the data types in the dataset are objects (in particular **strings**) and some should be turned to their appropriate data types.

## Part 1 - Data Cleaning

In order to improve the results of our analysis, we must first go through the stage of cleaning our data; making it more accurate and removing unwanted clutter. In the following steps we will be taking the necessary actions in doing that.

In [342]:
# dropping the columns we won't be using
df.drop(['Unnamed: 0', 'offer_description', 'power_kw'], axis=1, inplace=True)

In [343]:
df.fuel_type.unique()

array(['Petrol', 'Diesel', 'Hybrid', 'LPG', 'Other', '07/2004',
       '74.194 km', '110.250 km', '06/2014', 'CNG', 'Diesel Hybrid',
       '12/2016', 'Automatic', 'Electric', '12/2019', 'Unknown',
       '06/2023', 'Ethanol', 'Manual', '300.000 km', '264.000 km',
       'KETTE NEUE', '108.313 km', '05/2009', '180.000 km', '04/2013',
       '03/2014', '08/2014', '01/2016', '03/2017', '04/2008', '07/2007',
       '145.500 km', '12/2012', '25890', '10/2022', '06/2004', '09/2009',
       '12/2014', '02/2017', '12890', '11/2018', '08/2018', '03/2019',
       '19450', '11/2021', '20.600 km', 'Hydrogen', '07/2022', '05/2015',
       '03/2018', '04/2022', '160.629 km', '144.919 km', '02/1996',
       '04/2000', '200.000 km', '06/2009', '185.500 km', '13000',
       '05/2012', '11/2014', '10/2015', '350.000 km', '49.817 km',
       '34900', '35.487 km', '03/2021', '26890', '26990', '4.000 km',
       '11/2005', '07/2005', '08/2011', '02/2011', '03/2011', '10/2013',
       '09/2015', '02/2018',

In [344]:
#filtering the dataframe to include only these fuels in the 'fuel_type' column.
fuels = ['Petrol', 'Diesel', 'Electric', 'Hybrid', 'LPG', 'CNG', 'Diesel Hybrid', 'Ethanol', 'Hydrogen']
df = df[df.fuel_type.isin(fuels)]

In [345]:
# extracting the 'fuel_consumption_l_100km' numeric values
df['fuel_consumption_l_100km'] = df['fuel_consumption_l_100km'].str.extract('(\d+,\d+)', expand=False)
df['fuel_consumption_l_100km'] = df['fuel_consumption_l_100km'].str.replace(',', '.')

Now, let's take the time to address the missing values in our dataframe. During our brief look at the dataset at the beginning of the project we noted that most of them are located in the Fuel Consumption column. Before we proceed with any data manipulation, let's take a moment to question why some of these values are missing.

- Electric vehicles do not consume fuel, but instead rely on a battery to move.

- The dataset is a collection of online car listings, thus many entries are subject to human error. The fuel consumption entry is no exception as many car owners may lack the knowledge of that information about their car.

In [346]:
missing_values_df = pd.DataFrame(df[df.fuel_consumption_l_100km.isna()]['fuel_type'].value_counts())
missing_values_df.columns = ['Missing Values']
missing_values_df['Total Rows'] = df['fuel_type'].value_counts()
missing_values_df['% of Missing Values'] = round(missing_values_df['Missing Values'] \
/missing_values_df['Total Rows']*100, 2)
missing_values_df.sort_values('% of Missing Values', ascending=False)

Unnamed: 0,Missing Values,Total Rows,% of Missing Values
Electric,5951,5967,99.73
Hydrogen,58,82,70.73
Ethanol,6,10,60.0
Diesel Hybrid,150,476,31.51
Hybrid,3884,12607,30.81
CNG,137,508,26.97
LPG,290,1255,23.11
Diesel,15289,86421,17.69
Petrol,24274,143280,16.94


What meaning does the above table convey to us exactly? We can see that almost all of the rows of electric vehicles have a missing value in the Fuel Consumption column. Again, this makes sense to us because we know that Electric vehicles lack a Fuel Consumption value.

Dropping these rows from our dataset now would be disastrous for our later on analysis. On the other hand, we can drop the row for Ethanol since it's significantly less in size, compared to the total row count, but also due to the fact that the ratio between the missing and total entries is still high.

So, for now, let's simply replace the missing values of all the fuel types with zero and drop the entries which we do not need.

In [347]:
#removing the 'Ethanol fuel type' rows
df = df[df['fuel_type'] != 'Ethanol']

# replacing the nan values with zero for the rest of the entries and printing the results
df['fuel_consumption_l_100km'] = df.loc[:,'fuel_consumption_l_100km'].fillna(0)
na_fuel_consumption_rows = df['fuel_consumption_l_100km'].isna().sum()
print('Number of rows with missing values in the Fuel Consumption column after cleaning: ', na_fuel_consumption_rows)


Number of rows with missing values in the Fuel Consumption column after cleaning:  0


In [348]:
df

Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km
0,alfa-romeo,Alfa Romeo GTV,red,10/1995,1995,1300,201,Manual,Petrol,10.9,260 g/km,160500.0
1,alfa-romeo,Alfa Romeo 164,black,02/1995,1995,24900,260,Manual,Petrol,0,- (g/km),190000.0
2,alfa-romeo,Alfa Romeo Spider,black,02/1995,1995,5900,150,Unknown,Petrol,0,- (g/km),129000.0
3,alfa-romeo,Alfa Romeo Spider,black,07/1995,1995,4900,150,Manual,Petrol,9.5,225 g/km,189500.0
4,alfa-romeo,Alfa Romeo 164,red,11/1996,1996,17950,179,Manual,Petrol,7.2,- (g/km),96127.0
...,...,...,...,...,...,...,...,...,...,...,...,...
251074,volvo,Volvo XC40,white,04/2023,2023,57990,261,Automatic,Hybrid,0,43 km Reichweite,1229.0
251075,volvo,Volvo XC90,white,03/2023,2023,89690,235,Automatic,Diesel,7.6,202 g/km,4900.0
251076,volvo,Volvo V60,white,05/2023,2023,61521,197,Automatic,Diesel,4.7,125 g/km,1531.0
251077,volvo,Volvo XC40,white,05/2023,2023,57890,179,Automatic,Hybrid,0,45 km Reichweite,1500.0


We still need to address the zero-valued rows for the electric vehicles which we replaced in the previous step, so let's do that now.

In [349]:
df[df.fuel_type == 'Electric']

Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km
16552,audi,Audi e-tron,beige,09/2019,2019,51888,408,Automatic,Electric,0,359 km Reichweite,84800.0
16559,audi,Audi e-tron,beige,07/2019,2019,53990,408,Automatic,Electric,0,359 km Reichweite,51000.0
16561,audi,Audi e-tron,beige,11/2019,2019,54870,408,Automatic,Electric,0,0 g/km,82814.0
16571,audi,Audi e-tron,beige,12/2019,2019,61989,408,Automatic,Electric,0,0 g/km,55990.0
16579,audi,Audi e-tron,blue,02/2019,2019,32930,408,Automatic,Electric,0,359 km Reichweite,84300.0
...,...,...,...,...,...,...,...,...,...,...,...,...
251033,volvo,Volvo C40,black,05/2023,2023,52890,231,Automatic,Electric,0,400 km Reichweite,8.0
251037,volvo,Volvo XC40,black,04/2023,2023,49900,231,Automatic,Electric,0,0 g/km,14900.0
251048,volvo,Volvo C40,black,01/2023,2023,51990,231,Automatic,Electric,0,0 g/km,2106.0
251056,volvo,Volvo C40,black,05/2023,2023,60520,231,Automatic,Electric,0,400 km Reichweite,3000.0


The column named 'fuel_consumption_g_km' contains information about the cars' battery range (*Reichweite = Range* in German) so we can extract that information and make a new column with the battery range values.

In [350]:
el_vehicles = df[df.fuel_type == 'Electric'].index.values #index values of all electric vehicles
df = df.assign(range_in_km = df.loc[el_vehicles, 'fuel_consumption_g_km'].str.extract(pat='(\d+)', expand=False))

In [351]:
#dropping unnecessary columns and rearranging
df = df[['brand', 'model', 'color', 'registration_date', 'year', 'power_ps', \
'transmission_type', 'fuel_type', 'fuel_consumption_l_100km', 'range_in_km', 'mileage_in_km', 'price_in_euro']]
df.sample(10)

Unnamed: 0,brand,model,color,registration_date,year,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,range_in_km,mileage_in_km,price_in_euro
100409,mazda,Mazda CX-30,grey,08/2020,2020,122,Automatic,Petrol,5.5,,15290.0,25990
159419,peugeot,Peugeot 207,grey,11/2010,2010,120,Manual,Petrol,6.5,,112231.0,5900
96308,land-rover,Land Rover Range Rover Velar,grey,06/2021,2021,300,Automatic,Diesel,6.6,,42200.0,68880
147087,opel,Opel Corsa,blue,01/2017,2017,90,Manual,Petrol,5.1,,58700.0,10990
24861,bmw,BMW 318,black,06/2004,2004,143,Manual,Petrol,7.8,,217500.0,5500
102951,mercedes-benz,Mercedes-Benz C 200,silver,08/2000,2000,163,Manual,Petrol,9.7,,261562.0,1590
230818,volkswagen,Volkswagen Golf Sportsvan,silver,03/2015,2015,125,Manual,Petrol,5.4,,59987.0,17990
21570,audi,Audi A5,white,05/2022,2022,204,Automatic,Petrol,6.3,,16450.0,24450
246196,volkswagen,Volkswagen T6.1 California,red,06/2023,2023,150,Automatic,Diesel,6.1,,20.0,83450
153216,opel,Opel Insignia,brown,11/2020,2020,122,Manual,Diesel,5.5,,52650.0,18980


Since most of our columns still contain string based values, it would be beneficial to convert them into their appropriate data type.

In [352]:
df.dtypes

brand                        object
model                        object
color                        object
registration_date            object
year                         object
power_ps                     object
transmission_type            object
fuel_type                    object
fuel_consumption_l_100km     object
range_in_km                  object
mileage_in_km               float64
price_in_euro                object
dtype: object

In [353]:
columns_to_numeric = ['power_ps', 'fuel_consumption_l_100km', 'range_in_km', 'price_in_euro']
for column in columns_to_numeric:
    df[column] = df[column].astype(float)

df['registration_date'] = pd.to_datetime(df['registration_date'], format='%m/%Y')
df['year'] = df['registration_date'].dt.year

In [354]:
df.isna().sum()

brand                            0
model                            0
color                          166
registration_date                0
year                             0
power_ps                       115
transmission_type                0
fuel_type                        0
fuel_consumption_l_100km         0
range_in_km                 244645
mileage_in_km                   62
price_in_euro                    0
dtype: int64

In [355]:
cols = ['color', 'power_ps', 'mileage_in_km']
df[cols] = df.loc[~df[cols].isna().any(axis=1), cols]
df.sample(10)

Unnamed: 0,brand,model,color,registration_date,year,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,range_in_km,mileage_in_km,price_in_euro
164102,peugeot,Peugeot 3008,grey,2023-02-01,2023,131.0,Automatic,Diesel,5.6,,150.0,42990.0
210110,toyota,Toyota Aygo X,white,2021-07-01,2021,72.0,Manual,Petrol,4.1,,31200.0,13970.0
155774,opel,Opel Crossland X,white,2021-03-01,2021,131.0,Manual,Petrol,4.7,,44700.0,17380.0
80951,hyundai,Hyundai TUCSON,red,2019-03-01,2019,177.0,Automatic,Petrol,0.0,,47825.0,25990.0
241975,volkswagen,Volkswagen Tiguan,brown,2021-09-01,2021,131.0,Manual,Petrol,5.5,,18343.0,31850.0
204963,smart,smart forTwo,black,2021-03-01,2021,82.0,Automatic,Electric,0.0,0.0,7715.0,16686.0
212063,toyota,Toyota RAV 4,blue,2023-05-01,2023,218.0,Automatic,Hybrid,4.5,,20.0,41980.0
53854,fiat,Fiat,white,2016-11-01,2016,120.0,Manual,Diesel,0.0,,196290.0,11985.0
188701,seat,SEAT Ateca,black,2023-02-01,2023,150.0,Automatic,Petrol,6.7,,1100.0,33900.0
11369,audi,Audi TT,blue,2016-03-01,2016,230.0,Automatic,Petrol,6.7,,52300.0,28490.0


We have successfuly cleaned our dataset from NaN and "garbage" values. At this point, we are able to extract our dataset back to a csv file so we can continue with our exploratory part of our analysis.

*Note*: We have added a second dataset to be extracted. This one is filtered so that all entries have a fuel consumption ratio of 2L/100km and contains more realistic values virtually no vehicles have that low of a fuel consumption.

In [356]:
df.to_csv('New - German_Used_Cars.csv', index=False)

filtered_df = df[df.fuel_consumption_l_100km > 2]
filtered_df.to_csv('New - Filtered_German_Used_Cars.csv', index=False)
