# German Used Cars - Data Cleaning & E.D.A.

In this project we will be taking a closer look at the state of Germany's used car market. Using the dataset that is found on [Kaggle](https://www.kaggle.com/datasets/wspirat/germany-used-cars-dataset-2023), we are going to be cleaning, analysing and exploring the said dataset's data in order to accumulate as much information as possible about this market and come up with answers to various questions.

## Part 0 - Importing the Libraries & Taking a Glance at the Data

In [554]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

raw_df = pd.read_csv('Data - German_Used_Cars.csv')
df = raw_df.copy()

In [555]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_kw,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km,offer_description
173854,173854,renault,Renault Clio,white,12/2019,2019,15300,74,101,Manual,Petrol,"4,7 l/100 km",107 g/km,37600.0,V Zen Sitzheizung Navigation
120131,120131,mercedes-benz,Mercedes-Benz C 300,black,06/2017,2017,31880,180,245,Automatic,Petrol,"6,3 l/100 km",146 g/km,49785.0,"Klasse Lim. Avantgarde, Standheizung, AMG"
233036,233036,volkswagen,Volkswagen Golf,white,07/2016,2016,13990,92,125,Manual,Petrol,"5,2 l/100 km",120 g/km,80000.0,VII 1.4 TSI Lim. Comfortline BMT/NAVI/SHZ
64453,64453,ford,Ford Fiesta,black,08/2016,2016,9700,59,80,Manual,Petrol,"4,6 l/100 km",105 g/km,91904.0,1.0 Celebration*SHZ*PDC*ALU*
1740,1740,audi,Audi A4,blue,01/1999,1999,1400,92,125,Manual,Petrol,"9,4 l/100 km",- (g/km),210000.0,Avant 1.8 quattro
92841,92841,kia,Kia Sportage,red,04/2023,2023,36429,100,136,Automatic,Diesel,"4,4 l/100 km",117 g/km,5000.0,1.6D 48V AWD VISION KOMFORTPAKET
58432,58432,ford,Ford S-Max,silver,12/2006,2006,5290,162,220,Manual,Petrol,"9,4 l/100 km",- (g/km),219500.0,2.5 Titanium
24053,24053,bmw,BMW 325,silver,09/2001,2001,6300,141,192,Manual,Petrol,9 l/100 km,- (g/km),228200.0,Ci
107399,107399,mercedes-benz,Mercedes-Benz CLK 200,blue,05/2008,2008,14999,135,184,Manual,Petrol,"8,9 l/100 km",212 g/km,84300.0,CLK Cabrio CLK 200 Kompressor KE18ML
81214,81214,hyundai,Hyundai i30,silver,01/2019,2019,23790,202,275,Manual,Petrol,"7,8 l/100 km",178 g/km,77500.0,"Fastback N Perf. 2.0 T-GDI PANO*LED*NAV*19"""


In [556]:
df.shape

(251079, 15)

In [557]:
df.columns

Index(['Unnamed: 0', 'brand', 'model', 'color', 'registration_date', 'year',
       'price_in_euro', 'power_kw', 'power_ps', 'transmission_type',
       'fuel_type', 'fuel_consumption_l_100km', 'fuel_consumption_g_km',
       'mileage_in_km', 'offer_description'],
      dtype='object')

In [558]:
df.dtypes

Unnamed: 0                    int64
brand                        object
model                        object
color                        object
registration_date            object
year                         object
price_in_euro                object
power_kw                     object
power_ps                     object
transmission_type            object
fuel_type                    object
fuel_consumption_l_100km     object
fuel_consumption_g_km        object
mileage_in_km               float64
offer_description            object
dtype: object

In [559]:
df.isna().sum()

Unnamed: 0                      0
brand                           0
model                           0
color                         166
registration_date               4
year                            0
price_in_euro                   0
power_kw                      134
power_ps                      129
transmission_type               0
fuel_type                       0
fuel_consumption_l_100km    26873
fuel_consumption_g_km           0
mileage_in_km                 152
offer_description               1
dtype: int64

After our brief glance of the data there are some things that we need to keep in mind moving forward. More specifically:

- There is an extra column, namely 'Unnamed: 0', which does not serve any significant purpose and thus must be *dropped*.

- There is a considerable amount of missing values, compared to the amount of rows in the dataset, and we have to treat them carefully in order not to lose as much data during the 'cleaning' phase.

- The majority of the data types in the dataset are objects (in particular **strings**) and some should be turned to their appropriate data types.

## Part 1 - Data Cleaning

In order to improve the results of our analysis, we must first go through the stage of cleaning our data; making it more accurate and removing unwanted clutter. In the following steps we will be taking the necessary actions in doing that.

In [560]:
# dropping the columns we won't be using
df.drop(['Unnamed: 0', 'offer_description', 'power_kw'], axis=1, inplace=True)

In [561]:
df.fuel_type.unique()

array(['Petrol', 'Diesel', 'Hybrid', 'LPG', 'Other', '07/2004',
       '74.194 km', '110.250 km', '06/2014', 'CNG', 'Diesel Hybrid',
       '12/2016', 'Automatic', 'Electric', '12/2019', 'Unknown',
       '06/2023', 'Ethanol', 'Manual', '300.000 km', '264.000 km',
       'KETTE NEUE', '108.313 km', '05/2009', '180.000 km', '04/2013',
       '03/2014', '08/2014', '01/2016', '03/2017', '04/2008', '07/2007',
       '145.500 km', '12/2012', '25890', '10/2022', '06/2004', '09/2009',
       '12/2014', '02/2017', '12890', '11/2018', '08/2018', '03/2019',
       '19450', '11/2021', '20.600 km', 'Hydrogen', '07/2022', '05/2015',
       '03/2018', '04/2022', '160.629 km', '144.919 km', '02/1996',
       '04/2000', '200.000 km', '06/2009', '185.500 km', '13000',
       '05/2012', '11/2014', '10/2015', '350.000 km', '49.817 km',
       '34900', '35.487 km', '03/2021', '26890', '26990', '4.000 km',
       '11/2005', '07/2005', '08/2011', '02/2011', '03/2011', '10/2013',
       '09/2015', '02/2018',

In [562]:
#filtering the dataframe to include only these fuels in the 'fuel_type' column.
fuels = ['Petrol', 'Diesel', 'Electric', 'Hybrid', 'LPG', 'CNG', 'Diesel Hybrid', 'Ethanol', 'Hydrogen']
df = df[df.fuel_type.isin(fuels)]

In [563]:
# extracting the 'fuel_consumption_l_100km' numeric values
df['fuel_consumption_l_100km'] = df['fuel_consumption_l_100km'].str.extract('(\d+,\d+)', expand=False)
df['fuel_consumption_l_100km'] = df['fuel_consumption_l_100km'].str.replace(',', '.')

Now, let's take the time to address the missing values in our dataframe. During our brief look at the dataset at the beginning of the project we noted that most of them are located in the Fuel Consumption column. Before we proceed with any data manipulation, let's take a moment to question why some of these values are missing.

- Electric vehicles do not consume fuel, but instead rely on a battery to move.

- The dataset is a collection of online car listings, thus many entries are subject to human error. The fuel consumption entry is no exception as many car owners may lack the knowledge of that information about their car.

In [564]:
missing_values_df = pd.DataFrame(df[df.fuel_consumption_l_100km.isna()]['fuel_type'].value_counts())
missing_values_df.columns = ['Missing Values']
missing_values_df['Total Rows'] = df['fuel_type'].value_counts()
missing_values_df['% of Missing Values'] = round(missing_values_df['Missing Values'] \
/missing_values_df['Total Rows']*100, 2)
missing_values_df.sort_values('% of Missing Values', ascending=False)

Unnamed: 0,Missing Values,Total Rows,% of Missing Values
Electric,5951,5967,99.73
Hydrogen,58,82,70.73
Ethanol,6,10,60.0
Diesel Hybrid,150,476,31.51
Hybrid,3884,12607,30.81
CNG,137,508,26.97
LPG,290,1255,23.11
Diesel,15289,86421,17.69
Petrol,24274,143280,16.94


What meaning does the above table convey to us exactly? We can see that almost all of the rows of electric vehicles have a missing value in the Fuel Consumption column. Again, this makes sense to us because we know that Electric vehicles lack a Fuel Consumption value.

Dropping these rows from our dataset now would be disastrous for our later on analysis. On the other hand, we can drop the row for Ethanol since it's significantly less in size, compared to the total row count, but also due to the fact that the ratio between the missing and total entries is still high.

So, for now, let's simply replace the missing values of all the fuel types with zero and drop the entries which we do not need.

In [565]:
#removing the 'Ethanol fuel type' rows
df = df[df['fuel_type'] != 'Ethanol']

# replacing the nan values with zero for the rest of the entries and printing the results
df['fuel_consumption_l_100km'] = df.loc[:,'fuel_consumption_l_100km'].fillna(0)
na_fuel_consumption_rows = df['fuel_consumption_l_100km'].isna().sum()
print('Number of rows with missing values in the Fuel Consumption column after cleaning: ', na_fuel_consumption_rows)


Number of rows with missing values in the Fuel Consumption column after cleaning:  0


In [566]:
df

Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km
0,alfa-romeo,Alfa Romeo GTV,red,10/1995,1995,1300,201,Manual,Petrol,10.9,260 g/km,160500.0
1,alfa-romeo,Alfa Romeo 164,black,02/1995,1995,24900,260,Manual,Petrol,0,- (g/km),190000.0
2,alfa-romeo,Alfa Romeo Spider,black,02/1995,1995,5900,150,Unknown,Petrol,0,- (g/km),129000.0
3,alfa-romeo,Alfa Romeo Spider,black,07/1995,1995,4900,150,Manual,Petrol,9.5,225 g/km,189500.0
4,alfa-romeo,Alfa Romeo 164,red,11/1996,1996,17950,179,Manual,Petrol,7.2,- (g/km),96127.0
...,...,...,...,...,...,...,...,...,...,...,...,...
251074,volvo,Volvo XC40,white,04/2023,2023,57990,261,Automatic,Hybrid,0,43 km Reichweite,1229.0
251075,volvo,Volvo XC90,white,03/2023,2023,89690,235,Automatic,Diesel,7.6,202 g/km,4900.0
251076,volvo,Volvo V60,white,05/2023,2023,61521,197,Automatic,Diesel,4.7,125 g/km,1531.0
251077,volvo,Volvo XC40,white,05/2023,2023,57890,179,Automatic,Hybrid,0,45 km Reichweite,1500.0


We still need to address the zero-valued rows for the electric vehicles which we replaced in the previous step, so let's do that now.

In [567]:
df[df.fuel_type == 'Electric']

Unnamed: 0,brand,model,color,registration_date,year,price_in_euro,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,fuel_consumption_g_km,mileage_in_km
16552,audi,Audi e-tron,beige,09/2019,2019,51888,408,Automatic,Electric,0,359 km Reichweite,84800.0
16559,audi,Audi e-tron,beige,07/2019,2019,53990,408,Automatic,Electric,0,359 km Reichweite,51000.0
16561,audi,Audi e-tron,beige,11/2019,2019,54870,408,Automatic,Electric,0,0 g/km,82814.0
16571,audi,Audi e-tron,beige,12/2019,2019,61989,408,Automatic,Electric,0,0 g/km,55990.0
16579,audi,Audi e-tron,blue,02/2019,2019,32930,408,Automatic,Electric,0,359 km Reichweite,84300.0
...,...,...,...,...,...,...,...,...,...,...,...,...
251033,volvo,Volvo C40,black,05/2023,2023,52890,231,Automatic,Electric,0,400 km Reichweite,8.0
251037,volvo,Volvo XC40,black,04/2023,2023,49900,231,Automatic,Electric,0,0 g/km,14900.0
251048,volvo,Volvo C40,black,01/2023,2023,51990,231,Automatic,Electric,0,0 g/km,2106.0
251056,volvo,Volvo C40,black,05/2023,2023,60520,231,Automatic,Electric,0,400 km Reichweite,3000.0


The column named 'fuel_consumption_g_km' contains information about the cars' battery range (*Reichweite = Range* in German) so we can extract that information and make a new column with the battery range values.

In [568]:
el_vehicles = df[df.fuel_type == 'Electric'].index.values #index values of all electric vehicles
df = df.assign(range_in_km = df.loc[el_vehicles, 'fuel_consumption_g_km'].str.extract(pat='(\d+)', expand=False))

In [569]:
#dropping unnecessary columns and rearranging
df = df[['brand', 'model', 'color', 'registration_date', 'year', 'power_ps', \
'transmission_type', 'fuel_type', 'fuel_consumption_l_100km', 'range_in_km', 'mileage_in_km', 'price_in_euro']]
df.sample(10)

Unnamed: 0,brand,model,color,registration_date,year,power_ps,transmission_type,fuel_type,fuel_consumption_l_100km,range_in_km,mileage_in_km,price_in_euro
233292,volkswagen,Volkswagen T-Roc,blue,11/2017,2017,116,Manual,Petrol,5.1,,94690.0,16480
186901,seat,SEAT Tarraco,red,02/2022,2022,150,Manual,Diesel,4.8,,11950.0,38990
344,alfa-romeo,Alfa Romeo MiTo,red,05/2010,2010,95,Manual,Petrol,5.9,,208000.0,3380
54254,fiat,Fiat 500,grey,04/2018,2018,69,Manual,Petrol,4.9,,26000.0,11750
24096,bmw,BMW 320,silver,07/2001,2001,170,Manual,Petrol,0.0,,230000.0,6850
2137,audi,Audi TT,blue,09/2003,2003,150,Manual,Petrol,8.3,,176958.0,6290
52201,ferrari,Ferrari 458,grey,12/2014,2014,605,Automatic,Petrol,0.0,,5414.0,439000
154438,opel,Opel Corsa,orange,10/2020,2020,75,Manual,Petrol,5.9,,39367.0,12450
18816,audi,Audi A4,grey,12/2020,2020,190,Automatic,Diesel,4.2,,23.0,46575
13714,audi,Audi RS3,red,01/2017,2017,400,Automatic,Petrol,8.3,,97500.0,37388


Since most of our columns still contain string based values, it would be beneficial to convert them into their appropriate data type.

In [575]:
columns_to_numeric = ['power_ps', 'fuel_consumption_l_100km', 'range_in_km', 'price_in_euro']
for column in columns_to_numeric:
    df[column] = df[column].astype(float)

df['registration_date'] = pd.to_datetime('registration_date', errors='coerce')
df['year'] = df['registration_date'].dt.year

In [576]:
df.dtypes

brand                               object
model                               object
color                               object
registration_date           datetime64[ns]
year                               float64
power_ps                           float64
transmission_type                   object
fuel_type                           object
fuel_consumption_l_100km           float64
range_in_km                        float64
mileage_in_km                      float64
price_in_euro                      float64
dtype: object