# Exploring Ebay Car Sales Data

This project consists of a Exploratory Data Analysis made on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data).


The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.

The data dictionary provided with data is as follows:

- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which which year the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which which year the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.
- The aim of this project is to clean the data and analyze the included used car listings.



In [3]:
# importing the libraries:

import pandas as pd
import numpy as np

In [4]:
# reading the data and creating the dataframe:

autos = pd.read_csv('autos.csv', encoding = "Latin-1")

In [5]:
# taking a first look at the data frame:

autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [6]:
# taking a first look at the information about the autos dataframe we just created:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

The dataset contains 20 columns, most of which are strings.
Some columns have null values, but none have more than ~20% null values.
The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

In [7]:
# exploring the column names:

autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [8]:
# changing the columns names to make it more easy to understand and work with:

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'not_repaired_damage', 'ad_created', 'num_pics', 'postal_code',
       'last_seen']

In [9]:
# checking the new column names on the data frame:

autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,not_repaired_damage,ad_created,num_pics,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [10]:
# performing a general description on the information in the data frame to check if some columns can be dropped:

autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,not_repaired_damage,ad_created,num_pics,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


We can see that there are some columns that have mostly one value that are candidate to be dropped or needs a first cleaning step:

- seller: 49999/50000 values are "privat";
- offer_type: 49999/50000 values are "Angebot";
- num_pics: no information in the describe table;
- odometer and price formated as string with non numeric characteres.

In [11]:
# exploring the different formats of values in the column price:

autos['price'].value_counts() 

$0         1421
$500        781
$1,500      734
$2,500      643
$1,000      639
           ... 
$1,189        1
$16,650       1
$10,333       1
$34,550       1
$63,499       1
Name: price, Length: 2357, dtype: int64

In [12]:
# cleaning the price column values:

autos['price'] = autos['price'].str.replace('$',"") #removing the $ symbol 

autos['price'] = autos['price'].str.replace(',',"") #replacing the commas for dots 

autos['price'] = autos['price'].astype(float) #converting the values to numeric

autos['price'].describe(include="all") #checking the cleaning

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [13]:
# exploring the different formats of values in the odometer column:

autos['odometer_km'].value_counts() 

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer_km, dtype: int64

In [14]:
# cleaning the odometer column values:

autos['odometer_km'] = autos['odometer_km'].str.replace('km',"") #removing the $ symbol 

autos['odometer_km'] = autos['odometer_km'].str.replace(',',"") #replacing the commas for dots 

autos['odometer_km'] = autos['odometer_km'].astype(float) #converting the values to numeric

autos['odometer_km'].describe(include="all") #checking the cleaning

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Lets performe some more exploration on the data looking for information that doesn't make sense:

In [15]:
# checking how many unique price values there is the data set:

autos['price'].unique().shape 

(2357,)

In [16]:
# checking descriptive statistical information about the prices:

autos['price'].describe() 

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [18]:
# checking for outliers with high values (bigger than 1 million):
    
autos['price'].value_counts().sort_index(ascending=False) 

99999999.0       1
27322222.0       1
12345678.0       3
11111111.0       2
10000000.0       1
              ... 
5.0              2
3.0              1
2.0              3
1.0            156
0.0           1421
Name: price, Length: 2357, dtype: int64

In [19]:
#checking for outliers with low values (below 5.000):


autos.loc[autos['price'] < 2000, "price"].value_counts().sort_index(ascending=False)

1999.0     322
1998.0       4
1996.0       1
1995.0       4
1990.0     117
          ... 
5.0          2
3.0          1
2.0          3
1.0        156
0.0       1421
Name: price, Length: 432, dtype: int64

In [20]:
# removing values bigger than one million or values lower than 2k

autos = autos[autos['price'].between(1000,1000000)]

In [21]:
autos['price'].describe()

count     38629.000000
mean       7332.474359
std       13060.890754
min        1000.000000
25%        2200.000000
50%        4350.000000
75%        8950.000000
max      999999.000000
Name: price, dtype: float64

We now have only cars with values between 5k and 1M, with is a more realiable source of information.

In [22]:
# describing information about the mileage:

autos["odometer_km"].describe()

count     38629.000000
mean     122780.035724
std       40795.760641
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

We can see an average mileage of 106.000 km, with a minimum mileage of 5.000 km and a maximum mileage of 150.000 km. Nothing seems to be weird here, but we will have a more detailed exploration as a double check:

In [23]:
autos["odometer_km"].value_counts().sort_index(ascending=False)

150000.0    23316
125000.0     4341
100000.0     1860
90000.0      1569
80000.0      1334
70000.0      1154
60000.0      1099
50000.0       986
40000.0       795
30000.0       748
20000.0       692
10000.0       228
5000.0        507
Name: odometer_km, dtype: int64

Everything seems fine here. So as a result of our outliears cleaning, we can check how many entries were left in our dataset:

In [24]:
autos

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,not_repaired_damage,ad_created,num_pics,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004,manuell,158,andere,150000.0,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997,automatik,286,7er,150000.0,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009,manuell,102,golf,70000.0,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,kleinwagen,2007,automatik,71,fortwo,70000.0,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,kombi,2003,manuell,0,focus,150000.0,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,24900.0,control,limousine,2011,automatik,239,q5,100000.0,1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,1980.0,control,cabrio,1996,manuell,75,astra,150000.0,5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,13200.0,test,cabrio,2014,automatik,69,500,5000.0,11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,22900.0,control,kombi,2013,manuell,150,a3,40000.0,11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [25]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38629 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date_crawled         38629 non-null  object 
 1   name                 38629 non-null  object 
 2   seller               38629 non-null  object 
 3   offer_type           38629 non-null  object 
 4   price                38629 non-null  float64
 5   abtest               38629 non-null  object 
 6   vehicle_type         35873 non-null  object 
 7   registration_year    38629 non-null  int64  
 8   gearbox              37195 non-null  object 
 9   power_ps             38629 non-null  int64  
 10  model                37011 non-null  object 
 11  odometer_km          38629 non-null  float64
 12  registration_month   38629 non-null  int64  
 13  fuel_type            36271 non-null  object 
 14  brand                38629 non-null  object 
 15  not_repaired_damage  32857 non-null 

By removing all cars with balues lower than 1K and bigger than 1M, we dropped from 50.000 to 38629 entries.

Exploring the date columns and informaion. There are 5 columns with data information:

- date_crawled: added by the crawler
- last_seen: added by the crawler
- ad_created: from the website
- registration_month: from the website
- registration_year: from the website

the first three columns are string values. the other two, are numeric values.

In [26]:
# undestanding how the values in the three strings are formated:

autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [27]:
# calculating the distribution of the crawled dates columns as percentages:

autos['date_crawled'].str[0:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025551
2016-03-06    0.013876
2016-03-07    0.035129
2016-03-08    0.032618
2016-03-09    0.032463
2016-03-10    0.033317
2016-03-11    0.032799
2016-03-12    0.037381
2016-03-13    0.015998
2016-03-14    0.036631
2016-03-15    0.033628
2016-03-16    0.029071
2016-03-17    0.030495
2016-03-18    0.012840
2016-03-19    0.035129
2016-03-20    0.038158
2016-03-21    0.037304
2016-03-22    0.032514
2016-03-23    0.032204
2016-03-24    0.029020
2016-03-25    0.030521
2016-03-26    0.033110
2016-03-27    0.031401
2016-03-28    0.035362
2016-03-29    0.033990
2016-03-30    0.033058
2016-03-31    0.031401
2016-04-01    0.034611
2016-04-02    0.036294
2016-04-03    0.039142
2016-04-04    0.036863
2016-04-05    0.013358
2016-04-06    0.003262
2016-04-07    0.001501
Name: date_crawled, dtype: float64

It seems like the information was crawled in a range of around 1 month (between 2016 March 3rd and 2016 April 7th. The distribution is uniforme, around 3% of information crawled in each day inside this range.

In [28]:
# calculating the distribution of the listing dates columns as percentages:

autos['ad_created'].str[0:10].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000026
2015-08-10    0.000026
2015-09-09    0.000026
2015-11-10    0.000026
2015-12-30    0.000026
                ...   
2016-04-03    0.039452
2016-04-04    0.037278
2016-04-05    0.011986
2016-04-06    0.003365
2016-04-07    0.001320
Name: ad_created, Length: 74, dtype: float64

The ads were cretaed between 2015 November 6th and 2016 April July. Most part of it was created inside the range of the crawling period (2016 March 3rd and 2016 April 7th). Despite of it, there are a lot of old ads, the oldest one was created 9 months ago (2015 November).

In [29]:
# calculating the distribution of the last seen dates columns as percentages:

autos['last_seen'].str[0:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001087
2016-03-06    0.003572
2016-03-07    0.004556
2016-03-08    0.006239
2016-03-09    0.008905
2016-03-10    0.009811
2016-03-11    0.011727
2016-03-12    0.022185
2016-03-13    0.008387
2016-03-14    0.011986
2016-03-15    0.014989
2016-03-16    0.015455
2016-03-17    0.026379
2016-03-18    0.007378
2016-03-19    0.014600
2016-03-20    0.019804
2016-03-21    0.019674
2016-03-22    0.020787
2016-03-23    0.017914
2016-03-24    0.018535
2016-03-25    0.017759
2016-03-26    0.016076
2016-03-27    0.014083
2016-03-28    0.019441
2016-03-29    0.020787
2016-03-30    0.023454
2016-03-31    0.022729
2016-04-01    0.023195
2016-04-02    0.024904
2016-04-03    0.024438
2016-04-04    0.023376
2016-04-05    0.131119
2016-04-06    0.234694
2016-04-07    0.139973
Name: last_seen, dtype: float64

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.



Exploring the distribution of the registration year:

In [30]:
autos['registration_year'].describe()

count    38629.000000
mean      2005.678713
std         86.681928
min       1000.000000
25%       2001.000000
50%       2005.000000
75%       2009.000000
max       9999.000000
Name: registration_year, dtype: float64

As we can see above, the first year of registration is dated to the year 1000, and the last one to year 9999, which are clearly non-sense values. 

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [31]:
autos["registration_year"].between(1900,2016).sum()/autos.shape[0]

0.963240052810065

We see that 96% of our the registration years are inside a coherent. In this case we can drop the values out of this range.

In [32]:
# removing values outside the range:
autos = autos[autos["registration_year"].between(1900,2016)]

#checking the new distribution:
autos["registration_year"].value_counts(normalize=True, ascending=False)

2005    0.074847
2006    0.071246
2004    0.070091
2003    0.066570
2007    0.060711
          ...   
1950    0.000027
1948    0.000027
1938    0.000027
1939    0.000027
1952    0.000027
Name: registration_year, Length: 77, dtype: float64

We concluded that the most part of the vehicles were registered after 1995.

Exploring price by brand:

In [33]:
autos["brand"].value_counts(normalize=True, ascending=False)

volkswagen        0.210836
bmw               0.125346
mercedes_benz     0.111586
audi              0.097584
opel              0.089064
ford              0.058722
renault           0.037276
peugeot           0.027896
fiat              0.021070
skoda             0.019055
seat              0.017281
smart             0.016609
toyota            0.014620
mazda             0.014244
citroen           0.013894
nissan            0.013626
mini              0.010884
hyundai           0.010750
sonstige_autos    0.010454
volvo             0.008976
kia               0.007686
porsche           0.007471
honda             0.007337
mitsubishi        0.006880
chevrolet         0.006611
alfa_romeo        0.006235
suzuki            0.005724
dacia             0.003279
chrysler          0.003171
jeep              0.002768
land_rover        0.002634
jaguar            0.001854
subaru            0.001720
daihatsu          0.001693
saab              0.001371
daewoo            0.000914
trabant           0.000860
r

We can see above that not all brands are representative in our dataset. So in order to perform aggregations to analyze prices and mileages, we will reduce our analysis to the top 10 more representative brands:

In [34]:
# checking the percentual distribution of brands in the data set:
brand_counts = autos["brand"].value_counts(normalize=True) 

# selecting only the top 10 more representative brands in the dataset.
common_brands = brand_counts.iloc[:10]

print(common_brands)
print(type(common_brands))

volkswagen       0.210836
bmw              0.125346
mercedes_benz    0.111586
audi             0.097584
opel             0.089064
ford             0.058722
renault          0.037276
peugeot          0.027896
fiat             0.021070
skoda            0.019055
Name: brand, dtype: float64
<class 'pandas.core.series.Series'>


In [35]:
# we can use the .series attribute to check the lables of a series.
# let's store it in a list so we can iterate over it to creat a mean price table:

brands = common_brands.index
print(brands)

Index(['volkswagen', 'bmw', 'mercedes_benz', 'audi', 'opel', 'ford', 'renault',
       'peugeot', 'fiat', 'skoda'],
      dtype='object')


In [36]:
top_brands_mean_price = {}

for brand in brands:

    brand_only = autos[autos["brand"] == brand]
    mean_price = brand_only["price"].mean()
    top_brands_mean_price[brand] = int(mean_price)

top_brands_mean_price

{'volkswagen': 6898,
 'bmw': 9119,
 'mercedes_benz': 9302,
 'audi': 10322,
 'opel': 4219,
 'ford': 5786,
 'renault': 3590,
 'peugeot': 3955,
 'fiat': 4008,
 'skoda': 6836}

As we can see, Audi, BMW and Mercedez are the most expensive top brands. Renault and Peugeot are the less expensive ones. We can see if there is any relation between these values and the mean mileages for theses brands.

In [37]:
# creating a table of the mean mileage of each of the top brands:

top_brands_mean_mileage = {}

for brand in brands:

    brand_only = autos[autos["brand"] == brand]
    mean_mileage = brand_only["odometer_km"].mean()
    top_brands_mean_mileage[brand] = int(mean_mileage)

top_brands_mean_mileage

{'volkswagen': 125771,
 'bmw': 132001,
 'mercedes_benz': 130062,
 'audi': 127491,
 'opel': 123952,
 'ford': 119622,
 'renault': 121423,
 'peugeot': 122341,
 'fiat': 107901,
 'skoda': 110063}

We can now create a dataframe containing the both information and see if there is a direct relation between the two values.

In [38]:
# creating data series for both values:

bmp_series = pd.Series(top_brands_mean_price).sort_values(ascending=False)
print(bmp_series)

bmm_series = pd.Series(top_brands_mean_mileage).sort_values(ascending=False)
print(bmm_series)

audi             10322
mercedes_benz     9302
bmw               9119
volkswagen        6898
skoda             6836
ford              5786
opel              4219
fiat              4008
peugeot           3955
renault           3590
dtype: int64
bmw              132001
mercedes_benz    130062
audi             127491
volkswagen       125771
opel             123952
peugeot          122341
renault          121423
ford             119622
skoda            110063
fiat             107901
dtype: int64


In [39]:
# creating data frames based on the both series:

df_bmp = pd.DataFrame(bmp_series, columns = ['mean_price'])
df_bmm = pd.DataFrame(bmm_series, columns = ['mean_mileage'])

In [40]:
df_bmp

Unnamed: 0,mean_price
audi,10322
mercedes_benz,9302
bmw,9119
volkswagen,6898
skoda,6836
ford,5786
opel,4219
fiat,4008
peugeot,3955
renault,3590


In [41]:
df_bmm

Unnamed: 0,mean_mileage
bmw,132001
mercedes_benz,130062
audi,127491
volkswagen,125771
opel,123952
peugeot,122341
renault,121423
ford,119622
skoda,110063
fiat,107901


In [42]:
# we can now merge the both data frames

df_bmp['mean_mileage'] = df_bmm['mean_mileage']
df_brand_info = df_bmp
df_brand_info

Unnamed: 0,mean_price,mean_mileage
audi,10322,127491
mercedes_benz,9302,130062
bmw,9119,132001
volkswagen,6898,125771
skoda,6836,110063
ford,5786,119622
opel,4219,123952
fiat,4008,107901
peugeot,3955,122341
renault,3590,121423


It is possible to conclude that the mean mileages doest not vary as much as the prices by brand. It is also possible to detect a slight trend for the most expensive brands having higher mileage. 