## Exploring eBay Car Sales Data

In this project, we will be working with a dataset of used cars from eBay.  The dataset contains 50,000 data points that describe the types and conditions of cars being sold.

The goal of this project is to clean the data and analyze the used car listings.

In [250]:
#Import NumPy and Pandas

import numpy as np
import pandas as pd

In [251]:
#Read in csv file

autos = pd.read_csv("autos.csv", encoding = "Latin-1")

In [252]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Initial data exploration reveals the dataset contains 20 columns, most of which are stored as strings.  Few of the columns have null values, but no column has more than ~20% null values.  Additionally, some of the columns are either written in German or use European numeric reporting styles, e.g. kilometers instead of miles.  

We'll start by cleaning the column names to make the data easier to work with.

## Data Cleaning: Column Names

In [253]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We'll make a few changes to these column names to enhance clarity:
  1. Change wordings to better describe the data in the column
  2. Change columns from camelcase to snakecase

In [254]:
col_names = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_pictures', 'postal_code',
       'last_seen']
autos.columns = col_names
autos.head(2)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08


## Data Cleaning: Removing Columns & Converting Data Types

We'll start by exploring the data to find obvious candidates for data cleaning.

In [255]:
autos.describe(include = "all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-05 16:57:05,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


There are a couple of columns where most of the values are the same and don't provide much utility to our analysis:
* seller
* offer_type
  
There is also a column that looks a little odd, which we should investigate:
* number_pictures  

In [256]:
autos["number_pictures"].value_counts()

0    50000
Name: number_pictures, dtype: int64

It looks like none of the listings have any pictures.  We'll go ahead and delete the pictures column, as well as the two others that mostly have one value. 

In [257]:
autos = autos.drop(columns = ["number_pictures", "seller", "offer_type"])
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null object
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
powerPS               50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(4), object(13)
memory usage: 6.5+ MB


Now we'll look for columns where numeric data is being stored in strings.

In [258]:
autos.dtypes

date_crawled          object
name                  object
price                 object
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
powerPS                int64
model                 object
odometer              object
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
postal_code            int64
last_seen             object
dtype: object

It looks like the price and odometer columns are being stored in strings.  We'll remove non-numeric characters and convert the columns to the numeric data type.

In [259]:
autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(float)
                 )
autos["odometer"] = (autos["odometer"].
                     str.replace(",", "").
                     str.replace("km", "")
                     .astype(float)
                    )
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)
print(autos["price"].head())
print(autos["odometer_km"].head())

0    5000.0
1    8500.0
2    8990.0
3    4350.0
4    1350.0
Name: price, dtype: float64
0    150000.0
1    150000.0
2     70000.0
3     70000.0
4    150000.0
Name: odometer_km, dtype: float64


## Data Cleaning: Searching for Outliers

Now that our numeric data is stored as floating values, we will determine if the price and odometer columns have any outliers.  We'll start with the price column.

In [260]:
print(autos["price"].unique().shape)
print("\n")
print(autos["price"].describe())
print("\n")
print(autos["price"].value_counts().head(15))
print("\n")
print(autos["price"].value_counts().sort_index(ascending = False).head(25))
print("\n")
print(autos["price"].value_counts().sort_index(ascending = True).head(15))

(2357,)


count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


0.0       1421
500.0      781
1500.0     734
2500.0     643
1200.0     639
1000.0     639
600.0      531
800.0      498
3500.0     498
2000.0     460
999.0      434
750.0      433
900.0      420
650.0      419
850.0      410
Name: price, dtype: int64


99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
259000.0      1
250000.0      1
220000.0      1
198000.0      1
197000.0      1
194000.0      1
190000.0      1
180000.0      1
175000.0      1
169999.0      1
Name: price, dtype: int64


0.0     1421
1.0      156
2.0        3
3.0        1
5.0        2
8.0        1
9.0        1
10.0       7

It looks like there are ~1,400 listings which have a price of $0.  Given that these listings comprise only a small portion of the data and do not appear to be accurate, we will remove them.

There are also a number of listings with very low prices, for example, \$1.  Given that eBay is an auction site, it isn't unreasonable that there could be opening bids of \$1, so we will keep these entries.  

On the other hand, there are also a small number of listing with very large price tags.  Since prices increase steadily to \$350,000 but become erratic at higher values, we are going to remove the listings with a price > $350,000 (14 entries).

In [261]:
autos = autos[autos["price"].between(0.01, 350000, inclusive = True)]
autos["price"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

Now we will explore the odometer column a little more.  

In [262]:
print(autos["odometer_km"].unique().shape)
print("\n")
print(autos["odometer_km"].describe())
print("\n")
print(autos["odometer_km"].value_counts())

(13,)


count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


150000.0    31414
125000.0     5057
100000.0     2115
90000.0      1734
80000.0      1415
70000.0      1217
60000.0      1155
50000.0      1012
5000.0        836
40000.0       815
30000.0       780
20000.0       762
10000.0       253
Name: odometer_km, dtype: int64


All of the odometer readings are rounded numbers.  This suggests that there might be a selection of odometer readings the seller has to choose from.  Generally, it doesn't appear that there are any obvious outliers.

## Data Cleaning: Cleaning Date & Time Values

Now that we've cleaned up the numeric data, let's explore the dataset's date and time columns.   

In [263]:
print(autos[["date_crawled", "ad_created", "last_seen", "registration_month", "registration_year"]].dtypes)
autos[["date_crawled", "ad_created", "last_seen", "registration_month", "registration_year"]][0:5]

date_crawled          object
ad_created            object
last_seen             object
registration_month     int64
registration_year      int64
dtype: object


Unnamed: 0,date_crawled,ad_created,last_seen,registration_month,registration_year
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54,3,2004
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08,6,1997
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37,7,2009
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28,6,2007
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50,7,2003


The "date_crawled", "ad_created", and "last_seen" columns are all currently being stored as strings.  In order to better understand the distribution of this data, we will convert the three columns being represented as strings into a numeric date representation.  

In [264]:
#Extract only the date component from the datetime strings

autos["date_crawled"] = autos["date_crawled"].str[:10]
autos["ad_created"] = autos["ad_created"].str[:10]
autos["last_seen"] = autos["last_seen"].str[:10]
autos[["date_crawled", "ad_created", "last_seen"]].head(10)

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26,2016-03-26,2016-04-06
1,2016-04-04,2016-04-04,2016-04-06
2,2016-03-26,2016-03-26,2016-04-06
3,2016-03-12,2016-03-12,2016-03-15
4,2016-04-01,2016-04-01,2016-04-01
5,2016-03-21,2016-03-21,2016-04-06
6,2016-03-20,2016-03-20,2016-03-23
7,2016-03-16,2016-03-16,2016-04-07
8,2016-03-22,2016-03-22,2016-03-26
9,2016-03-16,2016-03-16,2016-04-06


In [265]:
(autos["date_crawled"]
    .value_counts(normalize = True, dropna = False)
    .sort_index()
    )

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

It looks like the crawler was primiarly active in March 2016, during which it crawled over the data daily.  With few exceptions, the distribution of the listings crawled appears to be uniform.

In [266]:
(autos["ad_created"]
    .value_counts(normalize = True, dropna = False)
    .sort_index()
    )

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21 

In [267]:
(autos["ad_created"]
    .value_counts(normalize = True, dropna = False)
    .sort_values()
    )

2016-02-22    0.000021
2016-02-16    0.000021
2016-02-09    0.000021
2015-12-30    0.000021
2016-02-17    0.000021
2016-02-11    0.000021
2016-02-08    0.000021
2016-01-13    0.000021
2015-08-10    0.000021
2016-01-16    0.000021
2016-02-01    0.000021
2016-01-03    0.000021
2016-01-29    0.000021
2016-01-07    0.000021
2015-06-11    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2016-02-07    0.000021
2015-12-05    0.000021
2016-01-22    0.000021
2016-01-14    0.000021
2016-02-24    0.000041
2016-02-05    0.000041
2016-02-18    0.000041
2016-02-20    0.000041
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-02    0.000041
2016-02-26    0.000041
2016-01-10    0.000041
                ...   
2016-03-06    0.015320
2016-03-13    0.017008
2016-03-05    0.022897
2016-03-24    0.029280
2016-03-16    0.030125
2016-03-27    0.030989
2016-03-17    0.031278
2016-03-25    0.031751
2016-03-31    0.031875
2016-03-10    0.031895
2016-03-23    0.032060
2016-03-26    0.032266
2016-03-22 

There is a wide range of dates when ads were created.  Most fall within a couple months of the listing date, but there are a small number that are as old as 8-9 months.

In [268]:
(autos["last_seen"]
    .value_counts(normalize = True, dropna = False)
    .sort_values(ascending = False)
    )

2016-04-06    0.221806
2016-04-07    0.131947
2016-04-05    0.124761
2016-03-17    0.028086
2016-04-03    0.025203
2016-04-02    0.024915
2016-03-30    0.024771
2016-04-04    0.024483
2016-03-31    0.023783
2016-03-12    0.023783
2016-04-01    0.022794
2016-03-29    0.022341
2016-03-22    0.021373
2016-03-28    0.020859
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-23    0.018532
2016-03-26    0.016802
2016-03-16    0.016452
2016-03-15    0.015876
2016-03-19    0.015834
2016-03-27    0.015649
2016-03-14    0.012602
2016-03-11    0.012375
2016-03-10    0.010666
2016-03-09    0.009595
2016-03-13    0.008895
2016-03-08    0.007413
2016-03-18    0.007351
2016-03-07    0.005395
2016-03-06    0.004324
2016-03-05    0.001071
Name: last_seen, dtype: float64

The "last_seen" column records the date the crawler last saw a listing.  Presumably, when a listing no longer exists, the seller either removed it or the car was sold.

The last three days the crawler was active contain a disproportionate number of last seen values.  There doesn't seem to be a compelling reason car sales would spike at the same time the crawler stops crawling, so we think this distribution has to do more with the crawling period coming to an end.

Now we'll look at the distribution of the registration years.

In [269]:
print(autos["registration_year"].describe())

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64


The distribution of car registrations has some odd values.  The minimum value is 1,000, long before cars were invented, and the maximum value is 9,999, which is millennia in the future.

If we assume any car registered before 1913, the year the Ford Model T became the first automobile to be mass-produced, is invalid, we can delete those rows.  

Similarly, we will delete any row with a registration date after 2016, the year in which this data was collected.

In [270]:
autos = autos[autos["registration_year"].between(1913, 2016, inclusive = True)]
autos["registration_year"].value_counts(normalize = True).head(10)

2000    0.067615
2005    0.062902
1999    0.062066
2004    0.057910
2003    0.057824
2006    0.057203
2001    0.056474
2002    0.053261
1998    0.050626
2007    0.048783
Name: registration_year, dtype: float64

Most of the vehicles in the listings were registered within the past 20 years.

## Exploring Data: Brand & Price

We will now explore the unique values in the "brand" column and aggregate price data by our selected brands.

In [271]:
autos["brand"].value_counts(normalize = True)

volkswagen        0.211286
bmw               0.110057
opel              0.107550
mercedes_benz     0.096474
audi              0.086576
ford              0.069907
renault           0.047133
peugeot           0.029844
fiat              0.025645
seat              0.018275
skoda             0.016411
nissan            0.015276
mazda             0.015190
smart             0.014161
citroen           0.014011
toyota            0.012705
hyundai           0.010027
sonstige_autos    0.009791
volvo             0.009148
mini              0.008763
mitsubishi        0.008227
honda             0.007841
kia               0.007070
alfa_romeo        0.006642
porsche           0.006127
suzuki            0.005935
chevrolet         0.005699
chrysler          0.003514
dacia             0.002635
daihatsu          0.002507
jeep              0.002271
subaru            0.002142
land_rover        0.002100
saab              0.001650
jaguar            0.001564
daewoo            0.001500
trabant           0.001371
r

A majority of the car brands are German manufacturers.  This makes sense, given that the dataset is from the German eBay car marketplace.  

To keep our analysis streamlined, we will only be analyzing brands that comprise at least 1% of the data.

In [272]:
#Aggregate price data by selected brands

brand_counts = autos["brand"].value_counts(normalize = True)
selected_brands = brand_counts[brand_counts >= 0.01].index

top_brands_price = {}
for brand in selected_brands:
    brand_bool = autos[autos["brand"] == brand]
    mean_price = brand_bool["price"].mean()
    top_brands_price[brand] = int(mean_price)
    
top_brands_df = pd.DataFrame.from_dict(top_brands_price, orient = "index")
print(top_brands_df[0].sort_values(ascending = False))

audi             9336
mercedes_benz    8628
bmw              8332
skoda            6368
volkswagen       5402
hyundai          5365
toyota           5167
nissan           4743
seat             4397
mazda            4112
citroen          3779
ford             3749
smart            3580
peugeot          3094
opel             2976
fiat             2813
renault          2475
Name: 0, dtype: int64


Audi, Mercedes Benz, and BMW are the most expensive brands being listed, while Renault, Fiat, and Opel are the least expensive.  Volkswagen, Hyundai, and Toyota fall into the middle.  Perhaps the popularity of Volkswagen can be explained by it being both a German car and a good middleground.

## Exploring Data: Brand & Mileage

Now we will explore how mileage differs across brands, as well as if there is a relationship between mileage and price.

In [289]:
# Aggregate mileage data by selected brands

top_brands_mileage = {}
for brand in selected_brands:
    brand_bool = autos[autos["brand"] == brand]
    mean_mileage = brand_bool["odometer_km"].mean()
    top_brands_mileage[brand] = int(mean_mileage)
    
price_series = pd.Series(top_brands_price)
mileage_series = pd.Series(top_brands_mileage)

price_mileage_df = pd.DataFrame(price_series, columns = ["mean_price"])
price_mileage_df["mean_mileage"] = mileage_series

print(price_mileage_df["mean_mileage"].describe())
print("\n")
print(price_mileage_df["mean_price"].describe())
price_mileage_df.sort_values(by = "mean_mileage", ascending = False)

count        17.000000
mean     121375.352941
std        9246.181739
min       99326.000000
25%      117121.000000
50%      124266.000000
75%      128707.000000
max      132572.000000
Name: mean_mileage, dtype: float64


count      17.000000
mean     4959.764706
std      2097.775236
min      2475.000000
25%      3580.000000
50%      4397.000000
75%      5402.000000
max      9336.000000
Name: mean_price, dtype: float64


Unnamed: 0,mean_price,mean_mileage
bmw,8332,132572
mercedes_benz,8628,130788
opel,2976,129311
audi,9336,129157
volkswagen,5402,128707
renault,2475,128127
peugeot,3094,127153
mazda,4112,124464
ford,3749,124266
seat,4397,121131


The range of car mileages does not vary nearly as much as the prices do by brand.  Generally, the mileages fall within ~20% of the average, whereas the prices can differ from the average by as much as 88%.         

The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.