# Exploring EBay Car Sales Data

This project dives into the sales data for cars on EBay. There are 50,000 data points, sampled from all car sales on EBay, and organized into a dictionary.

Dictionary Information:

* dataCrawled: The date the ad was first crawled
* name: The name of the car
* seller: Whether the seller is private or a dealer
* offerType: Type of listing
* price: Ad price
* abtest: Whether the listing is included in an A/B test
* vehicleType: Type of vehicle
* yearOfRegistration: The year in which the car was first registered
* gearbox: The transmission type
* powerPS: The power of the car in PS
* model: The car model name
* kilometer: How many kilometers the car has driven
* monthOfRegistration: The month in which the car was first registered
* fuelType: Type of fuel
* brand: Brand of the car
* notRepairedDamage: Damage that is unrepaired
* dateCreated: The date that the ebay listing was created
* nrOfPictures: The number of pictures in the ad
* postalCode: The postal code for the vehicle location
* lastSeenOnline: When the crawler last saw the ad online

In [1]:
import numpy as np
import pandas as pd

In [2]:
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Scanning through this data, we notice a few things. The first thing that stands out is that the name is more specific than just the model of the car. The ad name includes things like what console tech the car includes.

The thing that shows up second is that this is all in German. Or, at least the first few rows are. Later, there will likely be a need to translate some of the information from German into English.

Of the 20 columns, the vast majority of them have object dtype. Price and odometer should be converted to float values, and postal code should likely be converted to a string value since mathematical operations involving postal codes don't necessarily make sense.

In [5]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:
new_columns = [
    'date_crawled', 
    'name',
    'seller',
    'offer_type',
    'price',
    'abtest',
    'vehicle_type',
    'registration_year',
    'gearbox',
    'power_ps',
    'model',
    'odometer',
    'registration_month',
    'fuel_type',
    'brand',
    'unrepaired_damage',
    'ad_created',
    'number_of_pictures',
    'postal_code',
    'last_seen'
              ]

autos.columns = new_columns

In [7]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Column Name Changes
The column names were changed for readability, and to conform to a pythonic naming scheme. 
* nrOfPictures became number_of_pictures
* monthOfRegistration became registration_month
* etc.

In [8]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [9]:
autos['offer_type'].value_counts()

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [10]:
autos['seller'].value_counts()

privat        49999
gewerblich        1
Name: seller, dtype: int64

A search of the columns tells us that the offer_type and seller columns can go. They are string columns with the vast majority of rows having the same value.

The price and odometer columns have object dtype, when they should have either int or float dtype.

Registration_year and Registration_month should probably be changed to object values, and power_ps will require some more investigation by me to determine what to do with it.

In [11]:
autos['price' ] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','')
autos['price'] = autos['price'].astype(int)

In [12]:
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].astype(int)

In [13]:
autos.rename(columns={'odometer': 'odometer_km'}, inplace='True')

In [14]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [15]:
autos['price'].unique().shape[0]

2357

In [16]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [17]:
autos['price'].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [18]:
autos['price'].value_counts().sort_index(ascending=True).head(50)

0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
70       10
75        5
79        1
80       15
89        1
90        5
99       19
100     134
110       3
111       2
115       2
117       1
120      39
122       1
125       8
129       1
130      15
135       1
139       1
140       9
Name: price, dtype: int64

In [19]:
autos = autos[autos['price'].between(1000,360000)]

In [20]:
autos['price'].describe()

count     38626.000000
mean       7255.376275
std        9698.439853
min        1000.000000
25%        2200.000000
50%        4350.000000
75%        8950.000000
max      350000.000000
Name: price, dtype: float64

In [21]:
autos['odometer_km'].describe()

count     38626.000000
mean     122778.568840
std       40796.873127
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [22]:
autos['odometer_km'].value_counts().sort_index(ascending=False).head()

150000    23314
125000     4340
100000     1860
90000      1569
80000      1334
Name: odometer_km, dtype: int64

In [23]:
autos['odometer_km'].value_counts().sort_index(ascending=True).head()

5000     507
10000    228
20000    692
30000    748
40000    795
Name: odometer_km, dtype: int64

The price column had many values that were quite large, and many more values that seemed mis-entered. I decided to remove all values that were less than 1000 and values higher than 350,000. Some values less than 1000 may have been legitimate, but others may have been incorrectly entered, or measured in thousands.

For the most part, after removing the outlier price values, the odometer_km values seem reasonable. One thing to note is that there are many with 150,000 which is the highest value. It is possible that that is the highest possible value for many odometers, and therefore includes vehicles with much higher kilometers driven.

In [24]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [32]:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

2016-03-05    0.025553
2016-03-06    0.013877
2016-03-07    0.035132
2016-03-08    0.032621
2016-03-09    0.032465
2016-03-10    0.033320
2016-03-11    0.032802
2016-03-12    0.037384
2016-03-13    0.016000
2016-03-14    0.036633
2016-03-15    0.033630
2016-03-16    0.029074
2016-03-17    0.030472
2016-03-18    0.012841
2016-03-19    0.035132
2016-03-20    0.038161
2016-03-21    0.037281
2016-03-22    0.032517
2016-03-23    0.032206
2016-03-24    0.029022
2016-03-25    0.030523
2016-03-26    0.033112
2016-03-27    0.031404
2016-03-28    0.035365
2016-03-29    0.033967
2016-03-30    0.033061
2016-03-31    0.031404
2016-04-01    0.034614
2016-04-02    0.036297
2016-04-03    0.039145
2016-04-04    0.036866
2016-04-05    0.013359
2016-04-06    0.003262
2016-04-07    0.001502
Name: date_crawled, dtype: float64

It appears that all crawling was done between March 5th and April 7th 2016. While not necessarily uniform, it does appear that most days had about 3-4% of all the auto sales, with a large dropoff for the last few days.

In [31]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)

2015-06-11    0.000026
2015-08-10    0.000026
2015-09-09    0.000026
2015-11-10    0.000026
2015-12-30    0.000026
2016-01-03    0.000026
2016-01-07    0.000026
2016-01-10    0.000052
2016-01-13    0.000026
2016-01-14    0.000026
2016-01-16    0.000026
2016-01-22    0.000026
2016-01-27    0.000078
2016-01-29    0.000026
2016-02-01    0.000026
2016-02-02    0.000052
2016-02-05    0.000052
2016-02-07    0.000026
2016-02-09    0.000026
2016-02-11    0.000026
2016-02-12    0.000052
2016-02-14    0.000052
2016-02-16    0.000026
2016-02-17    0.000026
2016-02-18    0.000052
2016-02-19    0.000078
2016-02-20    0.000026
2016-02-21    0.000052
2016-02-22    0.000026
2016-02-23    0.000104
                ...   
2016-03-09    0.032646
2016-03-10    0.032983
2016-03-11    0.033061
2016-03-12    0.037125
2016-03-13    0.017657
2016-03-14    0.034976
2016-03-15    0.033449
2016-03-16    0.029643
2016-03-17    0.030161
2016-03-18    0.013281
2016-03-19    0.034070
2016-03-20    0.038264
2016-03-21 

Looking through the ad_created column, it appears that there was a wide range of times in which ads were created. A few from 2015 and early 2016, with the majority from near the dates when the ad crawler was working. 

Later I will dive into the length of time between creation and when the ad was crawled. 

In [30]:
autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

2016-03-05    0.001087
2016-03-06    0.003573
2016-03-07    0.004557
2016-03-08    0.006239
2016-03-09    0.008906
2016-03-10    0.009812
2016-03-11    0.011728
2016-03-12    0.022187
2016-03-13    0.008388
2016-03-14    0.011987
2016-03-15    0.014990
2016-03-16    0.015456
2016-03-17    0.026381
2016-03-18    0.007378
2016-03-19    0.014602
2016-03-20    0.019805
2016-03-21    0.019676
2016-03-22    0.020789
2016-03-23    0.017915
2016-03-24    0.018537
2016-03-25    0.017760
2016-03-26    0.016077
2016-03-27    0.014084
2016-03-28    0.019417
2016-03-29    0.020763
2016-03-30    0.023456
2016-03-31    0.022731
2016-04-01    0.023197
2016-04-02    0.024906
2016-04-03    0.024439
2016-04-04    0.023378
2016-04-05    0.131129
2016-04-06    0.234686
2016-04-07    0.139983
Name: last_seen, dtype: float64

The last_seen column had a wide variety of relative frequency values. I am really curious as to the days of the week when most ads were removed. Were most cars sold on weekends, with few during the week. Or, was it the other way around?

In [33]:
autos['registration_year'].describe()

count    38626.000000
mean      2005.679801
std         86.685138
min       1000.000000
25%       2001.000000
50%       2005.000000
75%       2009.000000
max       9999.000000
Name: registration_year, dtype: float64

Auto registration years should be in the 1900's or 2000's. It is odd to see that there are values as low as 1000 and as high as 9999. This seems like there are some incorrectly entered data values.

In [40]:
bad_value_count = autos[(autos['registration_year'] < 1900) | (autos['registration_year'] > 2016)].shape[0]

In [41]:
bad_value_count / autos.shape[0]

0.03676280225754673

Because the registration years that fall outside of expected ranges only comprise of about 3.7% of the data, I am comfortable dropping those rows.

In [42]:
autos = autos[autos['registration_year'].between(1900, 2016)]

In [43]:
autos.shape

(37206, 20)

In [51]:
autos['registration_year'].value_counts(normalize=True).sort_values(ascending=False)

2005    0.074854
2006    0.071252
2004    0.070096
2003    0.066575
2007    0.060716
2008    0.059238
2002    0.057383
2009    0.055797
2001    0.055502
2000    0.053782
1999    0.046337
2011    0.043461
2010    0.042574
2012    0.035102
1998    0.034484
2013    0.021368
1997    0.021153
2016    0.017497
2014    0.017443
1996    0.014568
1995    0.011907
2015    0.009703
1994    0.007284
1993    0.005859
1992    0.005859
1991    0.005564
1990    0.005080
1989    0.003306
1988    0.002849
1985    0.002123
          ...   
1966    0.000564
1977    0.000538
1975    0.000484
1969    0.000484
1965    0.000457
1960    0.000457
1964    0.000242
1963    0.000215
1959    0.000161
1961    0.000161
1962    0.000108
1937    0.000108
1956    0.000108
1958    0.000081
1954    0.000054
1955    0.000054
1941    0.000054
1957    0.000054
1934    0.000054
1951    0.000054
1931    0.000027
1938    0.000027
1948    0.000027
1950    0.000027
1939    0.000027
1929    0.000027
1927    0.000027
1943    0.0000

Registration years date from 1927 through 2016 with the majority occuring in the 2000's and late 1990's

In [64]:
autos['brand'].value_counts()

volkswagen        7843
bmw               4664
mercedes_benz     4152
audi              3631
opel              3314
ford              2184
renault           1387
peugeot           1038
fiat               784
skoda              709
seat               643
smart              618
toyota             544
mazda              530
citroen            517
nissan             507
mini               405
hyundai            400
sonstige_autos     389
volvo              334
kia                286
porsche            278
honda              273
mitsubishi         256
chevrolet          246
alfa_romeo         232
suzuki             213
dacia              122
chrysler           118
jeep               103
land_rover          98
jaguar              69
subaru              64
daihatsu            63
saab                51
daewoo              34
trabant             32
rover               27
lancia              25
lada                23
Name: brand, dtype: int64

In [70]:
brands = autos['brand'].value_counts(
     )[autos['brand'
            ].value_counts() > 200].index

In [71]:
mean_price_per_brand = {}

for brand in brands:
    temp = autos[autos['brand'] == brand]
    mean = temp['price'].mean()
    mean_price_per_brand[brand] = mean

In [72]:
mean_price_per_brand

{'volkswagen': 6645.13260232054,
 'bmw': 9119.20218696398,
 'mercedes_benz': 9302.614402697494,
 'audi': 10322.269347287249,
 'opel': 4219.954737477368,
 'ford': 5331.478021978022,
 'renault': 3590.942321557318,
 'peugeot': 3955.169556840077,
 'fiat': 4008.174744897959,
 'skoda': 6836.696755994359,
 'seat': 5638.640746500778,
 'smart': 3780.4692556634304,
 'toyota': 5573.57169117647,
 'mazda': 5309.526415094339,
 'citroen': 4614.970986460348,
 'nissan': 6428.428007889546,
 'mini': 10715.237037037037,
 'hyundai': 6181.8725,
 'sonstige_autos': 14454.552699228792,
 'volvo': 6151.583832335329,
 'kia': 6810.160839160839,
 'porsche': 46955.15107913669,
 'honda': 5297.282051282052,
 'mitsubishi': 4812.5078125,
 'chevrolet': 7177.414634146341,
 'alfa_romeo': 5255.379310344828,
 'suzuki': 5149.8779342723}

In [75]:
sorted_mean_price = {}
for brand in sorted(mean_price_per_brand, key=mean_price_per_brand.get, reverse=True):
    sorted_mean_price[brand] = mean_price_per_brand[brand]


In [76]:
sorted_mean_price

{'porsche': 46955.15107913669,
 'sonstige_autos': 14454.552699228792,
 'mini': 10715.237037037037,
 'audi': 10322.269347287249,
 'mercedes_benz': 9302.614402697494,
 'bmw': 9119.20218696398,
 'chevrolet': 7177.414634146341,
 'skoda': 6836.696755994359,
 'kia': 6810.160839160839,
 'volkswagen': 6645.13260232054,
 'nissan': 6428.428007889546,
 'hyundai': 6181.8725,
 'volvo': 6151.583832335329,
 'seat': 5638.640746500778,
 'toyota': 5573.57169117647,
 'ford': 5331.478021978022,
 'mazda': 5309.526415094339,
 'honda': 5297.282051282052,
 'alfa_romeo': 5255.379310344828,
 'suzuki': 5149.8779342723,
 'mitsubishi': 4812.5078125,
 'citroen': 4614.970986460348,
 'opel': 4219.954737477368,
 'fiat': 4008.174744897959,
 'peugeot': 3955.169556840077,
 'smart': 3780.4692556634304,
 'renault': 3590.942321557318}