# Exploring Ebay Car Sales Data

### Introduction

This notebook will analyse car sales from a classifieds section of the German eBay website (*eBay Kleinanzeigen*). 
This data was scraped and placed on [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data) and then it was sampled (50,000) data points and prepared by dataquest.io, simulating an real scraped data.

The learning objectives of this project are cleaning the data and using Pandas Library with Jupyter Notebook. 

The data dictionary provided details each column:

- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which which year the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which which year the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

### Reading the data

In [52]:
import numpy as np
import pandas as pd

autos = pd.read_csv("autos.csv", encoding='Latin-1')

In [53]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


In [54]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [55]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Of the 20 columns, 15 stores strings and 5 stores numbers (int64).

There are a few columns with null values, but no columns have more than ~20% null values. There are some columns that contain dates stored as strings.

The columns uses camelcase notation (instead of snakecase preferred by Python). We will convert them to snakecase, changing some columns names for easier understanding.

### Changing column names

In [56]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [57]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']


In [58]:
autos.head(1)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54


### Cleaning the data

In [59]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-14 20:50:02,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


That are two columns where most of the values are the same: seller and offer type. This columns could the removed from the Dataframe. 

The nr_of_pictures looks odd, so we will analyse it separated.

The price and odometer columns also looks odd (they don't have mean and other descriptive statistics), so we will analyse it further. 

In [60]:
autos = autos.drop(["seller", "offer_type"], axis=1)

In [61]:
autos['nr_of_pictures'].head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: nr_of_pictures, dtype: int64

All the data on nr_of_pictures are stored as zero (probably some error), so we will remove it from the Dataframe.

In [62]:
autos = autos.drop(["nr_of_pictures"], axis=1)

In [63]:
autos['price'].head()

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

In [64]:
autos['odometer'].head()

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object

Both columns (price and odometer) are incorrectly stored as text instead of numbers. We will transform it in numeric and put the units ($ and km) at the column_name.

In [65]:
autos['odometer'] = autos['odometer'].str.replace("km","").str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
autos["odometer_km"].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64

In [66]:
autos['price'] = autos['price'].str.replace("$","").str.replace(',', '')
autos['price'] = autos['price'].astype(int)
autos['price'].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

### Analysing prices

In [67]:
autos['price'].unique().shape

(2357,)

In [68]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [69]:
autos['price'].value_counts().sort_index(ascending=False).head(15)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price, dtype: int64

In [70]:
autos['price'].value_counts().sort_index(ascending=True).head(10)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64

There apppears to be some cars with too high numbers, including 99,9 millions. We will remove the numbers higher than 350,000, where appear to be a spike in price (there are 14 cars in this range).

In the lower range, we found 1,421 cars with zero price and others with small prices less than `$` 20. We will keep the prices higher than 0 (eBay is an auction site, so `$` 1 could be the starting price), but we will remove the prices equal to 0. 

In [71]:
autos = autos[autos["price"].between(1, 350000)]

### Analysing mileage

In [72]:
autos['odometer_km'].unique().shape

(13,)

In [73]:
autos['odometer_km'].describe()

count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [74]:
autos['odometer_km'].value_counts()

150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
5000        836
40000       815
30000       780
20000       762
10000       253
Name: odometer_km, dtype: int64

There are only 13 unique values in the odometer_km column and all the values are rounded. This suggests that there are some odometer categories on the site that the seller has to fill with their car information.

Also, there are more cars with high mileage than low (the mean in 125,700).

### Analysing date fields

In [75]:
autos['date_crawled'].str[:10].value_counts().sort_index()

2016-03-05    1230
2016-03-06     682
2016-03-07    1749
2016-03-08    1617
2016-03-09    1607
2016-03-10    1563
2016-03-11    1582
2016-03-12    1793
2016-03-13     761
2016-03-14    1775
2016-03-15    1665
2016-03-16    1438
2016-03-17    1536
2016-03-18     627
2016-03-19    1689
2016-03-20    1840
2016-03-21    1815
2016-03-22    1602
2016-03-23    1565
2016-03-24    1425
2016-03-25    1535
2016-03-26    1564
2016-03-27    1510
2016-03-28    1693
2016-03-29    1656
2016-03-30    1636
2016-03-31    1546
2016-04-01    1636
2016-04-02    1723
2016-04-03    1875
2016-04-04    1772
2016-04-05     636
2016-04-06     154
2016-04-07      68
Name: date_crawled, dtype: int64

In [76]:
autos['ad_created'].str[:7].value_counts().sort_index()

2015-06        1
2015-08        1
2015-09        1
2015-11        1
2015-12        2
2016-01       12
2016-02       61
2016-03    40673
2016-04     7813
Name: ad_created, dtype: int64

In [81]:
autos['last_seen'].str[:10].value_counts().sort_index()

2016-03-05       52
2016-03-06      210
2016-03-07      262
2016-03-08      360
2016-03-09      466
2016-03-10      518
2016-03-11      601
2016-03-12     1155
2016-03-13      432
2016-03-14      612
2016-03-15      771
2016-03-16      799
2016-03-17     1364
2016-03-18      357
2016-03-19      769
2016-03-20     1003
2016-03-21     1002
2016-03-22     1038
2016-03-23      900
2016-03-24      960
2016-03-25      933
2016-03-26      816
2016-03-27      760
2016-03-28     1013
2016-03-29     1085
2016-03-30     1203
2016-03-31     1155
2016-04-01     1107
2016-04-02     1210
2016-04-03     1224
2016-04-04     1189
2016-04-05     6059
2016-04-06    10772
2016-04-07     6408
Name: last_seen, dtype: int64

In [78]:
autos['registration_month'].value_counts()

3     5003
0     4480
6     4271
4     4036
5     4031
7     3857
10    3588
12    3374
9     3330
11    3313
1     3219
8     3126
2     2937
Name: registration_month, dtype: int64

In [79]:
autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Analysing the date crawled column, we can see that the data was crawled in a month, between 2016-03-05 and 2016-04-07.

The ads were created between June 2015 and April 2016, but 99.8% of it was created between March and April 2016. THis ads were last seen between 2016-03-05 and 2016-04-07, the same date range as the data was crawled.

Analysing the registration, we can see that there isn't a monthly pattern, but we find an error on cars registrated on month 0. Looking at the year of registration, we found cars registrated from year 1000 (before the car was invented) to year 9999 (in the future). We will count the number of cars registrated between 1886 (invention of the car) and 2016(year when the data was crawled) to see if it's safe to remove the remaining data.

In [80]:
total_cars = autos['registration_year'].shape[0]
correct_range_cars = autos[autos['registration_year'].between(1886, 2016)].shape[0]
correct_range_percentage = correct_range_cars / total_cars
correct_range_percentage

0.961206630289303

96.1% of the cars are from the range 1886 - 2016, so it's safe to remove the other cars from outside this range. 

In [82]:
autos = autos[autos['registration_year'].between(1886, 2016)]

### Analysing brands (price and mileage)

In [85]:
autos['brand'].value_counts(normalize=True)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

There are 40 brands in the dataset, but the top 5 (all germans) represent over 60% of the cars listed. We will limit the analysis to cars with more than 3% of the cars, which corresponds to the top 7 cars that represent 72% of the cars listed.

In [100]:
cars = autos['brand'].value_counts().head(7).index

In [103]:
cars

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford',
       'renault'],
      dtype='object')

In [108]:
brand_prices = {}

for car in cars:
    data = autos[autos['brand'] == car]
    mean = data['price'].mean()
    brand_prices[car] = mean

brand_prices

{'audi': 9336.687453600594,
 'bmw': 8332.820517811953,
 'ford': 3749.4695065890287,
 'mercedes_benz': 8628.450366422385,
 'opel': 2975.2419354838707,
 'renault': 2474.8646069968195,
 'volkswagen': 5402.410261610221}

The brand analysed could be classified in three classes according to their price. The first class, with a higher average price (around $ 8775) consist of three luxury brands: audi, mercedes_benz and bmw.

After comes the middle class with only volkswagen, with price around $ 5402.

Lastly, the other three brands (opel, renault and volskwagen) have a smaller price, around $ 3066. 

To further understand this relation, we will analyse the brand by mileage.

In [110]:
brand_mileage = {}

for car in cars:
    data = autos[autos['brand'] == car]
    mean = data['odometer_km'].mean()
    brand_mileage[car] = mean

brand_mileage

{'audi': 129157.38678544914,
 'bmw': 132572.51313996495,
 'ford': 124266.01287159056,
 'mercedes_benz': 130788.36331334666,
 'opel': 129310.0358422939,
 'renault': 128071.33121308497,
 'volkswagen': 128707.15879132022}

In [114]:
# Transforming the dictionaries into series

bp_series = pd.Series(brand_prices)
bm_series = pd.Series(brand_mileage)

# Transforming the series to a dataframe

price_and_mileage = pd.DataFrame(bp_series, columns=['mean_price'])
price_and_mileage['mean_mileage'] = bm_series
price_and_mileage

Unnamed: 0,mean_price,mean_mileage
audi,9336.687454,129157.386785
bmw,8332.820518,132572.51314
ford,3749.469507,124266.012872
mercedes_benz,8628.450366,130788.363313
opel,2975.241935,129310.035842
renault,2474.864607,128071.331213
volkswagen,5402.410262,128707.158791


Looking at the data comparing price and mileage, that doesn't appear to be a relation between the brands price and their mileage (the price differs significantly between brands but the mileage are similar). We wil analyse all the dataset to find if there is a correlation between price and mileage.

### Analysing correlation

In [117]:
autos[['price', 'odometer_km']].corr()

Unnamed: 0,price,odometer_km
price,1.0,-0.385825
odometer_km,-0.385825,1.0


There is a negative correlation between the price and the mileage of the car, which represents a lower price for cars with more mileage. Next, we will analyse if the unrepaired_damage has effect on the car prices. 

In [122]:
autos['unrepaired_damage'].value_counts()

nein    33834
ja       4540
Name: unrepaired_damage, dtype: int64

In [148]:
not_damaged_price = autos[autos['unrepaired_damage'] == 'nein']['price'].mean()
damaged_price = autos[autos['unrepaired_damage'] == 'ja']['price'].mean()
not_damaged_mileage = autos[autos['unrepaired_damage'] == 'nein']['odometer_km'].mean()
damaged_mileage = autos[autos['unrepaired_damage'] == 'ja']['odometer_km'].mean()

print("Not damaged cars\nprice: " + str(not_damaged_price) + "\nmileage: " + str(not_damaged_mileage))
print("\nDamaged cars\nprice: " + str(damaged_price) + "\nmileage: " + str(damaged_mileage))

Not damaged cars
price: 7164.033102796004
mileage: 122912.30714665721

Damaged cars
price: 2241.146035242291
mileage: 135356.8281938326


The cars that aren't damaged had a significantly higher price than those damaged (220% higher). 

This effect could be confused with the mileage if the damaged cars had more mileage than not damaged cars. The damaged cars had just 10% more mileage, so this don't explain the most part of the difference in prices.