# German Used Cars on eBay
INTRODUCTION

## Data
The dataset found on [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/downloads/used-cars-database.zip/3) contains over 370000 used cars scraped with Scrapy from Ebay-Kleinanzeigen. The content of the data is in german, so one has to translate it first if one can not speak german. Those fields are included: autos.csv:

- **dateCrawled** : when this ad was first crawled, all field-values are taken from this date
- **name** : "name" of the car
- **seller** : private or dealer
- **offerType**
- **price** : the price on the ad to sell the car
- **abtest**
- **vehicleType**
- **yearOfRegistration** : at which year the car was first registered
- **gearbox**
- **powerPS** : power of the car in PS
- **model**
- **kilometer** : how many kilometers the car has driven
- **monthOfRegistration** : at which month the car was first registered
- **fuelType**
- **brand**
- **notRepairedDamage** : if the car has a damage which is not repaired yet
- **dateCreated** : the date for which the ad at ebay was created
- **nrOfPictures** : number of pictures in the ad (unfortunately this field contains everywhere a 0 and is thus useless (bug in crawler!) )
- **postalCode**
- **lastSeenOnline** : when the crawler saw this ad last online

The fields lastSeen and dateCreated could be used to estimate how long a car will be at least online before it is sold.

In [1]:
import pandas as pd
import numpy as np

In [2]:
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [3]:
print(autos.info())
autos.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
dateCrawled            371528 non-null object
name                   371528 non-null object
seller                 371528 non-null object
offerType              371528 non-null object
price                  371528 non-null int64
abtest                 371528 non-null object
vehicleType            333659 non-null object
yearOfRegistration     371528 non-null int64
gearbox                351319 non-null object
powerPS                371528 non-null int64
model                  351044 non-null object
kilometer              371528 non-null int64
monthOfRegistration    371528 non-null int64
fuelType               338142 non-null object
brand                  371528 non-null object
notRepairedDamage      299468 non-null object
dateCreated            371528 non-null object
nrOfPictures           371528 non-null int64
postalCode             371528 non-null int64
lastSeen              

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


The dataset contains 20 columns most of which are strings, and some of the columns contain null values, but none with more than about 20% null values so we should be able to work with most of the columns.

Let's convert columns named in CamelCase to Python's snake_case.

In [4]:
autos.rename(columns={'yearOfRegistration': 'registration_year',
                      'monthOfRegistration': 'registration_month',
                      'notRepairedDamage': 'unrepaired_damage',
                      'dateCreated': 'ad_created'},
                       inplace=True)
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'registration_year', 'gearbox', 'powerPS', 'model',
       'kilometer', 'registration_month', 'fuelType', 'brand',
       'unrepaired_damage', 'ad_created', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for text columns where all or almost all values are the same, since these typically don't have useful information for analysis.

In [5]:
autos.describe(include='all')

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,registration_year,gearbox,powerPS,model,kilometer,registration_month,fuelType,brand,unrepaired_damage,ad_created,nrOfPictures,postalCode,lastSeen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:45:59
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


In [6]:
print(autos['seller'].value_counts())
print()
print(autos['offerType'].value_counts())

privat        371525
gewerblich         3
Name: seller, dtype: int64

Angebot    371516
Gesuch         12
Name: offerType, dtype: int64


It looks like nrOfPictures has a value of zero for all rows as noted in column descriptions at the top of the Data section, so it can be safely dropped as it has no information.

Additionally, the seller and offerType columns are nearly all one value so let's get rid of those as well.  

In [7]:
autos.drop(['nrOfPictures', 'seller', 'offerType'], axis=1, inplace=True)

Now let's more in depth at the price column.

In [16]:
print(autos['price'].describe())
autos['price'].value_counts().head(20)

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64


0       10778
500      5670
1500     5394
1000     4649
1200     4594
2500     4438
600      3819
3500     3792
800      3784
2000     3432
999      3364
750      3203
650      3150
4500     3053
850      2946
2200     2936
700      2936
1800     2886
900      2874
950      2793
1100     2772
1300     2757
300      2731
3000     2720
550      2591
1600     2570
5500     2543
350      2514
400      2442
1250     2441
Name: price, dtype: int64

We can see that there are a large number of cars being listed for $0, but since this is not realistic we should remove them.

In [18]:
autos['price'].value_counts().sort_index(ascending=True).head(10)

0     10778
1      1189
2        12
3         8
4         1
5        26
7         3
8         9
9         8
10       84
Name: price, dtype: int64

In [19]:
autos['price'].value_counts().sort_index(ascending=False).head(50)

2147483647     1
99999999      15
99000000       1
74185296       1
32545461       1
27322222       1
14000500       1
12345678       9
11111111      10
10010011       1
10000000       8
9999999        3
3895000        1
3890000        1
2995000        1
2795000        1
1600000        2
1300000        1
1250000        2
1234566        1
1111111        2
1010010        1
1000000        5
999999        13
999990         1
911911         1
849000         1
820000         1
780000         1
745000         2
725000         1
700000         1
650000         1
619000         1
600000         2
599000         1
585000         1
579000         1
517895         1
500000         2
488997         1
487000         1
485000         1
466000         1
445000         1
440000         1
420000         1
399997         1
395000         1
390000         1
Name: price, dtype: int64

Given that eBay is an auction site, there could legitimately be items where the opening bid is 1 dollar so we will keep the 1 dollar items, but remove anything above \$390,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

In [20]:
autos = autos[autos['price'].between(1, 390000)]