# Analyzing eBay Kleinanzeigen data
In this project we'll be looking at a sample of 50.000 data points from the original eBay Kleinanzeigen dataset. The full data set can be found __[here.](https://data.world/data-society/used-cars-data)__

Our goal is to clean up the data and then run a few analysis to get a better picture of the used car market of that time period.

## 1. Getting to know the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

autos = pd.read_csv("autos.csv",
                    encoding="Latin-1"
                   )

In [2]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [3]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Key observations:

For a number of cars we are missing the info about:

- Vehicle type (10.2%)
- The transmission type (5.3%)
- The model name (5.5%)
- The type of fuel the cars use (8.9%)
- Whether the car has damage or not (can possibly be explained by the fact that not all cars sold had damages) (19.7%)

#### Let's rename the headers so it's easier to use them later on:

In [4]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
autos.rename(columns={'yearOfRegistration' : 'registration_year',
                     'monthOfRegistration':'registration_month',
                     'notRepairedDamage':'unrepaired_damage',
                     'dateCreated':'ad_created',
                     'nrOfPictures':'pictures',
                     'postalCode':'postal_code',
                     'offerType':'offer_type',
                     'vehicleType':'vehicle_type',
                     'fuelType':'fuel_type',
                     'lastSeen':'last_seen_online',
                     'dateCrawled':'date_crawled',
                     'abtest':'AB_test'},inplace=True)

## 2. Data exploration and cleaning
Let's have a closer look at the data to see where we might need to clean it.

In [7]:
autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,AB_test,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,pictures,postal_code,last_seen_online
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07,,116.36,,,5.72,,,,,0.0,50813.63,
std,,,,,,,,105.71,,209.22,,,3.71,,,,,0.0,25779.75,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


#### This uncovers a few things:

1. The coloumns **price** and **odometer** are numeric values stored as strings. We'll need to convert them.
2. The **seller** and **offer_type** columns contain the same values in all but one entry. This makes them redundant for our analysis as they don't add much information about the listings.
3. The **registration_year** column has some extreme values. We'll delete them from our data.

### 2.1 Cleaning price and odometer columns

Just by looking at the values above, we know that we'll have to remove '$' as well as 'km' from the two columns. However, we don't know from the data above if the price is written with "," or "." or just a space to indicate a number. 

Let's check that first:

In [8]:
autos["price"].head(5)

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

We see that a comma is used to separate numbers above $999. We must therefore exclude that character from the column before we can covert the data to numeric type.

In [8]:
autos["price"] = autos["price"].str.replace('$','').str.replace(',','').astype(int)


Let's clean the **odometer** columns next. 

We already know that we need to remove the 'km' but what about any potential commas or periods?

In [9]:
autos["odometer"].head()

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object

A comma is used to separate the thousands, so we'll remove it as well before we convert the data to numeric type:

In [10]:
autos["odometer"] = autos["odometer"].str.replace('km','').str.replace(',','').astype(int)


**Now that both price and odometer columns are numeric, we can check for outliers**

We'll start with the price column.

In [30]:
autos["price"].unique().shape

(2357,)

In [11]:
autos["price"].describe()

count      50000.00
mean        9840.04
std       481104.38
min            0.00
25%         1100.00
50%         2950.00
75%         7200.00
max     99999999.00
Name: price, dtype: float64

#### Observations:

- There are 2357 unique prices in the data set
- The average price of a car is 9 840 USD. 
- The cheapest car is actually free, priced at 0. 
- The most expensive car seems to be prices at 100 million. That might be an error. 

We'll need to look closer at really big prices so that our analysis isn't skewed down the line:

In [12]:
autos["price"].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

There is a big jump in price after 350 000 USD. Some of the prices might be a random input - such as 99999999 and 12345678. Overall, all of the prices above 350 000 USD seem very high for a used car. 

Let's see the model of the cars to see if the prices makes sense.

In [13]:
price_filter = (autos["price"] > 350000)
strange_price = autos[price_filter]
strange_price.loc[:,["name","price","model"]].sort_values(by=["price"],ascending=False)

Unnamed: 0,name,price,model
39705,Tausch_gegen_gleichwertiges,99999999,s_klasse
42221,Leasinguebernahme,27322222,c4
27371,Fiat_Punto,12345678,punto
39377,Tausche_volvo_v40_gegen_van,12345678,v40
47598,Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...,12345678,vectra
2897,Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000,11111111,escort
24384,Schlachte_Golf_3_gt_tdi,11111111,
11137,suche_maserati_3200_gt_Zustand_unwichtig_laufe...,10000000,
47634,Ferrari_FXX,3890000,
7814,Ferrari_F40,1300000,


Only a few of the entries with high prices are actually luxury cars. It is therefore more plausible that they can be sold for such high prices. Therefore, we can assumes that only the Maserati and the two Ferraris are legitamate listings.

However, the three cars are much more expensive than the average price. Given that we are dealing with data of non-luxury brands, these three entires might again skew our analysis without adding too much value.

We will therefore remove them as well.

In [14]:
cars_drop = autos["price"] > 350000
autos = autos.drop(autos.index[cars_drop])

#### Let us look into the column odometer next

In [71]:
autos["odometer"].unique().shape

(13,)

In [72]:
autos["odometer"].describe()

count     49986.000000
mean     125736.506222
std       40038.133399
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer, dtype: float64

The average amount of km that a used car has driven before sale is 125.732 km. Given that the max is 150.000 and the min is 5.000, we can assume that most cars have a lot of high mileage. 

In [73]:
autos["odometer"].value_counts().sort_index(ascending=False).head(20)

150000    32416
125000     5169
100000     2168
90000      1757
80000      1436
70000      1230
60000      1164
50000      1025
40000       818
30000       789
20000       784
10000       264
5000        966
Name: odometer, dtype: int64

Almost 65% of all cars for sale has a mileage of 150.000 km. And almost 80% of all cars for sale have mileage over 100.000 km. 

None of the values stick out so we won't be removing any.

### Exploring date values

In [74]:
(autos["date_crawled"]
 .str[:10]
 .value_counts(normalize=True,dropna=False)
 .sort_index()
)

2016-03-05    0.025387
2016-03-06    0.013944
2016-03-07    0.035970
2016-03-08    0.033269
2016-03-09    0.033209
2016-03-10    0.032129
2016-03-11    0.032489
2016-03-12    0.036770
2016-03-13    0.015564
2016-03-14    0.036630
2016-03-15    0.033990
2016-03-16    0.029508
2016-03-17    0.031509
2016-03-18    0.013064
2016-03-19    0.034910
2016-03-20    0.037831
2016-03-21    0.037490
2016-03-22    0.032909
2016-03-23    0.032389
2016-03-24    0.029108
2016-03-25    0.031749
2016-03-26    0.032489
2016-03-27    0.031049
2016-03-28    0.034850
2016-03-29    0.034150
2016-03-30    0.033629
2016-03-31    0.031909
2016-04-01    0.033809
2016-04-02    0.035410
2016-04-03    0.038691
2016-04-04    0.036490
2016-04-05    0.013104
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled, dtype: float64

The ads were crawled mostly in March 2016, with some also from April 2016. The distribution seems to be fairly even. 

In [75]:
(autos["ad_created"]
 .str[:10]
 .value_counts(normalize=True,dropna=False)
 .sort_index()
)

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
                ...   
2016-04-03    0.038931
2016-04-04    0.036850
2016-04-05    0.011843
2016-04-06    0.003261
2016-04-07    0.001280
Name: ad_created, Length: 76, dtype: float64

In [76]:
(autos["ad_created"]
 .str[:10]
 .value_counts(normalize=True,dropna=False)
 .sort_index(ascending=False)
)

2016-04-07    0.001280
2016-04-06    0.003261
2016-04-05    0.011843
2016-04-04    0.036850
2016-04-03    0.038931
                ...   
2015-12-05    0.000020
2015-11-10    0.000020
2015-09-09    0.000020
2015-08-10    0.000020
2015-06-11    0.000020
Name: ad_created, Length: 76, dtype: float64

The ads were created between June 2015 and April 2016. It appears that the majority of ads are from 2016. 

This makes sense since the website was only crawled in March-April 2016, and cars that have been sold in before that would have been removed from the site.

In [77]:
(autos["last_seen_online"]
 .str[:10]
 .value_counts(normalize=True,dropna=False)
 .sort_index()
)

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005362
2016-03-08    0.007582
2016-03-09    0.009843
2016-03-10    0.010763
2016-03-11    0.012524
2016-03-12    0.023807
2016-03-13    0.008983
2016-03-14    0.012804
2016-03-15    0.015884
2016-03-16    0.016445
2016-03-17    0.027928
2016-03-18    0.007422
2016-03-19    0.015744
2016-03-20    0.020706
2016-03-21    0.020726
2016-03-22    0.021586
2016-03-23    0.018585
2016-03-24    0.019565
2016-03-25    0.019205
2016-03-26    0.016965
2016-03-27    0.016024
2016-03-28    0.020846
2016-03-29    0.022326
2016-03-30    0.024847
2016-03-31    0.023827
2016-04-01    0.023106
2016-04-02    0.024887
2016-04-03    0.025367
2016-04-04    0.024627
2016-04-05    0.124275
2016-04-06    0.220982
2016-04-07    0.130957
Name: last_seen_online, dtype: float64

The last three days of April 2016 seem to have a spike in activity. 

### Exploring the registration year

In [78]:
autos["registration_year"].describe()

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Most cars seem to have been registered in 2005. We can see some indication of potential wrong values:

- The earlist registraion year is 1000. This is simply not possible
- The latest year is 9999, which is equally impossible.

We will have to do further investigation to remove these values.

Technically, some old cars could have been up for sales when this data was crawled. This means we cannot, with confidence, remove any listing with registration year prior to 1908. This is the year the first massproduced car was sold. 

Additionally, no car could have been registreted after the listings have been crawled, which was in 2016. 

Therefore, let us look at all listings with registration years outside of 1900-2016 time period.

In [79]:
(autos["registration_year"]
 .value_counts()
 .sort_index()
)

1000    1
1001    1
1111    1
1500    1
1800    2
       ..
6200    1
8888    1
9000    2
9996    1
9999    4
Name: registration_year, Length: 97, dtype: int64

In [80]:
years_drop1 = autos["registration_year"] > 2016
autos = autos.drop(autos.index[years_drop1])
years_drop2 = autos["registration_year"] < 1908
autos = autos.drop(autos.index[years_drop2])

In [81]:
autos["registration_year"].describe()

count    48016.000000
mean      2002.806002
std          7.306212
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

As we see, we removed 1970 listings based on this criteria.

In [82]:
autos["registration_year"].value_counts(normalize=True).sort_index()

1910    0.000187
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
          ...   
2012    0.027553
2013    0.016786
2014    0.013850
2015    0.008310
2016    0.027408
Name: registration_year, Length: 78, dtype: float64

Now we see that the oldest car is from 1910 and the newest car is from 2016.

### Exploring the data about spesific brands

Let's have a look at the top 10 brands sold:

In [83]:
autos["brand"].value_counts().head(10)

volkswagen       10185
bmw               5283
opel              5194
mercedes_benz     4579
audi              4149
ford              3350
renault           2274
peugeot           1418
fiat              1242
seat               873
Name: brand, dtype: int64

The most popular car brand is Volkswagen. 20% of all cars for sale are of that brand. This is the same as the next two brands combined. 

German brands dominate the list, with 4 brands being on the top 10 list.

Let us see what is the average price for each of the top 10 brands:

In [84]:
top_10 = autos["brand"].value_counts().iloc[0:10]

print(top_10)

volkswagen       10185
bmw               5283
opel              5194
mercedes_benz     4579
audi              4149
ford              3350
renault           2274
peugeot           1418
fiat              1242
seat               873
Name: brand, dtype: int64


In [85]:
top_brand = {}

brands = top_10.index.unique()

for b in brands:
    selected_brand = autos[autos["brand"] == b]
    mean_price = selected_brand["price"].mean()
    price = round(mean_price)
    top_brand[b] = price
    
top_brand

{'volkswagen': 5231.0,
 'bmw': 8103.0,
 'opel': 2877.0,
 'mercedes_benz': 8485.0,
 'audi': 9094.0,
 'ford': 3652.0,
 'renault': 2395.0,
 'peugeot': 3039.0,
 'fiat': 2712.0,
 'seat': 4296.0}

In [86]:
autos["price"].describe()

count     48016.000000
mean       5811.516953
std        9102.630877
min           0.000000
25%        1150.000000
50%        2990.000000
75%        7399.000000
max      350000.000000
Name: price, dtype: float64

#### Observations:

- Audi, BMW, and Mercedes are the most expensive car on average. All three are almost double as expensive as the average price of all cars on the list. 
- Renault is the cheaps brand of all with an average price of 2395 USD.

### Looking at average mileage for the most popular brands

Let's first calculate the average mileage for the same top 10 brands that we looked at previously.

In [87]:
top_mile = {}

brands = top_10.index.unique()

for b in brands:
    selected_brand = autos[autos["brand"] == b]
    mean_mileage = selected_brand["odometer"].mean()
    mileage = round(mean_mileage)
    top_mile[b] = mileage
    
top_mile

{'volkswagen': 128724.0,
 'bmw': 132431.0,
 'opel': 129223.0,
 'mercedes_benz': 130856.0,
 'audi': 129288.0,
 'ford': 124069.0,
 'renault': 128184.0,
 'peugeot': 127137.0,
 'fiat': 116554.0,
 'seat': 121564.0}

Now let's combine both avg price and avg mileage into one dataframe for easier analysis.

In [91]:
prices = pd.Series(data=top_brand)
prices

volkswagen       5231.0
bmw              8103.0
opel             2877.0
mercedes_benz    8485.0
audi             9094.0
ford             3652.0
renault          2395.0
peugeot          3039.0
fiat             2712.0
seat             4296.0
dtype: float64

In [89]:
mileages = pd.Series(data=top_mile)
mileages

volkswagen       128724.0
bmw              132431.0
opel             129223.0
mercedes_benz    130856.0
audi             129288.0
ford             124069.0
renault          128184.0
peugeot          127137.0
fiat             116554.0
seat             121564.0
dtype: float64

In [92]:
top_10_brands = {"avg_price": prices,"avg_mileage":mileages}

agg_brands = pd.DataFrame(data=top_10_brands)
agg_brands

Unnamed: 0,avg_price,avg_mileage
volkswagen,5231.0,128724.0
bmw,8103.0,132431.0
opel,2877.0,129223.0
mercedes_benz,8485.0,130856.0
audi,9094.0,129288.0
ford,3652.0,124069.0
renault,2395.0,128184.0
peugeot,3039.0,127137.0
fiat,2712.0,116554.0
seat,4296.0,121564.0


#### Observations

The aveage mileage doesn't seem to be the dominant factor that effects the price. As we can see, the average mileage for all top 10 brands are all around 130.000 km.(with the exception of Fiat). 

### Furthet data cleaning
Let's take a look at the name column to see if it contains any useful information.

In [100]:
autos.loc[:,["name","brand","model"]].head(20)

Unnamed: 0,name,brand,model
0,Peugeot_807_160_NAVTECH_ON_BOARD,peugeot,andere
1,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,bmw,7er
2,Volkswagen_Golf_1.6_United,volkswagen,golf
3,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,smart,fortwo
4,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,ford,focus
5,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,chrysler,voyager
6,VW_Golf_III_GT_Special_Electronic_Green_Metall...,volkswagen,golf
7,Golf_IV_1.9_TDI_90PS,volkswagen,golf
8,Seat_Arosa,seat,arosa
9,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,renault,megane
