# Exploring ebay car sales data

 * a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
 * The aim of this project is to clean the data and analyze the included used car listings. You'll also become familiar with some of the unique benefits jupyter notebook provides for pandas.
 
Data dictionary:
 * dateCrawled - When this ad was first crawled. All field-values are taken from this date.
 * name - Name of the car.
 * seller - Whether the seller is private or a dealer.
 * offerType - The type of listing
 * price - The price on the ad to sell the car.
 * abtest - Whether the listing is included in an A/B test.
 * vehicleType - The vehicle Type.
 * yearOfRegistration - The year in which the car was first registered.
 * gearbox - The transmission type.
 * powerPS - The power of the car in PS.
 * model - The car model name.
 * kilometer - How many kilometers the car has driven.
 * monthOfRegistration - The month in which the car was first registered.
 * fuelType - What type of fuel the car uses.
 * brand - The brand of the car.
 * notRepairedDamage - If the car has a damage which is not yet repaired.
 * dateCreated - The date on which the eBay listing was created.
 * nrOfPictures - The number of pictures in the ad.
 * postalCode - The postal code for the location of the vehicle.
 * lastSeenOnline - When the crawler saw this ad last online.

In [27]:
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv",encoding="Latin-1")
display(autos.head())

autos.info()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

From the work we did in the last screen, we can make the following observations:

* The dataset contains 20 columns, most of which are strings.
* Some columns have null values, but none have more than ~20% null values.
* The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

In [59]:

autos.columns = ['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',
       'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',
       'odometer', 'registration_month', 'fueltype', 'brand',
       'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',
       'lastseen']
autos.head()

Unnamed: 0,datecrawled,name,seller,offertype,price,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer,registration_month,fueltype,brand,unrepaired_damage,ad_created,nrofpictures,postalcode,lastseen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004,manuell,158,andere,150000.0,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997,automatik,286,7er,150000.0,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009,manuell,102,golf,70000.0,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,kleinwagen,2007,automatik,71,fortwo,70000.0,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,kombi,2003,manuell,0,focus,150000.0,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [61]:
autos.describe(include='all')

Unnamed: 0,datecrawled,name,seller,offertype,price,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer_km,registration_month,fueltype,brand,unrepaired_damage,ad_created,nrofpictures,postalcode,lastseen
count,50000,50000,50000,50000,50000.0,50000,44905,50000.0,47320,50000.0,47242,50000.0,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,,2,8,,2,,245,,,7,40,2,76,,,39481
top,2016-03-23 19:38:20,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,,25756,12859,,36993,,4024,,,30107,10687,35232,1946,,,8
mean,,,,,9840.044,,,2005.07328,,116.35592,,125732.7,5.72336,,,,,0.0,50813.6273,
std,,,,,481104.4,,,105.712813,,209.216627,,40042.211706,3.711984,,,,,0.0,25779.747957,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1100.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30451.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49577.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71540.0,


* We can see that "seller", "offertype" columes have almost only one value. Those are the candidates might be dropped from the dataframe.
* "price", "odometer" are the text columes needed to be cleaned.
* and "nrofpictures" needs more investigation.

In [None]:
autos["price"] = autos["price"].str.replace["$",""].str.replace[",",""].astype(float)


In [57]:
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int)
autos["odometer"].head()

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

In [60]:
autos.rename({"odometer":"odometer_km"},axis=1,inplace=True)

In [77]:
display(autos["odometer_km"].max())
display(autos["odometer_km"].min())
display(autos["odometer_km"].unique().shape)
display(autos["odometer_km"].describe())
display(autos["odometer_km"].value_counts().sort_index(ascending = False))

150000.0

5000.0

(13,)

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
40000.0       819
30000.0       789
20000.0       784
10000.0       264
5000.0        967
Name: odometer_km, dtype: int64

* Here we can see that the 'odometer_km' column appears to have a 'Negative or Left Skew' with its minimum value being of the magnitude 10^2 times less than values within 75% of the dataset.For now we don't change anything for "odometer_km".

In [81]:
display(autos["price"].max())
display(autos["price"].min())
display(autos["price"].unique().shape)
display(autos["price"].describe())
display(autos["price"].value_counts().sort_index(ascending= True).head(10))
display(autos["price"].value_counts().sort_index(ascending= False).head(200))

99999999.0

0.0

(2357,)

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

0.0     1421
1.0      156
2.0        3
3.0        1
5.0        2
8.0        1
9.0        1
10.0       7
11.0       2
12.0       3
Name: price, dtype: int64

99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
259000.0      1
250000.0      1
220000.0      1
198000.0      1
197000.0      1
194000.0      1
190000.0      1
180000.0      1
175000.0      1
169999.0      1
169000.0      1
163991.0      1
163500.0      1
155000.0      1
151990.0      1
             ..
47500.0       3
47499.0       1
47000.0       4
46999.0       1
46990.0       1
46911.0       1
46900.0       3
46800.0       1
46500.0       1
46200.0       1
46000.0       2
45950.0       1
45949.0       1
45900.0       3
45800.0       1
45500.0       4
45000.0       5
44996.0       1
44990.0       1
44900.0       5
44777.0       1
44500.0       1
44499.0       2
44497.0       1
44444.0       1
44200.0       2
44000.0       4
43900.0       5
43500.0       2
43461.0       1
Name: price, Length: 200

In [84]:
autos["price"] = autos.loc[autos["price"].between(0,30000),"price"]
autos["price"].describe()

count    49206.000000
mean      5025.773483
std       5679.154441
min          0.000000
25%       1100.000000
50%       2850.000000
75%       6900.000000
max      30000.000000
Name: price, dtype: float64

Any numbers above $30000 have only a few counts but the numbers dramatically affected the distribution. Also price $0 has 1421 counts which account for almost half of the total count, so we will keep it. Anything in between 0 to 30000 seems resonable.

In [85]:
autos[["datecrawled","ad_created", "lastseen"]]

Unnamed: 0,datecrawled,ad_created,lastseen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50
5,2016-03-21 13:47:45,2016-03-21 00:00:00,2016-04-06 09:45:21
6,2016-03-20 17:55:21,2016-03-20 00:00:00,2016-03-23 02:48:59
7,2016-03-16 18:55:19,2016-03-16 00:00:00,2016-04-07 03:17:32
8,2016-03-22 16:51:34,2016-03-22 00:00:00,2016-03-26 18:18:10
9,2016-03-16 13:47:02,2016-03-16 00:00:00,2016-04-06 10:46:35


In [96]:
date_crawled = autos["datecrawled"].str[:10]

date_crawled.value_counts(normalize=True, dropna=False).mul(100).round(1).sort_index(ascending=True).astype(str) + "%"

2016-03-05    2.5%
2016-03-06    1.4%
2016-03-07    3.6%
2016-03-08    3.3%
2016-03-09    3.3%
2016-03-10    3.2%
2016-03-11    3.2%
2016-03-12    3.7%
2016-03-13    1.6%
2016-03-14    3.7%
2016-03-15    3.4%
2016-03-16    2.9%
2016-03-17    3.2%
2016-03-18    1.3%
2016-03-19    3.5%
2016-03-20    3.8%
2016-03-21    3.8%
2016-03-22    3.3%
2016-03-23    3.2%
2016-03-24    2.9%
2016-03-25    3.2%
2016-03-26    3.2%
2016-03-27    3.1%
2016-03-28    3.5%
2016-03-29    3.4%
2016-03-30    3.4%
2016-03-31    3.2%
2016-04-01    3.4%
2016-04-02    3.5%
2016-04-03    3.9%
2016-04-04    3.7%
2016-04-05    1.3%
2016-04-06    0.3%
2016-04-07    0.1%
Name: datecrawled, dtype: object

In [112]:
date_crawled.describe()


count          50000
unique            34
top       2016-04-03
freq            1934
Name: datecrawled, dtype: object

* Looking at the data above it seems like the period over which the data has been crawled covers roughly one month (March-April 2016).\ The distribution is more or less uniform.

In [109]:
ad_created = autos["ad_created"].str[:10]
ad_created.value_counts(normalize=True,dropna=False).mul(100).round(2).sort_index(ascending = True).astype(str) + "%"

2015-06-11     0.0%
2015-08-10     0.0%
2015-09-09     0.0%
2015-11-10     0.0%
2015-12-05     0.0%
2015-12-30     0.0%
2016-01-03     0.0%
2016-01-07     0.0%
2016-01-10     0.0%
2016-01-13     0.0%
2016-01-14     0.0%
2016-01-16     0.0%
2016-01-22     0.0%
2016-01-27    0.01%
2016-01-29     0.0%
2016-02-01     0.0%
2016-02-02     0.0%
2016-02-05     0.0%
2016-02-07     0.0%
2016-02-08     0.0%
2016-02-09     0.0%
2016-02-11     0.0%
2016-02-12    0.01%
2016-02-14     0.0%
2016-02-16     0.0%
2016-02-17     0.0%
2016-02-18     0.0%
2016-02-19    0.01%
2016-02-20     0.0%
2016-02-21    0.01%
              ...  
2016-03-09    3.32%
2016-03-10    3.19%
2016-03-11    3.28%
2016-03-12    3.66%
2016-03-13    1.69%
2016-03-14    3.52%
2016-03-15    3.37%
2016-03-16     3.0%
2016-03-17    3.12%
2016-03-18    1.37%
2016-03-19    3.38%
2016-03-20    3.79%
2016-03-21    3.77%
2016-03-22    3.28%
2016-03-23    3.22%
2016-03-24    2.91%
2016-03-25    3.19%
2016-03-26    3.26%
2016-03-27    3.09%


In [113]:
ad_created.describe()

count          50000
unique            76
top       2016-04-03
freq            1946
Name: ad_created, dtype: object

* The dates ads were created range from June 2015 until April of 2016. The majority (+- 97%) of ads in the dataset were created after the date on which data was crawled for the first time.

* This make sense as most auctions are only 'live' for a short period of time.

In [111]:
lastseen = autos["lastseen"].str[:10]
lastseen.value_counts(normalize = True, dropna = False).mul(100).round(1).sort_index(ascending = True).astype(str) + "%"

2016-03-05     0.1%
2016-03-06     0.4%
2016-03-07     0.5%
2016-03-08     0.8%
2016-03-09     1.0%
2016-03-10     1.1%
2016-03-11     1.3%
2016-03-12     2.4%
2016-03-13     0.9%
2016-03-14     1.3%
2016-03-15     1.6%
2016-03-16     1.6%
2016-03-17     2.8%
2016-03-18     0.7%
2016-03-19     1.6%
2016-03-20     2.1%
2016-03-21     2.1%
2016-03-22     2.2%
2016-03-23     1.9%
2016-03-24     2.0%
2016-03-25     1.9%
2016-03-26     1.7%
2016-03-27     1.6%
2016-03-28     2.1%
2016-03-29     2.2%
2016-03-30     2.5%
2016-03-31     2.4%
2016-04-01     2.3%
2016-04-02     2.5%
2016-04-03     2.5%
2016-04-04     2.5%
2016-04-05    12.4%
2016-04-06    22.1%
2016-04-07    13.1%
Name: lastseen, dtype: object

In [114]:
lastseen.describe()

count          50000
unique            34
top       2016-04-06
freq           11050
Name: lastseen, dtype: object

* The last seen dates looks a bit uniform, but the last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales. It's more likely that these values are to do with the crawling period ending and don't indicate car sales.

In [120]:
registration_year = autos["registration_year"]
display(registration_year.describe())

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [121]:
registration_year.unique()

array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010,
       1999, 1982, 1990, 2015, 2014, 1996, 1992, 2005, 2002, 2012, 2011,
       2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988,
       1989, 1967, 1973, 1956, 1976, 4500, 1987, 1991, 1983, 1960, 1969,
       1950, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971,
       1966, 1979, 1981, 1970, 1974, 1910, 1975, 5000, 4100, 2019, 1959,
       9996, 9999, 6200, 1964, 1958, 1800, 1948, 1931, 1943, 9000, 1941,
       1962, 1927, 1937, 1929, 1000, 1957, 1952, 1111, 1955, 1939, 8888,
       1954, 1938, 2800, 5911, 1500, 1953, 1951, 4800, 1001], dtype=int64)

In [122]:
registration_year.value_counts(dropna=False)

2000    3354
2005    3015
1999    3000
2004    2737
2003    2727
2006    2708
2001    2703
2002    2533
1998    2453
2007    2304
2008    2231
2009    2098
1997    2028
2011    1634
2010    1597
2017    1453
1996    1444
2012    1323
2016    1316
1995    1313
2013     806
2014     666
1994     660
2018     492
1993     445
2015     399
1990     395
1992     391
1991     356
1989     181
        ... 
1950       3
1955       2
9000       2
1954       2
1800       2
1957       2
1941       2
1951       2
1934       2
4100       1
4800       1
1953       1
1111       1
1927       1
6200       1
4500       1
1943       1
5911       1
1939       1
1938       1
2800       1
8888       1
1000       1
1500       1
1948       1
1931       1
1929       1
1001       1
9996       1
1952       1
Name: registration_year, Length: 97, dtype: int64

* The minimum and maximum value of registration year seems strange.

* The lowest registration year is 1,000 which must be incorrect as cars only started appearing in the late 1800's, Also the max year is 9999 which is talking about the future, Due to this all occurences with a registration year before 1885 (first patented practical automobile) will be removed.

* All registration years after 2016 must be incorrect as ads were created in 2015 & 2016.\ These will be removed from the dataset as well

* A manufactured car can't be registered after its listing was seen but before, any vehicle with a registration year above 2016 we can deduce as inaccurate as our ad_created column which represents date_of_listing only has data on years up to 2016.

* Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1800s. In an attempt to validate this we can count the number of listings that fall outside the 1900-2016 interval and see if it is safe to remove those rows entirely or if they require custom logic.

In [126]:
autos[autos["registration_year"].between(1800, 2016)].describe()

Unnamed: 0,price,registration_year,powerps,odometer_km,registration_month,nrofpictures,postalcode
count,47245.0,48030.0,48030.0,48030.0,48030.0,48030.0,48030.0
mean,5094.483014,2002.795066,117.140496,125539.142203,5.767604,0.0,50936.383094
std,5729.183673,7.426905,195.449149,40113.458962,3.696805,0.0,25791.666655
min,0.0,1800.0,0.0,5000.0,0.0,0.0,1067.0
25%,1100.0,1999.0,71.0,100000.0,3.0,0.0,30459.0
50%,2900.0,2003.0,107.0,150000.0,6.0,0.0,49696.0
75%,6999.0,2008.0,150.0,150000.0,9.0,0.0,71665.0
max,30000.0,2016.0,17700.0,150000.0,12.0,0.0,99998.0


In [129]:
autos["registration_year"].between(1800, 2016).value_counts()

True     48030
False     1970
Name: registration_year, dtype: int64

* There is still a large variety in registration years of the cars.The mean of apprpximately 2002 with a small standard deviation indicate that most cars are approximately between 7 and 21 years old.

In [133]:
autos["brand"].value_counts(normalize = True, dropna = False).mul(100).round(1).sort_index(ascending=True).astype(str) + "%"

alfa_romeo         0.7%
audi               8.6%
bmw               10.9%
chevrolet          0.6%
chrysler           0.4%
citroen            1.4%
dacia              0.3%
daewoo             0.2%
daihatsu           0.3%
fiat               2.6%
ford               7.0%
honda              0.8%
hyundai            1.0%
jaguar             0.2%
jeep               0.2%
kia                0.7%
lada               0.1%
lancia             0.1%
land_rover         0.2%
mazda              1.5%
mercedes_benz      9.5%
mini               0.8%
mitsubishi         0.8%
nissan             1.5%
opel              10.9%
peugeot            2.9%
porsche            0.6%
renault            4.8%
rover              0.1%
saab               0.2%
seat               1.9%
skoda              1.6%
smart              1.4%
sonstige_autos     1.1%
subaru             0.2%
suzuki             0.6%
toyota             1.2%
trabant            0.2%
volkswagen        21.4%
volvo              0.9%
Name: brand, dtype: object

In [131]:
autos["brand"].unique()

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

When working with data on cars, it's good to explore variations across different car brands. Aggregation can be used to understand the brand column.



In [135]:
autos_brand_20 = autos["brand"].value_counts().head(20).index
print(autos_brand_20)

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',
       'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini'],
      dtype='object')


In [144]:
brand_mean_price = {}
for brand in autos_brand_20:
    mean_price = round(autos.loc[autos["brand"] == brand, "price"].mean())
    brand_mean_price[brand] = mean_price
display(brand_mean_price)

{'volkswagen': 4945,
 'opel': 2834,
 'bmw': 7142,
 'mercedes_benz': 7209,
 'audi': 7640,
 'ford': 3421,
 'renault': 2314,
 'peugeot': 3011,
 'fiat': 2698,
 'seat': 4188,
 'skoda': 6271,
 'mazda': 3773,
 'nissan': 4511,
 'smart': 3483,
 'citroen': 3645,
 'toyota': 4984,
 'sonstige_autos': 6692,
 'hyundai': 5317,
 'volvo': 4686,
 'mini': 10281}

In [147]:
brand_mean_price = dict(sorted(brand_mean_price.items(),key=lambda item:item[1], reverse=True))
brand_mean_price

{'mini': 10281,
 'audi': 7640,
 'mercedes_benz': 7209,
 'bmw': 7142,
 'sonstige_autos': 6692,
 'skoda': 6271,
 'hyundai': 5317,
 'toyota': 4984,
 'volkswagen': 4945,
 'volvo': 4686,
 'nissan': 4511,
 'seat': 4188,
 'mazda': 3773,
 'citroen': 3645,
 'smart': 3483,
 'ford': 3421,
 'peugeot': 3011,
 'opel': 2834,
 'fiat': 2698,
 'renault': 2314}

we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

Audi, BMW and Mercedes Benz are more expensive

Ford and Opel are less expensive

Volkswagen is in between 4945.0

From the exploration and brief statistical aggregation conducted, it can be seen that the brand mini has the highest mean selling price among the top 20 German car brands in our dataset, followed by audi, sostige , mercedes and bmw

In [150]:
bmp_series = pd.Series(brand_mean_price)
bmp_series

mini              10281
audi               7640
mercedes_benz      7209
bmw                7142
sonstige_autos     6692
skoda              6271
hyundai            5317
toyota             4984
volkswagen         4945
volvo              4686
nissan             4511
seat               4188
mazda              3773
citroen            3645
smart              3483
ford               3421
peugeot            3011
opel               2834
fiat               2698
renault            2314
dtype: int64

In [152]:
brand_mean_mileage = {}
for brand in autos_brand_20:
    mileage_mean = round(autos.loc[autos["brand"] == brand, "odometer_km"].mean())
    brand_mean_mileage[brand] = mileage_mean
bmm_series = pd.Series(brand_mean_mileage)    
display(bmm_series)

volkswagen        128955
opel              129299
bmw               132522
mercedes_benz     130886
audi              129644
ford              124132
renault           128224
peugeot           127352
fiat              117037
seat              122062
skoda             110948
mazda             125132
nissan            118979
smart             100756
citroen           119765
toyota            115989
sonstige_autos     87189
hyundai           106783
volvo             138632
mini               89375
dtype: int64

In [158]:
brand_df = pd.DataFrame(bmp_series, columns = ['mean_price'])
display(brand_df)

Unnamed: 0,mean_price
mini,10281
audi,7640
mercedes_benz,7209
bmw,7142
sonstige_autos,6692
skoda,6271
hyundai,5317
toyota,4984
volkswagen,4945
volvo,4686


In [157]:
bmm_df = pd.DataFrame(bmm_series, columns=['mean_mileage'])
display(bmm_df)

Unnamed: 0,mean_mileage
volkswagen,128955
opel,129299
bmw,132522
mercedes_benz,130886
audi,129644
ford,124132
renault,128224
peugeot,127352
fiat,117037
seat,122062


In [160]:
brand_df["mean_mileage"] = bmm_series
display(brand_df)

Unnamed: 0,mean_price,mean_mileage
mini,10281,89375
audi,7640,129644
mercedes_benz,7209,130886
bmw,7142,132522
sonstige_autos,6692,87189
skoda,6271,110948
hyundai,5317,106783
toyota,4984,115989
volkswagen,4945,128955
volvo,4686,138632


* Both mean price and mean mileage can be seen and from the table we can see that brands with higer mean mileage have low mean price, therefore It is impossible to conclude whether higher mileage is affecting the price. This is due to the fact that within a brand there are a lot of other variables affecting price (such as car type, engine type, registration year etc). In order to confirm whether mileage affects the price a slice of the dataset is necessary where all those variables are kept the same as much as possible.

In [161]:
autos.head(10)

Unnamed: 0,datecrawled,name,seller,offertype,price,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer_km,registration_month,fueltype,brand,unrepaired_damage,ad_created,nrofpictures,postalcode,lastseen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004,manuell,158,andere,150000.0,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997,automatik,286,7er,150000.0,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009,manuell,102,golf,70000.0,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,kleinwagen,2007,automatik,71,fortwo,70000.0,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,kombi,2003,manuell,0,focus,150000.0,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,7900.0,test,bus,2006,automatik,150,voyager,150000.0,4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,300.0,test,limousine,1995,manuell,90,golf,150000.0,8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,1990.0,control,limousine,1998,manuell,90,golf,150000.0,12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,250.0,test,,2000,manuell,0,arosa,150000.0,10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,590.0,control,bus,1997,manuell,90,megane,150000.0,7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


* vehicletype, gearbox and fueltype are three columns which have values in german.



In [162]:
autos["vehicletype"].unique()

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [164]:
autos["gearbox"].unique()

array(['manuell', 'automatik', nan], dtype=object)

In [165]:
autos["fueltype"].unique()

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [167]:
translate_map = {'kleinwagen' : "small car", 
        'kombi' : "estate car" ,
        'cabrio' : "convertible", 
        'andere' : "other", 
        'elektro' : "electric",
        'benzin' : 'petrol',
        'manuell' : "manual", 
        'automatik' : "automatic",
        "bus" : "bus",
        "limousine" : "limousine",
                 "coupe":"coupe",
                 "suv" : "suv",
                 "lpg" : "lpg",
                 "diesel" : "diesel",
                 "cng" : "cng",
                 "hybrid": "hybrid",
                 "Unknown" : "Unknown",
                 "nein" : "no",
                 "ja" : "yes"}
categorical = ["fueltype", "gearbox", "vehicletype", 
               "unrepaired_damage"]

In [168]:
for german_word in categorical:
    autos[german_word] = autos[german_word].map(translate_map)

In [171]:
for c in categorical:
    autos.loc[autos[c].isnull(), c] ="Unknown"
    
autos["gearbox"].unique()   

array(['manual', 'automatic', 'Unknown'], dtype=object)

In [172]:
autos["vehicletype"].unique()

array(['bus', 'limousine', 'small car', 'estate car', 'Unknown', 'coupe',
       'suv', 'convertible', 'other'], dtype=object)

In [173]:
autos["fueltype"].unique()

array(['lpg', 'petrol', 'diesel', 'Unknown', 'cng', 'hybrid', 'electric',
       'other'], dtype=object)

In [174]:
autos["unrepaired_damage"].unique()

array(['no', 'Unknown', 'yes'], dtype=object)

In [178]:
date_cols = ["datecrawled","ad_created","lastseen"]

for d in date_cols:
    cleaned_date = autos[d].str[:10].str.replace("-","").astype(int)
    autos[d] = cleaned_date

In [179]:
autos.head()

Unnamed: 0,datecrawled,name,seller,offertype,price,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer_km,registration_month,fueltype,brand,unrepaired_damage,ad_created,nrofpictures,postalcode,lastseen
0,20160326,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004,manual,158,andere,150000.0,3,lpg,peugeot,no,20160326,0,79588,20160406
1,20160404,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997,automatic,286,7er,150000.0,6,petrol,bmw,no,20160404,0,71034,20160406
2,20160326,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009,manual,102,golf,70000.0,7,petrol,volkswagen,no,20160326,0,35394,20160406
3,20160312,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,small car,2007,automatic,71,fortwo,70000.0,6,petrol,smart,no,20160312,0,33729,20160315
4,20160401,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,estate car,2003,manual,0,focus,150000.0,7,petrol,ford,no,20160401,0,39218,20160401


* finding the most common brand/model combinations; this can be done using the aggregation method



In [181]:
unique_brands = autos["brand"].unique()
unique_brands

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [186]:
dict_model_brand = {}
for b in unique_brands:
    brands = autos.loc[autos["brand"] == b, "model"].value_counts().index.max()
    dict_model_brand[b] = brands
display(dict_model_brand)

{'peugeot': 'andere',
 'bmw': 'z_reihe',
 'volkswagen': 'up',
 'smart': 'roadster',
 'ford': 'transit',
 'chrysler': 'voyager',
 'seat': 'toledo',
 'renault': 'twingo',
 'mercedes_benz': 'vito',
 'audi': 'tt',
 'sonstige_autos': nan,
 'opel': 'zafira',
 'mazda': 'rx_reihe',
 'porsche': 'cayenne',
 'mini': 'one',
 'toyota': 'yaris',
 'dacia': 'sandero',
 'nissan': 'x_trail',
 'jeep': 'wrangler',
 'saab': 'andere',
 'volvo': 'xc_reihe',
 'mitsubishi': 'pajero',
 'jaguar': 'x_type',
 'fiat': 'stilo',
 'skoda': 'yeti',
 'subaru': 'legacy',
 'kia': 'sportage',
 'citroen': 'c5',
 'chevrolet': 'spark',
 'hyundai': 'tucson',
 'honda': 'jazz',
 'daewoo': 'nubira',
 'suzuki': 'swift',
 'trabant': 'andere',
 'land_rover': 'range_rover_sport',
 'alfa_romeo': 'spider',
 'lada': 'samara',
 'rover': 'rangerover',
 'daihatsu': 'terios',
 'lancia': 'ypsilon'}

In [187]:
dict_m_b_series = pd.Series(dict_model_brand)
dict_m_b_series

peugeot                      andere
bmw                         z_reihe
volkswagen                       up
smart                      roadster
ford                        transit
chrysler                    voyager
seat                         toledo
renault                      twingo
mercedes_benz                  vito
audi                             tt
sonstige_autos                  NaN
opel                         zafira
mazda                      rx_reihe
porsche                     cayenne
mini                            one
toyota                        yaris
dacia                       sandero
nissan                      x_trail
jeep                       wrangler
saab                         andere
volvo                      xc_reihe
mitsubishi                   pajero
jaguar                       x_type
fiat                          stilo
skoda                          yeti
subaru                       legacy
kia                        sportage
citroen                     

In [189]:
df = pd.DataFrame(dict_m_b_series, columns=["model"])
df

Unnamed: 0,model
peugeot,andere
bmw,z_reihe
volkswagen,up
smart,roadster
ford,transit
chrysler,voyager
seat,toledo
renault,twingo
mercedes_benz,vito
audi,tt


* using aggregation to see if average prices follows any patterns based on the mileage.


In [192]:
group_odemeter = autos["odometer_km"].unique().tolist()
group_odemeter = sorted(group_odemeter)
group_odemeter

[5000.0,
 10000.0,
 20000.0,
 30000.0,
 40000.0,
 50000.0,
 60000.0,
 70000.0,
 80000.0,
 90000.0,
 100000.0,
 125000.0,
 150000.0]

In [193]:
dict_avg_mil_price = {}
for km in group_odemeter:
    mean_mil = autos.loc[autos["odometer_km"] == km, "price"].mean()
    dict_avg_mil_price[km] = mean_mil
dict_avg_mil_price

{5000.0: 3743.7122692725297,
 10000.0: 12652.788018433179,
 20000.0: 12008.02643171806,
 30000.0: 12416.991561181434,
 40000.0: 12365.930388219545,
 50000.0: 11109.581664910433,
 60000.0: 10098.23891402715,
 70000.0: 9450.564516129032,
 80000.0: 8523.173725771716,
 90000.0: 7448.366180758017,
 100000.0: 6919.559905660378,
 125000.0: 5637.48087431694,
 150000.0: 3519.070613696089}

* The average car price is decreasing with increase in mileage. The reason for this could be cars with damage but further analysis is needed.

* Checking how cheaper are cars with damage than their non-damaged counterparts

In [194]:
autos["unrepaired_damage"].unique()

array(['no', 'Unknown', 'yes'], dtype=object)

In [195]:
autos["unrepaired_damage"].value_counts()

no         35232
Unknown     9829
yes         4939
Name: unrepaired_damage, dtype: int64

In [196]:
no_damaged_counterparts = autos["unrepaired_damage"].unique().tolist()


In [200]:
dic_non_damaged = {}
for no in no_damaged_counterparts:
    mean_non_dam_price = autos.loc[autos["unrepaired_damage"] == no, "price"].mean()
    mean_non_dam_price = round(mean_non_dam_price,2)
    dic_non_damaged[no] = mean_non_dam_price
dic_non_damaged

{'no': 6091.86, 'Unknown': 2768.66, 'yes': 2043.63}

In [203]:
dic_non_series = pd.Series(dic_non_damaged)
dic_non_df = pd.DataFrame(dic_non_series, columns=["non_damaged_counterparts"])
dic_non_df

Unnamed: 0,non_damaged_counterparts
no,6091.86
Unknown,2768.66
yes,2043.63


In [205]:
dif_price = dic_non_damaged["no"] - dic_non_damaged["yes"]
dif_price

4048.2299999999996

* There is a difference of about 4000 dollars in average price of damaged and non damaged cars

* The damaged price is more expensive than the non-damaged cars