# Project: Exploring Ebay Car Sales Data

In this project, we explore a dataset of used cars from eBay _Kleinanzeigen_, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). We've made a few modifications from the original dataset that was uploaded to Kaggle:

We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The aim of this project is to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

    dateCrawled - When this ad was first crawled. All field-values are taken from this date.
    name - Name of the car.
    seller - Whether the seller is private or a dealer.
    offerType - The type of listing
    price - The price on the ad to sell the car.
    abtest - Whether the listing is included in an A/B test.
    vehicleType - The vehicle Type.
    yearOfRegistration - The year in which which year the car was first registered.
    gearbox - The transmission type.
    powerPS - The power of the car in PS.
    model - The car model name.
    kilometer - How many kilometers the car has driven.
    monthOfRegistration - The month in which which year the car was first registered.
    fuelType - What type of fuel the car uses.
    brand - The brand of the car.
    notRepairedDamage - If the car has a damage which is not yet repaired.
    dateCreated - The date on which the eBay listing was created.
    nrOfPictures - The number of pictures in the ad.
    postalCode - The postal code for the location of the vehicle.
    lastSeenOnline - When the crawler saw this ad last online.

In [1]:
# import libraries
import pandas as pd
import numpy as np

## capture data and get basic information

In [2]:
autos = pd.read_csv('autos.csv', delimiter=',', encoding='Latin-1')

In [3]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


In [4]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

**Observations:**
The following columns have missing data:
- vehicleType
- gearbox
- model
- fuelType
- notRepairedDamage

Columns are either object (15/20) or int (5/20) data types.

In [5]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


**Observations:**
- The data isn't sorted in any way (e.g. not in order of data crawled).
- The name column follows a format of 
    - car_manufacturer_car_model_engine...
- The data is not in English (of course it's from a German website)
- monthOfRegistration column is represented in numbers
- Some columns look like they have been cleaned up and are in lower case format
- Units are included (odometer data includes km)
- dateCreated includes time portion which looks like it is set to 00:00:00

**More Observations:**
- The dataset contains 20 columns, most of which are strings.
- Some columns have null values, but none have more than ~20% null values.
- The column names use [camelcase](https://en.wikipedia.org/wiki/Camel_case) instead of Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case), which means we can't just replace spaces with underscores.

---

## convert column names from camelcase to snakecase

In [6]:
# current columns
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:
# map old names to new names
mapping_dict = {
    'dateCrawled': 'date_crawled',
    'offerType': 'offer_type',
    'vehicleType': 'vehicle_type',
    'yearOfRegistration': 'registration_year',
    'powerPS': 'power_ps',
    'monthOfRegistration': 'registration_month',
    'fuelType': 'fuel_type',
    'notRepairedDamage': 'unrepaired_damage',
    'dateCreated': 'ad_created',
    'nrOfPictures': 'num_photos',
    'postalCode': 'postal_code',
    'lastSeen': 'last_seen',
}

In [8]:
# rename columns
autos.rename(mapping_dict, axis=1, inplace=True)

In [9]:
# new columns
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen'],
      dtype='object')

In [10]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Steps Taken
At this stage, we rename the columns to follow a consistent format by:
- converting them from camelcase to snakecase format
- renaming certain columns to simpler versions (nrOfPictures -> num_photos)
- some columns did not need to be renamed (name, price, model...etc)

The approach I took was to create a mapping dictionary mapping the old names to the new names, then using the **df.rename()** function to rename the respective columns inplace.

---

## Basic Data Exploration

In [11]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


**Observations:**
- a few columns have only a handful of unique values - (seller, offer_type, abtest, vehicle_type, gearbox, fuel_type, unrepaired_damage)
- price and odometer have to be converted from string to numerical values
- look into num_photos, unrepaired damage, registration_month, offer_type, abtest, seller, vehicle_type

investigate a few columns in more detail:

In [12]:
autos['num_photos'].value_counts()

0    50000
Name: num_photos, dtype: int64

In [13]:
autos['unrepaired_damage'].value_counts()

nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64

In [14]:
autos['registration_month'].value_counts()

0     5075
3     5071
6     4368
5     4107
4     4102
7     3949
10    3651
12    3447
9     3389
11    3360
1     3282
8     3191
2     3008
Name: registration_month, dtype: int64

In [15]:
autos['registration_year'].value_counts()

2000    3354
2005    3015
1999    3000
2004    2737
2003    2727
2006    2708
2001    2703
2002    2533
1998    2453
2007    2304
2008    2231
2009    2098
1997    2028
2011    1634
2010    1597
2017    1453
1996    1444
2012    1323
2016    1316
1995    1313
2013     806
2014     666
1994     660
2018     492
1993     445
2015     399
1990     395
1992     391
1991     356
1989     181
        ... 
1950       3
1955       2
9000       2
1954       2
1800       2
1957       2
1941       2
1951       2
1934       2
4100       1
4800       1
1953       1
1111       1
1927       1
6200       1
4500       1
1943       1
5911       1
1939       1
1938       1
2800       1
8888       1
1000       1
1500       1
1948       1
1931       1
1929       1
1001       1
9996       1
1952       1
Name: registration_year, Length: 97, dtype: int64

In [16]:
autos['odometer'].value_counts()

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [17]:
autos['abtest'].value_counts()

test       25756
control    24244
Name: abtest, dtype: int64

In [18]:
autos['offer_type'].value_counts()

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [19]:
autos['seller'].value_counts()

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [20]:
autos['vehicle_type'].value_counts()

limousine     12859
kleinwagen    10822
kombi          9127
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64

### drop columns

We observed a few columns with a couple of unique values where most of the columns had the same value:
- offer_type
- seller

We also observed that all values in the num_photos column were 0. So none of the cars in our dataset had any photos included.

We can drop these 3 columns

In [21]:
autos = autos.drop(['offer_type', 'seller', 'num_photos'], axis=1)

### convert numeric values stored as text to numeric values - price and odometer

In [22]:
# convert price column to int type after cleaning non-numeric values
autos['price'] = (autos['price']
                      .str.replace('$', '')
                      .str.replace(',', '')
                      .astype(int)
                 )

In [23]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null int64
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(5), object(12)
memory usage: 6.5+ MB


In [24]:
# convert odometer column to int type after cleaning non-numeric values
autos['odometer'] = (autos['odometer']
                         .str.replace('km', '')
                         .str.replace(',', '')
                         .astype(int)
                    )

In [25]:
# rename odometer column name to include distance information
autos.rename({'odometer': 'odometer_km'}, axis=1, inplace=True)

In [26]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null int64
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null int64
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(6), object(11)
memory usage: 6.5+ MB


---

## Continue Data Exploration

### Explore prices

In [27]:
autos['price'].unique().shape

(2357,)

In [28]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [29]:
autos['price'].value_counts().head(20)

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
Name: price, dtype: int64

In [30]:
autos['price'].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [31]:
autos['price'].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

**Observations:**
- There are 2357 unique prices
- 1421 entries have a price of \$0
- The most expensive is \$99M
- The most expensive 20 are \$197,000 and up
- The 1st quartile is \$1,100 and the 3rd quartile is \$7,200 with the 2nd quartile being \$2,950
- Prices go up to \$350,000 before jumping up to \$1 M and then much higher
- Ebay being a bidding platform, it is possible for prices to start even at \$1

In [32]:
# remove any entries with prices below $1 and above $350,000
autos[autos['price'].between(1, 350000)].shape

(48565, 17)

Keeping entries with prices between \$1 and \$350,000, we retain 48,565 records. This is 97.13% of the data.

In [33]:
autos = autos[autos['price'].between(1, 350000)]

In [34]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48565 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          48565 non-null object
name                  48565 non-null object
price                 48565 non-null int64
abtest                48565 non-null object
vehicle_type          43979 non-null object
registration_year     48565 non-null int64
gearbox               46222 non-null object
power_ps              48565 non-null int64
model                 46107 non-null object
odometer_km           48565 non-null int64
registration_month    48565 non-null int64
fuel_type             44535 non-null object
brand                 48565 non-null object
unrepaired_damage     39464 non-null object
ad_created            48565 non-null object
postal_code           48565 non-null int64
last_seen             48565 non-null object
dtypes: int64(6), object(11)
memory usage: 6.7+ MB


### Explore odometer_km

In [35]:
autos['odometer_km'].unique()

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])

In [36]:
autos['odometer_km'].describe()

count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [37]:
autos['odometer_km'].value_counts().sort_index(ascending=True)

5000        836
10000       253
20000       762
30000       780
40000       815
50000      1012
60000      1155
70000      1217
80000      1415
90000      1734
100000     2115
125000     5057
150000    31414
Name: odometer_km, dtype: int64

In [38]:
# % of cars with odometer reading of 150,000 km
(31414/48565) * 100

64.68444352929063

The odometer numbers range from 5,000 km to 150,000 km with a majority (64.68%) having the maximum. Users must have been given a set of pre-set options to pick from, explaining the nice rounded numbers. There a lot of high mileage vehicles.

---

## Explore Date Columns

There are several date columns in the data:
- date_crawled
- ad_created
- registration_month
- registration_year
- last_seen

Some of these came directly from the website and others were created by the web crawler.

In [39]:
dates = ['date_crawled', 'ad_created', 'last_seen', 'registration_month', 'registration_year']
autos[dates].describe(include='all')

Unnamed: 0,date_crawled,ad_created,last_seen,registration_month,registration_year
count,48565,48565,48565,48565.0,48565.0
unique,46882,76,38474,,
top,2016-03-05 16:57:05,2016-04-03 00:00:00,2016-04-07 06:17:27,,
freq,3,1887,8,,
mean,,,,5.782251,2004.755421
std,,,,3.685595,88.643887
min,,,,0.0,1000.0
25%,,,,3.0,1999.0
50%,,,,6.0,2004.0
75%,,,,9.0,2008.0


In [40]:
autos[dates].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48565 entries, 0 to 49999
Data columns (total 5 columns):
date_crawled          48565 non-null object
ad_created            48565 non-null object
last_seen             48565 non-null object
registration_month    48565 non-null int64
registration_year     48565 non-null int64
dtypes: int64(2), object(3)
memory usage: 2.2+ MB


**Observation:**
registration_month and registration_year are numeric data types. registration_month has values ranging from 0 to 12, representing the 12 months of the year. Perhaps the 0 means there was no registration month. registration_year ranges from 1000 to 9999. Some of these values do not look valid.


date_crawled, ad_created and last_seen are dates identified as strings. They need to be converted into a numerical representation. We will split the date from the timestamp and look at the distribution of the values for each column.

In [41]:
autos[dates].head()

Unnamed: 0,date_crawled,ad_created,last_seen,registration_month,registration_year
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54,3,2004
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08,6,1997
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37,7,2009
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28,6,2007
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50,7,2003


**date_crawled**

In [42]:
(autos['date_crawled']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
     .count()
)

34

There are 34 dates where the web crawler captured data.

In [43]:
(autos['date_crawled']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
)

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

In [44]:
(autos['date_crawled']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_values()
)

2016-04-07    0.001400
2016-04-06    0.003171
2016-03-18    0.012911
2016-04-05    0.013096
2016-03-06    0.014043
2016-03-13    0.015670
2016-03-05    0.025327
2016-03-24    0.029342
2016-03-16    0.029610
2016-03-27    0.031092
2016-03-25    0.031607
2016-03-17    0.031628
2016-03-31    0.031834
2016-03-10    0.032184
2016-03-26    0.032204
2016-03-23    0.032225
2016-03-11    0.032575
2016-03-22    0.032987
2016-03-09    0.033090
2016-03-08    0.033296
2016-03-30    0.033687
2016-04-01    0.033687
2016-03-29    0.034099
2016-03-15    0.034284
2016-03-19    0.034778
2016-03-28    0.034860
2016-04-02    0.035478
2016-03-07    0.036014
2016-04-04    0.036487
2016-03-14    0.036549
2016-03-12    0.036920
2016-03-21    0.037373
2016-03-20    0.037887
2016-04-03    0.038608
Name: date_crawled, dtype: float64

The web crawler captured data on a daily basis over the course of a month and a few days. Specifically, from March 5th 2016 to April 7th 2016. The distribution is fairly uniform.

**ad_created**

In [45]:
(autos['ad_created']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
     .count()
)

76

In [46]:
(autos['ad_created']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
)

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21 

In [47]:
(autos['ad_created']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_values()
)

2016-02-01    0.000021
2015-12-05    0.000021
2015-11-10    0.000021
2016-01-16    0.000021
2016-02-16    0.000021
2016-01-03    0.000021
2016-01-14    0.000021
2016-02-22    0.000021
2016-02-07    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-17    0.000021
2015-12-30    0.000021
2016-01-07    0.000021
2016-01-29    0.000021
2016-01-22    0.000021
2016-02-08    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-06-11    0.000021
2016-01-13    0.000021
2016-02-05    0.000041
2016-02-24    0.000041
2016-02-20    0.000041
2016-02-26    0.000041
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-02    0.000041
2016-02-18    0.000041
2016-01-10    0.000041
                ...   
2016-03-06    0.015320
2016-03-13    0.017008
2016-03-05    0.022897
2016-03-24    0.029280
2016-03-16    0.030125
2016-03-27    0.030989
2016-03-17    0.031278
2016-03-25    0.031751
2016-03-31    0.031875
2016-03-10    0.031895
2016-03-23    0.032060
2016-03-26    0.032266
2016-03-22 

There are 76 dates when ads were created. These dates span from June 11th 2015 to April 7th 2016. The distribution is scattered - there are a number of dates where very few ads were placed and other days when lots of ads were placed.

**last_seen**

In [48]:
(autos['last_seen']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
     .count()
)

34

In [49]:
(autos['last_seen']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
)

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

In [50]:
(autos['last_seen']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_values()
)

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-18    0.007351
2016-03-08    0.007413
2016-03-13    0.008895
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-14    0.012602
2016-03-27    0.015649
2016-03-19    0.015834
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-26    0.016802
2016-03-23    0.018532
2016-03-25    0.019211
2016-03-24    0.019767
2016-03-21    0.020632
2016-03-20    0.020653
2016-03-28    0.020859
2016-03-22    0.021373
2016-03-29    0.022341
2016-04-01    0.022794
2016-03-12    0.023783
2016-03-31    0.023783
2016-04-04    0.024483
2016-03-30    0.024771
2016-04-02    0.024915
2016-04-03    0.025203
2016-03-17    0.028086
2016-04-05    0.124761
2016-04-07    0.131947
2016-04-06    0.221806
Name: last_seen, dtype: float64

The last_seen date represents when the crawler saw this ad last online. This would ideally be the day the car was sold and the seller removed the listing. The distribution is fairly consistent across the days until the last 3 days where it jumps up 6 - 12 times than the preceeding days. Unless there was some massive sales during those 3 days, this maybe correlated to the last days of the web crawler activity.

**registration_year**

In [51]:
autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [52]:
(autos['registration_year']
     .value_counts()
     .sort_index(ascending=False)
)

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
2009    2085
2008    2215
2007    2277
2006    2670
2005    2936
2004    2703
2003    2699
2002    2486
2001    2636
2000    3156
        ... 
1964      12
1963       8
1962       4
1961       6
1960      23
1959       6
1958       4
1957       2
1956       4
1955       2
1954       2
1953       1
1952       1
1951       2
1950       3
1948       1
1943       1
1941       2
1939       1
1938       1
1937       4
1934       2
1931       1
1929       1
1927       1
1910       5
1800       2
1111       1
1001       1
1000       1
Name: registration_year, Length: 95, dtype: int64

Registration year is the year that the car was first registered. This would be close to if not the same year as the car was built.

The minimum (1000) and maximum (9999) years are odd. There were no cars in the year 1000 and 9999 is way into the future. If the web crawler captured ads during March and April of 2016, we would expect the registration years to range from 19XX to 2016.

---

## Cleaning up registration years

We observed a few odd values in the registration_year column. We need to dive deeper to clean up the column. 

A car can't be registered before the listing was seen, so any car with a registration year above 2016 is inaccurate. We also need to look at the earlier dates to see what we can remove.

In [53]:
years = [2016, 2017, 2018, 2019, 2020]
for year in years:
    num = autos[autos['registration_year'] > year].shape[0]
    print('There are {0} vehicles with registration years higher than {1}'.format(num, year))

There are 1879 vehicles with registration years higher than 2016
There are 487 vehicles with registration years higher than 2017
There are 17 vehicles with registration years higher than 2018
There are 15 vehicles with registration years higher than 2019
There are 15 vehicles with registration years higher than 2020


In [54]:
years = [2016, 2017, 2018, 2019, 2020]
for year in years:
    num = autos[autos['registration_year'] == year].shape[0]
    print('There are {0} vehicles with registration years in {1}'.format(num, year))

There are 1220 vehicles with registration years in 2016
There are 1392 vehicles with registration years in 2017
There are 470 vehicles with registration years in 2018
There are 2 vehicles with registration years in 2019
There are 0 vehicles with registration years in 2020


There are almost 1900 cars with registration years higher than 2016, the year in which the web crawler captured this data. A majority of these have registration years in 2017, 2018 and 2 in 2019. Regardless, these will be removed.

In [55]:
years = [1900, 1950, 2000, 2010, 2020]
for year in years:
    num = autos[autos['registration_year'] < year].shape[0]
    print('There are {0} vehicles with registration years lower than {1}'.format(num, year))

There are 5 vehicles with registration years lower than 1900
There are 25 vehicles with registration years lower than 1950
There are 13223 vehicles with registration years lower than 2000
There are 39086 vehicles with registration years lower than 2010
There are 48550 vehicles with registration years lower than 2020


In [56]:
(autos['registration_year']
     .value_counts()
     .sort_index(ascending=True)
     .head(10)
)

1000    1
1001    1
1111    1
1800    2
1910    5
1927    1
1929    1
1931    1
1934    2
1937    4
Name: registration_year, dtype: int64

There are a handful of cars with registration years lower than 1900. This seems very unlikely. Looking at those years further (1800, 1111, 1001, 1000), cars did not even exist during this time! These should be removed.

In [57]:
autos.shape[0]

48565

In [58]:
autos[autos['registration_year'].between(1900, 2016)].shape[0]

46681

In [59]:
(46681/48565) * 100

96.1206630289303

In [60]:
(46681/50000) * 100

93.362

We will retain 96% of the current dataset by removing cars with registration years earlier than 1900 and later than 2016. From the original dataset of 50,000, we will retain ~93%.

In [61]:
autos = autos[autos['registration_year'].between(1900, 2016)]

In [62]:
autos.shape[0]

46681

Let's look at the distribution of the remaining registrations years:

In [63]:
(autos['registration_year']
     .value_counts(normalize=True)
     .sort_values()
)

1952    0.000021
1953    0.000021
1943    0.000021
1929    0.000021
1931    0.000021
1938    0.000021
1948    0.000021
1927    0.000021
1939    0.000021
1955    0.000043
1957    0.000043
1934    0.000043
1951    0.000043
1941    0.000043
1954    0.000043
1950    0.000064
1962    0.000086
1937    0.000086
1958    0.000086
1956    0.000086
1910    0.000107
1959    0.000129
1961    0.000129
1963    0.000171
1964    0.000257
1965    0.000364
1975    0.000386
1969    0.000407
1976    0.000450
1977    0.000471
          ...   
1985    0.002035
1988    0.002892
1989    0.003727
1991    0.007262
1990    0.007433
1992    0.007926
2015    0.008397
1993    0.009104
1994    0.013474
2014    0.014203
2013    0.017202
2016    0.026135
1995    0.026285
2012    0.028063
1996    0.029412
2010    0.034040
2011    0.034768
1997    0.041794
2009    0.044665
2008    0.047450
2007    0.048778
1998    0.050620
2002    0.053255
2001    0.056468
2006    0.057197
2003    0.057818
2004    0.057904
1999    0.0620

In [64]:
(autos['registration_year']
     .value_counts(normalize=True)
     .sort_index(ascending=True)
)

1910    0.000107
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000043
1937    0.000086
1938    0.000021
1939    0.000021
1941    0.000043
1943    0.000021
1948    0.000021
1950    0.000064
1951    0.000043
1952    0.000021
1953    0.000021
1954    0.000043
1955    0.000043
1956    0.000086
1957    0.000043
1958    0.000086
1959    0.000129
1960    0.000493
1961    0.000129
1962    0.000086
1963    0.000171
1964    0.000257
1965    0.000364
1966    0.000471
1967    0.000557
1968    0.000557
          ...   
1987    0.001542
1988    0.002892
1989    0.003727
1990    0.007433
1991    0.007262
1992    0.007926
1993    0.009104
1994    0.013474
1995    0.026285
1996    0.029412
1997    0.041794
1998    0.050620
1999    0.062060
2000    0.067608
2001    0.056468
2002    0.053255
2003    0.057818
2004    0.057904
2005    0.062895
2006    0.057197
2007    0.048778
2008    0.047450
2009    0.044665
2010    0.034040
2011    0.034768
2012    0.028063
2013    0.017202
2014    0.0142

Looking through the distribution, we can see most cars were registered from 1990's onwards. The number of cars registered in the earlier years (1910 - 1970) make up a small percentage. We can look at the actual number of registered cars in groups of year periods:

In [65]:
years = [(1900, 1950), (1950, 1960), (1960, 1970), (1970, 1980), 
         (1980, 1990), (1990, 2000), (2000, 2010), (2010, 2016)]

for start, end in years:
    filtered_autos = autos[((autos['registration_year'] >= start) & (autos['registration_year'] < end))]
    num = filtered_autos.shape[0]
    print("There were {} registered cars between {} and {}".format(num, start, end))

There were 20 registered cars between 1900 and 1950
There were 27 registered cars between 1950 and 1960
There were 163 registered cars between 1960 and 1970
There were 283 registered cars between 1970 and 1980
There were 804 registered cars between 1980 and 1990
There were 11921 registered cars between 1990 and 2000
There were 25863 registered cars between 2000 and 2010
There were 6380 registered cars between 2010 and 2016


In [66]:
before_90 = autos[autos['registration_year'] < 1990]
before_90_num = before_90.shape[0]
before_90_percent = (before_90_num/autos.shape[0]) * 100
print("{:.2f}% of cars were registered before 1990".format(before_90_percent))

# after 1990
after_90 = autos[autos['registration_year'] >= 1990]
after_90_num = after_90.shape[0]
after_90_percent = (after_90_num/autos.shape[0]) * 100
print("{:.2f}% of cars were registered in 1990 or later".format(after_90_percent))

2.78% of cars were registered before 1990
97.22% of cars were registered in 1990 or later


**Observations**

We can more clearly see that most the cars were registered in past 20 years. Less than 3% of cars were registered before 1990, while over 97% of cars were registered in 1990 or later. Looking at the distribution, since 1994, each year had at least 1% of registered cars with the exception of 2015.

---

## Exploring Price By Brand

We will explore variations across different car brands by using aggregation.

In [67]:
# confirm brand column has no missing data
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46681 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          46681 non-null object
name                  46681 non-null object
price                 46681 non-null int64
abtest                46681 non-null object
vehicle_type          43977 non-null object
registration_year     46681 non-null int64
gearbox               44571 non-null object
power_ps              46681 non-null int64
model                 44488 non-null object
odometer_km           46681 non-null int64
registration_month    46681 non-null int64
fuel_type             43363 non-null object
brand                 46681 non-null object
unrepaired_damage     38374 non-null object
ad_created            46681 non-null object
postal_code           46681 non-null int64
last_seen             46681 non-null object
dtypes: int64(6), object(11)
memory usage: 6.4+ MB


We can confirm that the 'brand' column has no missing data.

In [68]:
autos['brand'].value_counts()

volkswagen        9862
bmw               5137
opel              5022
mercedes_benz     4503
audi              4041
ford              3263
renault           2201
peugeot           1393
fiat              1197
seat               853
skoda              766
nissan             713
mazda              709
smart              661
citroen            654
toyota             593
hyundai            468
sonstige_autos     458
volvo              427
mini               409
mitsubishi         384
honda              366
kia                330
alfa_romeo         310
porsche            286
suzuki             277
chevrolet          266
chrysler           164
dacia              123
daihatsu           117
jeep               106
subaru             100
land_rover          98
saab                77
jaguar              73
daewoo              70
trabant             65
rover               62
lancia              50
lada                27
Name: brand, dtype: int64

In [69]:
# get distribution of brands ordered into descending order
(autos['brand']
         .value_counts(normalize=True)
         .sort_values(ascending=False)
)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

In [70]:
# capture top 20 brands
top_20_brands = (autos['brand']
                     .value_counts()
                     .sort_values(ascending=False)
                     .index[:20]
                )

In [71]:
top_20_brands

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart',
       'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'],
      dtype='object')

In [72]:
# top 6 brands
autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[:5].sum()

0.6119191962468671

**Observations**

- The top 4 brands account for 50% of all brands.
- 23 brands account for less than 1% each with 13 of these accounting for less than 0.5% each
- The top 5 brands are all German car manufactuers, accounting for over 60% of all cars, with the 6th being American (Ford) followed by a French (Renault).

Let's calculate the average brand price:

In [73]:
# calculate average brand price
average_brand_price = {}

for brand in top_20_brands:
    #print(brand)
    brand_autos = autos[autos['brand'] == brand]
    #print(brand_autos.head())
    average_price = brand_autos['price'].mean()
    #print(average_price)
    average_brand_price[brand] = average_price

print('Average Brand Price:')
average_brand_price

Average Brand Price:


{'audi': 9336.687453600594,
 'bmw': 8332.820517811953,
 'citroen': 3779.1391437308866,
 'fiat': 2813.748538011696,
 'ford': 3749.4695065890287,
 'hyundai': 5365.254273504273,
 'mazda': 4112.596614950635,
 'mercedes_benz': 8628.450366422385,
 'mini': 10613.459657701711,
 'nissan': 4743.40252454418,
 'opel': 2975.2419354838707,
 'peugeot': 3094.0172290021537,
 'renault': 2474.8646069968195,
 'seat': 4397.230949589683,
 'skoda': 6368.0,
 'smart': 3580.2239031770046,
 'sonstige_autos': 12338.550218340612,
 'toyota': 5167.091062394604,
 'volkswagen': 5402.410261610221,
 'volvo': 4946.501170960188}

In [74]:
sorted(average_brand_price.items(), key=lambda x: x[1])

[('renault', 2474.8646069968195),
 ('fiat', 2813.748538011696),
 ('opel', 2975.2419354838707),
 ('peugeot', 3094.0172290021537),
 ('smart', 3580.2239031770046),
 ('ford', 3749.4695065890287),
 ('citroen', 3779.1391437308866),
 ('mazda', 4112.596614950635),
 ('seat', 4397.230949589683),
 ('nissan', 4743.40252454418),
 ('volvo', 4946.501170960188),
 ('toyota', 5167.091062394604),
 ('hyundai', 5365.254273504273),
 ('volkswagen', 5402.410261610221),
 ('skoda', 6368.0),
 ('bmw', 8332.820517811953),
 ('mercedes_benz', 8628.450366422385),
 ('audi', 9336.687453600594),
 ('mini', 10613.459657701711),
 ('sonstige_autos', 12338.550218340612)]

**Observations**

Looking at the top 6 brands, BMW, Audi, Mercedes Benz were amongst the most expensive brands. Opel and Ford were amongst the cheaper with the former being the third cheapest brand. Volkswagen was in the middle of the pack.

The most expensive brands were Sonstige Autos and Mini (both had over $10,000 average prices). This may be attributed to both brands being more of a niche. There are not that many models of the mini vs. volkswagen which has so many different car models in various price ranges. The same applies to the other top brands as they are much larger car companies.

---

## Use aggregation to understand average mileage of top brands

Create brand_mean_price and brand_mean_mileage

In [75]:
# capture top 20 brands
top_20_brands = (autos['brand']
                     .value_counts()
                     .sort_values(ascending=False)
                     .index[:20]
                )

In [83]:
# calculate mean mileage and mean price for top 20 brands
brand_mean_price = {}
brand_mean_mileage = {}

for brand in top_20_brands:
    #print(brand)
    brand_autos = autos[autos['brand'] == brand]
    #print(brand_autos.head())
    mean_price = brand_autos['price'].mean()
    #print(mean_price)
    mean_mileage = brand_autos['odometer_km'].mean()
    #print(mean_mileage)
    brand_mean_price[brand] = mean_price
    brand_mean_mileage[brand] = mean_mileage

In [84]:
brand_mean_price

{'audi': 9336.687453600594,
 'bmw': 8332.820517811953,
 'citroen': 3779.1391437308866,
 'fiat': 2813.748538011696,
 'ford': 3749.4695065890287,
 'hyundai': 5365.254273504273,
 'mazda': 4112.596614950635,
 'mercedes_benz': 8628.450366422385,
 'mini': 10613.459657701711,
 'nissan': 4743.40252454418,
 'opel': 2975.2419354838707,
 'peugeot': 3094.0172290021537,
 'renault': 2474.8646069968195,
 'seat': 4397.230949589683,
 'skoda': 6368.0,
 'smart': 3580.2239031770046,
 'sonstige_autos': 12338.550218340612,
 'toyota': 5167.091062394604,
 'volkswagen': 5402.410261610221,
 'volvo': 4946.501170960188}

In [80]:
autos.columns

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

In [85]:
brand_mean_mileage

{'audi': 129157.38678544914,
 'bmw': 132572.51313996495,
 'citroen': 119694.18960244648,
 'fiat': 117121.9715956558,
 'ford': 124266.01287159056,
 'hyundai': 106442.30769230769,
 'mazda': 124464.03385049365,
 'mercedes_benz': 130788.36331334666,
 'mini': 88105.13447432763,
 'nissan': 118330.99579242637,
 'opel': 129310.0358422939,
 'peugeot': 127153.62526920316,
 'renault': 128071.33121308497,
 'seat': 121131.30128956624,
 'skoda': 110848.5639686684,
 'smart': 99326.77760968229,
 'sonstige_autos': 89956.33187772926,
 'toyota': 115944.35075885328,
 'volkswagen': 128707.15879132022,
 'volvo': 138067.9156908665}

Convert both dictionaries to series objects via the Series constructor

In [86]:
bmp_series = pd.Series(brand_mean_price)
bmp_series

audi               9336.687454
bmw                8332.820518
citroen            3779.139144
fiat               2813.748538
ford               3749.469507
hyundai            5365.254274
mazda              4112.596615
mercedes_benz      8628.450366
mini              10613.459658
nissan             4743.402525
opel               2975.241935
peugeot            3094.017229
renault            2474.864607
seat               4397.230950
skoda              6368.000000
smart              3580.223903
sonstige_autos    12338.550218
toyota             5167.091062
volkswagen         5402.410262
volvo              4946.501171
dtype: float64

In [87]:
bmm_series = pd.Series(brand_mean_mileage)
bmm_series

audi              129157.386785
bmw               132572.513140
citroen           119694.189602
fiat              117121.971596
ford              124266.012872
hyundai           106442.307692
mazda             124464.033850
mercedes_benz     130788.363313
mini               88105.134474
nissan            118330.995792
opel              129310.035842
peugeot           127153.625269
renault           128071.331213
seat              121131.301290
skoda             110848.563969
smart              99326.777610
sonstige_autos     89956.331878
toyota            115944.350759
volkswagen        128707.158791
volvo             138067.915691
dtype: float64

Create a dataframe from the brand_mean_prices series via the DataFrame constructor

In [107]:
# convert from float to int
brands = pd.DataFrame(bmp_series.astype(int), columns=['mean_price'])
brands

Unnamed: 0,mean_price
audi,9336
bmw,8332
citroen,3779
fiat,2813
ford,3749
hyundai,5365
mazda,4112
mercedes_benz,8628
mini,10613
nissan,4743


Assign the other series as a new column in this dataframe

In [108]:
# convert from float to int
brands['mean_mileage_km'] = bmm_series.astype(int)
brands

Unnamed: 0,mean_price,mean_mileage_km
audi,9336,129157
bmw,8332,132572
citroen,3779,119694
fiat,2813,117121
ford,3749,124266
hyundai,5365,106442
mazda,4112,124464
mercedes_benz,8628,130788
mini,10613,88105
nissan,4743,118330


Analyze dataframe

In [109]:
brands

Unnamed: 0,mean_price,mean_mileage_km
audi,9336,129157
bmw,8332,132572
citroen,3779,119694
fiat,2813,117121
ford,3749,124266
hyundai,5365,106442
mazda,4112,124464
mercedes_benz,8628,130788
mini,10613,88105
nissan,4743,118330


In [102]:
# brands with less than 100,000 km
brands[brands['mean_mileage_km'] < 100000]

Unnamed: 0,mean_prices,mean_mileage_km
mini,10613,88105
smart,3580,99326
sonstige_autos,12338,89956


In [110]:
# sort by mean mileage
brands.sort_values(by='mean_mileage_km')

Unnamed: 0,mean_price,mean_mileage_km
mini,10613,88105
sonstige_autos,12338,89956
smart,3580,99326
hyundai,5365,106442
skoda,6368,110848
toyota,5167,115944
fiat,2813,117121
nissan,4743,118330
citroen,3779,119694
seat,4397,121131


In [111]:
# sort by mean price
brands.sort_values(by='mean_price')

Unnamed: 0,mean_price,mean_mileage_km
renault,2474,128071
fiat,2813,117121
opel,2975,129310
peugeot,3094,127153
smart,3580,99326
ford,3749,124266
citroen,3779,119694
mazda,4112,124464
seat,4397,121131
nissan,4743,118330


Limiting our focus to the top 6 brands

In [116]:
top_6 = ['mercedes_benz', 'audi', 'bmw', 'volkswagen', 'ford', 'opel']
top_6_brands = brands.loc[top_6]
top_6_brands

Unnamed: 0,mean_price,mean_mileage_km
mercedes_benz,8628,130788
audi,9336,129157
bmw,8332,132572
volkswagen,5402,128707
ford,3749,124266
opel,2975,129310


In [117]:
# sort by price
top_6_brands.sort_values(by='mean_price')

Unnamed: 0,mean_price,mean_mileage_km
opel,2975,129310
ford,3749,124266
volkswagen,5402,128707
bmw,8332,132572
mercedes_benz,8628,130788
audi,9336,129157


In [118]:
# sort by mileage
top_6_brands.sort_values(by='mean_mileage_km')

Unnamed: 0,mean_price,mean_mileage_km
ford,3749,124266
volkswagen,5402,128707
audi,9336,129157
opel,2975,129310
mercedes_benz,8628,130788
bmw,8332,132572


**Observations:**

There does not seem to be a direct relationship between mean price and mean mileage when looking at all the brands. When focusing on the most expensive of the top 6 brands (Mercedes Benz, BMW, Audi), we observe a slight inverse relationship - as price goes up, mileage goes down. But this is a very select subset of the larger brands data. There is a similar relationship between Ford and Opel. Open is ~5,000 km more in mileage and is ~$800 cheaper than Ford. However, Volkswagen breaks this relationship as it has more mileage than Ford yet is more expensive than either Ford or Opel. 


All of the top brands were in the top 10 in terms of mean mileage. They all had average mileages of over 124,000 km. What is interesting is when we look at price, we notice that Audi, Mercedes Benz and BMW are amongst the most expensive cars despite having amongst the highest mileages. This could be due to a few related things:
- These brands retain more of their value over time compared to other brands
- These brands have longevity from a mechanical perspective
- These brands being more expensive to begin with, will have a higher price over time compared to an initially lower price competitor. e.g. a car bought new for $100,000 vs. $50,000
- The brands' models could be older more luxurious variants skewing the prices on the high end. We would need to dive deeper into the data and take a look at which models are being sold

Volkswagen although not as expensive had very high mileage as well. Opel had amongst the highest mileages (4th overall) yet was amongst some of the cheapder brands. Ford had the lowest mileage (124,266 km) from the top 6 brands.

---

## Conclusion

In this guided project, we practiced applying a variety of pandas methods to explore and understand a data set on car listings. Here are some next steps for you to consider:

- Data cleaning next steps:
    - Identify categorical data that uses german words, translate them and map the values to their english counterparts
    - Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
    - See if there are particular keywords in the name column that you can extract as new columns
- Analysis next steps
    - Find the most common brand/model combinations
    - Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.
    - How much cheaper are cars with damage than their non-damaged counterparts?