# Cleaning and Analyzing Ebay Car Sales Data

In this project we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The orignal dataset can be found: [here](https://data.world/data-society/used-cars-data). Our data set is 50k rows pulled from the original data set that have been purposely 'dirtied' by DataQuest as the goal of this project is data cleaning. After we clean up our dataset we'll do a bit of analysis on how mileage affects car price.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

autos = pd.read_csv('autos.csv', encoding='Latin-1')

### Exploring Dataset

In [2]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

The column names in our dataset are camel case, we'll start by converting the column names to snake case and simplifying a couple of the names to make them easier to work with.

There are 5 columns with null values but none of them have more than 20%.

### Clean Column Names

In [4]:
autos.columns = ['date_crawled', 'name', 'seller',
                 'offer_type', 'price', 'ab_test',
                 'vehicle_type', 'registration_year',
                 'gearbox', 'power_ps', 'model', 
                 'odometer', 'registration_month',
                 'fuel_type', 'brand', 
                 'unrepaired_damage', 'ad_created', 
                 'num_of_pictures', 'postal_code',
                 'last_seen']

autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [5]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


### Convert Values to Numeric

The price and odometer columns have numerical values stored as text due to the extra characters ('$' and 'km'), which will make analyzing these columns more difficult. We'll clean the values in these columns by removing th extra characters. Since the characters in the odometer column were an indication of the measurement type, we'll add it to the column name (odometer_km) so we're not losing important information.

In [6]:
#converting columns to numeric
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(float)
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)

In [7]:
#adding measurment to column name since we removed it from values
autos.rename(columns = {'odometer':'odometer_km'}, inplace=True)

In [8]:
autos['odometer_km'].value_counts().sort_index(ascending=True)

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

In [9]:
pd.set_option("display.max_rows", None) #show all instead of truncating output
autos['price'].value_counts().sort_index(ascending=True)

0.0           1421
1.0            156
2.0              3
3.0              1
5.0              2
8.0              1
9.0              1
10.0             7
11.0             2
12.0             3
13.0             2
14.0             1
15.0             2
17.0             3
18.0             1
20.0             4
25.0             5
29.0             1
30.0             7
35.0             1
40.0             6
45.0             4
47.0             1
49.0             4
50.0            49
55.0             2
59.0             1
60.0             9
65.0             5
66.0             1
70.0            10
75.0             5
79.0             1
80.0            15
89.0             1
90.0             5
99.0            19
100.0          134
110.0            3
111.0            2
115.0            2
117.0            1
120.0           39
122.0            1
125.0            8
129.0            1
130.0           15
135.0            1
139.0            1
140.0            9
145.0            2
149.0            7
150.0       

In [10]:
#calculating the percentage of price values less than $500 and price values greater than $150,000 
print((sum(autos['price'] < 500) / len(autos['price'])) * 100)
print((sum(autos['price'] > 100000) / len(autos['price'])) * 100)

9.778
0.106


When looking at the values in the odometer_km and price columns, the odometer_km values fall into a reasonable range. However, the price column doesn't look right - we see values from \\$0 all the way up to \\$100 million. Nearly 10% of our data has a price below \\$500, with 3\% at \\$0. We'll keep values between 100 and 100,000 in the price column so we don't lose too much our data. Any values outside that range, will be considered outliers that we'll remove.

In [11]:
autos = autos.loc[autos["price"].between(500,100000)]

In [12]:
autos['price'].describe()

count    45058.000000
mean      6177.407342
std       7636.932471
min        500.000000
25%       1500.000000
50%       3500.000000
75%       7900.000000
max      99900.000000
Name: price, dtype: float64

### Clean Date Columns

The date columns: date_crawled, ad_created, last_seen are strings. We'll convert these to datetime objects and remove the time stamps as the date is the only piece we are concerned with.

In [13]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000.0,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500.0,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990.0,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350.0,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350.0,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [14]:
#convert date columns to dataetime, only keep date
date_cols = ['date_crawled', 'ad_created', 'last_seen']

for col in date_cols:
    autos[col] = pd.to_datetime(autos[col])
    autos[col] = autos[col].dt.date

In [15]:
autos[['date_crawled', 'ad_created', 'last_seen']].head()

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26,2016-03-26,2016-04-06
1,2016-04-04,2016-04-04,2016-04-06
2,2016-03-26,2016-03-26,2016-04-06
3,2016-03-12,2016-03-12,2016-03-15
4,2016-04-01,2016-04-01,2016-04-01


In [16]:
autos['registration_year'].describe()

count    45058.000000
mean      2005.063918
std         89.689852
min       1000.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The registration_year column has a minimum of 1000 and maximum of 9999. The car was invented around 1900 and our dataset was created in 2016 so we'll only keep registration years within this range.

In [17]:
autos['registration_year'].between(1900, 2016).sum()

43284

In [18]:
autos = autos.loc[autos['registration_year'].between(1900,2016)]

### Translating Data

As mentioned in the intro, our dataset was taken from the German Ebay website. Some of the columns have German words and we'll translate these to English so we can have a better understanding of the data

In [19]:
autos['vehicle_type'].value_counts().head(20)

limousine     12069
kleinwagen     9216
kombi          8541
bus            3952
cabrio         2958
coupe          2346
suv            1953
andere          347
Name: vehicle_type, dtype: int64

In [20]:
german_to_english = {'limousine':'sedan', 
                     'kleinwagen':'small_car',
                     'kombi': 'vw_bus',
                     'bus': 'bus',
                     'cabrio': 'convertible',
                     'coupe' : 'coupe',
                     'suv' : 'suv',
                     'andere': 'other'
                    }

german_to_english1 = {'manuell':'manual',
                      'automatik':'automatic'
                    }

german_to_english3 = {'nein': 'no',
                      'ja': 'yes'}

autos['vehicle_type'] = autos['vehicle_type'].map(german_to_english)
autos['gearbox'] = autos['gearbox'].map(german_to_english1)
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(german_to_english3)

We've managed to clean up our dataset, now we'll begin doing some analysis. There are roughly 40 unique brands in our dataset. We'll look at the top 10 most popular brands - meaning the brands with the most cars listed

We want to determine average prices and average mileage for the different brands and determine if there is any relationship between the two.

### Exploring Price by Brand

In [21]:
brand_counts = list(autos['brand'].value_counts()[:10].index)
brand_counts

['volkswagen',
 'bmw',
 'mercedes_benz',
 'opel',
 'audi',
 'ford',
 'renault',
 'peugeot',
 'fiat',
 'seat']

In [22]:
autos[autos['brand'].isin(brand_counts)].groupby('brand').mean()['price'].sort_values(ascending=False)

brand
audi             9571.457398
mercedes_benz    8666.677208
bmw              8447.069880
volkswagen       5783.622985
seat             4810.883871
ford             4247.120482
opel             3394.039568
peugeot          3360.920597
fiat             3256.152110
renault          2819.059411
Name: price, dtype: float64

The above dictionary shows average price by brand for the top 10 brands (posting count).
* Max - \\$8-10k
* Mid - \\$4-6k
* Min - Around \\$3k

Audi, Mercedes, and BMW are the most expensive brands. Followed by Volkswagen, Seat, and Ford which are mid-range. The rest of the brands are the cheapest at around \\$3k. 7 of the 10 brands fall into the \\$3k-6k range

### Exploring Mileage by Brand

In [23]:
autos[autos['brand'].isin(brand_counts)].groupby('brand').mean()['odometer_km'].sort_values(ascending=False)

brand
bmw              132928.714859
mercedes_benz    131083.126271
audi             128941.326531
volkswagen       128234.749455
opel             128012.422360
renault          126351.209253
peugeot          126073.113208
ford             123520.552799
seat             120058.064516
fiat             114416.094210
Name: odometer_km, dtype: float64

Above we see average mileage by brand. All of the brands are within roughly 15k miles of each other. Half of them have average mileage of 128-132k and the bottom half is between 120-126 beside fiat which is around 114k.

### Comparing Price and Mileage by Brand

In [24]:
autos[autos['brand'].isin(brand_counts)].groupby('brand').mean()[['price', 'odometer_km']].sort_values(by='price', ascending=False)

Unnamed: 0_level_0,price,odometer_km
brand,Unnamed: 1_level_1,Unnamed: 2_level_1
audi,9571.457398,128941.326531
mercedes_benz,8666.677208,131083.126271
bmw,8447.06988,132928.714859
volkswagen,5783.622985,128234.749455
seat,4810.883871,120058.064516
ford,4247.120482,123520.552799
opel,3394.039568,128012.42236
peugeot,3360.920597,126073.113208
fiat,3256.15211,114416.09421
renault,2819.059411,126351.209253


When comparing the average price to average miles of the top 10 brands, we can see somewhat of a trend between mileage and price within the price brackets (Max, Mid, Low). For the most part, the lower the mileage, the more expensive the car. For example, looking at the car brands in the Max bracket (Audi, Mercedes, BMW) as the average price decreases, the average miles increase.

### Comparing Price and Mileage Overall

In [25]:
#group cars into mileage buckets
odometer_buckets = {}

for index, value in autos['odometer_km'].iteritems():
    miles = int(value)
    if 5000 <= miles < 30000:
        odometer_buckets[index] = '5k to 30k'
    elif 3000 <= miles < 70000:
        odometer_buckets[index] = '30k to 70k'
    elif 70000 <= miles < 100000:
        odometer_buckets[index] = '70k to 100k'
    elif 100000 <= miles <= 125000:
        odometer_buckets[index] = '100k to 125k'
    elif miles > 125000:
        odometer_buckets[index] = '125k+'
        
ob_series = pd.Series(odometer_buckets)
autos['odometer_km_group'] = ob_series

Majority of cars have higher mielage - 63% of cars have 125k+ and nearly 80% of cars have over 100k miles. I would expect this to be the case as the majority of people will sell their car after they've owned it for a while.

In [26]:
autos['odometer_km_group'].value_counts()

125k+           27442
100k to 125k     6623
70k to 100k      4123
30k to 70k       3606
5k to 30k        1490
Name: odometer_km_group, dtype: int64

There is a negative correlation between mileage and price. We can see that as the mileage increases, the price descreases. On average, a car with 125k+ miles will cost around \\$4k vs a car with 5-30k miles which will cost around \\$15k. 

In [27]:
autos.groupby('odometer_km_group').mean()['price'].sort_values()

odometer_km_group
125k+            4088.215910
100k to 125k     6970.396044
70k to 100k      9789.884308
30k to 70k      14150.301997
5k to 30k       14550.914094
Name: price, dtype: float64