# Exploring eBay Car Sales Data

We will be working on a dataset of used cars from *eBay Kleinanzeigen*, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data).  The version of the dataset we are working with is a sample of over 370,000 data points that was prepared by [Dataquest](https://www.dataquest.io).

The data dictionary provided with data is as follows:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which which year the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which which year the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.


The aim of this project is to clean the data and analyze the included used car listings.

In [1]:
# Import the necessary libaries
import pandas as pd
import numpy as np

In [2]:
# Read the CSV file and assign it to a variable name
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [3]:
# Examine the dataframe
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


So, there are some missing data such as `vehicleType`, `model`, `fuelType`, and `notRepairedDamage`. For `notRepairedDamage`, this is likely acceptable since not all of cars listed on eBay had a damage that was not yet repaired.

In addition, the column names use camelcase instead of snakecase. Therefore, in the following section, we are going to convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

# Cleaning Column Names

In [5]:
# Print the existing column names
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:
# Replace column names from camelcase to snakecase and reword some of the column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
       'kilometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_of_pictures', 'postal_code',
       'last_seen']

In [7]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


# Initial Exploration and Cleaning

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially, we will look for:

- Text columns where all or almost all values are the same. These can often be dropped as they don;t have useful information for analysis.
- Examples of numeric data stored as text which can be cleaned and converted.

The following methods are helpful for exploring the data:

- `DataFrame.describe()` (with `include='all'` to get both categorical and numeric columns).
- `Series.value_counts()` and `Series.head()` if any columns need a closer look.

In [8]:
# Look at descriptive statistics for all columns
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_of_pictures,postal_code,last_seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:45:59
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


We need to answer some questions before moving on:
- Any columns that have mostly one value that are candidates to be dropped?
`seller`, `offer_type`, and `num_of_pictures` look suspicious. The first two columns only have two unique values, while the latter only has "0.0".

- Any columns that need more investigation?
Perhaps we need to take a closer look on these three columns.

- Any examples of numeric data stored as text that needs to be cleaned?
No

In [9]:
# See the 'seller' column
autos['seller'].value_counts()

privat        371525
gewerblich         3
Name: seller, dtype: int64

This doesn't give us any useful insights, since virtually all of the sellers are listed as "privat," or private in English.

In [10]:
# See the offer_type column
autos['offer_type'].value_counts()

Angebot    371516
Gesuch         12
Name: offer_type, dtype: int64

We can say the same for `offer_type` since one particular offer type dominates. Drop this column.

In [11]:
# See the num_of_pictures column
autos['num_of_pictures'].value_counts()

0    371528
Name: num_of_pictures, dtype: int64

Since all three columns don't yield any useful insights, we'll also drop these columns.

In [12]:
# Drop three columns
autos = autos.drop(['seller', 'offer_type', 'num_of_pictures'], axis=1)

In [13]:
# Check the column again
autos.columns

Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_PS', 'model', 'kilometer',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

# Exploring the Kilometer and Price Columns

We are going to explore the `kilometer` and `price` columns. We'll take these steps below:
- Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove
- We'll use:
 - `Series.unique().shape` to see how many unique values
 - `Series.describe()` to view min/max/median/mean etc
 - `Series.value_counts()` with some variations:
  - chained to `.head()` if there are lots of values
  - Because `Series.value_counts()` returns a series, we can use `Series.sort_index()` with `ascending=` `True` or `False` to view the highest and lowest values with their counts (can also chain to `head()` here).
 - When removing outliers, we can do `df[df["col"].between(x,y)]`

In [14]:
# See how many unique values
print('The shape of unique values for both kilometers and price is {} and {}, respectively'.format(autos['kilometer'].unique().shape, autos['price'].unique().shape))

The shape of unique values for both kilometers and price is (13,) and (5597,), respectively


In [15]:
# Calculating the number of cars based on kilometers
autos['kilometer'].value_counts()

150000    240797
125000     38067
100000     15920
90000      12523
80000      11053
70000       9773
60000       8669
50000       7615
5000        7069
40000       6376
30000       6041
20000       5676
10000       1949
Name: kilometer, dtype: int64

The majority of cars have run over 100,000 kilometers. The fact that sellers had to select pre-set options for this firled helps explain why the data were listed in round numbers.

Now, let's look at the `price` columns.

In [16]:
# Look into price data
autos['price'].describe()

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

In [52]:
# Value_counts based on price
autos['price'].value_counts().head(20)

0       10014
500      5468
1500     5093
1000     4366
1200     4332
2500     4244
3500     3653
600      3636
800      3581
2000     3240
999      3202
750      3026
650      2997
4500     2929
850      2798
2200     2787
700      2776
1800     2745
900      2664
300      2649
Name: price, dtype: int64

About 11,000 cars were listed at "zero" price. As such number only constitutes a small portion of the overall cars, we'll eliminate those with "0". For this matter, we are going to keep data points with prices between 0 and 350,000.

In [18]:
autos = autos[autos['price'].between(0,350000)]
autos['price'].describe()

count    371413.000000
mean       5727.498932
std        8792.694663
min           0.000000
25%        1150.000000
50%        2950.000000
75%        7200.000000
max      350000.000000
Name: price, dtype: float64

# Exploring the Date Column

Let's now move on to the date columns and understand the date range the data covers. Some of these columns were created by the crawler, and some came from the website itself.

- `date_crawled` : added by the crawler
- `last_seen` : added by the crawler
- `ad_crawled` : from the wesbite
- `registration_month` : from the website
- `registration_year` : from the website

We'll start from the first three. Since the `date_crawled`, `last_seen`, and `ad_created` columns are all identified as string values by pandas, we need to convert those into a numerical representation so we can understand it quantitatively.

In [19]:
# Examine how those three columns are formatted
autos[['date_crawled', 'last_seen', 'ad_created']][0:5]

Unnamed: 0,date_crawled,last_seen,ad_created
0,2016-03-24 11:52:17,2016-04-07 03:16:57,2016-03-24 00:00:00
1,2016-03-24 10:58:45,2016-04-07 01:46:50,2016-03-24 00:00:00
2,2016-03-14 12:52:21,2016-04-05 12:47:46,2016-03-14 00:00:00
3,2016-03-17 16:54:04,2016-03-17 17:40:17,2016-03-17 00:00:00
4,2016-03-31 17:25:20,2016-04-06 10:17:21,2016-03-31 00:00:00


In [20]:
# Select the first 10 characters (date) and calculate the distribution of values
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025551
2016-03-06    0.014464
2016-03-07    0.035699
2016-03-08    0.033451
2016-03-09    0.034213
2016-03-10    0.032565
2016-03-11    0.032724
2016-03-12    0.036194
2016-03-13    0.015740
2016-03-14    0.036270
2016-03-15    0.033456
2016-03-16    0.030155
2016-03-17    0.031657
2016-03-18    0.013126
2016-03-19    0.035295
2016-03-20    0.036353
2016-03-21    0.035726
2016-03-22    0.032471
2016-03-23    0.031972
2016-03-24    0.029915
2016-03-25    0.032936
2016-03-26    0.031967
2016-03-27    0.030274
2016-03-28    0.035117
2016-03-29    0.034170
2016-03-30    0.033534
2016-03-31    0.031878
2016-04-01    0.034108
2016-04-02    0.035077
2016-04-03    0.038728
2016-04-04    0.037616
2016-04-05    0.012819
2016-04-06    0.003164
2016-04-07    0.001618
Name: date_crawled, dtype: float64

This shows that the crawler "crawled" the data roughly similar number of data points across the date.

In [21]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001290
2016-03-06    0.004136
2016-03-07    0.005264
2016-03-08    0.008053
2016-03-09    0.009994
2016-03-10    0.011564
2016-03-11    0.013047
2016-03-12    0.023400
2016-03-13    0.008492
2016-03-14    0.012299
2016-03-15    0.016410
2016-03-16    0.016424
2016-03-17    0.028755
2016-03-18    0.006925
2016-03-19    0.016313
2016-03-20    0.019905
2016-03-21    0.020131
2016-03-22    0.020608
2016-03-23    0.018150
2016-03-24    0.019237
2016-03-25    0.019097
2016-03-26    0.016163
2016-03-27    0.016911
2016-03-28    0.022272
2016-03-29    0.023311
2016-03-30    0.023858
2016-03-31    0.024243
2016-04-01    0.024019
2016-04-02    0.025005
2016-04-03    0.025352
2016-04-04    0.025661
2016-04-05    0.126202
2016-04-06    0.217844
2016-04-07    0.129667
Name: last_seen, dtype: float64

We notice that there is a surge in the last couple of days within the observation period. Were those ads canceled during those days? There is no strong argument that can help explain this phenomenon, unless this is because the period of crawling has ended.

In [22]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2014-03-10    0.000003
2015-03-20    0.000003
2015-06-11    0.000003
2015-06-18    0.000003
2015-08-07    0.000003
                ...   
2016-04-03    0.038887
2016-04-04    0.037745
2016-04-05    0.011650
2016-04-06    0.003156
2016-04-07    0.001556
Name: ad_created, Length: 114, dtype: float64

Lastly, the `ad_created` column doesn't really indicate anything other than more ads were created towards the end of the crawling period. Let's now understand the distribution of `registration_year`.

In [23]:
autos['registration_year'].describe()

count    371413.000000
mean       2004.561152
std          91.937676
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

There are likely incorrect data, since it is impossible for a car to be registered at year 1000. Cars had not even been invented back then! And the year 9999 are many years into the future. Thus, we are going to deal with those incorrect data.

# Dealing with Incorrect Registration Year Data

Since the listing was done in 2016, any vehicle with a registration year above 2016 is definitely inaccurate. Thus, we are going to count the number of listings with cars that fall outside the 1900-2016 interval, and see if it is safe to remove those rows entirely, or if we need more custom logic. Let's calculate the number of cars that fall outside the previously mentioned interval.

In [24]:
# Check the number of cars that fall outside the interval
autos[(autos['registration_year'] < 1900) | (autos['registration_year'] > 2016)]['registration_year'].value_counts()

2017    10542
2018     3991
1000       38
2019       26
9999       26
5000       18
3000        7
6000        6
1500        5
1800        5
9000        5
2500        4
7000        4
1234        4
1111        3
4000        3
5555        2
5911        2
7500        2
1600        2
2222        2
1300        2
4500        2
8000        2
8888        2
2800        2
8455        1
1688        1
9996        1
1039        1
1602        1
7800        1
5600        1
9229        1
7777        1
7100        1
8200        1
1253        1
3800        1
9450        1
1400        1
3200        1
2200        1
4100        1
2066        1
5300        1
2900        1
5900        1
1200        1
6200        1
8500        1
1255        1
3700        1
4800        1
6500        1
2290        1
1001        1
3500        1
Name: registration_year, dtype: int64

So there are almost 15,000 data points that have incorrect registration year. This is only equal to roughly 4% of total data. Thus, we are going to remove those values.

In [25]:
# Remove values that fall outside the interval
autos = autos[autos['registration_year'].between(1900,2016)]

In [26]:
# Calculate the distribution of the remaining values
autos['registration_year'].value_counts(normalize=True).head(10)

2000    0.068802
1999    0.063823
2005    0.062556
2006    0.056707
2001    0.056674
2003    0.055709
2004    0.055353
2002    0.053794
1998    0.050320
2007    0.049547
Name: registration_year, dtype: float64

In [27]:
autos['registration_year'].value_counts(normalize=True).tail(10)

1928    0.000006
1946    0.000006
1944    0.000006
1927    0.000006
1940    0.000006
1911    0.000003
1919    0.000003
1915    0.000003
1920    0.000003
1925    0.000003
Name: registration_year, dtype: float64

The majority of cars were registered in the 1990s and 2000s, while those that were registered early in the century only comprised of a small portion of the overall data.

# Exploring Price by Brand

When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the `brand` column. Here is the process:

- Identify the unique values we want to aggregate by
- Create an empty dictionary to store our aggregate data
- Loop over the unique values, and for each: 1) Subset the dataframe by the unique values, 2) Calculate the mean of whichever column we're interested in, 3) Assign the val/mean to the dict as k/v.

In [28]:
# Explore the unique values in the brand column and select the top 20 brands
autos['brand'].value_counts(normalize=True).head(20)

volkswagen        0.212396
bmw               0.109697
opel              0.107092
mercedes_benz     0.095964
audi              0.089356
ford              0.068856
renault           0.047573
peugeot           0.029868
fiat              0.025763
seat              0.018636
skoda             0.015412
mazda             0.015350
smart             0.014108
citroen           0.013875
nissan            0.013573
toyota            0.012751
sonstige_autos    0.010584
hyundai           0.009833
mini              0.009213
volvo             0.009132
Name: brand, dtype: float64

So, we have chosen the top 20 brands based on the number of cars of that brand as a % of total listed cars. We can see from the results above that only Volkswagen, BMW, and Opel have more than 10% of the portion, while the others seem to only have miniscule portion. We can limit these brands to more than 5%.

In [39]:
# Create an empty dictionary and loop over the selected brands
aggregate_price = {}
selected_brands = ['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford']

for brand in selected_brands:
    brand_only = autos[autos['brand'] == brand]
    mean_price = brand_only['price'].mean()
    aggregate_price[brand] = int(mean_price)

In [41]:
# Checking the dictionary
aggregate_price

{'volkswagen': 5231,
 'bmw': 8224,
 'opel': 2870,
 'mercedes_benz': 8387,
 'audi': 8849,
 'ford': 3595}

Audi, Mercedes Benz, and BMW are the most expensive ones, on average. Ford and Opel are less expensive, while Volkswagen is in between.

# Storing Aggregate Data in a DataFrame

For the top 6 brands, we are going to use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. We'll combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly.

In [43]:
# Create an empty dictionary and loop over the selected brands
aggregate_kilometer = {}
selected_brands = ['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford']

for brand in selected_brands:
    brand_only = autos[autos['brand'] == brand]
    mean_kilometer = brand_only['kilometer'].mean()
    aggregate_kilometer[brand] = int(mean_kilometer)

In [44]:
# Check the aggregate_kilometer dictionary
aggregate_kilometer

{'volkswagen': 128338,
 'bmw': 132666,
 'opel': 128756,
 'mercedes_benz': 130585,
 'audi': 129498,
 'ford': 123617}

In [45]:
# Convert both dictionaries to series object using the series conductor
price_series = pd.Series(aggregate_price)
kilometer_series = pd.Series(aggregate_kilometer)

In [47]:
# Create a dataframe from the first series object
df = pd.DataFrame(price_series, columns=['mean_price'])

In [48]:
# Assign the other series as a new column
df['mean_kilometer'] = kilometer_series

In [50]:
# Add a column called price/km
df['price per km'] = df['mean_price'] / df['mean_kilometer']

In [51]:
# Print the dataframe
print(df)

               mean_price  mean_kilometer  price per km
volkswagen           5231          128338      0.040760
bmw                  8224          132666      0.061990
opel                 2870          128756      0.022290
mercedes_benz        8387          130585      0.064226
audi                 8849          129498      0.068333
ford                 3595          123617      0.029082


So, there is no significant correlation between price and mileage. If anything, this seems to be positively correlated. For instance, BMW, Audi, and Mercedes Benz are on the expensive side of the spectrum, but mileage-wise, they are also the highest. Value-wise based on price per km, buying an Opel seems to be worth it.