# Exploring Ebay Car Sales Data on eBay Kleinanzeigen

In this project, we'll clean and analyse the data of used car listings from eBay Kleinanzeigen which is a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The [orignal data set](https://www.kaggle.com/orgesleka/used-cars-database/data) was scraped and uploaded to Kaggle but a few modifications has been made to it. Firstly, the data set that we are using in this project is made up of 50 000 sampled data points of the full dataset. It has also been dirtied to more closely resemble a scraped dataset as the orignal version was cleaned. This version of the dataset was prepared by [Dataquest](https://www.dataquest.io/).

The data dictionary provided with this data is as follows:

| Column names | Description | 
| :--- | :--- |
|`dateCrawled`| When this ad was first crawled. All field-values are taken from this date.
|`name`| Name of the car.
|`seller`| Whether the seller is private or a dealer.
|`offerType`| The type of listing
|`price`| The price on the ad to sell the car.
|`abtest`| Whether the listing is included in an A/B test.
|`vehicleType`| The vehicle Type.
|`yearOfRegistration`| The year in which the car was first registered.
|`gearbox`| The transmission type.
|`powerPS`| The power of the car in PS.
|`model`| The car model name.
|`kilometer`| How many kilometers the car has driven.
|`monthOfRegistration`| The month in which the car was first registered.
|`fuelType`| What type of fuel the car uses.
|`brand`| The brand of the car.
|`notRepairedDamage`| If the car has a damage which is not yet repaired.
|`dateCreated`| The date on which the eBay listing was created.
|`nrOfPictures`| The number of pictures in the ad.
|`postalCode`| The postal code for the location of the vehicle.
|`lastSeenOnline`| When the crawler saw this ad last online.

## Introduction
Firstly, we'll import the pandas and NumPy libraries.

In [1]:
import pandas as pd
import numpy as np

Next, we'll read the `autos.csv` file into pandas and assign it to a variable.

In [2]:
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

We'll now print information about the `autos` dataframe and the first few rows.

In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


**Column types**

Out of 20 columns, there are 5 int64 columns and 15 object columns which are strings. 

**Null Values**

The `vehicleType`, `gearbox`, `model`, `fuelType` and `notRepairedDamage` columns have null values in them. The `notRepairedDamage` column has the most number of null values, which makes up 19.6% of the column's data values. 

**Dates**

The `dateCrawled` column and `dateCreated` column have dates that are stored as strings.

## Cleaning Column Names

We'll start by cleaning the column names.

In [5]:
# Print an array of existing column names
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We'll make 2 changes to the column names:
1. Convert column names from [camelcase](https://en.wikipedia.org/wiki/Camel_case) to [snakecase](https://en.wikipedia.org/wiki/Snake_case), which is Python's preferred way of naming.
2. Reword some column names to be more descriptive.

In [6]:
# Edit column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [7]:
# Print first 5 rows of dataframe
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Initial Exploration and Cleaning
Now let's look into the descriptive statistics of the `autos` dataframe to do some basic data exploration and determine what other cleaning tasks need to be done.

In [8]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-21 16:37:21,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Based on the descriptive statistics above, the following observations are noted:
1. There are two columns where numeric data is stored as text, the `price` and `odometer` column.
2. The minimum `registration_year` is 1000.
3. The minimum `power_ps` is 0.
4. The maximum `nr_of_pictrues` is 0.
5. In the `seller` column, 49 999 out of 50 000 of the sellers are private sellers.
6. In the `offer_type` column, 49 999 out of 50 000 are 'Angebot'.
7. In the `price` column, the the price with the highest frequency is $0 with 1421 listings.

We will investigate these obsevations further respectively.

**Observation 1:** There are two columns where numeric data is stored as text, the `price` and `odometer` column.

*Outcome:* 
- Remove any non-numeric characters for these two columns.
- Convert the column to a numeric dtype.
- Rename the column to `odometer_km`

We'll first find the unique values in the 'price' column.

In [9]:
autos['price'].unique()

array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'],
      dtype=object)

We'll need to remove the '$' and ',' to convert the column to a numeric dtype. We'll be converting it to a int as there are no decimal places.

In [10]:
autos['price'] = autos['price'].str.replace("$","")
autos['price']= autos['price'].str.replace(",","")
autos['price'] = autos['price'].astype(int)

In [11]:
autos['price'].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int32

Now we'll work on the `odometer` column.

In [12]:
autos['odometer'].unique()

array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km',
       '30,000km', '125,000km', '90,000km', '20,000km', '60,000km',
       '5,000km', '100,000km', '40,000km'], dtype=object)

We'll need to remove the 'km' and ',' to convert the column to a numeric dtype. We'll be converting it to a int as there are no decimal places.

In [13]:
autos['odometer'] = autos['odometer'].str.replace("km","")
autos['odometer']= autos['odometer'].str.replace(",","")
autos['odometer'] = autos['odometer'].astype(int)

Now we'll rename the column to `odometer_km`.

In [14]:
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True)
autos['odometer_km'].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int32

**Observation 2:** The minimum `registration_year` is 1000.

We will investigate this further to see if it's a once off occurance or if there are other similar values.

In [15]:
sorted_rows = autos.sort_values("registration_year")
sorted_rows["registration_year"].unique()

array([1000, 1001, 1111, 1500, 1800, 1910, 1927, 1929, 1931, 1934, 1937,
       1938, 1939, 1941, 1943, 1948, 1950, 1951, 1952, 1953, 1954, 1955,
       1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966,
       1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977,
       1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988,
       1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
       2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
       2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2800, 4100,
       4500, 4800, 5000, 5911, 6200, 8888, 9000, 9996, 9999], dtype=int64)

It seems like there are other unrealistic registration year such as 1000, 1001, 1111, 1500, 1800, 2800, 4100, 4500, 4800, 5000, 5911, 6200, 8888, 9000, 9996 and 9999.

*Outcome:*
- We will not clean these values for now but will be mindful of the existance of these values in our future calculations.

**Observation 3:** The minimum `power_ps` is 0.

We'll look into how many cars listed have a `power_ps` of 0.

In [16]:
autos["power_ps"].value_counts().head()

0      5500
75     3171
60     2195
150    2046
140    1884
Name: power_ps, dtype: int64

There are 5500 cars with a `power_ps` of 0. These cars may not be working and only sold for parts or the `power_ps` are unknown to the seller.

*Outcome:*
- We will not clean these values for now but will be mindful of the existance of these values in our future calculations.

**Observation 4:** The maximum `nr_of_pictures` is 0.

In [17]:
autos["nr_of_pictures"].value_counts()

0    50000
Name: nr_of_pictures, dtype: int64

*Outcome:* 
- Since none of listings has any pictures at all, this column is insignificant and we will dropping this column.

In [18]:
autos = autos.drop(["nr_of_pictures"], axis = 1)

In [19]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 19 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price                 50000 non-null int32
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null int32
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int32(2), int64(4), object(13)
memory usage: 6.9+ MB


**Observation 5:** In the `seller` column, 49 999 out of 50 000 of the sellers are private sellers.

**Observation 6:** In the `offer_type` column, 49 999 out of 50 000 are 'Angebot'.

*Outcome:*
- Since majority of the data in these two columns are the same, we will be dropping these two columns. 

In [20]:
autos = autos.drop(["seller", "offer_type"], axis = 1)

In [21]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null int32
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null int32
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int32(2), int64(4), object(11)
memory usage: 6.1+ MB


**Observation 7:** In the `price` column, the the price with the highest frequency is $0 with 1421 listings.

In [22]:
autos["price"].value_counts().head()

0       1421
500      781
1500     734
2500     643
1000     639
Name: price, dtype: int64

*Outcome:* 
- These cars might be listed to be given out for free. 
- We will not clean these values for now but will be mindful of the existance of these values in our future calculations.

## Exploring the Odometer and Price Columns

In our last step, we converted the `odometer` and `price` columns to numeric data types and renamed the `odometer` column to `odometer_km`.

We'll continue to explore data in these two columns to specifically find data that are unrealistic.

These outliers will then be removed.

**The `odometer_km` column**

In [23]:
# Returns unique values in the column
autos["odometer_km"].unique()

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000], dtype=int64)

In [24]:
# Returns descriptive statistics for the column
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [25]:
# Returns the count of each unique value in ascending order
autos["odometer_km"].value_counts().sort_index(ascending = True)

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

The values seem reasonable and there does not appear to be any outliers. Therefore, we'll not be making any changes to this column. However, it should be noted that most of the cars that are listed have a higher mileage.

**The `price` column**

In [26]:
# Returns the unique values in the column
print(autos["price"].unique())

# Returns the number of unique values in the column
print(autos["price"].unique().shape)

# Returns descriptive statistics for the column
print(autos["price"].describe())

[ 5000  8500  8990 ...   385 22200 16995]
(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


In [27]:
# Returns the count of each unique value in ascending order
autos["price"].value_counts().head(10)

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
Name: price, dtype: int64

As noted in *observation 7* in our previous section, there are 1421 car listings with the price of $0.

In [28]:
print('Percentage of $0 cars listed: ' + str(round((1421/50000*100),2)) + "%")

Percentage of $0 cars listed: 2.84%


As it only makes up 2.84% of our data in the column, we might consider removing these values.

We also noticed that the max price is $100 000 000, which is very high. We'll look further into the prices listed in the lower range as well as the higher range. 

In [29]:
autos["price"].value_counts().sort_index().head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

In [30]:
autos["price"].value_counts().sort_index().tail(20)

197000      1
198000      1
220000      1
250000      1
259000      1
265000      1
295000      1
299000      1
345000      1
350000      1
999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price, dtype: int64

If we look carefully at the descriptive statistics of this column, the mean is 9840.44.

However, since there are outliers in this dataset that might skew the data, it might be more accurate to consider the median price which is only 2950. According to the 75th percentile, 75% of the cars costs below 7200. It might be sensible to estimate that most cars might cost less than 10000.

However, before removing any data, let's look into how many cars that are listed have a price greater than 100 000, 50000 and 10000 respectively.

In [31]:
above_100k = autos["price"] > 100000
above_100k.value_counts()

False    49947
True        53
Name: price, dtype: int64

In [32]:
above_100k = autos["price"] > 50000
above_100k.value_counts()

False    49800
True       200
Name: price, dtype: int64

In [33]:
above_100k = autos["price"] > 10000
above_100k.value_counts()

False    41927
True      8073
Name: price, dtype: int64

In [34]:
print('Percentage of cars with price > $100 000: ' + str(round(53/50000*100, 2)) + '%')
print('Percentage of cars with price > $50 000: ' + str(round(200/50000*100, 2)) + '%')
print('Percentage of cars with price > $10 000: ' + str(round(8073/50000*100, 2)) + '%')

Percentage of cars with price > $100 000: 0.11%
Percentage of cars with price > $50 000: 0.4%
Percentage of cars with price > $10 000: 16.15%


Even though there are only 0.4% of cars with prices greater than 100 000, the values increases steadily until 350 000 before jumping to 999 990. Therefore, we will remove values after the 350 000 mark. 

As for the cars at the lower end of the price range, it is sensible as eBay allows bidding. We will keep the listings that start at $1 and remove anything before that.

In [35]:
autos = autos[autos["price"].between(1,350000)]

In [36]:
autos["price"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

We notice that the mean as decreased from 9840 to 5888 while the median was still very close in value.

## Exploring the Dates Column

Now we are going to look into the dates column. In our dataset, there are 5 columns that represents date values. 

Two of the columns that are added by the *crawler* are:
- `date_crawled`(values stored as strings)
- `last_seen` (values stored as strings)

Three of the columns that are obtained from the *website* itself are:
- `ad_created` (values stored as strings)
- `registration_month` (values are numeric)
- `registration_year` (values are numeric)

Fistly, let's look into the three columns that are stored as strings.

In [37]:
autos[["date_crawled", "last_seen", "ad_created"]][0:5]

Unnamed: 0,date_crawled,last_seen,ad_created
0,2016-03-26 17:47:46,2016-04-06 06:45:54,2016-03-26 00:00:00
1,2016-04-04 13:38:56,2016-04-06 14:45:08,2016-04-04 00:00:00
2,2016-03-26 18:57:24,2016-04-06 20:15:37,2016-03-26 00:00:00
3,2016-03-12 16:58:10,2016-03-15 03:16:28,2016-03-12 00:00:00
4,2016-04-01 14:38:50,2016-04-01 14:38:50,2016-04-01 00:00:00


We are only interested in extracting the date values so that we are able to understand the date range. Therefore, we only need to select the first 10 characters in each column.

**The `date_crawled` column**

In [38]:
# Obtaining the percentages for each date, sorted by date
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna = False).sort_index()

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

In [39]:
# Obtaining the percentages for each date, sorted by percentages
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna = False).sort_values()

2016-04-07    0.001400
2016-04-06    0.003171
2016-03-18    0.012911
2016-04-05    0.013096
2016-03-06    0.014043
2016-03-13    0.015670
2016-03-05    0.025327
2016-03-24    0.029342
2016-03-16    0.029610
2016-03-27    0.031092
2016-03-25    0.031607
2016-03-17    0.031628
2016-03-31    0.031834
2016-03-10    0.032184
2016-03-26    0.032204
2016-03-23    0.032225
2016-03-11    0.032575
2016-03-22    0.032987
2016-03-09    0.033090
2016-03-08    0.033296
2016-04-01    0.033687
2016-03-30    0.033687
2016-03-29    0.034099
2016-03-15    0.034284
2016-03-19    0.034778
2016-03-28    0.034860
2016-04-02    0.035478
2016-03-07    0.036014
2016-04-04    0.036487
2016-03-14    0.036549
2016-03-12    0.036920
2016-03-21    0.037373
2016-03-20    0.037887
2016-04-03    0.038608
Name: date_crawled, dtype: float64

The site was crawled over a the duration of month from March to April. The distribution of listings crawled each day seems to be quite consistent at approximately 3%.

**The `last_seen` column**

In [40]:
# Obtaining the percentages for each date, sorted by date
autos["last_seen"].str[:10].value_counts(normalize=True, dropna = False).sort_index()

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

The `last_seen` column are dates obtained by the crawler of when the listings were last seen up on eBay. We are assuming that when listings are removed, the car has been sold. We noticed that very little listings were last seen during the initial couple of days. This might be because the crawling period just started and a lot of the listings were seen again later in the month. After the initial period, the distribution remained slightly uniform until the last couple of days. In the last 3 days alone, the distribution increased by 6 to 10 times. Although there seems to be a spike in the listings last seen on the last 3 days, it is probably because it is the end of the crawling period and not because there is a massive increase of car sales.

**The `ad_created` column**

In [41]:
# Obtaining the percentages for each date, sorted by date
autos["ad_created"].str[:10].value_counts(normalize=True, dropna = False).sort_index()

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21 

In [42]:
# Obtaining the percentages for each date, sorted by date
autos["ad_created"].str[:10].value_counts(normalize=True, dropna = False).sort_values()

2015-12-05    0.000021
2015-06-11    0.000021
2016-01-22    0.000021
2016-01-07    0.000021
2016-01-29    0.000021
2016-02-16    0.000021
2016-02-09    0.000021
2016-01-13    0.000021
2016-01-14    0.000021
2015-12-30    0.000021
2016-01-16    0.000021
2016-01-03    0.000021
2016-02-07    0.000021
2016-02-22    0.000021
2016-02-01    0.000021
2015-09-09    0.000021
2015-08-10    0.000021
2016-02-17    0.000021
2016-02-11    0.000021
2015-11-10    0.000021
2016-02-08    0.000021
2016-02-12    0.000041
2016-02-26    0.000041
2016-01-10    0.000041
2016-02-05    0.000041
2016-02-24    0.000041
2016-02-14    0.000041
2016-02-20    0.000041
2016-02-02    0.000041
2016-02-18    0.000041
                ...   
2016-03-06    0.015320
2016-03-13    0.017008
2016-03-05    0.022897
2016-03-24    0.029280
2016-03-16    0.030125
2016-03-27    0.030989
2016-03-17    0.031278
2016-03-25    0.031751
2016-03-31    0.031875
2016-03-10    0.031895
2016-03-23    0.032060
2016-03-26    0.032266
2016-03-22 

Approximately 0% of ads were created in 2015 and the beginning of 2016. In March we see a sudden increase of ads from 0% to 1% then 3% in just a first few days. This maintained until the first few days of April. Thereafter, the number of ads created decreased to almost 0% again. 

There is also a wide range of ad created dates from June 2015 to April 2016. However, most of the ads created fall into the 2 month period in March and April.



**The `registration_year` column**

In our initial exploration of data, we made the observation that the minimum registration year is 1000. We looked further into the data and found that there are a number of other unrealistic registration year such as 1000, 1001, 1111, 1500, 1800, 2800, 4100, 4500, 4800, 5000, 5911, 6200, 8888, 9000, 9996 and 9999.

Let's look into this once again to see if there are other unrealistic values.

In [43]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

After analysing the date column, we know that ads were created from 2015-2016. Therefore, it is impossible for any cars to be first registered after 2016. It is also slightly unrealistic for cars to be first registered more than a 100 years ago to still be around.

We will look at how plausible is it to remove these odd values in the next section.

## Dealing with Incorrect Registration Year Data ##

We'll now investigate the number of listing of cars that fall out of the 1910-2016 range to see how much of an impact would removing these rows make on our dataset.

In [44]:
outside_range = (autos["registration_year"] < 1910) | (autos["registration_year"] > 2016)
outside_range.value_counts(normalize = True)

False    0.961207
True     0.038793
Name: registration_year, dtype: float64

Since only 3.9% of the data is outside of this range, it is safe to remove these values.

In [45]:
autos = autos[autos["registration_year"].between(1910, 2016)] 

In [46]:
autos["registration_year"].value_counts(normalize = True).sort_index()

1910    0.000107
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000043
1937    0.000086
1938    0.000021
1939    0.000021
1941    0.000043
1943    0.000021
1948    0.000021
1950    0.000064
1951    0.000043
1952    0.000021
1953    0.000021
1954    0.000043
1955    0.000043
1956    0.000086
1957    0.000043
1958    0.000086
1959    0.000129
1960    0.000493
1961    0.000129
1962    0.000086
1963    0.000171
1964    0.000257
1965    0.000364
1966    0.000471
1967    0.000557
1968    0.000557
          ...   
1987    0.001542
1988    0.002892
1989    0.003727
1990    0.007433
1991    0.007262
1992    0.007926
1993    0.009104
1994    0.013474
1995    0.026285
1996    0.029412
1997    0.041794
1998    0.050620
1999    0.062060
2000    0.067608
2001    0.056468
2002    0.053255
2003    0.057818
2004    0.057904
2005    0.062895
2006    0.057197
2007    0.048778
2008    0.047450
2009    0.044665
2010    0.034040
2011    0.034768
2012    0.028063
2013    0.017202
2014    0.0142

In [47]:
autos["registration_year"].value_counts(normalize = True).sort_values().tail(30)

1985    0.002035
1988    0.002892
1989    0.003727
1991    0.007262
1990    0.007433
1992    0.007926
2015    0.008397
1993    0.009104
1994    0.013474
2014    0.014203
2013    0.017202
2016    0.026135
1995    0.026285
2012    0.028063
1996    0.029412
2010    0.034040
2011    0.034768
1997    0.041794
2009    0.044665
2008    0.047450
2007    0.048778
1998    0.050620
2002    0.053255
2001    0.056468
2006    0.057197
2003    0.057818
2004    0.057904
1999    0.062060
2005    0.062895
2000    0.067608
Name: registration_year, dtype: float64

From our observation, mojority of the cars were created from 1994-2016.

## Exploring Price by Brand
We'll now explore the prices of listed cars on eBay by brand.

In [48]:
autos["brand"].value_counts(normalize = True).sort_values(ascending = False)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

As there are numerous brand of cars in our dataset, we are only going to look specifically at brands that take up at least 5% of our data.

In [49]:
brands_ordered = autos["brand"].value_counts(normalize = True)
sel_brands = brands_ordered[brands_ordered > 0.05].index
sel_brands

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

The top 6 brands of cars listed on eBay have the following manufacturing countries:
1. Volkswagen (Germany)
2. BMW (Germamy)
3. Opel (France)
4. Mercedes Benz (Germany)
5. Audi (Germany)
6. Ford (America)

4 out of 6 of these manufacturers are German manufacturers. As we are currently analysing data from a German eBay website, it makes sense that German manufactured cars are more popular on the site.

Next, we'll look at the mean prices of these brands.

In [50]:
# Calculate the mean price of selected brands
mean_price = {}
for b in sel_brands:    
    mean_price[b] = round(autos[autos["brand"] == b]["price"].mean(), 2)
    
mean_price

{'volkswagen': 5402.41,
 'bmw': 8332.82,
 'opel': 2975.24,
 'mercedes_benz': 8628.45,
 'audi': 9336.69,
 'ford': 3749.47}

BMW, Mercedes Benz and Audi have a higher price mean price while Opel and Ford are on the lower end. Volkswagen's mean price is in between, making it affordable. This explains why Volkswagen is the top brand that is listed on the site.

## Exploring Mileage

We'll now find the mean mileage of the top 6 car brands listed.

In [51]:
# Calculate the mean mileage of selected brands
mean_mileage = {}
for b in sel_brands:    
    mean_mileage[b] = round(autos[autos["brand"] == b]["odometer_km"].mean(),2)

mean_mileage

{'volkswagen': 128707.16,
 'bmw': 132572.51,
 'opel': 129310.04,
 'mercedes_benz': 130788.36,
 'audi': 129157.39,
 'ford': 124266.01}

To compare, we'll create a Data Frame to combine our `mean_price` and `mean_mileage` dictionaries.

In [52]:
mp_series = pd.Series(mean_price)
mm_series = pd.Series(mean_mileage)

mmp_df = pd.DataFrame(mp_series, columns = ['mean_price'])
mmp_df["mean_mileage"] = mm_series

mmp_df

Unnamed: 0,mean_price,mean_mileage
volkswagen,5402.41,128707.16
bmw,8332.82,132572.51
opel,2975.24,129310.04
mercedes_benz,8628.45,130788.36
audi,9336.69,129157.39
ford,3749.47,124266.01


We can see that the mean mileage for the top 6 brands of listed cars are very close in value with the highest mileage being 132572.51 and lowest being 124266.01. The brand with the highest mileage is BMW while the lowest mileage is Ford.

## Summary

In summary, these are the things that we have achieved in this project:
- We cleaned the column names to be snakecase and more descriptive.
- We converted columns with numbers from string to numeric types.
- We dropped columns as majority of their data were similar, namely, `nr_of_pictures`, `seller` and `offer_type`.
- We clean the `price` column of unrealistic values. These were values that were too high or 0.
- We analysed the distribution in the columns with date values, namely, the `date crawled`, `last seen` and `ad created` column.
- We removed unrealistic dates from the `registration_year` column. These were the dates that were outside of the 1910-2016 range.
- We explored the mean price and mean mileage of the Top 6 Car Brands that are listed on the site. These car brands are Volkswagen, BMW, Open, Mercedes Benz, Audi and Ford. We found that Volkswagen is the most popular car listed as the mean price is affordable. Additionally, the mean mileage of these listed cars are close in value (124k - 132k km).