# Exploring eBay Car Listings

## Introduction
> "eBay Kleineanzeigen" translated into English as "eBay Small ads", is a German Marketplace where users can find great deals from household goods to clothing, garden tools, electronics, and second-hand items. More interestingly, it's a popular platform for selling cars. Throughout this project, we will be working with a dataset that has over 50,000 used cars. The aim of this project is to clean the various columns of the dataset and explore the different characters of used cars in this marketplace. The first part of this notebook will focus on preparing and cleaning the dataset for exploration, while the second part will center on exploring the cleaned data and discovering interesting findings.

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import datetime as dt
from tabulate import tabulate

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Loading in the dataset
autos = pd.read_csv("/Users/omarstinner/Data Files/Python Projects/Files/Guided Project - Exploring eBay Car Sales Data/autos.csv", encoding = "Latin-1")

In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

> Before cleaning and exploring our data, let’s first get a macro overview of our dataset as a whole and take a look at some of the characteristics of our columns. Looking at the output above, we see that the "vehicleType", "gearbox", "model", "fueltype", and "notRepairDamage" columns all include null values. However, none of these columns have more than 20% null values. We can also tell that the following columns have been assigned the wrong data type: "dateCrawled", "price", "odometer", "dateCreated", and "lastSeen". This is mostly due to the addition of symbols to the column contents such as "$" (for price) and "km" (for odometer).

## Part 1: Cleaning The Data

In [4]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

> **What's Happening?** We are changing the column names from camel case to snake case because using snake case has a cleaner appearance and is regarded as the standard case for naming columns in python. We are also rewording some of the columns for a more fitting description of the column's contents.

In [5]:
# Looking at the descriptive statistics for all the columns
display(autos.describe(include = "all"))

# Further insepcting the "nr_of_pictures" and "postal_code" columns
print(autos["nr_of_pictures"].value_counts())
print(autos["postal_code"].value_counts())

# Dropping the columns "seller", "offer_type", "num_photos"
autos = autos.drop(["seller", "offer_type", "nr_of_pictures"], axis = 1)

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-11 22:38:16,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


0    50000
Name: nr_of_pictures, dtype: int64
10115    109
65428    104
66333     54
45888     50
44145     48
        ... 
71576      1
76776      1
76872      1
91233      1
67585      1
Name: postal_code, Length: 7014, dtype: int64


> **What's Happening?** We are looking at descriptive statistics of the dataset to look for column candidates we can remove entirely. The columns that are in contention of being dropped are "seller" and "offer_type". Mainly because they both only have 2 unique values of which 49,999 of the columns are the same. "ab_test", "gear_box", and "unrepaired_damage" also only have 2 unique values, however, the split of the rows between the unique values in the columns are more distributed. The "nr_of_pictures" and "postal_code" columns also look quite odd, with the output telling us they have "NaN" values for the "unique" and "top" criteria. After further exploration into the columns, we see that only "nr_of_pictures" warrants a drop as all the values in the column are 0, while the "postal code" column has multiple unique values that could be useful to us in our analysis, and so we will keep the column. After better understanding the columns mentioned in this cell, we have proceeded and dropped "seller", "offer_type", and "nr_of_pictures".

In [6]:
# Converting"vehicle_type" column to english
autos["vehicle_type"] = autos["vehicle_type"].str.replace("kleinwagen","minature car")
autos["vehicle_type"] = autos["vehicle_type"].str.replace("kombi","station wagon")
autos["vehicle_type"] = autos["vehicle_type"].str.replace("andere","other")

# Converting"gearbox" column to english
autos["gearbox"] = autos["gearbox"].str.replace("manuell", "manual")
autos["gearbox"] = autos["gearbox"].str.replace("automatik", "automatic")

# Converting "fuel_type" column to english
fuel_type_dict = {"lpg" : "lpg",
                 "benzin" : "petrol",
                 "diesel" : "diesel",
                 "nan" : "nan",
                 "cng" : "cng",
                 "hybrid" : "hybrid",
                 "elektro" : "electric",
                 "andere" : "other"}
autos["fuel_type"] = autos["fuel_type"].map(fuel_type_dict)

# Converting "unrepaired_damage" column to english
unrepaired_damage_dict = {"nein" : "no",
                         "ja" : "yes",
                         "nan" : "nan"}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(unrepaired_damage_dict)

# Converting "model" column to english
autos["model"] = autos["model"].str.replace("andere", "other")
autos["model"] = autos["model"].str.replace("_reihe", "_series")
autos.loc[(autos["brand"] == "bmw") & (autos["model"].str[-2:] == "er"), "model"] = autos.loc[(autos["brand"] == "bmw") & (autos["model"].str[-2:] == "er"), "model"].str.replace("er", "_series")
autos["model"] = autos["model"].str.replace("klasse", "class")
autos["model"] = autos["model"].str.replace("oth_series", "other")

# Converting the "brand" column to english
autos["brand"] = autos["brand"].str.replace("sonstige_autos", "miscellaneous_cars")

> **What's Happening?** The columns "vehicle_type", "gearbox", "fuel_type", "unrepaired_damages", and "model" are all in German and so we are translating the columns to english.

In [7]:
# Removing Symbols from the "price" column
autos["price"] = autos["price"].str.replace(",", "")
autos["price"] = autos["price"].str.replace("$", "").astype("int64")

# Removing Symbols from the "odometer" column
autos["odometer"] = autos["odometer"].str.replace(",","")
autos["odometer"] = autos["odometer"].str.replace("km","").astype("int64")
autos.rename({"odometer" : "odometer_km"}, axis = 1, inplace = True)

print(autos["price"].value_counts().head(20))
print(autos["price"].unique().shape)
print(autos["price"].value_counts().sort_index(ascending = False))
print(autos["price"].max())
print(autos["price"].value_counts().sort_index(ascending = False).head(20))
print(autos["price"].value_counts().sort_index(ascending = False).tail(40))    

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
Name: price, dtype: int64
(2357,)
99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
            ... 
5              2
3              1
2              3
1            156
0           1421
Name: price, Length: 2357, dtype: int64
99999999
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64
111       2
110       3
100     134
99       19
90        5
89        1
80       15
79        1
75        5
70       10
66        1
65        5
60        9
59      

> **What's Happening?** Here we are cleaning the "price" and "odometer_km" columns. We remove the "," symbol from both columns, the "\\$" symbol from the "price" column, and the "km" symbol from the "odometer_km" for computation purposes. We then are further examined the "price" column to identify rows that are inaccurate. We can see that there are 1421 cars that are priced at $0, which does not seem accurate as people would not sell their cars for anything. We also see car listings that are priced above \\$1 million, which is also hard to believe. There are in fact some cars priced this high, however, these types of sales do not usually occur on eBay, but rather at a more formal setting, such as an auction house. For these reasons, we will remove such outliers in the next cell.

In [8]:
# Creating a function that detects outliers and removes them
def outlier_remover(df, cols_list, change = False):
    data = []
    index_list = []
    print("number of rows before removing outliers: ", df.shape[0])
    for col in cols_list:
        # Defining the upper and lower limits for outlier detection
        q1 = df[col].quantile(0.05)
        q3 = df[col].quantile(0.95)
        iqr = q3 - q1
        
        upper_limit = q3 + 1.5 * iqr
        lower_limit = q1 - 1.5 * iqr
        
        # If the lower_limit is below 0 then I decided to to make the lower_limit the half way point between 25th percentile and the minimum value, so that I could be reasonably inclusive
        if lower_limit < 0:
            lower_limit = 1
        
        # Finding out whether or not a column has outliers and seeing how many outliers are present in the column 
        true_or_false = None
        count = None
        if df[(df[col] < lower_limit) | (df[col] > upper_limit)].any(axis = None):#Im guessing the above evausates to weather there are any rows present, adn there are, so it evaluates to true
            true_or_false = True
            count = df[(df[col] < lower_limit) | (df[col] > upper_limit)].shape[0]
            # Adding to the index_list the index label/number of the row that has an outlier
            index_list.extend(list(df[(df[col] < lower_limit) | (df[col] > upper_limit)].index))
        else:
            true_or_false = False
            count = 0
            
        
        # Append to the data list
        data.append([col, true_or_false, count, lower_limit, upper_limit])
    
    # Making sure that the same index labels/numbers aren't repeated
    index_list = list(set(index_list))

    # Removing the rows deemed outliers
    if change:
        df = df.drop(index_list)
        
    
    table = tabulate(data, headers = ["column", "outliers (T/F)", "count", "lower limit", "upper limit" ], tablefmt = "rst", numalign = "right")
    print("number of rows after removing outliers: ", df.shape[0])
    print(table)
    return df

# Removing outliers from both the "price" and "odometer_km" column
autos = outlier_remover(autos, ["price", "odometer_km"], change = True)

number of rows before removing outliers:  50000
number of rows after removing outliers:  48364
column       outliers (T/F)      count    lower limit    upper limit
price        True                 1636              1          49450
odometer_km  False                   0              1         330000


> **What's Happening?** We created a function that detects outliers that remove such identified rows. To ensure that car listings priced at $0 and having an odometer reading of 0 km are not included, we set the outlier lower limit to 1. The output of the function is a table that shows us whether a column has outliers or not, the number of outliers present in the column, and the lower and upper outlier limits. We see that the "price" column had 1636 outliers, but the "odometer_km" did not. We have used the function to remove such rows.

In [9]:
# Converting the data type of the date related columns to a dateitime object
autos["date_crawled"] = pd.to_datetime(autos["date_crawled"])
autos["ad_created"] = pd.to_datetime(autos["ad_created"])
autos["last_seen"] = pd.to_datetime(autos["last_seen"])

# Only selecting rows where registration_year is between 1900 and 2016
autos = autos[autos["registration_year"].between(1900,2016)]

# Creating a new column
autos["registration_month+year"] = ("0" + autos["registration_month"].astype(str) + "-" + autos["registration_year"].astype(str)).str[-7:]
autos[autos["registration_month+year"].str[:2] == "00"] = np.nan
autos["registration_month+year"] = pd.to_datetime(autos["registration_month+year"], format = "%m-%Y")

> **What's Happening?** We are converting the "date_crawled", "ad_created", and "last" seen column into datetime objects. Since the year in which a car is last seen is 2016, it is impossible to have car registration past the latest "last seen" date. For that reason, we have selected only a listing with registration dates between 1900 and 2016. Although selecting such rows has immensely cleaned our data, we still need to deal with listings where cars have registration dates past last seen dates in 2016. This means we need to compare the two dates on a monthly level. So, we created another column called "registration_month+year" which will enable us to compare the registration date and the last seen date by a month.

In [10]:
# Removing rows where the "registration_month+year" is greater than the "last_seen"
remove_rows = autos[autos["registration_month+year"] > autos["last_seen"]].index
autos = autos.drop(remove_rows)

> **What's Happening?** We got rid of the rows where registration_year was past last_seen in 2016.

## Part 2: Exploring The Data

### Exploring The Date Columns

In [11]:
autos["date_crawled"].astype(str).str[:10].value_counts(normalize = True).sort_index()

2016-03-05    0.022936
2016-03-06    0.013159
2016-03-07    0.032997
2016-03-08    0.030181
2016-03-09    0.030160
2016-03-10    0.029527
2016-03-11    0.030225
2016-03-12    0.033848
2016-03-13    0.014360
2016-03-14    0.033302
2016-03-15    0.030989
2016-03-16    0.026930
2016-03-17    0.028479
2016-03-18    0.011784
2016-03-19    0.031316
2016-03-20    0.034742
2016-03-21    0.034000
2016-03-22    0.029658
2016-03-23    0.029461
2016-03-24    0.026646
2016-03-25    0.028326
2016-03-26    0.029439
2016-03-27    0.028065
2016-03-28    0.031600
2016-03-29    0.030945
2016-03-30    0.030640
2016-03-31    0.029156
2016-04-01    0.031251
2016-04-02    0.032844
2016-04-03    0.035899
2016-04-04    0.033498
2016-04-05    0.011784
2016-04-06    0.002815
2016-04-07    0.001309
NaT           0.087729
Name: date_crawled, dtype: float64

> We can see that all the listings were scrapped throughout the month of March and the beginning of April. The distribution of the number of scrapped listings a day is roughly normal. Only the 6th, 13th, and 18th have less scraped clean data for the month of March. Towards the last couple of days of April, we see a significant drop in the number of listings scrapped.

In [12]:
autos["last_seen"].astype(str).str[:10].value_counts(normalize = True).sort_index()

2016-03-05    0.001004
2016-03-06    0.003579
2016-03-07    0.004932
2016-03-08    0.006612
2016-03-09    0.008751
2016-03-10    0.009559
2016-03-11    0.011239
2016-03-12    0.021627
2016-03-13    0.007878
2016-03-14    0.011370
2016-03-15    0.014425
2016-03-16    0.014752
2016-03-17    0.025009
2016-03-18    0.006765
2016-03-19    0.014054
2016-03-20    0.018441
2016-03-21    0.018615
2016-03-22    0.018702
2016-03-23    0.016476
2016-03-24    0.017611
2016-03-25    0.016957
2016-03-26    0.015145
2016-03-27    0.013749
2016-03-28    0.018637
2016-03-29    0.019881
2016-03-30    0.022063
2016-03-31    0.021474
2016-04-01    0.021278
2016-04-02    0.023089
2016-04-03    0.022892
2016-04-04    0.021692
2016-04-05    0.116383
2016-04-06    0.205661
2016-04-07    0.121969
NaT           0.087729
Name: last_seen, dtype: float64

> The "last_seen" column shows the day a listing was removed from eBay. For the purpose of this project, we will assume that the listing was taken down because the car was sold. Looking at the distribution of cars "sold" throughout the days, we see a disproportionate amount of cars being sold in the last couple of days. This spike in cars sold does not seem reasonable and so we will interpret this as a malfunction during the scrapping process.

In [13]:
autos["ad_created"].astype(str).str[:10].value_counts(normalize = True).sort_index()

2015-06-11    0.000022
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
                ...   
2016-04-04    0.033848
2016-04-05    0.010650
2016-04-06    0.002924
2016-04-07    0.001157
NaT           0.087729
Name: ad_created, Length: 75, dtype: float64

> We see that most of the ads were created around the same time the listings we crawled. However, there are a couple of ads that have been around since 2015, with the oldest one dating back to June - around 9 months since the last last "date_crawled" date.

In [14]:
print(autos["registration_year"].describe())
print(autos["registration_year"].mode())
display(autos[autos["registration_year"] == 1927])

count    41803.000000
mean      2002.981485
std          6.754847
min       1927.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64
0    2005.0
dtype: float64


Unnamed: 0,date_crawled,name,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen,registration_month+year
21416,2016-03-12 08:36:21,Essex_super_six__Ford_A,16500.0,control,cabrio,1927.0,manual,40.0,other,5000.0,5.0,petrol,ford,,2016-03-12,74821.0,2016-03-15 12:45:12,1927-05-01


> We see that the most common year a car was registered is in the year 2005, which shows, based on this data, that the average car owner keeps their car for an average of 10 years before selling it (that is if the original owner is the one selling the car). It is interesting to see that there is a car from 1927 being sold on eBay (a Cabrio Ford). There are also owners of recently registered (2016) cars who are looking to sell.

### Exploring The "price" and "odometer_km" column

In [15]:
brand_mean_prices = {}
brand_mean_kilometerage = {}
brand_count = {}
brands = autos["brand"].unique()

for b in brands:
    b_rows = autos[autos["brand"] == b]
    count = b_rows.shape[0]
    mean_price = b_rows["price"].mean()
    brand_count[b] = count
    brand_mean_prices[b] = mean_price
    
for b in brands:
    b_rows = autos[autos["brand"] == b]
    mean_kilometerage = b_rows["odometer_km"].mean()
    brand_mean_kilometerage[b] = mean_kilometerage
    
bmp_series = pd.Series(brand_mean_prices)
bmk_series = pd.Series(brand_mean_kilometerage)
bc_series = pd.Series(brand_count)

bmp_bmk_bc_df = pd.DataFrame(bmp_series, columns = ["mean_price"])
bmp_bmk_bc_df["mean_kilometerage"] = bmk_series
bmp_bmk_bc_df["count"] = bc_series

print(bc_series.sort_values(ascending = False))

print(bmp_bmk_bc_df.sort_values(["mean_price","mean_kilometerage"], ascending = False).iloc[:10])

volkswagen            8716
bmw                   4716
opel                  4334
mercedes_benz         4179
audi                  3690
ford                  2872
renault               1948
peugeot               1255
fiat                  1066
seat                   765
skoda                  721
nissan                 647
mazda                  626
smart                  603
citroen                576
toyota                 559
hyundai                430
mini                   395
volvo                  394
miscellaneous_cars     382
mitsubishi             336
honda                  328
kia                    318
alfa_romeo             276
suzuki                 253
chevrolet              241
porsche                184
chrysler               145
dacia                  116
daihatsu                99
jeep                    99
subaru                  84
land_rover              84
saab                    73
jaguar                  65
daewoo                  64
rover                   55
t

> Looking at the results for the top 10 most expensive cars brands, we see that the car brand with the highest average price is "Porsche" with a mean price of roughly 23,700 dollars followed by the "land_rover" who has an average price 37% cheaper than that of Porsche’s. The most popular brand on eBay is "BMW" with 4716 listings. Mercedes cars are also as popular with 4179 listings. We see from the above list that 40% of the listings are German cars, which is a reasonable outcome considering that this is a German eBay car selling site. Lastly, we see that most of the mileage for the cars is quite similar, which could indicate that after around 120,000 km, people are ready to sell their cars. However, it should be noted that "mini" cars justifiably have the lowest mileage due to the nature and purpose of the car. The "mini" car is a city car that is mainly used to travel short distances, unlike the remaining cars that are built to withstand harsher activities.

In [16]:
autos.groupby(["brand", "model"]).size().sort_values(ascending = False)[:10]

brand          model   
volkswagen     golf        3297
bmw            3_series    2403
opel           corsa       1373
volkswagen     polo        1372
               passat      1232
opel           astra       1196
audi           a4          1130
mercedes_benz  c_class     1073
bmw            5_series    1067
mercedes_benz  e_class      905
dtype: int64

## Conclusion
> Throughout this Project, we have cleaned the data in various ways and have determined that "eBay Kleineanzeigen" is a marketplace heavily filled with German cars. Most people who sell their cars on this website tend to keep their car for roughly 10 years and sell it when the mileage reaches around 12,000 km. We discovered that Germans who buy foreign cars tend to purchase French cars such as "Peugeot" and "Renault". The most popular brand-model combination is "Volkswagen Golf" with an occurrence of 3368 listings.

In [17]:
%%html
<style>
.nbviewer div.output_area {
  overflow-y: auto;
  max-height: 400px;
}
</style>