# Exploring Ebay Car Sales Data

Aim of this project is to clean and analyze data of used cars from __eBay Kleinanzeigen__, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/). The version of the dataset we are working with is __a sample of 50,000 data points__ that was prepared by [Dataquest](https://www.dataquest.io/) including simulating a less-cleaned version of the data.

The data dictionary provided with data is as follows:
- <code>dateCrawled</code> - When this ad was first crawled. All field-values are taken from this date.<br>
- <code>name</code> - Name of the car.<br>
- <code>seller</code> - Whether the seller is private or a dealer.<br>
- <code>offerType</code> - The type of listing<br>
- <code>price</code> - The price on the ad to sell the car.<br>
- <code>abtest</code> - Whether the listing is included in an A/B test.<br>
- <code>vehicleType</code> - The vehicle Type.<br>
- <code>yearOfRegistration</code> - The year in which which year the car was first registered.<br>
- <code>gearbox</code> - The transmission type.<br>
- <code>powerPS</code> - The power of the car in PS.<br>
- <code>model</code> - The car model name.<br>
- <code>kilometer</code> - How many kilometers the car has driven.<br>
- <code>monthOfRegistration</code> - The month in which which year the car was first registered.<br>
- <code>fuelType</code> - What type of fuel the car uses.<br>
- <code>brand</code> - The brand of the car.<br>
- <code>notRepairedDamage</code> - If the car has a damage which is not yet repaired.<br>
- <code>dateCreated</code> - The date on which the eBay listing was created.<br>
- <code>nrOfPictures</code> - The number of pictures in the ad.<br>
- <code>postalCode</code> - The postal code for the location of the vehicle.<br>
- <code>lastSeenOnline</code> - When the crawler saw this ad last online.<br>

---
## Data exploration

We will start with importing the libraries and reading the data. 

In [4]:
import pandas as pd
import numpy as np

# Read csv into pandas
autos = pd.read_csv("autos.csv", encoding = "Latin-1")

In [5]:
# Get information about the dataframe
autos.info()
# Print first 5 rows
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Initial observation:
- Our dataset is made of 20 columns and 50 000 rows
- There are two data types: int64 and object 
- Columns <code>vehicleType</code>, <code>gearbox</code>, <code>model</code>, <code>fuelType</code> and <code>notRepairedDamage</code> contain null values
- The column names use camelcase and will need to be changed to snakecase

---
## Data cleaning

First we will convert columns from camelcase to snakecase and reword some of the column names to be more descriptive.

In [5]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [17]:
autos.columns = ['date_crawled', 'name', 'seller', 'offertype', 'price', 'abtest',
       'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',
       'odometer', 'registration_month', 'fueltype', 'brand',
       'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',
       'last_seen']
autos.head()

Unnamed: 0,date_crawled,name,seller,offertype,price,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer,registration_month,fueltype,brand,unrepaired_damage,ad_created,nrofpictures,postalcode,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Defining cleaning tasks

We will check what tasks need to be performed to clean the dataset. The following methods are helpful for exploring the data: <code>DataFrame.describe()</code>, <code>Series.value_counts()</code> and <code>Series.head()</code> if any columns need a closer look.

In [7]:
autos.describe(include='all')

Unnamed: 0,datecrawled,name,seller,offertype,price,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer,registration_month,fueltype,brand,unrepaired_damage,ad_created,nrofpictures,postalcode,lastseen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-30 19:48:02,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


__Observations__:
- Columns <code>seller</code> and offertype store almost the same values - we can drop them
- Column <code>nrofpictures</code> looks suspicious - we will have a closer look at it
- Non-numerical columns <code>price</code> and <code>odometer</code> need to be converted to numerical columns
- <code>registration_year</code> has a max of 9999
- <code>registration_month</code> month has a min of 0

Let's look at the column <code>nrofpictures</code> first.

In [8]:
autos["nrofpictures"].value_counts()

0    50000
Name: nrofpictures, dtype: int64

All rows for column <code>nrofpictures</code> store value 0, we will drop this column, together with <code>seller</code> and <code>offertype</code> columns since they will not be helpful in our analysis.

In [9]:
autos = autos.drop(["nrofpictures", "seller", "offertype"], axis = 1)

### Converting text values to numeric

Columns <code>price</code> and <code>odometer</code> store numeric values with special characters. We will clean these and convert them to numeric-only values.

In [10]:
autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",","") 
                  .astype(int)
                 )

# rename the column
autos.rename({"price" : "price_dollars"}, axis = 1, inplace = True)
autos["price_dollars"].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price_dollars, dtype: int64

In [11]:
autos["odometer"] = (autos["odometer"]
                     .str.replace("km", "")
                     .str.replace(",", "")
                     .astype(int)
                    )

# rename the column
autos.rename({"odometer" : "odometer_km"}, axis = 1, inplace = True)
autos["odometer_km"].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64

### Price column - removing outliers

In [11]:
# Number of unique values
print(autos["price_dollars"].unique().shape)

(2357,)


In [12]:
# Descriptive statistics of the data
print(autos["price_dollars"].describe())

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollars, dtype: float64


In [13]:
# Find the count of each distance and for any variation
print(autos["price_dollars"].value_counts().head(30))

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
1100     376
1300     371
3000     365
550      356
1800     355
5500     340
1250     335
350      335
1600     327
1999     322
Name: price_dollars, dtype: int64


We can see __1421 cars are listed with price of 0$__. Since it is less than 3% of the whole dataset, we can consider removing these listings. The maximum price of a car is 1 hundred million dollars, which seems to be a lot. Let's take a closer look on the highest prices.

In [15]:
autos["price_dollars"].value_counts().sort_index(ascending=False).head(15)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price_dollars, dtype: int64

In [16]:
autos["price_dollars"].value_counts().sort_index(ascending=True).head(15)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
Name: price_dollars, dtype: int64

There are some cars with a price above 1 million dollars. On the other side of the spectrum we can see very low prices such as 1 dollar. Since eBay is an auction site we can assume that these are the opening bids. We will remove anything above 350,000 since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

In [16]:
autos = autos[autos["price_dollars"].between(1,351000)]
autos["price_dollars"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price_dollars, dtype: float64

### Date columns

In the data set we have several columns with date information:
- `date_crawled`: added by the crawler
- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website

These are a combination of dates that were crawled, and dates with meta-information from the crawler. Values in <code>date_crawled</code>, <code>last_seen</code> and <code>ad_created</code> are stored as strings.

Let's have a closer look on those.

In [23]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


From the above result, we can see that the first 10 characters represent the day (2016-03-26).

To understand the date range, we will first select only the first 10 characters in each column. Then generate a distribution as percentages including missing numbers and then sort by the index.

In [24]:
(autos["date_crawled"]
         .str[:10]
         .value_counts(normalize=True, dropna=False)
         .sort_index()
)

2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64

In [26]:
(autos["date_crawled"]
         .str[:10]
         .value_counts(normalize=True, dropna=False)
         .sort_values()
)

2016-04-07    0.00142
2016-04-06    0.00318
2016-03-18    0.01306
2016-04-05    0.01310
2016-03-06    0.01394
2016-03-13    0.01556
2016-03-05    0.02538
2016-03-24    0.02910
2016-03-16    0.02950
2016-03-27    0.03104
2016-03-17    0.03152
2016-03-25    0.03174
2016-03-31    0.03192
2016-03-10    0.03212
2016-03-23    0.03238
2016-03-26    0.03248
2016-03-11    0.03248
2016-03-22    0.03294
2016-03-09    0.03322
2016-03-08    0.03330
2016-03-30    0.03362
2016-04-01    0.03380
2016-03-15    0.03398
2016-03-29    0.03418
2016-03-28    0.03484
2016-03-19    0.03490
2016-04-02    0.03540
2016-03-07    0.03596
2016-04-04    0.03652
2016-03-14    0.03662
2016-03-12    0.03678
2016-03-21    0.03752
2016-03-20    0.03782
2016-04-03    0.03868
Name: date_crawled, dtype: float64

Seems like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.

In [27]:
(autos["last_seen"]
         .str[:10]
         .value_counts(normalize=True, dropna=False)
         .sort_index()
)

2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64

The crawler recorded each day number of listings there were seen for the last time, presumably because of the fact the car was sold that or following day.

In the last 3 days we can see immense rise of the counts compared to the previous days. Since the numbers are 6-10x higher than days before we can assume the spike is connected to the ending of the crawler job. It is unlikely there was a massive spike in car sales just these 3 days.

In [21]:
print(autos["ad_created"].unique().shape)

(76,)


In [30]:
(autos["ad_created"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_values(ascending=False)
)

2016-04-03    0.03892
2016-03-20    0.03786
2016-03-21    0.03772
2016-04-04    0.03688
2016-03-12    0.03662
               ...   
2016-01-14    0.00002
2016-01-16    0.00002
2015-12-05    0.00002
2016-02-01    0.00002
2016-01-07    0.00002
Name: ad_created, Length: 76, dtype: float64

There is a large variety of ad created dates. 2016-04-03 makes up most of the data in the ad_created column, with 3.9%.

In [33]:
autos["registration_year"].describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

We can see some odd values in the <code>registration_year</code> column, for example the minimum value being 1000, long before cars were invented or max value 9999, too far in the future. Let's have a closer look at those.

### Registration year data

We can clearly see that there are years that make no sense at all, like 9000, 4800 or any year greater than 2016, since 2016 is the year the data was crawled. Also makes no sense to take values from less than 1900 as cars were invented in the few decades of 1900s.

Before we remove these rows, let's check the relative frequency of cars with a registration year that fall outside of the 1900 - 2016 interval and see if it's safe to remove those rows.

In [24]:
(~autos["registration_year"].between(1900, 2016)).sum() / autos.shape[0]

0.038793369710697

Since these data make only 4% of the whole data set let's remove these rows.

In [25]:
autos = autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).head(10)

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
2006    0.057197
2001    0.056468
2002    0.053255
1998    0.050620
2007    0.048778
Name: registration_year, dtype: float64

It appears that most of the vehicles were first registered in the past 20 years.

---
## Aggregating data by brand

In [26]:
autos["brand"].describe()

count          46681
unique            40
top       volkswagen
freq            9862
Name: brand, dtype: object

In [27]:
autos["brand"].value_counts(normalize=True)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen being the most popular brand, with approximately double the cars for sale of the next two brands combined.

In our analysis we will limit to __brands representing more than 5% of total listings__ as the rest of the brands have low numbers compared to the top of the spectrum.

In [28]:
brands = autos["brand"].value_counts(normalize=True)
common_brands = brands[brands > .05].index
print(common_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')


In [29]:
brands_mean = {}

for b in common_brands:
    selected_brands = autos[autos["brand"] == b]
    mean_price = selected_brands["price_dollars"].mean()
    brands_mean[b] = int(mean_price)
    
print(brands_mean)

{'bmw': 8332, 'ford': 3749, 'mercedes_benz': 8628, 'audi': 9336, 'volkswagen': 5402, 'opel': 2975}


Of the top 5 brands, there is a distinct price gap:
- Audi, BMW and Mercedes Benz are more expensive
- Ford and Opel are less expensive
- Volkswagen is in between - this may explain its popularity, it may be a 'best of 'both worlds' option.

### Mileage exploration

To compare the mean mileage with mean prices of the selected brands we will combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly. 

In [30]:
bm_series = pd.Series(brands_mean)
pd.DataFrame(bm_series, columns=["mean_price"])

Unnamed: 0,mean_price
audi,9336
bmw,8332
ford,3749
mercedes_benz,8628
opel,2975
volkswagen,5402


In [31]:
brand_mean_mileage = {}

for b in common_brands:
    selected_rows = autos[autos["brand"] == b]
    mean_mileage = selected_rows["odometer_km"].mean()
    brand_mean_mileage[b] = mean_mileage
    
bmm_series = pd.Series(brand_mean_mileage)

In [32]:
mean_mileage = bmm_series.sort_values(ascending=False)
mean_prices = pd.Series(brands_mean).sort_values(ascending=False)

In [33]:
brand_info = pd.DataFrame(mean_mileage, columns=["mean_mileage"])
brand_info

Unnamed: 0,mean_mileage
bmw,132572.51314
mercedes_benz,130788.363313
opel,129310.035842
audi,129157.386785
volkswagen,128707.158791
ford,124266.012872


In [34]:
brand_info["mean_prices"] = mean_prices
brand_info

Unnamed: 0,mean_mileage,mean_prices
bmw,132572.51314,8332
mercedes_benz,130788.363313,8628
opel,129310.035842,2975
audi,129157.386785,9336
volkswagen,128707.158791,5402
ford,124266.012872,3749


The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.

---
## Conclusion

Main goal of this project was to use the data cleaning techniques on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. We focused on gathering vauable information about the whole dataset, removing non-numeric characters, renaming columns, removing outliers and filtering date type data. Additionally, we used aggregation to mark the top brands in this dataset.

The top brands by listings:
- Volkswagen (21.12%)
- BMW (11.00%)
- Opel (10.76%)
- Mercedes Benz (9.65%)
- Audi (8.66%)
- Ford (6.99%)

Ideas for further analysis:
- Find out what models are most popular in the dataset
- What is the most expensive car in the dataset
- What is the difference in price for cars with/without a damage