# Exploring German Ebay's Used Car Sales Data

The combined sales of used car in Germany make up a market worth  $103b euros. Here we have a much smaller section of data containing only the used car sales made through popular e-commerce market Ebay.

### Original Dataset Source: [eBay Kleinanzeigen](https://data.world/data-society/used-cars-data)

### Our Dataset: [Dataquest](https://app.dataquest.io/72353caf-14fb-4ab8-a40c-11b89dc29a21)

## Features:
1. 50,000 data points
2. 20 columns
3. Non-traditional encoding - Latin-1
4. Needs cleaning
5. German language

## Goals:
1. Find and highlight areas for cleaning.
2. Explore and identify data patterns.
3. Use those patterns to extract useful insights.

Ebay is a well known e-commerce site that focuses primarily on consumer-to-consumer sales. While this collection focuses on used car sales sold on the German, we can still learn a good deal about the auto industry as we comb through the data.

Large datasets like this usually require a bit of cleaning before you dive in. 50,000 data points is minor amount in the world of big data, but for us its a good chance to break out pandas and numPy.

## Data Dictionary
### A brief explanation of what each column provides

| Column name           | Description                               |
| --------------------- | -----------                               |
|**dateCrawled**| - When this ad was first crawled. All field-values are taken from this date.|
|**name**| - Name of the car.|
|**seller**| - Whether the seller is private or a dealer.|
|**offerType**| - The type of listing.|
|**price**| - The price on the ad to sell the car.|
|**abtest**| - Whether the listing is included in an A/B test.|
|**vehicleType**| - The vehicle Type.|
|**yearOfRegistration**| - The year in which the car was first registered.|
|**gearbox**| - The transmission type.|
|**powerPS**| - The power of the car in PS.|
|**model**| - The car model name.|
|**kilometer**| - How many kilometers the car has driven.|
|**monthOfRegistration**| - The month in which the car was first registered.|
|**fuelType**| - What type of fuel the car uses.|
|**brand**| - The brand of the car.|
|**notRepairedDamage**| - If the car has a damage which is not yet repaired.|
|**dateCreated**| - The date on which the eBay listing was created.|
|**nrOfPictures**| - The number of pictures in the ad.|
|**postalCode**| - The postal code for the location of the vehicle.|
|**lastSeenOnline**| - When the crawler saw this ad last online.|

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv("autos.csv", encoding="Latin-1")

# Initial Exploration

What does our dataset look like? The easiest way to start out is to just view the initial dataset itself. This gives you a nice overview before any alterations are made. Follow this up with an info method call.

In [2]:
# An overview
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

# Observations so far

20 Columns total

1. The Price and odometer columns are **alphanumeric**, but are stored as objects instead of  numbers.
2. The vehicleType, gearbox, model, fueltype, and notRepairedDamage columns are **missing values**, i.e < 50,000.
3. The Date Format used is: YYYY-MM-DD HH-MM-SS and is based on **24 Hours**
4. The **name** column uses _ instead of spaces.
5. Our data model must stay as alphanumeric in order to retain information.
6. Our column names use camelcase.

## Next Steps

1. Convert each column name to snake_case which is the standard.
2. Locate any data that can be safely dropped.
3. Clean and convert numeric data stored as objects.

## Convert columns names to snake_case standard

"Snake case (stylized as snake_case) refers to the style of writing in which each space is replaced by an underscore (_) character, and the first letter of each word written in lowercase."

In [4]:
# print columns for easy copying
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
# rename() allows to easily rename and assign columns
autos.rename(
    {
        "dateCrawled": "date_crawled",
        "offerType": "offer_type",
        "vehicleType": "vehicle_type",
        "yearOfRegistration": "registration_year",
        "monthOfRegistration": "registration_month",
        "fuelType": "fuel_type",
        "notRepairedDamage": "unrepaired_damage",
        "dateCreated": "date_ad_created",
        "nrOfPictures": "number_of_pics",
        "postalCode": "postal_code",
        "lastSeen": "last_seen",
    },
    axis=1,
    inplace=True,
)

# verify
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,date_ad_created,number_of_pics,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [6]:
autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,date_ad_created,number_of_pics,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


## Continuing

The Seller, offer_type, abtest, gearbox, and unrepaired_damage columns can **potentially** be dropped at first glance as they only contain 2 unique values. 

However, dropping the gearbox, and damage columns may cause us to lose valuable information.

As noted previously the **price** and **odometer** columns should be cleaned and converted to a **numeric dtype**.


## Clean and Convert Price/Odometer Column

### Steps
1. Remove any non-numeric characters.
2. Convert the column to a numeric dtype.
3. Use DataFrame.rename() to rename the column to odometer_km.


In [7]:
autos["price"].head()

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

In [8]:
# remove , and $ from text
autos["price"] = autos["price"].str.replace(",", "", regex=True)
autos["price"] = autos["price"].str.replace("$", "", regex=True)

# convert to integer type
autos["price"] = autos["price"].astype(int)

# rename columns
autos.rename({"price": "price_usd"}, axis=1, inplace=True)

# verify
autos["price_usd"]

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price_usd, Length: 50000, dtype: int64

In [9]:
# remove km and ,
autos["odometer"] = autos["odometer"].str.replace("km", "", regex=True)
autos["odometer"] = autos["odometer"].str.replace(",", "", regex=True)

# convert to integer type
autos["odometer"] = autos["odometer"].astype(int)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

#verify
autos["odometer_km"]

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer_km, Length: 50000, dtype: int64

## Next Steps
Now we will analyze the odometer and price columns for outliers. Outliers are a good indicator of an area to explore. Many unique insights can be found by locating **outliers in a dataset**. 

Since our dataset deals with sales the **price** column is good starting place. The **odometer** column is another good area because a used cars miles can greatly infleunce the asking price.

In [10]:
# how many unique values in the price_usd column?
autos["price_usd"].unique().shape

(2357,)

In [11]:
# Stats: min max median mean
autos["price_usd"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

In [12]:
# a general overview
autos["price_usd"].value_counts().head(10)

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
Name: price_usd, dtype: int64

In [13]:
# Top 20 Most Frequent Prices sorted by highest value
autos["price_usd"].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price_usd, dtype: int64

In [14]:
# Top 20 Most Frequent Prices sorted by lowest value
autos["price_usd"].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price_usd, dtype: int64

## Price Outliers

What do we know so far ?

1. The average price of a sold car was around $ 10,000.

2. The lowest sales found averaged around $ 1000.

3. 75% of the car sales were sold for around $ 7,200.

4. The maximum value found was greater than 10s of millions of usd. This hints towards possible outliers that may be skewing our data.

5. $ 3,890,000 seems to be our highest realistic value. Anything above this value will be considered an outlier.

6. 1421 cars were "sold" for $ 0, and are also outliers.

## Using Outliers to create a range 

Since we know where our maximum and minimum outliers lie. We can use these values to create a **range** that contains all our **non outlier values**.

In [15]:
x = 499 # min before 0
y = 3890001 # max realistic value

# create seperate array containing only non outliers 
non_outliers = autos[autos["price_usd"].between(x, y)]

# verify
autos = non_outliers
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price_usd,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,date_ad_created,number_of_pics,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [16]:
# Previously we had about 2357 datapoints.
autos["price_usd"].unique().shape

(2214,)

### Available datapoints dropped from 2357 to 2214.

We removed about **1049** price outliers from our data set.

Cars that were sold for less than \\$ 500  or more than \\$ 3.8 million were removed. 

$ 500 being our lowest non outlier value, or the bottom 25% of our dataset. 

$ 3.8 million being our highest non outlier value.

## Cleaning up Dates

We have 5 columns that revolve around dates. 

3 of our 5: the date_crawled, last_seen, and ad_created columns are not stored as integers. This makes them harder to work with programmatically. 

We will convert them to numeric types.

In [17]:
# what are the types?
autos[
    [
        "date_crawled",
        "last_seen",
        "date_ad_created",
        "registration_month",
        "registration_year",
    ]
].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45195 entries, 0 to 49999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        45195 non-null  object
 1   last_seen           45195 non-null  object
 2   date_ad_created     45195 non-null  object
 3   registration_month  45195 non-null  int64 
 4   registration_year   45195 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 2.1+ MB


In [18]:
# How are they formatted?
autos[
    [
        "date_crawled",
        "last_seen",
        "date_ad_created",
        "registration_month",
        "registration_year",
    ]
].head()

Unnamed: 0,date_crawled,last_seen,date_ad_created,registration_month,registration_year
0,2016-03-26 17:47:46,2016-04-06 06:45:54,2016-03-26 00:00:00,3,2004
1,2016-04-04 13:38:56,2016-04-06 14:45:08,2016-04-04 00:00:00,6,1997
2,2016-03-26 18:57:24,2016-04-06 20:15:37,2016-03-26 00:00:00,7,2009
3,2016-03-12 16:58:10,2016-03-15 03:16:28,2016-03-12 00:00:00,6,2007
4,2016-04-01 14:38:50,2016-04-01 14:38:50,2016-04-01 00:00:00,7,2003


### Date and Time Format
2016-03-26 17:47:46

YYYY-MM-DD HH:MM:SS


Lets gather more info about the **range of dates**

The first **10 characters**, YYYY-MM-DD, tells us the exact date the ad was posted. By extracting these values we can dig into the distrubtion of dates for further insight.

In [19]:
# grab the first 10 characters, the date
# find the unique dates
# display them as as percents
# sort the dates by earliest -> latest
# 'normalize = True' and 'dropna = False' in order to include missing values
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025578
2016-03-06    0.014117
2016-03-07    0.036177
2016-03-08    0.033145
2016-03-09    0.032880
2016-03-10    0.032658
2016-03-11    0.033013
2016-03-12    0.037261
2016-03-13    0.015533
2016-03-14    0.036309
2016-03-15    0.034030
2016-03-16    0.029317
2016-03-17    0.031132
2016-03-18    0.012878
2016-03-19    0.034716
2016-03-20    0.038079
2016-03-21    0.037836
2016-03-22    0.033057
2016-03-23    0.032371
2016-03-24    0.028986
2016-03-25    0.031132
2016-03-26    0.032636
2016-03-27    0.031110
2016-03-28    0.034805
2016-03-29    0.033322
2016-03-30    0.033344
2016-03-31    0.031685
2016-04-01    0.033920
2016-04-02    0.035823
2016-04-03    0.038787
2016-04-04    0.036663
2016-04-05    0.013187
2016-04-06    0.003164
2016-04-07    0.001350
Name: date_crawled, dtype: float64

### Distribution of date_ad_created values

In [20]:
autos["date_ad_created"].str[:10].value_counts(
    normalize=True, dropna=False
).sort_index()

2015-06-11    0.000022
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
                ...   
2016-04-03    0.039009
2016-04-04    0.037039
2016-04-05    0.011926
2016-04-06    0.003253
2016-04-07    0.001195
Name: date_ad_created, Length: 76, dtype: float64

### Distribution of last_seen values

In [21]:
autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001084
2016-03-06    0.004182
2016-03-07    0.005222
2016-03-08    0.007036
2016-03-09    0.009448
2016-03-10    0.010289
2016-03-11    0.012037
2016-03-12    0.023874
2016-03-13    0.008873
2016-03-14    0.012280
2016-03-15    0.015643
2016-03-16    0.016174
2016-03-17    0.027658
2016-03-18    0.007390
2016-03-19    0.015444
2016-03-20    0.020400
2016-03-21    0.020666
2016-03-22    0.021263
2016-03-23    0.018431
2016-03-24    0.019538
2016-03-25    0.018630
2016-03-26    0.016484
2016-03-27    0.015466
2016-03-28    0.020577
2016-03-29    0.021463
2016-03-30    0.024140
2016-03-31    0.023476
2016-04-01    0.022857
2016-04-02    0.024892
2016-04-03    0.024981
2016-04-04    0.024295
2016-04-05    0.126541
2016-04-06    0.225224
2016-04-07    0.134041
Name: last_seen, dtype: float64

### Date Ranges

We can now see what percentage a specific date made up of the overall sales. Generally weekends, holidays and other important dates can heavily influence sales. Does our data reflect any of this?

Though we were able to pull out a date range and sort it using value_counts(), the actual values are not "traditional numbers". 

Thus, series.describe() is not a good metric for our datetime columns. It important to not draw conclusions from incorrectly gathered data.

## Exploring registration_year

The registration is generally the date the vehicle was first registered. 

We know people pay higher prices for newer cars. 

A series.describe() can give us some insight into this direction.



In [22]:
autos["registration_year"].describe()

count    45195.000000
mean      2005.051886
std         89.555897
min       1000.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

## Registration Outliers
Here we have two outliers.

A minimum registration **year of 1000**, and a maximum **year of 9999**. Both are impossible values.
 
Our last seen column points out that the realistic maximum value for a registration_year value is 2016. Also, as we know cars didn't exist in the year 1000 either.
 
A filter of **1920-2017** would help us remove those outliers.

In [23]:
# filter by years
filter = autos["registration_year"].between(1920, 2017)
autos = autos[filter]

# recheck describe
autos["registration_year"].describe()

count    44714.000000
mean      2003.621908
std          7.322404
min       1927.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       2017.000000
Name: registration_year, dtype: float64

## registration_year Recap

After filtering for outliers, we have years that makes sense.

Our average registration_year is around 2004 with a median of 2008.

Most of our cars are located in the 2004-2008 range.

## Brands Aggregation


Data aggregation is taking raw data extracting insights, and presenting them in a summarized format.

A car's brand has a huge impact on price, as popularity can make or break a potential sale.

Lets narrow the focus of our data aggregation down to specific car brands in order to find out how branding can influence price points.


### Steps
1. Extract the top 20 unique values.
2. Store this in a seperate array for comparison.
3. Find the mean of our data("the arithmetic average of the values").
4. Draw conclusions. 

In [24]:
# find # unique values
print("Number of unique brands: ", autos["brand"].nunique(), "\n")
print("Car Brands:")
autos["brand"].value_counts()

Number of unique brands:  40 

Car Brands:


volkswagen        9537
bmw               5075
mercedes_benz     4539
opel              4534
audi              4023
ford              2907
renault           1992
peugeot           1305
fiat              1065
seat               816
skoda              754
smart              682
mazda              674
nissan             674
citroen            637
toyota             597
hyundai            455
sonstige_autos     428
volvo              415
mini               415
honda              356
mitsubishi         343
kia                329
alfa_romeo         294
porsche            278
chevrolet          266
suzuki             260
chrysler           161
dacia              128
jeep               107
land_rover          99
daihatsu            96
subaru              87
saab                73
jaguar              70
daewoo              62
rover               58
trabant             51
lancia              46
lada                26
Name: brand, dtype: int64

In [30]:
# grab all brands
# store in a seperate 1d series
car_brands = autos["brand"].value_counts().index
car_brands = car_brands.drop("sonstige_autos") # misc cars skew our data
car_brands

Index(['volkswagen', 'bmw', 'mercedes_benz', 'opel', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'smart', 'mazda', 'nissan',
       'citroen', 'toyota', 'hyundai', 'volvo', 'mini', 'honda', 'mitsubishi',
       'kia', 'alfa_romeo', 'porsche', 'chevrolet', 'suzuki', 'chrysler',
       'dacia', 'jeep', 'land_rover', 'daihatsu', 'subaru', 'saab', 'jaguar',
       'daewoo', 'rover', 'trabant', 'lancia', 'lada'],
      dtype='object')

In [26]:
# holds our mean x price data
mean_brand_price = {}

for brand in car_brands:
    mean_price = autos.loc[autos['brand'] == brand, 'price_usd'].mean()
    # key: brand, value: price
    mean_brand_price[brand] = mean_price
    
# verify
mean_brand_price

{'volkswagen': 5925.152353989724,
 'bmw': 8773.91802955665,
 'mercedes_benz': 8658.356245869134,
 'opel': 3351.909572121747,
 'audi': 9497.976137211037,
 'ford': 4612.264533883729,
 'renault': 2768.13202811245,
 'peugeot': 3321.872796934866,
 'fiat': 3208.955868544601,
 'seat': 4748.568627450981,
 'skoda': 6527.81299734748,
 'smart': 3559.690615835777,
 'mazda': 4396.721068249258,
 'nissan': 5074.501483679525,
 'citroen': 3977.287284144427,
 'toyota': 5248.839195979899,
 'hyundai': 5637.687912087912,
 'volvo': 5117.616867469879,
 'mini': 10616.96626506024,
 'honda': 4293.820224719101,
 'mitsubishi': 3821.5801749271136,
 'kia': 6129.936170212766,
 'alfa_romeo': 4372.057823129252,
 'porsche': 46955.15107913669,
 'chevrolet': 6811.804511278196,
 'suzuki': 4459.534615384616,
 'chrysler': 3678.360248447205,
 'dacia': 5920.3828125,
 'jeep': 11590.214953271028,
 'land_rover': 18934.272727272728,
 'daihatsu': 1942.3958333333333,
 'subaru': 4511.137931034483,
 'saab': 3392.4383561643835,
 'jagu

In [27]:
sorted(mean_brand_price.values(), reverse=True)

[46955.15107913669,
 18934.272727272728,
 12129.6,
 11590.214953271028,
 10616.96626506024,
 9497.976137211037,
 8773.91802955665,
 8658.356245869134,
 6811.804511278196,
 6527.81299734748,
 6129.936170212766,
 5925.152353989724,
 5920.3828125,
 5637.687912087912,
 5248.839195979899,
 5117.616867469879,
 5074.501483679525,
 4748.568627450981,
 4612.264533883729,
 4511.137931034483,
 4459.534615384616,
 4396.721068249258,
 4372.057823129252,
 4293.820224719101,
 3977.287284144427,
 3821.5801749271136,
 3678.360248447205,
 3669.0,
 3559.690615835777,
 3392.4383561643835,
 3351.909572121747,
 3321.872796934866,
 3208.955868544601,
 2780.153846153846,
 2768.13202811245,
 2272.8627450980393,
 1942.3958333333333,
 1741.896551724138,
 1272.9193548387098]

In [28]:
sorted(mean_brand_price, key=mean_brand_price.get, reverse=True)

['porsche',
 'land_rover',
 'jaguar',
 'jeep',
 'mini',
 'audi',
 'bmw',
 'mercedes_benz',
 'chevrolet',
 'skoda',
 'kia',
 'volkswagen',
 'dacia',
 'hyundai',
 'toyota',
 'volvo',
 'nissan',
 'seat',
 'ford',
 'subaru',
 'suzuki',
 'mazda',
 'alfa_romeo',
 'honda',
 'citroen',
 'mitsubishi',
 'chrysler',
 'lancia',
 'smart',
 'saab',
 'opel',
 'peugeot',
 'fiat',
 'lada',
 'renault',
 'trabant',
 'daihatsu',
 'rover',
 'daewoo']

In [29]:
# sorted lists 
brands = sorted(mean_brand_price, key=mean_brand_price.get, reverse=True)
means = sorted(mean_brand_price.values(), reverse=True)

# label
print("Brand - Mean\n")

# verify + formatting
for n in range(39):
    print(brands[n], " " * (len(brands[7]) - len(brands[n])),means[n]) 

Brand - Mean

porsche        46955.15107913669
land_rover     18934.272727272728
jaguar         12129.6
jeep           11590.214953271028
mini           10616.96626506024
audi           9497.976137211037
bmw            8773.91802955665
mercedes_benz  8658.356245869134
chevrolet      6811.804511278196
skoda          6527.81299734748
kia            6129.936170212766
volkswagen     5925.152353989724
dacia          5920.3828125
hyundai        5637.687912087912
toyota         5248.839195979899
volvo          5117.616867469879
nissan         5074.501483679525
seat           4748.568627450981
ford           4612.264533883729
subaru         4511.137931034483
suzuki         4459.534615384616
mazda          4396.721068249258
alfa_romeo     4372.057823129252
honda          4293.820224719101
citroen        3977.287284144427
mitsubishi     3821.5801749271136
chrysler       3678.360248447205
lancia         3669.0
smart          3559.690615835777
saab           3392.4383561643835
opel           3351.

## Brand Insights

Well known and expensive luxury brands such as **Porsches, Land Rovers, and Jags** top our list, taking **top 3** as expected. Chevrolet and skoda get an honorable mention for coming in at 9th/10th place.

**Daihatsu, Rover, Daewoo*** represent Jpanese, Italian and South Korean car manufacturers and fall to our **last** 3 spots. These foreign cars brands simply cannot compete with home grown german brands such as Porche when it comes to pricing. Rover seems to cater soley to older vehicles and has under gone a merger with bmw, hinting at weak finances.  
