# Exploring eBay Car Sales Data

This project is mainly for practicing cleaning data.

The dataset used in the project is a classifieds section of eBay Germany. The dataset is available to download <a href="https://data.world/data-society/used-cars-data">here</a>. As this is from eBay Germany, the data is in German with English header. The data dictionary are as follows:

| column name | description |
| --- | --- |
| dateCrawled | When this ads was first crawled. All field-values are taken from this date. |
| name | Name of the car/ads |
| seller | Type of seller: private or commercial (a dealer) |
| offerType | The type of listing: offer or request |
| price | The price on the ads to sell the car |
| abtest | Whether the listing is a test or control group in the A/B test |
| vehicleType | Type of vehicle |
| yearOfRegistration | The year in which the car was first registered |
| gearbox | Type of transmission: manual or automatic|
| powerPS | The power of the car in PS |
| model | The car model name |
| kilometer | How many kilometers the car has driven |
| monthOfRegistration | The month in which the car was first registered |
| fuelType | Type of fuel the car uses |
| brand | The brand of the car |
| notRepairedDamage | If the car has a damage which is not yet repaired |
| dateCreated | The date on which the eBay listing was created |
| nrOfPictures | The number of pictures in the ad |
| postalCode | The postal code for the location of the vehicle |
| lastSeenOnline | When the crawler saw this ad last online |



In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv("autos.csv")

In [2]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

In [3]:
print(autos.head(5))

           dateCrawled                            name  seller offerType  \
0  2016-03-24 11:52:17                      Golf_3_1.6  privat   Angebot   
1  2016-03-24 10:58:45            A5_Sportback_2.7_Tdi  privat   Angebot   
2  2016-03-14 12:52:21  Jeep_Grand_Cherokee_"Overland"  privat   Angebot   
3  2016-03-17 16:54:04              GOLF_4_1_4__3T�RER  privat   Angebot   
4  2016-03-31 17:25:20  Skoda_Fabia_1.4_TDI_PD_Classic  privat   Angebot   

   price abtest vehicleType  yearOfRegistration    gearbox  powerPS  model  \
0    480   test         NaN                1993    manuell        0   golf   
1  18300   test       coupe                2011    manuell      190    NaN   
2   9800   test         suv                2004  automatik      163  grand   
3   1500   test  kleinwagen                2001    manuell       75   golf   
4   3600   test  kleinwagen                2008    manuell       69  fabia   

   kilometer  monthOfRegistration fuelType       brand notRepairedDamage  

There are 371,528 rows and 20 columns in our dataset. The types of each column seem correct as it matches with the column names. The column names used camelcase which could be hard to read. Note that there are columns with null value, which are `vehicleType`, `gearbox`, `model`, `fuelType`, and `notRepairedDamaged`.

## Clean data
### Clean columns

In [4]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

As mentioned previously that camelcase is used, we change all column names to snakecase. All letters in the original column names are changed to lowercases. An underscore (_) is added to separate words in the name. Additionally, we change names of some columns to be more descriptive and easier to understand. 

In [5]:
new_col_name = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'kilometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code', 'last_seen']

autos.columns = new_col_name
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3T�RER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


## Explore data

Now that the column headers are cleaned, we will start exploring the data.

### Remove unnecessary columns

In [6]:
autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233528,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-06 13:45:54
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


There are columns which contain less than 10 unique values. Those columns are `seller`, `offer_type`, `abtest`, `vehicle_type`, `gearbox` `fuel_type`, and `unrepaired_damage`. We will explore each column to determine whether to keep or to drop them.

In [7]:
print(autos["seller"].value_counts())
print()
print(autos["offer_type"].value_counts())
print()
print(autos["abtest"].value_counts())

privat        371525
gewerblich         3
Name: seller, dtype: int64

Angebot    371516
Gesuch         12
Name: offer_type, dtype: int64

test       192585
control    178943
Name: abtest, dtype: int64


`seller` column shows the types of seller: "privat" (private) and "gewerblich" (commercial). Since 99.99% of our dataset belong to privat type, this column will be removed as it would not be useful much for the analysis.

`offer_type` shows the types of ads: "Angebot" (offer) and "Gesuch" (request). Similar to `seller` column, 99.99% of our dataset belong to one type. This column will also be removed.

`abtest` shows better distribution between each type. It is a part of eBay's new feature testing process. "test" is assigned to an ads to receive a new feature, and "control" is an ads with normal features. We don't know yet if this feature testing could affect the car ads, hence this column will be kept for now.

In [8]:
print(autos["vehicle_type"].value_counts())
print()
print(autos["gearbox"].value_counts())
print()
print(autos["fuel_type"].value_counts())
print()
print(autos["unrepaired_damage"].value_counts())

limousine     95894
kleinwagen    80023
kombi         67564
bus           30201
cabrio        22898
coupe         19015
suv           14707
andere         3357
Name: vehicle_type, dtype: int64

manuell      274214
automatik     77105
Name: gearbox, dtype: int64

benzin     223857
diesel     107746
lpg          5378
cng           571
hybrid        278
andere        208
elektro       104
Name: fuel_type, dtype: int64

nein    263182
ja       36286
Name: unrepaired_damage, dtype: int64


These columns provide information about the cars being advertised despite small unique values. Hence we will keep these columns for further analysis.

In [9]:
print(autos["num_pictures"].value_counts())

0    371528
Name: num_pictures, dtype: int64


We spot that all statistical values for `num_pictures` column are 0. We check all data in the column and conclude to remove this column as well.

In conclusion, we will drop 3 columns (`seller`, `offer_type`, and `num_pictures`) from our dataset.

In [10]:
autos = autos.drop(["seller", "offer_type", "num_pictures"], axis=1)
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 17 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        371528 non-null  object
 1   name                371528 non-null  object
 2   price               371528 non-null  int64 
 3   abtest              371528 non-null  object
 4   vehicle_type        333659 non-null  object
 5   registration_year   371528 non-null  int64 
 6   gearbox             351319 non-null  object
 7   power_ps            371528 non-null  int64 
 8   model               351044 non-null  object
 9   kilometer           371528 non-null  int64 
 10  registration_month  371528 non-null  int64 
 11  fuel_type           338142 non-null  object
 12  brand               371528 non-null  object
 13  unrepaired_damage   299468 non-null  object
 14  ad_created          371528 non-null  object
 15  postal_code         371528 non-null  int64 
 16  la

### Remove outliers
Next we will check if our dataset contains any outliers, or any unrealistic data that could mislead the analysis. We will focus on `price` and `kilometer` columns.

*Remove outliers for* `price`

In [11]:
print(autos["price"].unique().shape[0])
print()
print(autos["price"].describe())

5597

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64


In [12]:
autos["price"].value_counts().sort_index(ascending=True).head(10)

0     10778
1      1189
2        12
3         8
4         1
5        26
7         3
8         9
9         8
10       84
Name: price, dtype: int64

In [13]:
autos_zero = autos[autos["price"] < 100].head()
print(autos_zero)

            date_crawled                                               name  \
7    2016-03-21 18:54:38                       VW_Derby_Bj_80__Scheunenfund   
40   2016-03-26 22:06:17                  Suche_Opel_corsa_a_zu_verschenken   
60   2016-03-29 15:48:15  TAUSCHE_BMW_E38_740i_g._SUV_/_GEL�NDEWAGEN_LES...   
91   2016-03-28 09:37:01  MERCEDES_BENZ_W124_250D_83KW_/_113PS___SCHLACH...   
115  2016-03-19 18:40:12                                    Golf_IV_1.4_16V   

     price   abtest vehicle_type  registration_year  gearbox  power_ps  \
7        0     test    limousine               1980  manuell        50   
40       0     test          NaN               1990      NaN         0   
60       1  control          suv               1994  manuell       286   
91       1  control    limousine               1995  manuell       113   
115      0     test          NaN               2017  manuell         0   

        model  kilometer  registration_month fuel_type           brand  \
7     

Although these prices seem too low to be a car price, it is possible that these are the starting price. Hence we will keep these data.

Next we check the price at the higher end.

In [14]:
autos["price"].value_counts().sort_index(ascending=False).head(20)

2147483647     1
99999999      15
99000000       1
74185296       1
32545461       1
27322222       1
14000500       1
12345678       9
11111111      10
10010011       1
10000000       8
9999999        3
3895000        1
3890000        1
2995000        1
2795000        1
1600000        2
1300000        1
1250000        2
1234566        1
Name: price, dtype: int64

There are many unrealistic prices. For example, the maximum price is over 2billion, which is too high. We also notice a price jump from 3,895,000 to 9,999,999. However, the frequency of each price in the top 20 range is very low. We will therefore check the price in each range to better determine our cut-off line.

In [15]:
# Create new column for price range
autos["price_range"] = autos["price"] <= 1000
autos.loc[(autos["price"]) <= 1000, "price_range"] = "0 - 1,000"
autos.loc[(autos["price"] > 1000) & (autos["price"] <= 10000), "price_range"] = "1,001 - 10,000"
autos.loc[(autos["price"] > 10000) & (autos["price"] <= 100000), "price_range"] = "10,001 - 100,000"
autos.loc[(autos["price"] > 100000) & (autos["price"] <= 200000), "price_range"] = "100,001 - 200,000"
autos.loc[(autos["price"] > 200000) & (autos["price"] <= 300000), "price_range"] = "200,001 - 300,000"
autos.loc[autos["price"] > 300000, "price_range"] = "> 300,000"

print(autos["price_range"].shape[0])
autos["price_range"].value_counts().sort_index(ascending=True)

371528


0 - 1,000             87984
1,001 - 10,000       223142
10,001 - 100,000      59999
100,001 - 200,000       233
200,001 - 300,000        48
> 300,000               122
Name: price_range, dtype: int64

We see that there are 403 car ads at the price above 100,000 euro. These 403 car ads are accounted for 0.11% of our data set. Therefore we will drop any rows above 100,000 euro.

In [16]:
autos = autos[autos["price"] <= 100000]
autos["price"].describe()

count    371125.000000
mean       5607.804734
std        7503.923817
min           0.000000
25%        1150.000000
50%        2950.000000
75%        7199.000000
max      100000.000000
Name: price, dtype: float64

*Remove outliers for* `kilometer`

In [17]:
print(autos["kilometer"].unique().shape[0])
print()
print(autos["kilometer"].describe())

13

count    371125.000000
mean     125692.206130
std       40031.064165
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: kilometer, dtype: float64


In [18]:
autos["kilometer"].value_counts().sort_index(ascending=False)

150000    240720
125000     38054
100000     15896
90000      12516
80000      11038
70000       9762
60000       8650
50000       7598
40000       6358
30000       6015
20000       5632
10000       1922
5000        6964
Name: kilometer, dtype: int64

These kilometer data seem to be a preset information from eBay as they are all rounded and not many unique values. Since there is no outlier in this column, no rows are removed.

In [19]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 371125 entries, 0 to 371527
Data columns (total 18 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        371125 non-null  object
 1   name                371125 non-null  object
 2   price               371125 non-null  int64 
 3   abtest              371125 non-null  object
 4   vehicle_type        333298 non-null  object
 5   registration_year   371125 non-null  int64 
 6   gearbox             350983 non-null  object
 7   power_ps            371125 non-null  int64 
 8   model               350750 non-null  object
 9   kilometer           371125 non-null  int64 
 10  registration_month  371125 non-null  int64 
 11  fuel_type           337804 non-null  object
 12  brand               371125 non-null  object
 13  unrepaired_damage   299129 non-null  object
 14  ad_created          371125 non-null  object
 15  postal_code         371125 non-null  int64 
 16  la

We will continue working with 371,125 rows and 18 columns.

### Date columns
There are 5 date-related columns in our data set. Some are added by the crawler, some are data from the website. We will convert the date in string format into numerical values so we can work with them quantitatively.

|Column|Type|Remark|
|------|----|------|
|`date_crawled` | String | Added by the crawler|
|`registration_year` | Integer | From the website|
|`registration_month` | Integer | From the website|
|`ad_created` | Integer | From the website|
|`last_seen` | String | Added by the crawler|

In [20]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-24 11:52:17,2016-03-24 00:00:00,2016-04-07 03:16:57
1,2016-03-24 10:58:45,2016-03-24 00:00:00,2016-04-07 01:46:50
2,2016-03-14 12:52:21,2016-03-14 00:00:00,2016-04-05 12:47:46
3,2016-03-17 16:54:04,2016-03-17 00:00:00,2016-03-17 17:40:17
4,2016-03-31 17:25:20,2016-03-31 00:00:00,2016-04-06 10:17:21


The date format of all 3 columns is "YYYY-MM-DD hh:mm:ss". We will extract the date section from these columns.

In [21]:
(autos['date_crawled']
                    .str[:10]
                    .value_counts(normalize=True, dropna=False)
                    .sort_index(ascending=True)
)

2016-03-05    0.025557
2016-03-06    0.014464
2016-03-07    0.035697
2016-03-08    0.033458
2016-03-09    0.034220
2016-03-10    0.032574
2016-03-11    0.032728
2016-03-12    0.036209
2016-03-13    0.015749
2016-03-14    0.036257
2016-03-15    0.033450
2016-03-16    0.030154
2016-03-17    0.031652
2016-03-18    0.013122
2016-03-19    0.035295
2016-03-20    0.036357
2016-03-21    0.035721
2016-03-22    0.032463
2016-03-23    0.031973
2016-03-24    0.029917
2016-03-25    0.032946
2016-03-26    0.031976
2016-03-27    0.030267
2016-03-28    0.035118
2016-03-29    0.034164
2016-03-30    0.033525
2016-03-31    0.031873
2016-04-01    0.034115
2016-04-02    0.035091
2016-04-03    0.038720
2016-04-04    0.037596
2016-04-05    0.012812
2016-04-06    0.003161
2016-04-07    0.001617
Name: date_crawled, dtype: float64

The crawler scraped data in a period of roughly 1 month, from 5 March to 7 April 2016. The amount of data each day is about the same, except on 6 and 7 April where we can see significant drop.

In [22]:
(autos['ad_created']
                    .str[:10]
                    .value_counts(normalize=True, dropna=False)
                    .sort_index(ascending=True)
)

2014-03-10    0.000003
2015-03-20    0.000003
2015-06-11    0.000003
2015-06-18    0.000003
2015-08-07    0.000003
                ...   
2016-04-03    0.038876
2016-04-04    0.037729
2016-04-05    0.011643
2016-04-06    0.003153
2016-04-07    0.001555
Name: ad_created, Length: 114, dtype: float64

There are 114 unique dates in `ad_created` column. The oldest ads in our dataset was created since 2014.

In [23]:
(autos['last_seen']
                    .str[:10]
                    .value_counts(normalize=True, dropna=False)
                    .sort_index(ascending=True)
)

2016-03-05    0.001291
2016-03-06    0.004139
2016-03-07    0.005265
2016-03-08    0.008057
2016-03-09    0.009997
2016-03-10    0.011573
2016-03-11    0.013047
2016-03-12    0.023413
2016-03-13    0.008496
2016-03-14    0.012292
2016-03-15    0.016407
2016-03-16    0.016426
2016-03-17    0.028764
2016-03-18    0.006928
2016-03-19    0.016321
2016-03-20    0.019915
2016-03-21    0.020123
2016-03-22    0.020616
2016-03-23    0.018150
2016-03-24    0.019244
2016-03-25    0.019112
2016-03-26    0.016164
2016-03-27    0.016916
2016-03-28    0.022281
2016-03-29    0.023313
2016-03-30    0.023865
2016-03-31    0.024234
2016-04-01    0.024024
2016-04-02    0.025008
2016-04-03    0.025355
2016-04-04    0.025665
2016-04-05    0.126146
2016-04-06    0.217816
2016-04-07    0.129638
Name: last_seen, dtype: float64

This is the date that the crawler last seen the car ads. It could be that the ads was removed because they were sold. However, there is an unusual spike during 5 - 7 April. The highest is on 6 April, where the value is 10 times higher preivous days. Hence, the `last_seen` data during these 3 days might not be an indication of car sales.

In [24]:
autos["registration_year"].describe()

count    371125.000000
mean       2004.548785
std          91.260186
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

The data in `registration_year` shows a minimum at *1000* and a maximum at *9999*. These numbers are definitely not an actual information of a car. We will need to explore this column further in order to correct these incorrect data.

### Clean `registration_year` column

In [25]:
np.sort(autos["registration_year"].unique())

array([1000, 1001, 1039, 1111, 1200, 1234, 1253, 1255, 1300, 1400, 1500,
       1600, 1602, 1688, 1800, 1910, 1911, 1915, 1919, 1920, 1923, 1925,
       1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937,
       1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948,
       1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,
       1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970,
       1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
       1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992,
       1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
       2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014,
       2015, 2016, 2017, 2018, 2019, 2066, 2200, 2222, 2290, 2500, 2800,
       2900, 3000, 3200, 3500, 3700, 3800, 4000, 4100, 4500, 4800, 5000,
       5300, 5555, 5600, 5900, 5911, 6000, 6200, 6500, 7000, 7100, 7500,
       7777, 7800, 8000, 8200, 8455, 8500, 8888, 90

The first modern car was invented in 1886. Consequently, we will drop rows before 1886. We will also drop any rows after 2016. Since our data set is in April 2016, any years after 2016 would be a future date.

In [26]:
autos = autos[autos["registration_year"].between(1886,2016)]
autos["registration_year"].describe()

count    356388.000000
mean       2002.784084
std           7.373305
min        1910.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        2016.000000
Name: registration_year, dtype: float64

In [27]:
autos["registration_year"].value_counts(normalize=True).sort_values(ascending=False).head(10)

2000    0.068844
1999    0.063871
2005    0.062600
2006    0.056739
2001    0.056702
2003    0.055748
2004    0.055386
2002    0.053837
1998    0.050338
2007    0.049570
Name: registration_year, dtype: float64

Most cars in our data set were first registered during 1998-2007. This percentage matches the percentile ranks shown above.

## Analyze data
### Price-Brand analysis
Here we will analyze which brands are more expensive and which are cheaper on average.

In [28]:
print("Total number of brands: ",autos["brand"].unique().shape[0])
print()
print(autos["brand"].value_counts(normalize=True))

Total number of brands:  40

volkswagen        0.212552
bmw               0.109746
opel              0.107178
mercedes_benz     0.095932
audi              0.089403
ford              0.068905
renault           0.047608
peugeot           0.029892
fiat              0.025784
seat              0.018648
skoda             0.015421
mazda             0.015362
smart             0.014119
citroen           0.013887
nissan            0.013584
toyota            0.012759
sonstige_autos    0.010418
hyundai           0.009840
mini              0.009220
volvo             0.009139
mitsubishi        0.008272
honda             0.007596
kia               0.006886
alfa_romeo        0.006353
suzuki            0.006319
porsche           0.005716
chevrolet         0.005017
chrysler          0.003945
dacia             0.002452
daihatsu          0.002189
jeep              0.002186
land_rover        0.002130
subaru            0.002127
jaguar            0.001700
trabant           0.001619
saab              0.001453

In [29]:
# Select top 10 common brands
top10 = autos["brand"].value_counts(normalize=True).head(10).index

# Find average price of the top 10 brands
brand_mean_price = {}

for brand in top10:
    car_brand = autos[autos["brand"] == brand]
    mean_price = car_brand["price"].mean()
    brand_mean_price[brand] = int(mean_price)

bmp_series = pd.Series(brand_mean_price)
brand_analysis = pd.DataFrame(bmp_series, columns=["mean_price"])
brand_analysis.sort_values("mean_price", ascending=False)

Unnamed: 0,mean_price
audi,8810
mercedes_benz,8215
bmw,8165
volkswagen,5222
seat,4397
ford,3585
peugeot,3206
opel,2870
fiat,2803
renault,2360


There are 40 unique values in `brand` columns, consisting of 39 brands and 1 "other brands". As the top 10 brands already cover approximately 75% of the data, we therefore focus on these brands.

From the information above, there is a big price gap between top-tier and other brands. The prices of *Audi*, *Mercedes Benz*, and *BMW* are more than 8,000 euro on average. The less expensive brands are *Fiat*,*Opel*, and *Renault*, at less than 3,000 euro. The brands in the middle range are *Volkswagen*, *Seat*, *Ford*, and *Peugeot*.

### Mileage-Brand analysis
Next we will analyze the mileage of each top 10 brands.

In [30]:
brand_mean_mileage = {}

for brand in top10:
    car_brand = autos[autos["brand"] == brand]
    mean_km = car_brand["kilometer"].mean()
    brand_mean_mileage[brand] = int(mean_km)

brand_analysis["mean_mileage"] = pd.Series(brand_mean_mileage)
brand_analysis.sort_values("mean_mileage", ascending=False)

Unnamed: 0,mean_price,mean_mileage
bmw,8165,132688
mercedes_benz,8215,130671
audi,8810,129529
opel,2870,128756
volkswagen,5222,128341
renault,2360,127877
peugeot,3206,124599
ford,3585,123621
seat,4397,120911
fiat,2803,116523


*BMW*, *Mercendes Benz*, and *Audi* are still at the top for mileage. However, the mileage ranges for each brand are at about the same level with less than 10% variance between the 1st (BMW) and the last (Fiat) ranks.

## Conclusion
The price of each brand tends to follow the brand positioning. The top-tier brands (Audi, Mercedes Benz, and BMW) are listed at more than 8,000 euro on average, while the less expensive brands (Fiat, Opel, and Renault) are listed at less than 3,000 euro.

The more-expensive brands tend to have higher mileage on average. However, there are less than 10% variance between the 1st (BMW) and the last (Fiat) ranks.

## Next step
### Transmission-Price analysis
We will analyze whether a type of transmission would affect the price of the car. We will focus only on the common brands from above.

In [31]:
autos["gearbox"].value_counts()

manuell      263282
automatik     74901
Name: gearbox, dtype: int64

In [32]:
# Find average price for manual transmission
brand_manual_price = {}

for brand in top10:
    car_brand = autos[(autos["brand"] == brand) & (autos["gearbox"] == "manuell")]
    mean_price = car_brand["price"].mean()
    brand_manual_price[brand] = int(mean_price)

brand_analysis["manual_mean_price"] = pd.Series(brand_manual_price)
brand_analysis.sort_values("manual_mean_price")


# Find average price for automatic transmission
brand_auto_price = {}

for brand in top10:
    car_brand = autos[(autos["brand"] == brand) & (autos["gearbox"] == "automatik")]
    mean_price = car_brand["price"].mean()
    brand_auto_price[brand] = int(mean_price)

brand_analysis["manual_auto_price"] = pd.Series(brand_auto_price)
brand_analysis.sort_values("manual_auto_price", ascending=False)

Unnamed: 0,mean_price,mean_mileage,manual_mean_price,manual_auto_price
audi,8810,129529,7052,12832
bmw,8165,132688,6248,12202
seat,4397,120911,4140,11517
mercedes_benz,8215,130671,5223,10695
volkswagen,5222,128341,4605,10157
ford,3585,123621,3350,7433
fiat,2803,116523,2810,5443
peugeot,3206,124599,3139,5133
opel,2870,128756,2863,3649
renault,2360,127877,2346,3493


The table shows that the transmission type of the car affects the price. On average, cars with automatic transmission are more expensive. For most brands, the price for manual transmission cars is lower than their average price, except for *Fiat*. *Renault* has minimal price difference between the two transmissions, while *Seat* has the highest price difference between manual and automatic transmission cars.