# Exploring ebay car sales data : Pandas
This is a guided project being done as part of the [DataQuest](https://www.dataquest.io/) Data Scientist in python path. 
In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). There are few modifications made from the original dataset that was uploaded to Kaggle:

- 50,000 data points from the full dataset were sampled, to ensure code runs quickly in the hosted environment
- Kaggle dataset was relatively clean. To resemble a scraped dataset, the clean dataset in kaggle was intentionally dirtied. 

The data dictionary provided with data is as follows:
- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings. 

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')


In [2]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

`DF.info()` gives the information about the dataframe. Looking at `autos` dataframe, we can observe somethings
- Dataset has 20 columns, most of which are strings
- There are a maximum of 50,000 rows, since the dataset contains 50k entries
- Some columns doesn't have the information about all the 50k rows. Either the data could be `NaN` or the data could be missing. These are 5 columns of `vehicleType`, `gearbox`, `model`, `fuelType` and `notRepairedDamage`
- The column names are camelcase and not python's preferred snakecase

In [3]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


`DF.head()` lists the top 5 rows of a DataFrame. For our dataframe autos, it could be observed that
- `price` column has to be cleaned to remove '$' and ','
- `odometer` column has to be cleaned to remove ',' and 'km'
- `dateCrawled`, `dateCreated` and `lastSeen` are datefields

In [4]:
org_col_names = autos.columns

In [5]:
org_col_names

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:
mod_col_names = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
                'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
                 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
                 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'
                ]

In [7]:
autos.columns = mod_col_names

In [8]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Column names are changed from camelcase to the snakecase convention and 4 column names are changed to clear out a better understanding and a clear description

### Initial exploration and cleaning

In [9]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-21 16:37:21,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


It can be seen that following columns have only two unique values
- `seller`, `offer_type`, `abtest`, `gear_box`, `unrepaired_damage`

We will look at each of the unique values for these cols in detail

In [10]:
sel_cols = ['seller', 'offer_type', 'abtest', 'gear_box', 'unrepaired_damage']

for col in sel_cols:
    series = autos[col]
    print('\n\nColumn: ',col)
    print(series.value_counts())



Column:  seller
privat        49999
gewerblich        1
Name: seller, dtype: int64


Column:  offer_type
Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64


Column:  abtest
test       25756
control    24244
Name: abtest, dtype: int64


Column:  gear_box
manuell      36993
automatik    10327
Name: gear_box, dtype: int64


Column:  unrepaired_damage
nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64


#### Observations 

1. It can be further noted that the columns `seller` and `offer_type` are useless. Further, google translate states that
    - Only one entry in `seller` is 'commercial', rest all are 'private' and,
    - Only one entry in `offer_type` is 'request', rest all are 'offer'
2. `price` and `odometer` columns have to be cleaned and converted to float/int

In [11]:
# 1. dropping seller and offer_type cols 
# Note: Careful with executing this cell multiple times because of the method drop()
# multiple executions will result in an error because after dropping the col, it can't find col to drop again
autos = autos.drop(columns = ['seller', 'offer_type'])

#2. converting price and odometer cols to float
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(float)
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace('Km','').str.replace(',','').astype(float)


In [12]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000.0,50000,44905,50000.0,47320,50000.0,47242,50000.0,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,,2,8,,2,,245,,,7,40,2,76,,,39481
top,2016-03-21 16:37:21,Ford_Fiesta,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,,25756,12859,,36993,,4024,,,30107,10687,35232,1946,,,8
mean,,,9840.044,,,2005.07328,,116.35592,,125732.7,5.72336,,,,,0.0,50813.6273,
std,,,481104.4,,,105.712813,,209.216627,,40042.211706,3.711984,,,,,0.0,25779.747957,
min,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,1100.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30451.0,
50%,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49577.0,
75%,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71540.0,


It can be noted that someone listed a car with a price of 0$. Lets see price col in detail

In [13]:
autos['price'].value_counts()

0.0           1421
500.0          781
1500.0         734
2500.0         643
1200.0         639
1000.0         639
600.0          531
800.0          498
3500.0         498
2000.0         460
999.0          434
750.0          433
900.0          420
650.0          419
850.0          410
700.0          395
4500.0         394
300.0          384
2200.0         382
950.0          379
1100.0         376
1300.0         371
3000.0         365
550.0          356
1800.0         355
5500.0         340
1250.0         335
350.0          335
1600.0         327
1999.0         322
              ... 
2225.0           1
69997.0          1
139997.0         1
69999.0          1
4780.0           1
8930.0           1
21599.0          1
15911.0          1
10000000.0       1
5180.0           1
919.0            1
1247.0           1
5998.0           1
27020.0          1
21888.0          1
46500.0          1
2001.0           1
2459.0           1
345000.0         1
34940.0          1
2785.0           1
5248.0      

Infact, 1421 entries are with a price of 0 $. These are wrong inputs and we will rename the column `odometer` to `odometer_km` for a better understanding

In [14]:
autos.rename(columns={'odometer':'odometer_km'}, inplace=True)

In [15]:
#checking whether changes are reflected or not
autos.head(3)

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gear_box,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,5000.0,control,bus,2004,manuell,158,andere,150000.0,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500.0,control,limousine,1997,automatik,286,7er,150000.0,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,8990.0,test,limousine,2009,manuell,102,golf,70000.0,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


### Exploring the `odometer_km` and `price` columns

In [16]:
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [17]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [18]:
autos['price'].value_counts().sort_index(ascending = False).head(10)

99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
Name: price, dtype: int64

1. It could be observed that `odometer_km` has values between 5,000 and 150,000. This is perfectly reasonable and there are no outliers in this data.
2. Whereas for `price`, it could be noted that there are wild variations in the price, some as low as 0-10 dollars  and some above 75000 dollars, some even as high as 10 million. Hence price column has to be cleaned 

In [19]:
autos = autos[autos['price'].between(1000,75000)]

In [20]:
autos['price'].describe()

count    38558.000000
mean      7029.210203
std       7586.142730
min       1000.000000
25%       2200.000000
50%       4300.000000
75%       8900.000000
max      75000.000000
Name: price, dtype: float64

Since there are many wild entries in the price column, it is decided that we won't consider the prices below 1000 dollars and above 75,000 dollars. Thus our dataset has been reduced to 38558 rows 

### Exploring the date columns

There are five columns corresponding to the date. 
- `date_crawled`
- `last_seen`
- `ad_created`
- `registration_month`
- `registration_year`

Of these 5, only `registration_month` and `registration_year` are integer fields. Rest 3 are string representations. They have to be converted to numerical representations to do meaningful data analysis.

In [21]:
autos[['date_crawled','last_seen','ad_created']].head(3)

Unnamed: 0,date_crawled,last_seen,ad_created
0,2016-03-26 17:47:46,2016-04-06 06:45:54,2016-03-26 00:00:00
1,2016-04-04 13:38:56,2016-04-06 14:45:08,2016-04-04 00:00:00
2,2016-03-26 18:57:24,2016-04-06 20:15:37,2016-03-26 00:00:00


These 3 columns have to be converted to the string representations

In [22]:
date_cols = ['date_crawled','last_seen','ad_created','registration_month','registration_year']

print(autos['date_crawled'].value_counts(normalize = True, dropna = False).sort_index())
print(autos['last_seen'].value_counts(normalize = True, dropna = False).sort_index())
autos['ad_created'].value_counts(normalize = True, dropna = False).sort_index()

2016-03-05 14:06:30    0.000026
2016-03-05 14:06:40    0.000026
2016-03-05 14:07:21    0.000026
2016-03-05 14:07:26    0.000026
2016-03-05 14:07:40    0.000026
2016-03-05 14:07:45    0.000026
2016-03-05 14:08:00    0.000052
2016-03-05 14:08:05    0.000052
2016-03-05 14:08:42    0.000026
2016-03-05 14:09:02    0.000052
2016-03-05 14:09:05    0.000026
2016-03-05 14:09:22    0.000026
2016-03-05 14:09:38    0.000026
2016-03-05 14:09:46    0.000026
2016-03-05 14:09:57    0.000026
2016-03-05 14:09:58    0.000052
2016-03-05 14:10:20    0.000026
2016-03-05 14:10:46    0.000026
2016-03-05 14:11:03    0.000026
2016-03-05 14:11:05    0.000026
2016-03-05 14:11:14    0.000026
2016-03-05 14:11:25    0.000026
2016-03-05 14:11:40    0.000026
2016-03-05 14:11:56    0.000026
2016-03-05 14:12:20    0.000026
2016-03-05 14:12:32    0.000026
2016-03-05 14:12:34    0.000026
2016-03-05 14:12:35    0.000026
2016-03-05 14:12:41    0.000026
2016-03-05 14:12:54    0.000026
                         ...   
2016-04-

2015-06-11 00:00:00    0.000026
2015-08-10 00:00:00    0.000026
2015-09-09 00:00:00    0.000026
2015-11-10 00:00:00    0.000026
2015-12-30 00:00:00    0.000026
2016-01-03 00:00:00    0.000026
2016-01-07 00:00:00    0.000026
2016-01-10 00:00:00    0.000052
2016-01-13 00:00:00    0.000026
2016-01-14 00:00:00    0.000026
2016-01-16 00:00:00    0.000026
2016-01-22 00:00:00    0.000026
2016-01-27 00:00:00    0.000078
2016-01-29 00:00:00    0.000026
2016-02-01 00:00:00    0.000026
2016-02-02 00:00:00    0.000052
2016-02-05 00:00:00    0.000052
2016-02-07 00:00:00    0.000026
2016-02-09 00:00:00    0.000026
2016-02-11 00:00:00    0.000026
2016-02-12 00:00:00    0.000052
2016-02-14 00:00:00    0.000052
2016-02-16 00:00:00    0.000026
2016-02-17 00:00:00    0.000026
2016-02-18 00:00:00    0.000052
2016-02-19 00:00:00    0.000078
2016-02-20 00:00:00    0.000026
2016-02-21 00:00:00    0.000052
2016-02-22 00:00:00    0.000026
2016-02-23 00:00:00    0.000104
                         ...   
2016-03-

Two things could be observed from this.
- `date_crawled` and `last_seen` : There is no distribution, data is all over the place
- `ad_created` distribution is significant and it could be seen that all ads are created at the midnight. So time column can be deleted from the dataset

In [23]:
print(autos['registration_year'].value_counts(normalize = True, dropna = False).sort_index(ascending = True))
print(autos['registration_month'].value_counts(normalize = True, dropna = False).sort_index())

1000    0.000026
1001    0.000026
1927    0.000026
1929    0.000026
1931    0.000026
1934    0.000052
1937    0.000104
1938    0.000026
1939    0.000026
1941    0.000052
1943    0.000026
1948    0.000026
1950    0.000026
1951    0.000026
1952    0.000026
1953    0.000026
1954    0.000052
1955    0.000026
1956    0.000104
1957    0.000052
1958    0.000078
1959    0.000156
1960    0.000441
1961    0.000156
1962    0.000104
1963    0.000182
1964    0.000233
1965    0.000441
1966    0.000545
1967    0.000648
          ...   
1999    0.044712
2000    0.051870
2001    0.053530
2002    0.055371
2003    0.064241
2004    0.067612
2005    0.072229
2006    0.068728
2007    0.058561
2008    0.057083
2009    0.053763
2010    0.040951
2011    0.041781
2012    0.033741
2013    0.020437
2014    0.016728
2015    0.009233
2016    0.016728
2017    0.026142
2018    0.010322
2019    0.000026
2800    0.000026
4100    0.000026
4500    0.000026
5000    0.000052
5911    0.000026
6200    0.000026
8888    0.0000

It can also be observed that some rows in `registration_year` are erroneous. Anything below 1990 and above 2016 could be discarded. It should be noted that data is scraped in 2016, so anything above 2016 is erroneous. 

In [24]:
autos = autos[autos['registration_year'].between(1990,2016)]

In [25]:
autos['registration_year'].value_counts(normalize = True)

2005    0.077262
2006    0.073517
2004    0.072324
2003    0.068718
2007    0.062642
2008    0.061061
2002    0.059230
2009    0.057510
2001    0.057260
2000    0.055485
1999    0.047828
2011    0.044693
2010    0.043805
2012    0.036093
1998    0.035482
2013    0.021861
1997    0.021778
2016    0.017894
2014    0.017894
1996    0.015009
1995    0.012234
2015    0.009876
1994    0.007518
1993    0.006048
1992    0.006020
1991    0.005743
1990    0.005216
Name: registration_year, dtype: float64

### Exploring Price by Brand

In [26]:
brands_val = autos['brand'].value_counts()
print(brands_val)

volkswagen        7604
bmw               4574
mercedes_benz     3917
audi              3599
opel              3244
ford              2108
renault           1374
peugeot           1037
fiat               753
skoda              707
seat               643
smart              618
toyota             536
mazda              527
citroen            503
nissan             502
mini               402
hyundai            400
volvo              317
kia                286
honda              264
mitsubishi         255
sonstige_autos     253
chevrolet          216
alfa_romeo         213
suzuki             212
porsche            203
dacia              122
chrysler           113
jeep               100
land_rover          92
jaguar              66
daihatsu            63
subaru              61
saab                50
daewoo              34
rover               27
lancia              23
lada                20
trabant              8
Name: brand, dtype: int64


In [27]:
tot_cars = autos['brand'].value_counts().sum()
print(tot_cars)

36046


Total number of cars for all brands = 36046, We would be considering only the brands, which make up more than 1 % of the total listed cars.


In [28]:
interested_brands = [] #making list of brands which have more than 1% listings
for brand, val in brands_val.iteritems():
    if (val/tot_cars)>0.01:
        interested_brands.append(brand)
        
interested_brands


['volkswagen',
 'bmw',
 'mercedes_benz',
 'audi',
 'opel',
 'ford',
 'renault',
 'peugeot',
 'fiat',
 'skoda',
 'seat',
 'smart',
 'toyota',
 'mazda',
 'citroen',
 'nissan',
 'mini',
 'hyundai']

In [30]:
brand_mean_prices = {}
brand_mean_mileage = {}
for brand in interested_brands:
    avg_brand = autos.loc[autos['brand'] == brand,'price'].mean()
    avg_mileage = autos.loc[autos['brand'] == brand,'odometer_km'].mean()
    brand_mean_prices[brand] = avg_brand
    brand_mean_mileage[brand] = avg_mileage
    
print(pd.Series(brand_mean_prices).sort_values(ascending = True))
#print()

renault           3510.434498
smart             3780.469256
fiat              3884.151394
peugeot           3948.376085
opel              4203.332922
citroen           4471.831014
ford              4917.289374
mazda             5327.455408
toyota            5563.102612
seat              5638.640747
hyundai           6181.872500
nissan            6445.464143
volkswagen        6647.213835
skoda             6841.906648
bmw               9012.056843
mercedes_benz     9107.126628
audi             10314.614893
mini             10742.965174
dtype: float64


**Analysis**:
- The costliest brand is mini, the second costliest is audi
- The cheapest brand is renault followed by smart

### Storing aggregate data in a DataFrame

We want to store two dictionaries, `brand_mean_prices` and `brand_mean_mileage` as a DataFrame. This is achieved in two steps
- First, we will make two individual`pd.Series`
- Second, we will combine these two `pd.Series` into a `pd.DataFrame`

In [45]:
brand_mean_prices_series = pd.Series(brand_mean_prices)
brand_mean_mileage_series = pd.Series(brand_mean_mileage)
brand_mean_prices_mileage_df = pd.DataFrame(brand_mean_prices_series)
brand_mean_prices_mileage_df.columns = ['mean_price']
brand_mean_prices_mileage_df['mean_mileage'] = brand_mean_mileage_series

In [46]:
brand_mean_prices_mileage_df

Unnamed: 0,mean_price,mean_mileage
audi,10314.614893,127357.599333
bmw,9012.056843,132176.432007
citroen,4471.831014,114801.192843
fiat,3884.151394,109183.266932
ford,4917.289374,120915.559772
hyundai,6181.8725,101862.5
mazda,5327.455408,119981.024668
mercedes_benz,9107.126628,130195.302527
mini,10742.965174,88843.283582
nissan,6445.464143,110139.442231


In [48]:
(brand_mean_prices_mileage_df['mean_price']/brand_mean_prices_mileage_df['mean_mileage']).sort_values(ascending = True)

renault          0.028855
peugeot          0.032255
opel             0.033736
fiat             0.035575
smart            0.038665
citroen          0.038953
ford             0.040667
mazda            0.044402
seat             0.048565
toyota           0.048926
volkswagen       0.052695
nissan           0.058521
hyundai          0.060688
skoda            0.062032
bmw              0.068182
mercedes_benz    0.069950
audi             0.080989
mini             0.120920
dtype: float64

### Conclusion
The point here is that a new Honda maybe valuable than a 100,000Km run Audi. So we shouldn't just look at the `mean_price` but at `mean_price`/`mean_mileage`. This will give us the relative cost w.r.t distance it has clocked, because ideally we want a car that has less number of mileage, but even then if the price is higher, it says that brand commands more value, because of its value. Based on this criteria, we can say that,
- `mini` is the costliest brand
- `audi` is the next costliest brand and,
- `renault` commands the lowest brand value in the given data set

