# Exploring and cleaning eBay Car Sales Data

In this project We are going to be exploring a dataset of used cars from eBay and performing cleaning operations on it.

This is a guided project from the Data Analyst on Python Certification from Dataquest

We will be using a dataset of used cards from a section of the German eBay website which was uploaded to Kaggle.

The objective of this project is to clean the data and analyze the included used car listings.



## Importing the data

In [2]:
# import Pandas and Numpy libraries
import numpy as np
import pandas as pd

# reading the csv file containing the data

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [3]:
# review information about the data set
# such as number of columns and rows and types of each column

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Initial observations:

* Dataset contains 50,000 rows and 20 columns, most of the them are strings
* Column names use camelcase convention instead of snakecase (Python´s preferred)

## Renaming columns names

Chaning the column names to snakecase convention

In [5]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:
# editing the column names

new_col_names = ['date_crawled', 
             'name', 
             'seller', 
             'offer_type', 
             'price', 
             'abtest',
             'vehicle_type', 
             'registration_year', 
             'gearbox', 
             'power_ps', 
             'model',
             'odometer', 
             'registration_month', 
             'fuel_type', 
             'brand',
             'unrepaired_damage', 
             'ad_created', 
             'number_pictures', 
             'postal_code',
             'last_seen']

autos.columns = new_col_names
autos.columns


Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

## Data Exploration

Let´s perform basic exploring tasks to find areas where we can clean data

In [7]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-30 19:48:02,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Initial observations:

* Seller and offer_type columns contain the same value for most of the rows
* It looks like the number_pictures columns have no data. We will need to investigate further
* Price and adometer columns are numbers stored as strings

## Analyzing columns number_pictures, seller and offer_type

In [8]:
print('Values count for column number_pictures:')
print(autos['number_pictures'].value_counts())
print('/n')
print('Values count for column seller:')
print(autos['seller'].value_counts())
print('/n')
print('Values count for column offer_type:')
print(autos['offer_type'].value_counts())

Values count for column number_pictures:
0    50000
Name: number_pictures, dtype: int64
/n
Values count for column seller:
privat        49999
gewerblich        1
Name: seller, dtype: int64
/n
Values count for column offer_type:
Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64


These three columns have only 1 or 2 values for the all the rows. We can ommit them. Let´s remove them from our dataset

In [9]:
autos = autos.drop(columns= ['number_pictures','seller','offer_type'])
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null object
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(4), object(13)
memory usage: 6.5+ MB


## Changing data type for columns price and odometer

Odometer and Price columns are numeric values stored as text, let´s fix that

In [10]:
autos['price']

0         $5,000
1         $8,500
2         $8,990
3         $4,350
4         $1,350
5         $7,900
6           $300
7         $1,990
8           $250
9           $590
10          $999
11          $350
12        $5,299
13        $1,350
14        $3,999
15       $18,900
16          $350
17        $5,500
18          $300
19        $4,150
20        $3,500
21       $41,500
22       $25,450
23        $7,999
24       $48,500
25           $90
26          $777
27            $0
28        $5,250
29        $4,999
          ...   
49970    $15,800
49971       $950
49972     $3,300
49973     $6,000
49974         $0
49975     $9,700
49976     $5,900
49977     $5,500
49978       $900
49979    $11,000
49980       $400
49981     $2,000
49982     $1,950
49983       $600
49984         $0
49985     $1,000
49986    $15,900
49987    $21,990
49988     $9,550
49989       $150
49990    $17,500
49991       $500
49992     $4,800
49993     $1,650
49994     $5,000
49995    $24,900
49996     $1,980
49997    $13,2

In [11]:
## removing '$' and ',' characters from column price

autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','')

autos['price']

0         5000
1         8500
2         8990
3         4350
4         1350
5         7900
6          300
7         1990
8          250
9          590
10         999
11         350
12        5299
13        1350
14        3999
15       18900
16         350
17        5500
18         300
19        4150
20        3500
21       41500
22       25450
23        7999
24       48500
25          90
26         777
27           0
28        5250
29        4999
         ...  
49970    15800
49971      950
49972     3300
49973     6000
49974        0
49975     9700
49976     5900
49977     5500
49978      900
49979    11000
49980      400
49981     2000
49982     1950
49983      600
49984        0
49985     1000
49986    15900
49987    21990
49988     9550
49989      150
49990    17500
49991      500
49992     4800
49993     1650
49994     5000
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: object

In [12]:
# Changing data type of column price to int
autos['price'] = autos['price'].astype(int)
autos.info()

autos['price']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null int64
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(5), object(12)
memory usage: 6.5+ MB


0         5000
1         8500
2         8990
3         4350
4         1350
5         7900
6          300
7         1990
8          250
9          590
10         999
11         350
12        5299
13        1350
14        3999
15       18900
16         350
17        5500
18         300
19        4150
20        3500
21       41500
22       25450
23        7999
24       48500
25          90
26         777
27           0
28        5250
29        4999
         ...  
49970    15800
49971      950
49972     3300
49973     6000
49974        0
49975     9700
49976     5900
49977     5500
49978      900
49979    11000
49980      400
49981     2000
49982     1950
49983      600
49984        0
49985     1000
49986    15900
49987    21990
49988     9550
49989      150
49990    17500
49991      500
49992     4800
49993     1650
49994     5000
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: int64

In [13]:
autos['odometer']

0        150,000km
1        150,000km
2         70,000km
3         70,000km
4        150,000km
5        150,000km
6        150,000km
7        150,000km
8        150,000km
9        150,000km
10       150,000km
11       150,000km
12        50,000km
13       150,000km
14       150,000km
15        80,000km
16       150,000km
17       150,000km
18       150,000km
19       150,000km
20       150,000km
21       150,000km
22        10,000km
23       150,000km
24        30,000km
25       150,000km
26       125,000km
27       150,000km
28       150,000km
29       150,000km
           ...    
49970     60,000km
49971    150,000km
49972    150,000km
49973    150,000km
49974    150,000km
49975    100,000km
49976    150,000km
49977    150,000km
49978    150,000km
49979     70,000km
49980    125,000km
49981    150,000km
49982     90,000km
49983    150,000km
49984    150,000km
49985    150,000km
49986    125,000km
49987     50,000km
49988    150,000km
49989    150,000km
49990     30,000km
49991    150

In [14]:
## removing '$' and ',' characters from column odometer

autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','')

autos['odometer']

0        150000
1        150000
2         70000
3         70000
4        150000
5        150000
6        150000
7        150000
8        150000
9        150000
10       150000
11       150000
12        50000
13       150000
14       150000
15        80000
16       150000
17       150000
18       150000
19       150000
20       150000
21       150000
22        10000
23       150000
24        30000
25       150000
26       125000
27       150000
28       150000
29       150000
          ...  
49970     60000
49971    150000
49972    150000
49973    150000
49974    150000
49975    100000
49976    150000
49977    150000
49978    150000
49979     70000
49980    125000
49981    150000
49982     90000
49983    150000
49984    150000
49985    150000
49986    125000
49987     50000
49988    150000
49989    150000
49990     30000
49991    150000
49992    125000
49993    150000
49994    150000
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer, Length: 

In [15]:
# Changing data type of column odometer to int
autos['odometer'] = autos['odometer'].astype(int)

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null int64
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null int64
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(6), object(11)
memory usage: 6.5+ MB


In [16]:
# Renaming column odometer to odometer_km

autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null int64
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null int64
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(6), object(11)
memory usage: 6.5+ MB


Let´s analyze columns price and odometer_km more deeply

## Exploring Odomter and Price

In [17]:
print('Column odometer_km has {} unique values:'.format(autos['odometer_km'].unique().shape))
print(autos['odometer_km'].value_counts())


Column odometer_km has (13,) unique values
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64


It looks like the numbers in odometer are rounded and most of the cars are high mileage.

In [20]:
print('Column price has {} unique values:'.format(autos['price'].unique().shape))
print(autos['price'].value_counts().head(10))

Column price has (2357,) unique values:
0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
Name: price, dtype: int64


Prices are rounded as well. There are also 1,421 cars with a price of $0

In [26]:
print('Cars with highest prices:')
autos['price'].value_counts().sort_index(ascending=False).head(20)

Cars with highest prices:


99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [27]:
print('Cars with lowest prices:')
autos['price'].value_counts().sort_index(ascending=True).head(20)

Cars with lowest prices:


0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

For the purposes of this exercise having a car with a list price of \$0 semms to be incorrect. 
Also when analyzing the higher prices, there are 14 entries with prices higher than \$350,000. 
The next price after \$350,000 is \$999,999 which seems unrealistic and a mistake. 
For the purposes of this exercise we are going to create a new data set containing only cars with prices between \$1 and \$350,000.

In [38]:
# Creating a copy of the dataset with all the prices
autos_all_prices = autos.copy()

autos = autos[autos['price'].between(1,350001)]

autos['price'].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

## Exploring date columns

In [46]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48565 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          48565 non-null object
name                  48565 non-null object
price                 48565 non-null int64
abtest                48565 non-null object
vehicle_type          43979 non-null object
registration_year     48565 non-null int64
gearbox               46222 non-null object
power_ps              48565 non-null int64
model                 46107 non-null object
odometer_km           48565 non-null int64
registration_month    48565 non-null int64
fuel_type             44535 non-null object
brand                 48565 non-null object
unrepaired_damage     39464 non-null object
ad_created            48565 non-null object
postal_code           48565 non-null int64
last_seen             48565 non-null object
dtypes: int64(6), object(11)
memory usage: 6.7+ MB


The dataset has the following date columns:

* date_crawled
* registration_year
* registration_month
* ad_created
* last_seen

The registrations date are stored as int and the other columns are stored as strings. We will analyze dates in groups according to its data types

In [50]:
autos[['date_crawled','ad_created','last_seen']].head(10)


Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50
5,2016-03-21 13:47:45,2016-03-21 00:00:00,2016-04-06 09:45:21
6,2016-03-20 17:55:21,2016-03-20 00:00:00,2016-03-23 02:48:59
7,2016-03-16 18:55:19,2016-03-16 00:00:00,2016-04-07 03:17:32
8,2016-03-22 16:51:34,2016-03-22 00:00:00,2016-03-26 18:18:10
9,2016-03-16 13:47:02,2016-03-16 00:00:00,2016-04-06 10:46:35


Let´s analyze date_crawled column

In [71]:
(autos['date_crawled']
        .str[:10]
        .value_counts()
        .sort_values(ascending=True)
)

2016-04-07      68
2016-04-06     154
2016-03-18     627
2016-04-05     636
2016-03-06     682
2016-03-13     761
2016-03-05    1230
2016-03-24    1425
2016-03-16    1438
2016-03-27    1510
2016-03-25    1535
2016-03-17    1536
2016-03-31    1546
2016-03-10    1563
2016-03-26    1564
2016-03-23    1565
2016-03-11    1582
2016-03-22    1602
2016-03-09    1607
2016-03-08    1617
2016-04-01    1636
2016-03-30    1636
2016-03-29    1656
2016-03-15    1665
2016-03-19    1689
2016-03-28    1693
2016-04-02    1723
2016-03-07    1749
2016-04-04    1772
2016-03-14    1775
2016-03-12    1793
2016-03-21    1815
2016-03-20    1840
2016-04-03    1875
Name: date_crawled, dtype: int64

It looks like the data correspond to entries from aproximately a month from March 2016 to April 2016. The number of ads by each day is uniform.

Now let´s analyze column last_seen

In [74]:
(autos['last_seen']
        .str[:10]
        .value_counts(normalize=True)
        .sort_values(ascending=True)
)

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-18    0.007351
2016-03-08    0.007413
2016-03-13    0.008895
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-14    0.012602
2016-03-27    0.015649
2016-03-19    0.015834
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-26    0.016802
2016-03-23    0.018532
2016-03-25    0.019211
2016-03-24    0.019767
2016-03-21    0.020632
2016-03-20    0.020653
2016-03-28    0.020859
2016-03-22    0.021373
2016-03-29    0.022341
2016-04-01    0.022794
2016-03-12    0.023783
2016-03-31    0.023783
2016-04-04    0.024483
2016-03-30    0.024771
2016-04-02    0.024915
2016-04-03    0.025203
2016-03-17    0.028086
2016-04-05    0.124761
2016-04-07    0.131947
2016-04-06    0.221806
Name: last_seen, dtype: float64

The last_seen column correspond to the same range of dates from March 2016 to April 2016, but the distribution of rows by date is focused on three main dates which have the 47% of the ads: from April 5th to April 7th.

If we assume the last_seen date is either, the day when the car was sold or when the add ended, it is more likey that these three days represent ending days for ads instead of cars sales.

For the rest of the dates, it is more probable to represent a date when the car was sold.

Let´s analyze the last date column of the non registration columns

In [76]:
(autos["ad_created"]
        .str[:10]
        .value_counts(normalize=True, dropna=False)
        .sort_index()
        )

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21 

Ad created dates are more than the other two column dates. Adds creation by date is still quite uniform.

Let´s analyze the two registration date columns

In [79]:
autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Registration_year represent the year when the car has manufactured. The column statistics show some odd values. The min and max values are 1000 and 9999 which are incorrect manufacturing dates. We will need to fix that:

## Cleaning incorrect registration year rows

According to wikipedia 1901 is the year when Large-scale, production-line manufacturing of affordable cars was started, therefore, as the registration year represent the manufacturing date for the car, any date below 1901 will be innacurate.

Also, because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is also inaccurate.

Let´s found out how many row are in and out of these dates ranges.

In [91]:
# Before we remove rows not in the range of 1901 - 2016
#Let´s first make another copy of autos with all registration year rows

autos_all_registration_year = autos.copy()

autos = autos[autos['registration_year'].between(1901,2016)]

print(autos['registration_year'].value_counts(normalize=True).sort_index().head(20))

print(autos['registration_year'].value_counts(normalize=True).sort_index(ascending=False).head(20))


1910    0.000107
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000043
1937    0.000086
1938    0.000021
1939    0.000021
1941    0.000043
1943    0.000021
1948    0.000021
1950    0.000064
1951    0.000043
1952    0.000021
1953    0.000021
1954    0.000043
1955    0.000043
1956    0.000086
1957    0.000043
1958    0.000086
Name: registration_year, dtype: float64
2016    0.026135
2015    0.008397
2014    0.014203
2013    0.017202
2012    0.028063
2011    0.034768
2010    0.034040
2009    0.044665
2008    0.047450
2007    0.048778
2006    0.057197
2005    0.062895
2004    0.057904
2003    0.057818
2002    0.053255
2001    0.056468
2000    0.067608
1999    0.062060
1998    0.050620
1997    0.041794
Name: registration_year, dtype: float64


## Explore Price by Brand

In [97]:
autos['brand'].value_counts(normalize=True).sort_values(ascending=False)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

As this is a dataset from eBay in Germany, it is expected to have more German brand of cars.

German brands are in 4 of the top 5 brands which represent ~59% of the total brands. Volkswagen is by far the most popular brand in the dataset with even more of the next two brands combined.

There a lots of brands that don't have a significat percentage of distribution. For this exercise we will limit out analysis to brands representing more that 2.5% of the total listing. This group is composed by 9 brands with a distribution of ~80% of the total listings.

In [115]:
# Creating a list with the brands which listings are higher than 25.%
brand_counts = autos['brand'].value_counts(normalize=True)

common_brands = brand_counts[brand_counts > 0.025].index
common_brands

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat'],
      dtype='object')

In [133]:
# Creating a dictionary with the average price for each brand
brand_avg_price = {}

for brand in common_brands:
    brand_series = autos[autos['brand'] == brand]
    avg_price = brand_series['price'].mean()
    brand_avg_price[brand] = int(avg_price)
    
brand_avg_price

{2474: 'renault',
 2813: 'fiat',
 2975: 'opel',
 3094: 'peugeot',
 3749: 'ford',
 5402: 'volkswagen',
 8332: 'bmw',
 8628: 'mercedes_benz',
 9336: 'audi'}

Audi is the most expensive car, followed by Mercedez Benz and BMW.

Volkswage can be categorized a in between and Ford and Opel are the less expensive popular brands

## Exploring relation between mieage and price

In [136]:
# Creating a dictionary with the average mileage for each brand
brand_avg_mileage = {}

for brand in common_brands:
    brand_series = autos[autos['brand']==brand]
    avg_mileage = brand_series['odometer_km'].mean()
    brand_avg_mileage[brand] = int(avg_mileage)
    
brand_avg_mileage

{'audi': 129157,
 'bmw': 132572,
 'fiat': 117121,
 'ford': 124266,
 'mercedes_benz': 130788,
 'opel': 129310,
 'peugeot': 127153,
 'renault': 128071,
 'volkswagen': 128707}

In [141]:
# Creating series for the average price dictionary
# and the average mileage dictionary

df_avg_price = pd.Series(brand_avg_price).sort_values(ascending=False)
df_avg_mileage = pd.Series(brand_avg_mileage).sort_values(ascending=False)


In [147]:
#Creating a dataframe containing the two new series

price_mileage = pd.DataFrame(df_avg_price,columns=['Avg Price'])
price_mileage


Unnamed: 0,Avg Price
audi,9336
mercedes_benz,8628
bmw,8332
volkswagen,5402
ford,3749
peugeot,3094
opel,2975
fiat,2813
renault,2474


In [150]:
# Adding the mileage series to the price_mileage DataFrame

price_mileage['Avg Mileage'] = df_avg_mileage

price_mileage

Unnamed: 0,Avg Price,Avg Mileage
audi,9336,129157
mercedes_benz,8628,130788
bmw,8332,132572
volkswagen,5402,128707
ford,3749,124266
peugeot,3094,127153
opel,2975,129310
fiat,2813,117121
renault,2474,128071


There is a slight relationship between price and mileage. More expensive cars trend to have higher mileage while less expensive cards trend to have lower mileage.