## Exploring Ebay Car Sales Data

That project was created to practice data cleaning skills with Pandas. So it consists of two parts: a large preparatory stage with data exploring and cleaning and some data analysis itself. 
The original dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website scraped by Kaggle user [orgesleka](https://www.kaggle.com/orgesleka). The original file can be found [here](https://data.world/data-society/used-cars-data).

The original dataset was modified by the authors of the challenge:
- a sample of 50,000 data points was selected from the full dataset
- the dataset was "dirtified" a bit to more closely resemble what you would expect from a scraped dataset

The dataset includes following columns:
 - `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
 - `name` - Name of the car.
 - `seller` - Whether the seller is private or a dealer.
 - `offerType` - The type of listing
 - `price` - The price on the ad to sell the car.
 - `abtest` - Whether the listing is included in an A/B test.
 - `vehicleType` - The vehicle Type.
 - `yearOfRegistration` - The year in which the car was first registered.
 - `gearbox` - The transmission type.
 - `powerPS` - The power of the car in PS.
 - `model` - The car model name.
 - `kilometer` - How many kilometers the car has driven.
 - `monthOfRegistration` - The month in which the car was first registered.
 - `fuelType` - What type of fuel the car uses.
 - `brand` - The brand of the car.
 - `notRepairedDamage` - If the car has damage which is not yet repaired.
 - `dateCreated` - The date on which the eBay listing was created.
 - `nrOfPictures` - The number of pictures in the ad.
 - `postalCode` - The postal code for the location of the vehicle.
 - `lastSeenOnline` - When the crawler saw this ad last online.
 
 The aim of this project is to **clean the data** and **analyze the included used car listings in regards to their prices**.

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [2]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
# show class of object autos
autos.info()
# show number of columns and non-null values in them
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


The first look at the dataset shows that there are 50,000 rows and 20 columns. 

- There are 5 columns with *Null* values: vehicleType, gearbox, model, fuelType, notRepairedDamage. I should keep it in mind in case I use these columns in the further analysis.
- The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

At first I'll convert column names to underscores and check which formats do different columns have.

In [4]:
# rename columns in dataframe
autos.rename({"dateCrawled": "date_crawled",
              "offerType":"offer_type", 
              "abtest": "a_b_test",
              "vehicleType":"vehicle_type",
              "yearOfRegistration":"registration_year",
              "powerPS":"powers_ps",
              "monthOfRegistration":"registration_month",
              "fuelType":"fuel_type",
              "notRepairedDamage":"unrepaired_damage",
              "dateCreated":"ad_created",
              "nrOfPictures":"number_of_photos",
              "postalCode":"postal_code",
              "lastSeen":"last_seen",
              }, axis=1, inplace=True)
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'a_b_test',
       'vehicle_type', 'registration_year', 'gearbox', 'powers_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_photos', 'postal_code',
       'last_seen'],
      dtype='object')

In [5]:
# show formats of data in columns and basic statistics for numeric ones
autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,a_b_test,vehicle_type,registration_year,gearbox,powers_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_photos,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-27 22:55:05,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


After first look at the dataset I have determined several types of columns which could be used in a different ways:
 - columns which can be used to group data by categories: `seller`, `offer_type`, `gearbox`, `unrepaired_damage`, `vehicle_type`, `brand`, `model`, `fuel_type`.
 - columns with data that can be used for calculations: `price`, `odometer`, `powers_ps`
 - columns with dates: `date_crawled`, `ad_created`, `last_seen` as well as `registration_year` and `registration_month`
My task now is also to convert when necessary the values in the columns to appropriate format: strings, numeric and datetime respectively.

### Exploring columns with string formats

My goal is to define if the columns are useful for further analysis and delete some rows if they will not be useful. To explore their range I'll use *Series.value_counts()* and *Series.head()* methods.

In [6]:
autos['seller'].value_counts()

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [7]:
# delete a row with company ad
autos = autos[autos['seller'] == 'privat']

In [8]:
autos['offer_type'].value_counts()

Angebot    49998
Gesuch         1
Name: offer_type, dtype: int64

In [9]:
# delete a row with one search ad

autos = autos[autos['offer_type']=='Angebot']

Columns `offer_type` and `seller` don't provide me with interesting information, so I cleaned up 2 rows to focus only on private ads selling autos.

In [10]:
autos['gearbox'].value_counts()

manuell      36992
automatik    10327
Name: gearbox, dtype: int64

In [11]:
autos['unrepaired_damage'].value_counts()

nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64

In [12]:
autos['vehicle_type'].value_counts()

limousine     12859
kleinwagen    10822
kombi          9126
bus            4092
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64

In [13]:
print(autos['brand'].value_counts().head(10))

# create a list with names of brands for further analysis
brands = autos['brand'].value_counts().index.tolist()

volkswagen       10686
opel              5461
bmw               5429
mercedes_benz     4734
audi              4283
ford              3479
renault           2403
peugeot           1456
fiat              1308
seat               941
Name: brand, dtype: int64


In [14]:
autos['model'].value_counts().head(10)

golf        4024
andere      3528
3er         2761
polo        1757
corsa       1735
astra       1454
passat      1425
a4          1291
5er         1183
c_klasse    1172
Name: model, dtype: int64

In [15]:
print(autos['fuel_type'].value_counts())

# create a list of fuel types for further analysis
fuel_types = autos['fuel_type'].value_counts().index.tolist()

benzin     30106
diesel     14567
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64


## Observations about categorical columns

 - In the column `gearbox` more than 3 times more cars are sold with manual gearbox
 - In the column `unrepaired_damage` roughly 1/8 of cars sold have unrepaired damage 
 - Column `model` includes a very heterogeneous set of names, not exclusively model names. Probably it was scraped from the title of ad and it could be interesting only to show top selling models.
 - Column `brand` has 40 unique values, top-6 of which: Volkswagen, Opel, BMW, Mercedes Benz, Audi and Ford.   There is a certain preference for cars from german producers (with top brand Volkswagen).
 - Column `fuel_type` shows predominance of bensin cars (66%) and diesel cars (32%). All other types of fuel take a modest share of 2%. Together with the column `vehicle_type` it could be used for some fine-tailored analysis of price differences, but that is out of scope of that project.
 
## Which columns will be used further?

For the analysis I decided leave **all numerical columns** as well as **columns with temporal information** (apart from `date_crawled`). I'll narrow down the scope of **string categorical columns** and delete columns `name`, `seller`, `offer_type`, `a_b_test`, `vehicle_type`, `number_of_photos`, `postal_code`. 

In [16]:
# delete columns that will not be used in the analysis
autos = autos.drop(['date_crawled', 'name', 'seller', 'offer_type', 'a_b_test', 'vehicle_type', 'number_of_photos', 'postal_code'], axis=1)

In [17]:
autos

Unnamed: 0,price,registration_year,gearbox,powers_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,last_seen
0,"$5,000",2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,2016-04-06 06:45:54
1,"$8,500",1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,2016-04-06 14:45:08
2,"$8,990",2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,2016-04-06 20:15:37
3,"$4,350",2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,2016-03-15 03:16:28
4,"$1,350",2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...
49995,"$24,900",2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,2016-04-01 13:47:40
49996,"$1,980",1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,2016-04-02 14:18:02
49997,"$13,200",2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,2016-04-04 11:47:27
49998,"$22,900",2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,2016-04-05 16:45:07


## Cleaning columns with numeric values

As I have mentioned above apart from string value columns there are few columns presenting numeric information in the dataset. They could be valuable for our analysis:
- `price`
- `odometer`
- `powers_ps`

Last column is already formatted as integer type.
Next step is to take a look at columns `price` and `odometer` to identify the way to format them.

In [18]:
autos['price'].value_counts().tail()

$155,000    1
$190,000    1
$2,033      1
$61,999     1
$51,990     1
Name: price, dtype: int64

In [19]:
autos['odometer'].value_counts()

150,000km    32422
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

After taking look at these columns I've found out what should be done next:
 - column `price` should be cleaned from dollar signs, formatted to float with deleted "," thousand delimiters and renamed `"price_dollars"`. There is also a large number of rows with 0 price, that also should be taken into consideration
 - column `odometer` should be cleaned from km signs, formatted to float with deleted "," thousand delimiters and renamed `"odometer_km"`

In [20]:
# clean and convert values in price column
autos['price'] = autos['price'].str.replace('$', '', regex=False).str.replace(',', '').astype(int)

# clean and convert values in odometer column
autos['odometer'] = autos['odometer'].str.replace('km', '').str.replace(',', '').astype(int)

In [21]:
# rename columns 
autos.rename({'price':'price_dollars', 'odometer':'odometer_km'}, axis=1, inplace=True)

In [22]:
autos['price_dollars'].describe()

count    4.999800e+04
mean     9.840435e+03
std      4.811140e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollars, dtype: float64

In [23]:
autos['odometer_km'].describe()
autos['odometer_km'].dtype

dtype('int64')

I have succesfully converted columns `price` and `odometer` to numeric values (both are integers now) and renamed them to `price_dollars` and `odometer_km`.

Next step is to identify outliers in columns "price_eur" and "odometer_km".
I'll use `Series.unique().shape` to see how many unique values include these columns.

In [24]:
# identify outliers in the column price_eur
print(autos['price_dollars'].value_counts())

# find rows without outliers
autos = autos[autos['price_dollars'].between(1,999989)]
print(autos['price_dollars'].describe())

0        1420
500       781
1500      734
2500      643
1000      639
         ... 
6770        1
61999       1
20987       1
6578        1
48600       1
Name: price_dollars, Length: 2357, dtype: int64
count     48564.000000
mean       5889.054794
std        9059.909948
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price_dollars, dtype: float64


There were a number of entries with unrealistic prices which lay far above maximum values in the `price_eur` column, so rows with price>=999990 and with price=0 were treated as outliers.
The same operation for the `odometer_km` column doesn't show obvious outliers.

## Exploring date columns

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. I can differentiate by referring to the data dictionary:

- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website

Right now `last_seen`, and `ad_created` columns are all identified as string values by pandas. I need to convert the data from these columns to datetime format so I can understand it quantitatively. The other two columns are represented as numeric values, so I can use methods like Series.describe() to understand the distribution without any extra data processing.

In [25]:
print(autos['registration_year'].describe())
print(autos['registration_year'].value_counts().sort_index().head(10))
print(autos['registration_year'].value_counts().sort_index().tail(15))

count    48564.000000
mean      2004.755518
std         88.644797
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
1000    1
1001    1
1111    1
1800    2
1910    5
1927    1
1929    1
1931    1
1934    2
1937    4
Name: registration_year, dtype: int64
2015     392
2016    1220
2017    1392
2018     470
2019       2
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       1
9999       3
Name: registration_year, dtype: int64


The column `registration_year` includes strange values that are above a year that the dataset was created (2016) and less then 1910.
These values could be treated as outliers and filtered out.

In [26]:
# remove strange values from registration_year
autos = autos[autos['registration_year'].between(1911, 2017)]
autos['registration_year'].describe()

count    48067.000000
mean      2003.328500
std          7.403593
min       1927.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2017.000000
Name: registration_year, dtype: float64

In [27]:
autos[['ad_created','last_seen']].head()

Unnamed: 0,ad_created,last_seen
0,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 00:00:00,2016-04-01 14:38:50


Column `ad_created` is looking like a date converted to datetime and than to string format, it has a time part with zeroes. That doesn't interfere with the further analisys so I will format that column to datetime and perform basic data exploration.

In [28]:
# convert column ad_created to datetime format
autos['ad_created'] = pd.to_datetime(autos['ad_created'])
print(autos['ad_created'].dtype)

# find earliest and latest date
print(autos['ad_created'].min())
print(autos['ad_created'].max())

# percentages of unique dates
print(autos["ad_created"].value_counts(normalize=True, dropna=False).head(40))

datetime64[ns]
2015-06-11 00:00:00
2016-04-07 00:00:00
2016-04-03    0.038821
2016-03-20    0.037947
2016-03-21    0.037510
2016-04-04    0.036844
2016-03-12    0.036636
2016-04-02    0.035097
2016-03-28    0.035076
2016-03-14    0.034972
2016-03-07    0.034826
2016-03-29    0.034098
2016-03-15    0.033932
2016-04-01    0.033745
2016-03-19    0.033641
2016-03-30    0.033557
2016-03-08    0.033391
2016-03-09    0.033328
2016-03-11    0.032933
2016-03-22    0.032788
2016-03-26    0.032267
2016-03-23    0.032080
2016-03-10    0.031935
2016-03-25    0.031747
2016-03-31    0.031664
2016-03-17    0.031331
2016-03-27    0.031019
2016-03-16    0.030187
2016-03-24    0.029272
2016-03-05    0.022885
2016-03-13    0.017080
2016-03-06    0.015374
2016-03-18    0.013544
2016-04-05    0.011775
2016-04-06    0.003245
2016-03-04    0.001498
2016-04-07    0.001248
2016-03-03    0.000853
2016-02-28    0.000208
2016-02-29    0.000166
2016-02-27    0.000125
2016-03-02    0.000104
Name: ad_created, dtype: 

The column `ad_created` shows that ads were created in course of 2015 and 2016, from 11th June 2015 to 07th of April 2016. Most part of ads were created in March and April of 2016.

Next column is `last_seen`. It was created by the person who scraped the dataset and includes day and time parts. I will convert the data in that column to datetime format, separate date part and perform basic exploration.

In [29]:
# convert column last_seen to datetime format
autos.loc[:, 'last_seen'] = pd.to_datetime(autos['last_seen'])
print(autos['last_seen'].dtype)

# find earliest and latest date
print(autos['last_seen'].min())
print(autos['last_seen'].max())

# separate date part
autos.loc[:, 'last_seen'] = autos.loc[:, 'last_seen'].dt.date
autos.loc[:,'last_seen'] = pd.to_datetime(autos['last_seen'])
print(autos['last_seen'].dtype)

# percentages of unique dates
print(autos["last_seen"].value_counts(normalize=True, dropna=False).head(30))

datetime64[ns]
2016-03-05 14:45:46
2016-04-07 14:58:50
datetime64[ns]
2016-04-06    0.222044
2016-04-07    0.132107
2016-04-05    0.124971
2016-03-17    0.028107
2016-04-03    0.025152
2016-04-02    0.024820
2016-03-30    0.024820
2016-04-04    0.024362
2016-03-12    0.023779
2016-03-31    0.023634
2016-04-01    0.022802
2016-03-29    0.022323
2016-03-22    0.021324
2016-03-28    0.020888
2016-03-20    0.020638
2016-03-21    0.020492
2016-03-24    0.019702
2016-03-25    0.019140
2016-03-23    0.018516
2016-03-26    0.016831
2016-03-16    0.016331
2016-03-15    0.015915
2016-03-19    0.015790
2016-03-27    0.015645
2016-03-14    0.012566
2016-03-11    0.012420
2016-03-10    0.010735
2016-03-09    0.009695
2016-03-13    0.008925
2016-03-08    0.007427
Name: last_seen, dtype: float64


The column `last_seen` includes dates in range from 05th of March 2016 to 07th of April 2016, with 47% of rows in range 05-07th of April 2016.

Now I have a dataset which suits the analysis goals: **to explore the average prices, most popular ads and brands**.

## From data exploration and cleaning to analysis
### Top brands and average prices and mileage

On the stage of dataset exloration and data cleaning I have already found top-6 brands sold on e-bay: 
 - Volkswagen
 - Opel
 - BMW
 - Mercedes-Benz
 - Audi
 - Ford
 
Next step is to examine price differences among these brands.

In [30]:
# create empty dictionaries to store the results
mean_price_by_brand = {}
mean_mileage_by_brand = {}

# slice a list brands to leave top-6 brands
top_6_brands = brands[:6]

# iterate over top-6 brands to collect mean prices and millage
for b in top_6_brands:
    selected_rows = autos[autos["brand"]==b]
    price = selected_rows["price_dollars"].mean()
    mileage = selected_rows["odometer_km"].mean()
    mean_price_by_brand[b] = round(price, 2)
    mean_mileage_by_brand[b] = round(mileage, 2)
print(mean_price_by_brand)
print(mean_mileage_by_brand)

{'volkswagen': 5351.4, 'opel': 2953.44, 'bmw': 8284.05, 'mercedes_benz': 8528.27, 'audi': 9239.32, 'ford': 3732.27}
{'volkswagen': 128928.08, 'opel': 129417.24, 'bmw': 132666.79, 'mercedes_benz': 130962.04, 'audi': 129406.86, 'ford': 124255.29}


Among top-sellers it's quite understandable why Volkswagen is at the first place: the mean price is 5231 dollars, which looks like a middle between the price group of Mercedes Benz, Audi and BMW on one side and Opel and Ford on the other.
Mean mileage fluctuates slightly: for BMW and Mercedes Benz it is a little higher.

**How does mileage affect these average numbers?**
Trying to answer that question I'll compare mean price and mean mileage for every top-6 brand.

To present that difference more clearly, on the next step I'll create a table with these average measures.

In [31]:
# create dataframe with series mean_price and mean_mileage
price_brand_serie = pd.Series(mean_price_by_brand)
means_df = pd.DataFrame(price_brand_serie, columns=["mean_price"])

mileage_brand_serie = pd.Series(data=mean_mileage_by_brand)
means_df.assign(mean_mileage=mileage_brand_serie)

Unnamed: 0,mean_price,mean_mileage
volkswagen,5351.4,128928.08
opel,2953.44,129417.24
bmw,8284.05,132666.79
mercedes_benz,8528.27,130962.04
audi,9239.32,129406.86
ford,3732.27,124255.29


### How does fuel type influence the average price?
Next step is to find out if there is a difference among cars with different types of fuel. The government of Germany is planning to limit the usage of diesel cars that can impact average prices for diesel cars in a significant way.

In [32]:
# create an empty dictionary to store the results
mean_price_by_fuel = {}

# iterate over fuel types list created earlier to collect mean prices
for f in fuel_types:
    selected_rows = autos[autos["fuel_type"]==f]
    price = selected_rows["price_dollars"].mean()
    mean_price_by_fuel[f] = round(price, 2)
print(mean_price_by_fuel)

{'benzin': 5005.69, 'diesel': 8531.54, 'lpg': 4324.6, 'cng': 4825.9, 'hybrid': 14346.03, 'andere': 2912.94, 'elektro': 24716.37}


Diesel cars have a higher average price, in comparison to gasoline cars. That can either mean that planned limitation still didn't start to affect prices or that the group of gasoline cars included more old vehicles or with higher mileage. I'll try to answer the questions below comparing the average registration year and mileage for every fuel type.

Top average prices had electric and hybrid cars.

In [33]:
# create empty dictionaries to store values
mean_year_fuel = {}
mean_mileage_fuel = {}

# iterate over fuel list created above
for f in fuel_types:
    selected_rows = autos[autos['fuel_type']==f]
    mean_year = selected_rows['registration_year'].mean()
    mean_mileage = selected_rows['odometer_km'].mean()
    mean_year_fuel[f] = round(mean_year)
    mean_mileage_fuel[f] = round(mean_mileage)
print(mean_year_fuel)
print(mean_mileage_fuel)

{'benzin': 2002, 'diesel': 2006, 'lpg': 2002, 'cng': 2006, 'hybrid': 2010, 'andere': 1999, 'elektro': 2009}
{'benzin': 122530, 'diesel': 131540, 'lpg': 140890, 'cng': 121944, 'hybrid': 86081, 'andere': 98438, 'elektro': 36053}


The hypothesis that gasoline cars have more mileage on average than diesel ones was not proved, but the average registration year was higher for diesel cars. 

*Maybe the proportion of diesel/gasoline cars differs from brand to brand and share of diesel cars is higher in more expensive brands?*
Next step is to check the proportions of different fuel types for top-6 brands.

In [34]:
for b in top_6_brands:
    selected_rows = autos[autos['brand']==b]
    print('For {} the percentages of different fuel_types look like that:'.format(b))
    print(selected_rows['fuel_type'].value_counts(normalize=True, dropna=False))

For volkswagen the percentages of different fuel_types look like that:
benzin     0.550783
diesel     0.353914
NaN        0.084051
lpg        0.007828
cng        0.003229
elektro    0.000098
hybrid     0.000098
Name: fuel_type, dtype: float64
For opel the percentages of different fuel_types look like that:
benzin    0.706797
diesel    0.175307
NaN       0.101382
lpg       0.013633
cng       0.002496
andere    0.000192
hybrid    0.000192
Name: fuel_type, dtype: float64
For bmw the percentages of different fuel_types look like that:
benzin     0.552284
diesel     0.364749
NaN        0.063659
lpg        0.018161
cng        0.000574
elektro    0.000382
hybrid     0.000191
Name: fuel_type, dtype: float64
For mercedes_benz the percentages of different fuel_types look like that:
benzin    0.513015
diesel    0.411497
NaN       0.054013
lpg       0.020824
andere    0.000434
hybrid    0.000217
Name: fuel_type, dtype: float64
For audi the percentages of different fuel_types look like that:
benzin

In [41]:
#create new column with time difference
autos['time_diff'] = autos['last_seen'] - autos['ad_created']
autos['time_diff'].astype(int)

0         950400000000000
1         172800000000000
2         950400000000000
3         259200000000000
4                       0
               ...       
49995     432000000000000
49996     432000000000000
49997     172800000000000
49998    2419200000000000
49999    2073600000000000
Name: time_diff, Length: 48067, dtype: int64

In [42]:
autos['time_diff'].describe()

count                        48067
mean     8 days 21:24:48.996608900
std      8 days 18:21:02.623070074
min                0 days 00:00:00
25%                2 days 00:00:00
50%                6 days 00:00:00
75%               14 days 00:00:00
max              300 days 00:00:00
Name: time_diff, dtype: object

The column that we created, `time_diff` gives me an approximation to the time span from posting ad to the moment ad was not interesting to the owner. Average time is 8 days, median - 6 days and standard deviation 8 days. The distribution is skewed right. Further I would create several bins to put ads with large, medium amd short 'lifespans' and check how the price, age or mileage affects it. It's outside the scope of that project as well as all sorts of aggregations with _groupby()_

## Instead of conclusion

My main goal in that project was to train data cleaning and formatting, the profound data analysis was out the scope of it. 
Pandas is a powerful tool which enables fast data handling and first stages of exploratory analysis.

The project will be continued, the dataset could be used for predictive modelling.