# Exploring Ebay Car Sale Data 

## Background 

The aim of this project is to clean the data and analyze the included used car listings. 

For this exercise, we'll work with a dataset of used cars from eBay [Kleinanzeigen](https://en.wikipedia.org/wiki/Classified_advertising), a classifieds section of the German eBay website. 

## Opening and Exploring the Data

The dataset was originally scraped with [Scrapy](https://scrapy.org) and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). 

There are two modifications which have been made from the original dataset that was uploaded to Kaggle:
    1. Sample size has been reduced to 50,000 whereas the full dataset has over 370,000.
    2. Our data set has been purposely 'dirtied' to practice basic data cleaning.
    
The data dictionary provided with data has the following structure:

   `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
   
   `name` - Name of the car.
   
   `seller` - Whether the seller is private or a dealer.
   
   `offerType` - The type of listing
   
   `price` - The price on the ad to sell the car.
   
   `abtest` - Whether the listing is included in an A/B test.
   
   `vehicleType` - The vehicle Type.
   
   `yearOfRegistration` - The year in which the car was first registered.
   
   `gearbox` - The transmission type.
   
   `powerPS` - The power of the car in PS.
   
   `model` - The car model name.
   
   `kilometer` - How many kilometers the car has driven.
   
   `monthOfRegistration` - The month in which the car was first registered.
   
   `fuelType` - What type of fuel the car uses.
   `brand` - The brand of the car.
   
   `notRepairedDamage` - If the car has a damage which is not yet repaired.
   
   `dateCreated` - The date on which the eBay listing was created.
   
   `nrOfPictures` - The number of pictures in the ad.
   
   `postalCode` - The postal code for the location of the vehicle.
   
   `lastSeenOnline` - When the crawler saw this ad last online.


Let's start by importing the libraries we need and loading the data set into a pandas DataFrame object.

In [161]:
import numpy as np
import pandas as pd 

autos = pd.read_csv('autos.csv', encoding='Latin-1')

### Take a look at the data

We can use DataFrame.info() to quickly view information about the `autos` DataFrame, and DataFrame.head(5) to view the first 5 columns.

In [162]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [163]:
autos.head(5)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Initial observations

   1. We can see that this dataset contains 20 columns, most of which are strings.
   2. Some columns have null values, but none have more than ~20% null values.
   3. The column names use camelCase which means we can't just replace spaces with underscores.

## Make some initial changes so that our data is easier to work with: 

We will now convert the column names from `camelCase` to `snake_case` and reword some of the column names based on the data dictionary to be more descriptive. For reference we've printed the columns below.

In [164]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [165]:
autos.columns = [
       'date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_pictures', 'postal_code',
       'last_seen'
]

Above, we implemented `snake_case` for the column names. We also changed a few of the names to be easier to understand. We can check the first 5 columns again to see the changes below:

In [166]:
autos.head(5)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Determine what other cleaning tasks need to be done
Initially we will look for: 
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. 
- Examples of numeric data stored as text which can be cleaned and converted.
- Any other data which is logically out-of-bounds or needs further investigation.

We'll now use `DataFrame.describe()` to look at descriptive statistics for all columns:

In [167]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 15:49:30,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


### Define basic cleaning strategy

From the above we can see:
   1. The `number_pictures` column can be dropped as none of the rows have any pictures. Also, the `seller` and `offer_type` columns contain values where almost all the values are the same, so they can be safely removed from the DataFrame.
   2. `Price` and `odometer` are numeric values stored as text. 
   4. There is some general cleaning to do in date type fields which are being interpreted as strings, `date_crawled`, `ad_created`, and `last_seen`.
   5. The min for `registration_month` is 0, also a sign of inaccurate data.
   6. The max for `power_ps` suggests inaccurate data as well.
   
   

### Remove the unneeded data from our DataFrame

Below we will start cleaning by dropping the `number_pictures`, `seller`, and `offer_type` data.

In [168]:
autos = autos.drop(["number_pictures", "seller", "offer_type"], axis=1)

## Analysis and treatment of `price` and `odometer` values

We'll start reviewing and treating these values next:

In [169]:
autos['price'].unique()

array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'],
      dtype=object)

In [170]:
autos['odometer'].unique()

array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km',
       '30,000km', '125,000km', '90,000km', '20,000km', '60,000km',
       '5,000km', '100,000km', '40,000km'], dtype=object)

We'll convert these values below:

In [171]:
autos['price'] = (autos['price']
                 .str.replace('$','')
                 .str.replace(',','')
                  .astype(float)
                 )

autos['odometer'] = (autos['odometer']
                     .str.replace('km','')
                     .str.replace(',','')
                     .astype(float)
                    )

# rename the odometer column                                              
autos = autos.rename({'odometer':'odometer_km'}, axis=1)

Let's have another look at our data:

In [172]:
print(autos['price'].unique().shape)
print(autos['price'].describe())
print(autos['price'].value_counts().head(20))

(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
0.0       1421
500.0      781
1500.0     734
2500.0     643
1200.0     639
1000.0     639
600.0      531
800.0      498
3500.0     498
2000.0     460
999.0      434
750.0      433
900.0      420
650.0      419
850.0      410
700.0      395
4500.0     394
300.0      384
2200.0     382
950.0      379
Name: price, dtype: int64


### Findings

 - There are 2357 unique values in column price.
 - Around 3% of `prices` are 0, these rows will considered for removal from our dataset.
 - The min `price` is 0 which is too low, and max `price` looks too large at one hundred million.

Let's explore further the lowest and the highest prices:

In [173]:
autos['price'].value_counts().sort_index(ascending=False).head(20)

99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
259000.0      1
250000.0      1
220000.0      1
198000.0      1
197000.0      1
Name: price, dtype: int64

In [174]:
autos['price'].value_counts().sort_index(ascending=True).head(20)

0.0     1421
1.0      156
2.0        3
3.0        1
5.0        2
8.0        1
9.0        1
10.0       7
11.0       2
12.0       3
13.0       2
14.0       1
15.0       2
17.0       3
18.0       1
20.0       4
25.0       5
29.0       1
30.0       7
35.0       1
Name: price, dtype: int64

### Findings:

 - A number of prices are under 30 dollars, the most frequent of which is 0 (1421 rows).
 - Around 15 prices are higher than one million which seems too much.
 - Starting from 350000, prices double.
 - We will take into consideration that Ebay is an auction site. Therefore:
     - Prices between 1 and 350000 will be kept
     - Prices below or above will be discarded

In [175]:
autos = autos[autos['price'].between(1,350000)]
autos['price'].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

In [176]:
autos.tail(3)

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,13200.0,test,cabrio,2014,automatik,69,500,5000.0,11,benzin,fiat,nein,2016-04-02 00:00:00,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,22900.0,control,kombi,2013,manuell,150,a3,40000.0,11,diesel,audi,nein,2016-03-08 00:00:00,35683,2016-04-05 16:45:07
49999,2016-03-14 00:42:12,Opel_Vectra_1.6_16V,1250.0,control,limousine,1996,manuell,101,vectra,150000.0,1,benzin,opel,nein,2016-03-13 00:00:00,45897,2016-04-06 21:18:48


## Further cleaning in the Odometer column

We'll continue analyzing the `odometer_km` column in our DataFrame with the following steps:
- Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low that we might want to remove
- We'll use:
    - `Series.unique().shape` to see how many unique values
    - `Series.describe()` to view `min`, `max`, `median`, `mean`, etc.
    - `Series.value_counts()`, with some variations: 
        - Chained to `.head()` if there are lots of values
        - Because `Series.value_counts()` returns a series, we can use `Series.sort_index()` with `ascending=True` or `False` to view the highest and lowest values with their counts, this can also be chained to `head()`
- When removing values, we'll opt for `df[df["col"].between(x,y)]`, as it is more readable than the method shown in Dataquest which is `df[(df["col"] > x ) & (df["col"] < y )]`

#### Use `Series.unique().shape`, `Series.decribe()`, `Series.value_counts()`

In [177]:
autos['odometer_km'].unique().shape

(13,)

In [178]:
autos['odometer_km'].describe()

count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [179]:
autos['odometer_km'].value_counts()

150000.0    31414
125000.0     5057
100000.0     2115
90000.0      1734
80000.0      1415
70000.0      1217
60000.0      1155
50000.0      1012
5000.0        836
40000.0       815
30000.0       780
20000.0       762
10000.0       253
Name: odometer_km, dtype: int64

## Exploring the date columns
Let's now move on to the date columns to further understand the date range the data covers.

There are five columns that should represent date values, referring to the data dictionary:
  
 - `date_crawled`: added by Scrapy
  
 - `last_seen`: added by Scrapy
 - `ad_created`: from the website
 - `registration_month`: from the website
 - `registration_year`: from the website

The two columns added by Scrapy and the `ad_created` column are string values. To do any quantitative analysis these will need to be converted into a numeric value. 
The other two columns are represented as numeric values, the `Series.describe()` method can be used to understand the distribution without any extra data processing.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [180]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


We notice that the first 10 characters represent the date stamp, for example 2016-03-12.

To understand the date range, we can extract just the date values, use `Series.value_counts()` to generate a distribution, and then sort by the index.
To select the first 10 characters in each column, we can use `Series.str[:10]`.

#### We'll start with `date_crawled`:

In [181]:
autos['date_crawled'].str[:10]

0        2016-03-26
1        2016-04-04
2        2016-03-26
3        2016-03-12
4        2016-04-01
            ...    
49995    2016-03-27
49996    2016-03-28
49997    2016-04-02
49998    2016-03-08
49999    2016-03-14
Name: date_crawled, Length: 48565, dtype: object

In [182]:
autos['date_crawled'].value_counts(normalize=True, dropna=False)

2016-03-14 20:50:02    0.000062
2016-03-25 19:57:10    0.000062
2016-04-02 11:37:04    0.000062
2016-04-04 16:40:33    0.000062
2016-03-23 19:38:20    0.000062
                         ...   
2016-03-12 12:55:05    0.000021
2016-03-28 13:49:44    0.000021
2016-03-27 18:38:38    0.000021
2016-03-17 09:39:48    0.000021
2016-03-26 19:38:48    0.000021
Name: date_crawled, Length: 46882, dtype: float64

We'll sort the values:

In [183]:
autos['date_crawled'].sort_index()

0        2016-03-26 17:47:46
1        2016-04-04 13:38:56
2        2016-03-26 18:57:24
3        2016-03-12 16:58:10
4        2016-04-01 14:38:50
                ...         
49995    2016-03-27 14:38:19
49996    2016-03-28 10:50:25
49997    2016-04-02 14:44:48
49998    2016-03-08 19:25:42
49999    2016-03-14 00:42:12
Name: date_crawled, Length: 48565, dtype: object

#### Findings: 

 - The data was crawled every day starting from 5 March 2016 and ending 7 April 2016
 - The distribution of listings crawled on each day is roughly uniform.

#### Now for `ad_created`:

In [184]:
autos['ad_created'].value_counts(normalize=True, dropna=False)

2016-04-03 00:00:00    0.038855
2016-03-20 00:00:00    0.037949
2016-03-21 00:00:00    0.037579
2016-04-04 00:00:00    0.036858
2016-03-12 00:00:00    0.036755
                         ...   
2016-02-11 00:00:00    0.000021
2016-02-08 00:00:00    0.000021
2015-12-05 00:00:00    0.000021
2015-09-09 00:00:00    0.000021
2016-02-09 00:00:00    0.000021
Name: ad_created, Length: 76, dtype: float64

In [185]:
autos['ad_created'].sort_index()

0        2016-03-26 00:00:00
1        2016-04-04 00:00:00
2        2016-03-26 00:00:00
3        2016-03-12 00:00:00
4        2016-04-01 00:00:00
                ...         
49995    2016-03-27 00:00:00
49996    2016-03-28 00:00:00
49997    2016-04-02 00:00:00
49998    2016-03-08 00:00:00
49999    2016-03-13 00:00:00
Name: ad_created, Length: 48565, dtype: object

#### Findings:

This distribution shows the percentage of created ads by date. There are a large variety of values. It appears that the month of March is when the majority of ads were created. Some ads were created 10 months ago. 

#### Followed by `last_seen`:

In [190]:
autos['last_seen'].value_counts(normalize=True, dropna=False)

2016-04-07 06:17:27    0.000165
2016-04-06 21:17:51    0.000144
2016-04-07 03:16:17    0.000144
2016-04-06 14:44:55    0.000124
2016-04-06 20:48:27    0.000124
                         ...   
2016-03-20 00:15:49    0.000021
2016-03-15 17:45:40    0.000021
2016-04-06 20:15:48    0.000021
2016-03-14 19:16:05    0.000021
2016-04-04 00:47:10    0.000021
Name: last_seen, Length: 38474, dtype: float64

In [191]:
autos['last_seen'].sort_index()

0        2016-04-06 06:45:54
1        2016-04-06 14:45:08
2        2016-04-06 20:15:37
3        2016-03-15 03:16:28
4        2016-04-01 14:38:50
                ...         
49995    2016-04-01 13:47:40
49996    2016-04-02 14:18:02
49997    2016-04-04 11:47:27
49998    2016-04-05 16:45:07
49999    2016-04-06 21:18:48
Name: last_seen, Length: 48565, dtype: object

#### Findings:

The distribution above shows the percent of removed ads per each day. It appears that during the last 3 days the percent of removed ads reached its highest point. However this is not necessarily due to the increase in number of cars sold. 

No definite conclusions are reached from this distribution.

#### A closer look at `registration_year`:

In [192]:
autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

### Handling incorrect `registration_year` Data

We find some erroneous values in the `registration_year` column, with min and max outside of logical values. 
    
  - We should expect the listings to include values in the 1900 - 2016 range:
      - Any `registration_year` after the listing was seen is invalid, so we can rule out values of `registration_year` above 2016.
      - Any year prior to 1900 is highly unlikely as well.

We will count the number of listings which are outside of this range, to determine if it is viable to remove those rows. 

In [193]:
(~autos['registration_year'].between(1900,2016)).sum()/autos.shape[0]

0.038793369710697

There are 519 listings or less than 4%, which is not a significant portion of our sample. We will amend the erroneous `registration_year` data below:

In [194]:
autos = autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].value_counts(normalize=True).head(10)

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
2006    0.057197
2001    0.056468
2002    0.053255
1998    0.050620
2007    0.048778
Name: registration_year, dtype: float64

#### Findings

 - The majority of the autos were registered within the past 20 years, which appears to be in range.

In [195]:
autos.shape

(46681, 17)

## Exploring Price by Brand

When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the brand column. Here's what the process looks like:

 - Identify the unique values we want to aggregate by
 - Create an empty dictionary to store our aggregate data
 - Loop over the unique values, and for each:
 - Subset the dataframe by the unique values
 - Calculate the mean of whichever column we're interested in
 - Assign the val/mean to the dict as k/v.

In [196]:
autos['brand'].value_counts(normalize=True)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

German brands represent almost 50% of our dataset and four out of the five top brands. The most popular brand is Volkswagen, the next two brands are twice less represented. 

For our analysis, we will focus only on the brands with 5 percent or higher representation. The brands which do not represent a significant percentage will be excluded.  

In [197]:
brand_counts = autos['brand'].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index
print(common_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')


In [199]:
brand_mean_prices = {}

for b in common_brands:
    b_only = autos[autos['brand'] == b]
    mean_price = b_only['price'].mean()
    brand_mean_prices[b] = int(mean_price)

brand_mean_prices

{'volkswagen': 5402,
 'bmw': 8332,
 'opel': 2975,
 'mercedes_benz': 8628,
 'audi': 9336,
 'ford': 3749}

#### Findings:

 - The most expensive brand is audi
 - The least expensive is opel
 - volkswagen has a price somewhere near the average, which makes sense as it has the highest quantity of ads. 

In [201]:
brand_mean_mileage = {}
for b in common_brands:
    b_only = autos[autos['brand'] == b]
    mean_mil = b_only['odometer_km'].mean()
    brand_mean_mileage[b] = int(mean_mil)
    

mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_price = pd.Series(brand_mean_prices).sort_values(ascending=False)

In [202]:
brand_agg = pd.DataFrame(mean_mileage,columns=['mean_mileage'])
brand_agg

Unnamed: 0,mean_mileage
bmw,132572
mercedes_benz,130788
opel,129310
audi,129157
volkswagen,128707
ford,124266


In [203]:
brand_agg['mean_price'] = mean_price
brand_agg

Unnamed: 0,mean_mileage,mean_price
bmw,132572,8332
mercedes_benz,130788,8628
opel,129310,2975
audi,129157,9336
volkswagen,128707,5402
ford,124266,3749


#### Findings:

 - The more expensive the car, the higher its mileage will be (with the exception of opel)

## Next steps:

### Data Translation 

You've probably seen that there is some data identified by german words in the dataset, we will translate them and map the values to their english counterparts below.

In [204]:
autos.head()

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,5000.0,control,bus,2004,manuell,158,andere,150000.0,3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500.0,control,limousine,1997,automatik,286,7er,150000.0,6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,8990.0,test,limousine,2009,manuell,102,golf,70000.0,7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,4350.0,control,kleinwagen,2007,automatik,71,fortwo,70000.0,6,benzin,smart,nein,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,1350.0,test,kombi,2003,manuell,0,focus,150000.0,7,benzin,ford,nein,2016-04-01 00:00:00,39218,2016-04-01 14:38:50


The columns with data in German are the following: 

 - `vehicle_type`
 - `gearbox`
 - `fuel_type`
 - `unrepaired_damage`
 
Let's explore each column.

In [205]:
autos['vehicle_type'].unique()

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [206]:
autos['gearbox'].unique()

array(['manuell', 'automatik', nan], dtype=object)

In [207]:
autos['fuel_type'].unique()

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [208]:
autos['unrepaired_damage'].unique()

array(['nein', nan, 'ja'], dtype=object)

We will now create our translation dictionary, with German words as keys and their English translations as values

In [209]:
words_translated = {
    'bus':'bus',
    'limousine':'limousine',
    'kleinwagen':'supermini',
    'kombi':'station_wagon',
    'coupe':'coupe',
    'suv':'suv',
    'cabrio':'cabrio',
    'andere' :'other',
    'manuell':'manual',
    'automatik':'automatic',
    'lpg':'lpg',
    'benzin':'petrol',
    'diesel':'diesel',
    'cng':'cng',
    'hybrid':'hybrid',
    'elektro':'electric',
    'nein':'no',
    'ja':'yes'
}
for each in ['vehicle_type','gearbox','fuel_type','unrepaired_damage']:
    autos[each] = autos[each].map(words_translated)

In [210]:
print(autos['vehicle_type'].unique())
print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())

['bus' 'limousine' 'supermini' 'station_wagon' nan 'coupe' 'suv' 'cabrio'
 'other']
['manual' 'automatic' nan]
['lpg' 'petrol' 'diesel' nan 'cng' 'hybrid' 'electric' 'other']
['no' nan 'yes']


We've successfully translated those fields.

## Completed!

### We've successfully finished our basic data analysis and cleaning. 