# Exploring Ebay Car Sale Data 

## Background 

The aim of this project is to clean the data and analyze the included used car listings. 

For this exercise, we'll work with a dataset of used cars from eBay [Kleinanzeigen](https://en.wikipedia.org/wiki/Classified_advertising), a classifieds section of the German eBay website. 

    A. Some question to answer in our analysis?
    B. another question to answer?
    C. ?
    D. ?

## Step 1: Opening and Exploring the Data

The dataset was originally scraped with [Scrapy](https://scrapy.org) and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). 

There are two modifications which have been made from the original dataset that was uploaded to Kaggle:
    1. Sample size has been reduced to 50,000 whereas the full dataset has over 370,000.
    2. Our data set has been purposely 'dirtied' to practice basic data cleaning.
    
The data dictionary provided with data has the following structure:

   `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
   
   `name` - Name of the car.
   
   `seller` - Whether the seller is private or a dealer.
   
   `offerType` - The type of listing
   
   `price` - The price on the ad to sell the car.
   
   `abtest` - Whether the listing is included in an A/B test.
   
   `vehicleType` - The vehicle Type.
   
   `yearOfRegistration` - The year in which the car was first registered.
   
   `gearbox` - The transmission type.
   
   `powerPS` - The power of the car in PS.
   
   `model` - The car model name.
   
   `kilometer` - How many kilometers the car has driven.
   
   `monthOfRegistration` - The month in which the car was first registered.
   
   `fuelType` - What type of fuel the car uses.
   `brand` - The brand of the car.
   
   `notRepairedDamage` - If the car has a damage which is not yet repaired.
   
   `dateCreated` - The date on which the eBay listing was created.
   
   `nrOfPictures` - The number of pictures in the ad.
   
   `postalCode` - The postal code for the location of the vehicle.
   
   `lastSeenOnline` - When the crawler saw this ad last online.


Let's start by importing the libraries we need and loading the data set into a pandas DataFrame object.

In [139]:
import numpy as np
import pandas as pd 

autos = pd.read_csv('autos.csv', encoding='Latin-1')

## Take a look at the data

We can use DataFrame.info() to quickly view information about the `autos` DataFrame, and DataFrame.head(5) to view the first 5 columns.

In [140]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [141]:
autos.head(5)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Initial observations

    1. We can see that this dataset contains 20 columns, most of which are strings.
    2. Some columns have null values, but none have more than ~20% null values.
    3. The column names use camelCase which means we can't just replace spaces with underscores.

## Step 2: Make some initial changes so that our data is easier to work with: 

We will now convert the column names from `camelCase` to `snake_case` and reword some of the column names based on the data dictionary to be more descriptive. For reference we've printed the columns below.

In [142]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [143]:
autos.columns = [
       'date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_pictures', 'postal_code',
       'last_seen'
]

Above, we implemented `snake_case` for the column names. We also changed a few of the names to be easier to understand. We can check the first 5 columns again to see the changes below:

In [144]:
autos.head(5)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Step 3: Determine what other cleaning tasks need to be done
Initially we will look for: 
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. 
- Examples of numeric data stored as text which can be cleaned and converted.
- Any other data which is logically out-of-bounds or needs further investigation.

We'll now use `DataFrame.describe()` to look at descriptive statistics for all columns:

In [145]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 15:49:30,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


## Step 4: Define basic cleaning strategy

From the above we can see:
   1. The `number_pictures` column can be dropped as none of the rows have any pictures. Also, the `seller` and `offer_type` columns contain values where almost all the values are the same, so they can be safely removed from the DataFrame.
   2. `Price` and `odometer` are numeric values stored as text. 
   4. There is some general cleaning to do in date type fields which are being interpreted as strings, `date_crawled`, `ad_created`, and `last_seen`.
   5. The min for `registration_month` is 0, also a sign of inaccurate data.
   6. The max for `power_ps` suggests inaccurate data as well.
   
   

### 4.1:  Remove the unneeded data from our DataFrame

Below we will start cleaning by dropping the `number_pictures`, `seller`, and `offer_type` data.

In [146]:
autos = autos.drop(["number_pictures", "seller", "offer_type"], axis=1)

### 4.2:  Analysis and treatment of `price` and `odometer` values

We'll start reviewing and treating these values next:

In [147]:
autos['price'].unique()

array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'],
      dtype=object)

In [148]:
autos['odometer'].unique()

array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km',
       '30,000km', '125,000km', '90,000km', '20,000km', '60,000km',
       '5,000km', '100,000km', '40,000km'], dtype=object)

We'll convert these values below:

In [149]:
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','')
autos['price'] = autos['price'].astype(float)
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','')
autos['odometer'] = autos['odometer'].astype(float)
# rename the odometer column                                              
autos.rename(columns={'odometer':'odometer_km'}, inplace=True)

Let's have another look at our data:

In [150]:
autos.tail(3)

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,13200.0,test,cabrio,2014,automatik,69,500,5000.0,11,benzin,fiat,nein,2016-04-02 00:00:00,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,22900.0,control,kombi,2013,manuell,150,a3,40000.0,11,diesel,audi,nein,2016-03-08 00:00:00,35683,2016-04-05 16:45:07
49999,2016-03-14 00:42:12,Opel_Vectra_1.6_16V,1250.0,control,limousine,1996,manuell,101,vectra,150000.0,1,benzin,opel,nein,2016-03-13 00:00:00,45897,2016-04-06 21:18:48


### 4.2 Continued:  Further cleaning in Odometer and Price columns

We'll start analyzing the `odometer_km` and `price` columns in our DataFrame with the following steps:
- Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low that we might want to remove
- We'll use:
    - `Series.unique().shape` to see how many unique values
    - `Series.describe()` to view `min`, `max`, `median`, `mean`, etc.
    - `Series.value_counts()`, with some variations: 
        - Chained to `.head()` if there are lots of values
        - Because `Series.value_counts()` returns a series, we can use `Series.sort_index()` with `ascending=True` or `False` to view the highest and lowest values with their counts, this can also be chained to `head()`
- When removing values, we'll opt for `df[df["col"].between(x,y)]`, as it is more readable than the method shown in Dataquest which is `df[(df["col"] > x ) & (df["col"] < y )]`

#### Use `Series.unique().shape`, `Series.decribe()`, `Series.value_counts()`

In [131]:
autos['odometer_km'].unique().shape

(13,)

In [132]:
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [133]:
autos['odometer_km'].value_counts()

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
5000.0        967
40000.0       819
30000.0       789
20000.0       784
10000.0       264
Name: odometer_km, dtype: int64

In [134]:
autos['price'].unique().shape

(2357,)

In [135]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [136]:
autos['price'].value_counts().head()

0.0       1421
500.0      781
1500.0     734
2500.0     643
1200.0     639
Name: price, dtype: int64

### 4.3:  Exploring the date columns
Let's now move on to the date columns to further understand the date range the data covers.

There are five columns that should represent date values, referring to the data dictionary:
  
 - `date_crawled`: added by Scrapy
  
 - `last_seen`: added by Scrapy
 - `ad_created`: from the website
 - `registration_month`: from the website
 - `registration_year`: from the website

The two columns added by Scrapy and the `ad_created` column are string values. To do any quantitative analysis these will need to be converted into a numeric value. 
The other two columns are represented as numeric values, the `Series.describe()` method can be used to understand the distribution without any extra data processing.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [137]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


We notice that the first 10 characters represent the date stamp, for example 2016-03-12.

To understand the date range, we can extract just the date values, use `Series.value_counts()` to generate a distribution, and then sort by the index.
To select the first 10 characters in each column, we can use `Series.str[:10]`.

#### We'll start with `date_crawled`:

In [104]:
autos['date_crawled'].str[:10]

0        2016-03-26
1        2016-04-04
2        2016-03-26
3        2016-03-12
4        2016-04-01
            ...    
49995    2016-03-27
49996    2016-03-28
49997    2016-04-02
49998    2016-03-08
49999    2016-03-14
Name: date_crawled, Length: 50000, dtype: object

In [105]:
autos['date_crawled'].value_counts(normalize=True, dropna=False)

2016-04-02 15:49:30    0.00006
2016-03-10 15:36:24    0.00006
2016-03-30 17:37:35    0.00006
2016-03-29 23:42:13    0.00006
2016-03-09 11:54:38    0.00006
                        ...   
2016-03-12 15:42:48    0.00002
2016-04-03 16:50:04    0.00002
2016-03-08 15:51:35    0.00002
2016-03-05 15:50:58    0.00002
2016-03-26 19:38:48    0.00002
Name: date_crawled, Length: 48213, dtype: float64

We'll sort the values:

In [106]:
autos['date_crawled'].sort_index()

0        2016-03-26 17:47:46
1        2016-04-04 13:38:56
2        2016-03-26 18:57:24
3        2016-03-12 16:58:10
4        2016-04-01 14:38:50
                ...         
49995    2016-03-27 14:38:19
49996    2016-03-28 10:50:25
49997    2016-04-02 14:44:48
49998    2016-03-08 19:25:42
49999    2016-03-14 00:42:12
Name: date_crawled, Length: 50000, dtype: object

#### Now for `ad_created`:

In [107]:
autos['ad_created'].value_counts(normalize=True, dropna=False)

2016-04-03 00:00:00    0.03892
2016-03-20 00:00:00    0.03786
2016-03-21 00:00:00    0.03772
2016-04-04 00:00:00    0.03688
2016-03-12 00:00:00    0.03662
                        ...   
2015-06-11 00:00:00    0.00002
2015-12-30 00:00:00    0.00002
2016-01-14 00:00:00    0.00002
2016-01-13 00:00:00    0.00002
2016-02-22 00:00:00    0.00002
Name: ad_created, Length: 76, dtype: float64

In [108]:
autos['ad_created'].sort_index()

0        2016-03-26 00:00:00
1        2016-04-04 00:00:00
2        2016-03-26 00:00:00
3        2016-03-12 00:00:00
4        2016-04-01 00:00:00
                ...         
49995    2016-03-27 00:00:00
49996    2016-03-28 00:00:00
49997    2016-04-02 00:00:00
49998    2016-03-08 00:00:00
49999    2016-03-13 00:00:00
Name: ad_created, Length: 50000, dtype: object

#### Followed by `last_seen`:

In [109]:
autos['last_seen'].value_counts(normalize=True, dropna=False)

2016-04-07 06:17:27    0.00016
2016-04-07 03:16:17    0.00014
2016-04-06 06:17:24    0.00014
2016-04-06 21:17:51    0.00014
2016-04-06 02:16:12    0.00012
                        ...   
2016-04-05 17:27:08    0.00002
2016-04-02 19:19:07    0.00002
2016-03-30 12:15:42    0.00002
2016-04-06 03:16:18    0.00002
2016-04-04 00:47:10    0.00002
Name: last_seen, Length: 39481, dtype: float64

In [110]:
autos['last_seen'].sort_index()

0        2016-04-06 06:45:54
1        2016-04-06 14:45:08
2        2016-04-06 20:15:37
3        2016-03-15 03:16:28
4        2016-04-01 14:38:50
                ...         
49995    2016-04-01 13:47:40
49996    2016-04-02 14:18:02
49997    2016-04-04 11:47:27
49998    2016-04-05 16:45:07
49999    2016-04-06 21:18:48
Name: last_seen, Length: 50000, dtype: object

#### A closer look at `registration_year`:

In [112]:
autos['registration_year'].describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

### Handling incorrect `registration_year` Data

We find some erroneous values in the `registration_year` column, with min and max outside of logical values. 
    
  - We should expect the listings to include values in the 1900 - 2016 range:
      - Any `registration_year` after the listing was seen is invalid, so we can rule out values of `registration_year` above 2016.
      - Any year prior to 1900 is highly unlikely as well.

We will count the number of listings which are outside of this range, to determine if it is viable to remove those rows. 

In [113]:
autos.loc[~ autos['registration_year'].between(1899, 2017), 'registration_year'].shape

(519,)

There are 519 listings, which is not a significant portion of our sample. We will amend the erroneous `registration_year` data below:

In [114]:
autos.loc[~ autos['registration_year'].between(1899, 2017), 'registration_year'] = np.nan

In [115]:
autos.dropna(subset=['registration_year'], axis=0, inplace=True)

In [116]:
autos['registration_year'].value_counts(normalize=True)

2000.0    0.067784
2005.0    0.060932
1999.0    0.060629
2004.0    0.055314
2003.0    0.055112
            ...   
1939.0    0.000020
1938.0    0.000020
1931.0    0.000020
1929.0    0.000020
1927.0    0.000020
Name: registration_year, Length: 79, dtype: float64

In [117]:
autos.shape

(49481, 17)

## Step 7: Exploring Price by Brand


When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the brand column. Here's what the process looks like:
Identify the unique values we want to aggregate by
Create an empty dictionary to store our aggregate data
Loop over the unique values, and for each:
Subset the dataframe by the unique values
Calculate the mean of whichever column we're interested in
Assign the val/mean to the dict as k/v.

## Summary 

To summarize our results:

    A. Do `Ask HN` or `Show HN` receive more comments on average?
        ### Our calculation indicates that on average, `Ask HN` posts receive more comments (14 vs 10).
    
    B. Do posts created at a certain time receive more comments on average?
        ### On average, the majority of comments are created at 15:00 EST.
    
    C. Do either `Ask HN` or `Show HN` receive more points?
        ### `Show HN` posts receive more points (27 points vs. 15 for `Ask HN`)
    
    D. During which hours are the posts more likely to receive higher points?
        ### The most points are received by posts created at 23:00, followed closely by those at 12:00 and 22:00
    