# Exploring Ebay Car Sales Data

- This project is focused on cleaning and analyzing a large used car data set. 
- Our dataset is from the German eBay classifieds called "Kleinanzeigen" and can be found on:
https://data.world/data-society/used-cars-data
- For the purpose of this project, the DQ team has "dirtied" the data to resemble a real world dataset that would be freshly scraped from the internet. The original, cleaned set was uploaded on Kaggle by user orgesleka.

## Importing the Ebay Dataset

- Importing our dataset using the default encoding UTF-8 results in an encoding error. Setting the encoding to 'Latin-1' fixes this problem. This is most likely due to the German use of the Umlaut characters(ä, ö, ü)

In [1]:
import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv",encoding = 'Latin-1')


- Test run our csv file inside the jupyter notbook shows us the first and last 5 elements

In [2]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

The following obervations can be made about the initial inspection of the dataset:


- Dates are objects (dateCrawled, lastSeen, dateCreated)
- Five columns contain null values(vehicleType,  gearbox ,model, fuelType, notRepairedDamage).
- There is a total of 20 columns
- The column names use camelcase instead of snakecase

## Cleaning the Data

### 1. Changing all Columns from Camelcase to Snakecase:

In [4]:
mapping_dict = {
    "yearOfRegistration":"registration_year",
    "monthOfRegistration":"registration_month",
    "notRepairedDamage":"unrepaired_damage",
    "dateCreated":"ad_created",
    "powerPS":"power_ps"
    
}
autos.rename(columns=mapping_dict,inplace=True)

autos.columns# = column_names

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuelType', 'brand',
       'unrepaired_damage', 'ad_created', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In the next cell, iterate throught the column names and change to snake case using the custom method `change_to_camel_case(string)`.

In [5]:
# Iterate through string, append "_" to character
# strip "_" if it was appended directly at the beginning or end of a word
# return the snakecase string
def change_to_camelcase(string):
    new_string = ""
    for char in string:
        if char.isupper():
            char = char.lower()
            new_string += "_"+char
        else:
            new_string += char
    new_string = new_string.strip("_")
    return new_string

# Extract columns as Index
camel_col_names = autos.columns

# Iterate through the columns Index and convert to snakecase
snake_col_names= []
for col in camel_col_names:
    col = change_to_camelcase(col)
    snake_col_names.append(col)

# Assign cleaned coluns back to DataFrame:
autos.columns= snake_col_names

# Show the resulting column Series:
autos.head()


Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### 2. Exploring Data for Further Cleaning Tasks

In this section we will be exploring text columns where most of the values are the same, these will be dropped from our dataset. We will also explore numeric data that is storead as a text, this will be cleaned and convert. 

In the cell below we will anayze the statistics for all numerical columns

In [6]:
autos.describe()

Unnamed: 0,registration_year,power_ps,registration_month,nr_of_pictures,postal_code
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2005.07328,116.35592,5.72336,0.0,50813.6273
std,105.712813,209.216627,3.711984,0.0,25779.747957
min,1000.0,0.0,0.0,0.0,1067.0
25%,1999.0,70.0,3.0,0.0,30451.0
50%,2003.0,105.0,6.0,0.0,49577.0
75%,2008.0,150.0,9.0,0.0,71540.0
max,9999.0,17700.0,12.0,0.0,99998.0


Further inspection is warrented on:
- registration year (invalid years for min year=1000/max year=9999)
- power_ps (min=0 /max=17700)
- registration year ( invalid min year is 0)
- nr_of_pictures has zero numerical desciptive value

Expand the inspection by using the `describe()` method on our dataset to see a statistic for all columns using the "all" argument. This will include all non-numeric columns.

In [7]:
autos.describe(include = "all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-16 21:50:53,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In the above cells the `top` descriptor shows us the most common value and the `freq` tells us how often this top value appears inside the column

Inspection of the above data:

- The columns "seller" and "offer_type" show values that are repeaded almost 50,000 times. These are columns that should be dropped, they will not contribute to this analysis.
- Columns that contain int/float are object dtypes that will need conversion( price, odometer)

### 3. Converting "price" and "odometer" Columns to Numeric

Check "price" and "odometer" value counts to examine what type of unique text symbols are used and see if any null values are present.

In [8]:
autos["price"].value_counts(dropna= False)

$0         1421
$500        781
$1,500      734
$2,500      643
$1,200      639
           ... 
$6,755        1
$193          1
$46,911       1
$51,990       1
$2,910        1
Name: price, Length: 2357, dtype: int64

In [9]:
autos["odometer"].value_counts(dropna=False)

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

Convert "price" column by removing the first $ Symbol from the string and then converting the column to dtype. In the cells below we also convert "odometer" column to int. This is confirmed after displaying the columns with the flota64 and int64 dtypes

In [10]:
autos["price"] = autos["price"].str.replace(',','')
autos["price"] = autos["price"].str.replace('$','')

In [11]:
autos["price"] = autos["price"].astype(int)

autos["price"]

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: int64

In [12]:
autos["odometer"] = autos["odometer"].str.replace(",","")
autos["odometer"] = autos["odometer"].str.replace("km","")

In [13]:
autos["odometer"] = autos["odometer"].astype(int)
autos["odometer"] 

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer, Length: 50000, dtype: int64

In [14]:
autos.rename(columns={"odometer":"odometer_km"},inplace=True)
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### 4. Removing Unnecessary Columns

The columns "seller" and "offer_type" both have over 49k values that are identical in their corresponding columns. This data will be removed because it will not contribute to our analysis. Display the first three rows to confirm that we dropped the above columns from the autos.csv dataset.
The nr_of_pictures also has mostly 0 values in its columns, we will remove this as well.

In [15]:
autos = autos.drop("seller",axis = 1)
autos = autos.drop("offer_type",axis = 1)
autos = autos.drop("nr_of_pictures", axis = 1)

autos.head(3)

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37


### 5. Exploring Odometer and Price Columns

I this section we will eplore the corresponding columns shape, statistical description and value counts, including min/max values. We will also analyze and remove any outliers from our columns.

In [16]:
# Extract the column:
odometer_col = autos["odometer_km"]

# Shape tupule: (rows, colums)
odometer_col.unique().shape

(13,)

In [17]:
# Statistical summary of the column:
odometer_col.describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [18]:
odometer_col.value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

The "odometer" column most likely had a pre-selected odometer range on Ebay because the values a rounded by atleast 1000. 

In the next cells we will explore the "price" column.

In [19]:
price_col = autos["price"]
price_col.unique().shape

(2357,)

In [20]:
# Show a statistical description and suppress the scientific notation using
# a lambda funcion
price_col.describe().apply(lambda x: format(x, 'f'))

count       50000.000000
mean         9840.043760
std        481104.380500
min             0.000000
25%          1100.000000
50%          2950.000000
75%          7200.000000
max      99999999.000000
Name: price, dtype: object

In [21]:
price_col.value_counts()

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
20790       1
8970        1
846         1
2895        1
33980       1
Name: price, Length: 2357, dtype: int64

In [22]:
price_outliers = autos[autos["price"]<=100] 

price_outliers.head(10)

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
25,2016-03-21 21:56:18,Ford_escort_kombi_an_bastler_mit_ghia_ausstattung,90,control,kombi,1996,manuell,116,,150000,4,benzin,ford,ja,2016-03-21 00:00:00,27574,2016-04-01 05:16:49
27,2016-03-27 18:45:01,Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE,0,control,,2005,,0,,150000,0,,ford,,2016-03-27 00:00:00,66701,2016-03-27 18:45:01
30,2016-03-14 11:47:31,Peugeot_206_Unfallfahrzeug,80,test,kleinwagen,2002,manuell,60,2_reihe,150000,6,benzin,peugeot,ja,2016-03-14 00:00:00,57076,2016-03-14 11:47:31
55,2016-03-07 02:47:54,Mercedes_E320_AMG_zu_Tauschen!,1,test,,2017,automatik,224,e_klasse,125000,7,benzin,mercedes_benz,nein,2016-03-06 00:00:00,22111,2016-03-08 05:45:44
64,2016-04-05 07:36:19,Autotransport__Abschlepp_Schlepper,40,test,,2011,,0,5er,150000,5,,bmw,,2016-04-05 00:00:00,40591,2016-04-07 12:16:01
71,2016-03-28 19:39:35,Suche_Opel_Astra_F__Corsa_oder_Kadett_E_mit_Re...,0,control,,1990,manuell,0,,5000,0,benzin,opel,,2016-03-28 00:00:00,4552,2016-04-07 01:45:48
80,2016-03-09 15:57:57,Nissan_Primera_Hatchback_1_6_16v_73_Kw___99Ps_...,0,control,coupe,1999,manuell,99,primera,150000,3,benzin,nissan,ja,2016-03-09 00:00:00,66903,2016-03-09 16:43:50
87,2016-03-29 23:37:22,Bmw_520_e39_zum_ausschlachten,0,control,,2000,,0,5er,150000,0,,bmw,,2016-03-29 00:00:00,82256,2016-04-06 21:18:15
99,2016-04-05 09:48:54,Peugeot_207_CC___Cabrio_Bj_2011,0,control,cabrio,2011,manuell,0,2_reihe,60000,7,diesel,peugeot,nein,2016-04-05 00:00:00,99735,2016-04-07 12:17:34
118,2016-03-12 05:03:00,VW_Sharan_V6_204_PS_Karosse_Rohkarosse_mit_Pap...,0,control,bus,2001,manuell,204,sharan,150000,7,benzin,volkswagen,ja,2016-03-12 00:00:00,15370,2016-03-12 21:44:23


In [23]:
price_outliers = autos[autos["price"]>=300000] 
price_outliers
#price_outliers["price"].value_counts()

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
514,2016-03-17 09:53:08,Ford_Focus_Turnier_1.6_16V_Style,999999,test,kombi,2009,manuell,101,focus,125000,4,benzin,ford,nein,2016-03-17 00:00:00,12205,2016-04-06 07:17:35
2897,2016-03-12 21:50:57,Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000,11111111,test,limousine,1973,manuell,48,escort,50000,3,benzin,ford,nein,2016-03-12 00:00:00,94469,2016-03-12 22:45:27
7814,2016-04-04 11:53:31,Ferrari_F40,1300000,control,coupe,1992,,0,,50000,12,,sonstige_autos,nein,2016-04-04 00:00:00,60598,2016-04-05 11:34:11
11137,2016-03-29 23:52:57,suche_maserati_3200_gt_Zustand_unwichtig_laufe...,10000000,control,coupe,1960,manuell,368,,100000,1,benzin,sonstige_autos,nein,2016-03-29 00:00:00,73033,2016-04-06 21:18:11
14715,2016-03-30 08:37:24,Rolls_Royce_Phantom_Drophead_Coupe,345000,control,cabrio,2012,automatik,460,,20000,8,benzin,sonstige_autos,nein,2016-03-30 00:00:00,73525,2016-04-07 00:16:26
22947,2016-03-22 12:54:19,Bmw_530d_zum_ausschlachten,1234566,control,kombi,1999,automatik,190,,150000,2,diesel,bmw,,2016-03-22 00:00:00,17454,2016-04-02 03:17:32
24384,2016-03-21 13:57:51,Schlachte_Golf_3_gt_tdi,11111111,test,,1995,,0,,150000,0,,volkswagen,,2016-03-21 00:00:00,18519,2016-03-21 14:40:18
27371,2016-03-09 15:45:47,Fiat_Punto,12345678,control,,2017,,95,punto,150000,0,,fiat,,2016-03-09 00:00:00,96110,2016-03-09 15:45:47
36818,2016-03-27 18:37:37,Porsche_991,350000,control,coupe,2016,manuell,500,911,5000,3,benzin,porsche,nein,2016-03-27 00:00:00,70499,2016-03-27 18:37:37
37585,2016-03-29 11:38:54,Volkswagen_Jetta_GT,999990,test,limousine,1985,manuell,111,jetta,150000,12,benzin,volkswagen,ja,2016-03-29 00:00:00,50997,2016-03-29 11:38:54


In [24]:
# Displaying unique car prices for top 20 most expensive cars
autos["price"].value_counts().sort_index(ascending = False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

We can see that the top 75% of car prices range within $7200, which is incompatible with the max price of 99 Mil. dollars for the most expensive cars. Another outlier shows that 1,421 cars have been sold for 0 dollars. 

Free cars or cars with a price tag of 1 Dollar or similar were sold as junk cars, were on offer for a trade or were involved in a car accident. We will keep these 0 values, since these are real transaction costs at which cars were sold/traded.

We will exclude cars with a cost over 300,000 Million USD. Cars with a price tag of over 300 000 USD had mostly arbitrary prices assigned to them for the same reasons as cars getting sold for free. We will remove cars with a price tag of 300,000 USD because they will skew our mean curve heavily and were not actual transaction costs. 

In [25]:
# Check how many cars exist with a price over 300 000
outliers_price_before = autos.loc[autos["price"] >= 300000,"price"]
outliers_price_before

514        999999
2897     11111111
7814      1300000
11137    10000000
14715      345000
22947     1234566
24384    11111111
27371    12345678
36818      350000
37585      999990
39377    12345678
39705    99999999
42221    27322222
43049      999999
47598    12345678
47634     3890000
Name: price, dtype: int64

In [26]:
# Replace values over 300 000 USD with NaN
autos.loc[autos["price"]>=300000,"price"] = np.nan
# Double check of there are any cars that cost over 300 000 USD
autos["price"].describe()


count     49984.000000
mean       5707.849652
std        8719.752927
min           0.000000
25%        1100.000000
50%        2950.000000
75%        7200.000000
max      299000.000000
Name: price, dtype: float64

In [27]:
check_null = autos.loc[autos["price"].isnull(),"price"]
check_null

514     NaN
2897    NaN
7814    NaN
11137   NaN
14715   NaN
22947   NaN
24384   NaN
27371   NaN
36818   NaN
37585   NaN
39377   NaN
39705   NaN
42221   NaN
43049   NaN
47598   NaN
47634   NaN
Name: price, dtype: float64

### 6. Date and Time Columns

Using the info method we can see that the "ad_created", "registration_month" and "registration_year" date columns are represented as strings.

We will analyse these by extracting the date as a string from each column value(the first 10 characters represent the date).

Once we extract the date, we will use the value_counts() method to show the count values for each unique date.

In [28]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        50000 non-null  object 
 1   name                50000 non-null  object 
 2   price               49984 non-null  float64
 3   abtest              50000 non-null  object 
 4   vehicle_type        44905 non-null  object 
 5   registration_year   50000 non-null  int64  
 6   gearbox             47320 non-null  object 
 7   power_ps            50000 non-null  int64  
 8   model               47242 non-null  object 
 9   odometer_km         50000 non-null  int64  
 10  registration_month  50000 non-null  int64  
 11  fuel_type           45518 non-null  object 
 12  brand               50000 non-null  object 
 13  unrepaired_damage   40171 non-null  object 
 14  ad_created          50000 non-null  object 
 15  postal_code         50000 non-null  int64  
 16  last

In [29]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [30]:
date_crawl_date = autos["date_crawled"].str[:10]
date_crawl_date.value_counts( dropna = False).sort_index(ascending= False)

2016-04-07      71
2016-04-06     159
2016-04-05     655
2016-04-04    1826
2016-04-03    1934
2016-04-02    1770
2016-04-01    1690
2016-03-31    1596
2016-03-30    1681
2016-03-29    1709
2016-03-28    1742
2016-03-27    1552
2016-03-26    1624
2016-03-25    1587
2016-03-24    1455
2016-03-23    1619
2016-03-22    1647
2016-03-21    1876
2016-03-20    1891
2016-03-19    1745
2016-03-18     653
2016-03-17    1576
2016-03-16    1475
2016-03-15    1699
2016-03-14    1831
2016-03-13     778
2016-03-12    1839
2016-03-11    1624
2016-03-10    1606
2016-03-09    1661
2016-03-08    1665
2016-03-07    1798
2016-03-06     697
2016-03-05    1269
Name: date_crawled, dtype: int64

The dataset shows us that it was evaluated during a 2 month period between March and April of 2016. The time stamps are consistent. 

In [31]:
ad_created_dates= autos["ad_created"].str[:10]
ad_created_dates.value_counts(dropna=False).sort_index(ascending=False)

2016-04-07      64
2016-04-06     163
2016-04-05     592
2016-04-04    1844
2016-04-03    1946
              ... 
2015-12-05       1
2015-11-10       1
2015-09-09       1
2015-08-10       1
2015-06-11       1
Name: ad_created, Length: 76, dtype: int64

In [32]:
last_seen_dates= autos["last_seen"].str[:10]
last_seen_dates.value_counts(dropna=False).sort_index(ascending=False)


2016-04-07     6546
2016-04-06    11050
2016-04-05     6214
2016-04-04     1231
2016-04-03     1268
2016-04-02     1245
2016-04-01     1155
2016-03-31     1192
2016-03-30     1242
2016-03-29     1117
2016-03-28     1043
2016-03-27      801
2016-03-26      848
2016-03-25      960
2016-03-24      978
2016-03-23      929
2016-03-22     1079
2016-03-21     1037
2016-03-20     1035
2016-03-19      787
2016-03-18      371
2016-03-17     1396
2016-03-16      822
2016-03-15      794
2016-03-14      640
2016-03-13      449
2016-03-12     1191
2016-03-11      626
2016-03-10      538
2016-03-09      493
2016-03-08      380
2016-03-07      268
2016-03-06      221
2016-03-05       54
Name: last_seen, dtype: int64

We can see that the 6th of April contains a large outlier with over 11000 timestamps.

We will now take a look at the registration year column:

In [33]:
reg_year_dates= autos["registration_year"]

reg_year_dates.describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The registration year shows us inconsistent dates. Outliers such as registration dates in the years 1000 and 9999 stand out. We will remove car registration years that fall outside of the interval 1900 - 2016 and set to to null

In [34]:
# Remove cars that are not regstered between 1900 and 2016
autos = autos[autos["registration_year"].between(1900,2016)]

# Show value counts of cars by registration year and NaN to confirm
# values have been removed.
autos["registration_year"].value_counts(normalize=True).head(10)

2000    0.069834
2005    0.062776
1999    0.062464
2004    0.056988
2003    0.056779
2006    0.056384
2001    0.056280
2002    0.052740
1998    0.051074
2007    0.047972
Name: registration_year, dtype: float64

In [35]:
autos["registration_year"].value_counts(dropna=False).sort_index(ascending=False)

2016    1316
2015     399
2014     666
2013     806
2012    1323
        ... 
1934       2
1931       1
1929       1
1927       1
1910       9
Name: registration_year, Length: 78, dtype: int64

In [36]:
autos["registration_year"].describe()

count    48028.00000
mean      2002.80351
std          7.31085
min       1910.00000
25%       1999.00000
50%       2003.00000
75%       2008.00000
max       2016.00000
Name: registration_year, dtype: float64

After applying the filter to the "registration_year" column we can see that 1972 cars fall outside of our predefined range. 
The majorty of the cars were registered in the year 2000. 
9 cars were registered in 1910, this could be either be an outlier or represent true data, but not statistically significaant for removal in a dataset of 50k cars.


## Aggregating Car Brands by Price

I this section we aggregate car brands by its mean price. This will be achieved by looping over the "brand" columns, calculating the mean price of a certain brand and adding a key-value pair of the brand name and price to the `brand_mean_dict`dictionary.

In [50]:
# Check the value count types of brands we have in total as assign to unique_brands
unique_series = autos["brand"].value_counts()


brands_col= autos[["brand","price"]]
#autos["brand"].value_counts() 

unique_series

volkswagen        10188
bmw                5284
opel               5195
mercedes_benz      4580
audi               4149
ford               3352
renault            2274
peugeot            1418
fiat               1242
seat                873
skoda               770
mazda               727
nissan              725
citroen             669
smart               668
toyota              599
sonstige_autos      526
hyundai             473
volvo               444
mini                415
mitsubishi          391
honda               377
kia                 341
alfa_romeo          318
porsche             293
suzuki              284
chevrolet           274
chrysler            176
dacia               123
daihatsu            123
jeep                108
subaru              105
land_rover           98
saab                 77
jaguar               76
trabant              75
daewoo               72
rover                65
lancia               52
lada                 29
Name: brand, dtype: int64

In [38]:
# unique_series is a list of all unique values inside the column
# Take advantage a vectorization in pandas to compute mean of our target column
def aggreate_col_val(dataframe,column_val: str,column_worth: str):
    # Use value counts to order our Series by brand counts
    total_val= dataframe[column_val].value_counts()
    avg_price = 0
    mean_dict = {}   
    # Take advantage a vectorization in pandas to compute mean of our target column
    for value in total_val.index:
         avg_price = dataframe.loc[autos[column_val]==value,column_worth].mean()#
         mean_dict[value]= int(avg_price)
    return mean_dict

unsorted_dict=aggreate_col_val(autos,"brand","price")
unsorted_dict


{'volkswagen': 5231,
 'bmw': 8102,
 'opel': 2876,
 'mercedes_benz': 8485,
 'audi': 9093,
 'ford': 3652,
 'renault': 2395,
 'peugeot': 3039,
 'fiat': 2711,
 'seat': 4296,
 'skoda': 6334,
 'mazda': 4010,
 'nissan': 4664,
 'citroen': 3699,
 'smart': 3542,
 'toyota': 5115,
 'sonstige_autos': 10164,
 'hyundai': 5308,
 'volvo': 4757,
 'mini': 10460,
 'mitsubishi': 3333,
 'honda': 3988,
 'kia': 5789,
 'alfa_romeo': 3984,
 'porsche': 43507,
 'suzuki': 3995,
 'chevrolet': 6488,
 'chrysler': 3229,
 'dacia': 5915,
 'daihatsu': 1556,
 'jeep': 11434,
 'subaru': 3765,
 'land_rover': 19108,
 'saab': 3211,
 'jaguar': 11176,
 'trabant': 1552,
 'daewoo': 1019,
 'rover': 1528,
 'lancia': 3246,
 'lada': 2502}

The aggreate result shows us the top 5 most expensive cars ins our dataset. Audi, Mercedes Benz and BMW are the 3 most expensive cars followed by Volkswagen and Opel respectively.

## Aggregting Car Milage

I this section we will be aggregating the average milage for the top 6 car brands in our data sets and compare it to the mean price.
First, we will create 2 dictinaries and aggregate mean milage/mean price for each of the top brands. 

In [42]:
mean_milage = aggreate_col_val(autos,"brand","odometer_km")
mean_mile_series= pd.Series(mean_milage)
mean_mile_series = mean_mile_series.head(6)
mean_mile_series

volkswagen       128730
bmw              132434
opel             129227
mercedes_benz    130860
audi             129287
ford             124046
dtype: int64

In [49]:
mean_price = aggreate_col_val(autos,"brand","price")
mean_price_series=pd.Series(mean_price)
mean_price_series = mean_price_series.head(6)
mean_price_series

volkswagen       5231
bmw              8102
opel             2876
mercedes_benz    8485
audi             9093
ford             3652
dtype: int64

Using the above series we combine them into one dataframe for visual clarity

In [66]:
milage_mean_df= pd.DataFrame(mean_mile_series, columns=['mean_milage'])
milage_mean_df["mean_price"] = mean_price_series
milage_mean_df.index.name = "top_6_brands"
milage_mean_df


Unnamed: 0_level_0,mean_milage,mean_price
top_6_brands,Unnamed: 1_level_1,Unnamed: 2_level_1
volkswagen,128730,5231
bmw,132434,8102
opel,129227,2876
mercedes_benz,130860,8485
audi,129287,9093
ford,124046,3652


The car brand with the top mean milage is BMW and the car brand with the highest mean price is Mercedes. 
Ford hast the lowest mean milage and the second lowest mean price, this could be due to the fact that the cars is less common amongs German consumers compared to the other top 6. Other than this observation there seems to be no direct correlation between mean milage and mean price. 
