# This project aims to clean the data and analyze the included used car listings

The dataset was originally scraped and uploaded to Kaggle by user orgesleka.
The original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data). 

We've made a few modifications from the original dataset:

- We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
- We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The data dictionary provided with data is as follows:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.

Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [3]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371523,2016-03-14 17:48:27,Suche_t4___vito_ab_6_sitze,privat,Angebot,2200,test,,2005,,0,,20000,1,,sonstige_autos,,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,privat,Angebot,1199,test,cabrio,2000,automatik,101,fortwo,125000,3,benzin,smart,nein,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200,test,bus,1996,manuell,102,transporter,150000,3,diesel,volkswagen,nein,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,privat,Angebot,3400,test,kombi,2002,manuell,100,golf,150000,6,diesel,volkswagen,,2016-03-20 00:00:00,0,40764,2016-03-24 12:45:21


## Print basic info about `autos` dataframe

In [4]:
print(autos.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

## Print first n rows of the `autos` dataset

In [5]:
print(autos.head(5))

           dateCrawled                            name  seller offerType  \
0  2016-03-24 11:52:17                      Golf_3_1.6  privat   Angebot   
1  2016-03-24 10:58:45            A5_Sportback_2.7_Tdi  privat   Angebot   
2  2016-03-14 12:52:21  Jeep_Grand_Cherokee_"Overland"  privat   Angebot   
3  2016-03-17 16:54:04              GOLF_4_1_4__3TÜRER  privat   Angebot   
4  2016-03-31 17:25:20  Skoda_Fabia_1.4_TDI_PD_Classic  privat   Angebot   

   price abtest vehicleType  yearOfRegistration    gearbox  powerPS  model  \
0    480   test         NaN                1993    manuell        0   golf   
1  18300   test       coupe                2011    manuell      190    NaN   
2   9800   test         suv                2004  automatik      163  grand   
3   1500   test  kleinwagen                2001    manuell       75   golf   
4   3600   test  kleinwagen                2008    manuell       69  fabia   

   kilometer  monthOfRegistration fuelType       brand notRepairedDamage  

# Data cleaning start

## 1. Update column names

Change column names from camelcase to Python's preferred snakecase

In [6]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:
autos.rename({
    "dateCrawled":"date_crawled",
    "offerType":"offer_type",
    "vehicleType":"vehicle_type",
    "powerPS":"power_in_ps",
    "yearOfRegistration":"registration_year",
    "monthOfRegistration":"registration_month",
    "fuelType":"fuel_type",
    "notRepairedDamage":"not_repaired_damage",
    "dateCreated":"ad_created",
    "nrOfPictures":"num_of_pictures",
    "postalCode":"postal_code",
    "lastSeen":"last_seen",
    "kilometer":"odometer_km"
}, axis=1, inplace=True)

Display n rows with updated column names

In [8]:
autos.head(5)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_in_ps,model,odometer_km,registration_month,fuel_type,brand,not_repaired_damage,ad_created,num_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [9]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_in_ps,model,odometer_km,registration_month,fuel_type,brand,not_repaired_damage,ad_created,num_of_pictures,postal_code,last_seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:45:59
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


In [10]:
autos["registration_year"].unique()

array([1993, 2011, 2004, 2001, 2008, 1995, 1980, 2014, 1998, 2005, 1910,
       2016, 2007, 2009, 2002, 2018, 1997, 1990, 2017, 1981, 2003, 1994,
       1991, 1984, 2006, 1999, 2012, 2010, 2000, 1992, 2013, 1996, 1985,
       1989, 2015, 1968, 1982, 1976, 1983, 1959, 1973, 1111, 1969, 1971,
       1987, 1986, 1988, 1967, 1970, 1965, 1945, 1925, 1974, 1979, 1955,
       1978, 1972, 1977, 1961, 1963, 1964, 1960, 1966, 1975, 1937, 1936,
       5000, 1954, 1958, 9999, 1956, 3200, 1000, 1933, 1941, 1962, 8888,
       1500, 2200, 4100, 1929, 1951, 1957, 1940, 3000, 2066, 1949, 2019,
       1800, 1953, 1935, 1234, 8000, 5300, 9000, 2900, 6000, 5900, 5911,
       1400, 1950, 4000, 1948, 1952, 1200, 8500, 1932, 1255, 1927, 1923,
       1931, 3700, 3800, 4800, 1942, 7000, 1911, 6500, 2290, 2500, 1930,
       1001, 6200, 9450, 1944, 1943, 1947, 1934, 1938, 1688, 2800, 1253,
       1928, 7500, 1919, 5555, 7777, 5600, 1600, 1939, 2222, 1039, 9996,
       1300, 8455, 1915, 4500, 1920, 1602, 7800, 92

#### Observations


## 2. Investigate columns with low amount of unique values

Check proportion of data for columns with less then 10 unique values

In [11]:
autos["seller"].value_counts()

privat        371525
gewerblich         3
Name: seller, dtype: int64

In [12]:
autos["offer_type"].value_counts()

Angebot    371516
Gesuch         12
Name: offer_type, dtype: int64

In [13]:
autos["abtest"].value_counts()

test       192585
control    178943
Name: abtest, dtype: int64

In [14]:
autos["vehicle_type"].value_counts()

limousine     95894
kleinwagen    80023
kombi         67564
bus           30201
cabrio        22898
coupe         19015
suv           14707
andere         3357
Name: vehicle_type, dtype: int64

In [15]:
autos["gearbox"].value_counts()

manuell      274214
automatik     77105
Name: gearbox, dtype: int64

In [16]:
autos["fuel_type"].value_counts()

benzin     223857
diesel     107746
lpg          5378
cng           571
hybrid        278
andere        208
elektro       104
Name: fuel_type, dtype: int64

In [17]:
autos["not_repaired_damage"].value_counts()

nein    263182
ja       36286
Name: not_repaired_damage, dtype: int64

In [18]:
autos["num_of_pictures"].head(5)

0    0
1    0
2    0
3    0
4    0
Name: num_of_pictures, dtype: int64

#### Observations:
- Columns `seller` and `offer_type` have prominent advantage of one value, therefore will be dropped
- Column `num_of_pictures` contains only 0 values and will be droped

## 3. Drop broken column and columns which have siginicant advantage of one value

In [19]:
autos.drop("num_of_pictures", axis=1, inplace=True)

In [20]:
autos.drop("seller", axis=1, inplace=True)
autos.drop("offer_type", axis=1, inplace=True)

In [21]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   date_crawled         371528 non-null  object
 1   name                 371528 non-null  object
 2   price                371528 non-null  int64 
 3   abtest               371528 non-null  object
 4   vehicle_type         333659 non-null  object
 5   registration_year    371528 non-null  int64 
 6   gearbox              351319 non-null  object
 7   power_in_ps          371528 non-null  int64 
 8   model                351044 non-null  object
 9   odometer_km          371528 non-null  int64 
 10  registration_month   371528 non-null  int64 
 11  fuel_type            338142 non-null  object
 12  brand                371528 non-null  object
 13  not_repaired_damage  299468 non-null  object
 14  ad_created           371528 non-null  object
 15  postal_code          371528 non-nu

## 4. Look for any values that look unrealistically high or low (outliers)

### 4.2 Determinate outliers in `autos["odometer_km"]` column

Analyze `odometer_km` column

In [22]:
autos["odometer_km"].unique().shape

(13,)

In [23]:
autos["odometer_km"].describe()

count    371528.000000
mean     125618.688228
std       40112.337051
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [24]:
autos["odometer_km"].value_counts()

150000    240797
125000     38067
100000     15920
90000      12523
80000      11053
70000       9773
60000       8669
50000       7615
5000        7069
40000       6376
30000       6041
20000       5676
10000       1949
Name: odometer_km, dtype: int64

In [25]:
autos["odometer_km"].describe()

count    371528.000000
mean     125618.688228
std       40112.337051
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Using the interquartile rule to find outliers in `autos["odometer_km"]` column

#### Observations

- The median and 75th percentile value is the same as the maximum value 150,000km. 
- The values distribition looks good, taking in the considaratdin the characterstic od the data



### 4.2 Determinate outliers in `autos["price"]` column

In [26]:
autos["price"].value_counts()

0         10778
500        5670
1500       5394
1000       4649
1200       4594
          ...  
349000        1
8889          1
3440          1
1997          1
10985         1
Name: price, Length: 5597, dtype: int64

In [27]:
autos["price"].describe()

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

In [28]:
autos.shape

(371528, 17)

In [29]:
n_rows = autos.shape[0]
percentage_zero_prices = round((10778 / n_rows) * 100)
print(f"Values with 0 price state {percentage_zero_prices}% of total")

Values with 0 price state 3% of total


Using the interquartile rule to find outliers in `autos["price"]` column

In [30]:
sorted_prices = autos["price"].sort_values()
sorted_prices.head(5)

275109    0
215757    0
338766    0
253111    0
141603    0
Name: price, dtype: int64

In [31]:
Q1 = sorted_prices.quantile(.25)
Q3 = sorted_prices.quantile(.75)
IQR = Q3 - Q1
upperFence = Q3 + (1.5 * IQR)
lowerFence = Q1 - (1.5 * IQR)
print(f"The 25% qurtile is {Q1}")
print(f"The 75% qurtile is {Q3}")
print(f"The interquartile range is {IQR}")
print(f"The upper limit for outliers is {upperFence}")
print(f"The upper limit for outliers is {lowerFence}")

The 25% qurtile is 1150.0
The 75% qurtile is 7200.0
The interquartile range is 6050.0
The upper limit for outliers is 16275.0
The upper limit for outliers is -7925.0


In [32]:
values_above_upper_limit = autos[autos["price"] > upperFence]
autous_without_outliers = autos[autos["price"].between(1, upperFence)]

In [33]:
values_above_upper_limit["price"].count()

28108

In [34]:
autous_without_outliers["price"].count()

332642

Determinate how big part of total are values above upper limit

In [35]:
prices_above_upper_limit = round((28108/n_rows) * 100)
print(f"Values above upper limit state {prices_above_upper_limit}% of total")

Values above upper limit state 8% of total


In [36]:
autos = autos[autos["price"].between(1, upperFence)]
autos.shape

(332642, 17)

In [37]:
autos["price"].describe()

count    332642.000000
mean       4109.250005
std        3844.906790
min           1.000000
25%        1150.000000
50%        2700.000000
75%        5999.000000
max       16270.000000
Name: price, dtype: float64

#### Observations
- 0 prices state only 3% of total price values, we can exclude them and treat as outliers
- prices above upper limit state 8% of total price values, we can exclude them and treat as outliers
- there is no big difference between mean (2700) and Q3 (5999) values
- distribution of values seems to be fine for the price data

## 5. Analyze the date clumns range

In [38]:
print(autos.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 332642 entries, 0 to 371526
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   date_crawled         332642 non-null  object
 1   name                 332642 non-null  object
 2   price                332642 non-null  int64 
 3   abtest               332642 non-null  object
 4   vehicle_type         299207 non-null  object
 5   registration_year    332642 non-null  int64 
 6   gearbox              315439 non-null  object
 7   power_in_ps          332642 non-null  int64 
 8   model                315482 non-null  object
 9   odometer_km          332642 non-null  int64 
 10  registration_month   332642 non-null  int64 
 11  fuel_type            303495 non-null  object
 12  brand                332642 non-null  object
 13  not_repaired_damage  267503 non-null  object
 14  ad_created           332642 non-null  object
 15  postal_code          332642 non-nu

In [39]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-24 11:52:17,2016-03-24 00:00:00,2016-04-07 03:16:57
2,2016-03-14 12:52:21,2016-03-14 00:00:00,2016-04-05 12:47:46
3,2016-03-17 16:54:04,2016-03-17 00:00:00,2016-03-17 17:40:17
4,2016-03-31 17:25:20,2016-03-31 00:00:00,2016-04-06 10:17:21
5,2016-04-04 17:36:23,2016-04-04 00:00:00,2016-04-06 19:17:07


### 5.1 Analyze `date_crawled` column

- Exctrat year, month and day from `date_crawled` column
- Sort values by index
- Convert values to relative frequency

In [40]:
date_crawled = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
date_crawled = date_crawled.apply(lambda x: x * 100)

In [41]:
date_crawled.describe()

count    34.000000
mean      2.941176
std       0.967754
min       0.159932
25%       3.017809
50%       3.291527
75%       3.496402
max       3.856098
Name: date_crawled, dtype: float64

#### Observations
- Advertisements were added with similar frequency through 34 days
- At 13, 16 and 18 of March 2016, we can notice a decline in numbers of new offers crawled
- Since 2016-04-05 we can observe a significant decrease in the number of offers crawled added
- Standard deviation is 0.009678, which means that variability is very low

### 5.2 Analyze `ad_created` column

- Exctrat year, month and day from `ad_created` column
- Sort values by index
- Convert values to relative frequency

In [42]:
ad_created = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
ad_created = date_crawled.apply(lambda x: x * 100)
print(ad_created)

2016-03-05    257.003024
2016-03-06    145.171085
2016-03-07    356.990398
2016-03-08    335.856567
2016-03-09    341.267789
2016-03-10    328.912164
2016-03-11    329.393161
2016-03-12    364.505985
2016-03-13    157.196025
2016-03-14    363.062993
2016-03-15    334.233200
2016-03-16    304.351224
2016-03-17    319.683023
2016-03-18    131.342404
2016-03-19    353.833851
2016-03-20    363.634177
2016-03-21    358.373266
2016-03-22    327.619483
2016-03-23    320.133958
2016-03-24    297.677383
2016-03-25    327.859982
2016-03-26    320.254207
2016-03-27    300.924117
2016-03-28    350.286494
2016-03-29    340.546293
2016-03-30    334.173075
2016-03-31    317.308097
2016-04-01    338.051118
2016-04-02    347.701132
2016-04-03    385.609755
2016-04-04    373.494628
2016-04-05    126.622615
2016-04-06     30.934157
2016-04-07     15.993170
Name: date_crawled, dtype: float64


In [43]:
ad_created.describe()

count     34.000000
mean     294.117647
std       96.775429
min       15.993170
25%      301.780894
50%      329.152663
75%      349.640154
max      385.609755
Name: date_crawled, dtype: float64

#### Observations
- Advertisements were added with similar frequency through 34 days
- At 13, 16 and 18 of March 2016, we can notice a decline in numbers of new offers
- Since 2016-04-05 we can observe a significant decrease in the number of new offers added
- Standard deviation is 0.967754, which means that variability is higher than for `ad_created` column, but still, it is low

### 5.2 Analyze `last_seen` column

- Exctrat year, month and day from `last_seen` column
- Sort values by index
- Convert values to relative frequency

In [44]:
last_seen = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
last_seen = date_crawled.apply(lambda x: x * 100)
print(last_seen)

2016-03-05    257.003024
2016-03-06    145.171085
2016-03-07    356.990398
2016-03-08    335.856567
2016-03-09    341.267789
2016-03-10    328.912164
2016-03-11    329.393161
2016-03-12    364.505985
2016-03-13    157.196025
2016-03-14    363.062993
2016-03-15    334.233200
2016-03-16    304.351224
2016-03-17    319.683023
2016-03-18    131.342404
2016-03-19    353.833851
2016-03-20    363.634177
2016-03-21    358.373266
2016-03-22    327.619483
2016-03-23    320.133958
2016-03-24    297.677383
2016-03-25    327.859982
2016-03-26    320.254207
2016-03-27    300.924117
2016-03-28    350.286494
2016-03-29    340.546293
2016-03-30    334.173075
2016-03-31    317.308097
2016-04-01    338.051118
2016-04-02    347.701132
2016-04-03    385.609755
2016-04-04    373.494628
2016-04-05    126.622615
2016-04-06     30.934157
2016-04-07     15.993170
Name: date_crawled, dtype: float64


In [45]:
last_seen.describe()

count     34.000000
mean     294.117647
std       96.775429
min       15.993170
25%      301.780894
50%      329.152663
75%      349.640154
max      385.609755
Name: date_crawled, dtype: float64

#### Observations:
- Advertisements were added with similar frequency through 34 days
- At 13, 16 and 18 of March 2016, we can notice a decline in numbers of last seen offers
- Since 2016-04-05 we can observe a significant decrease in the number of last seen offers
- Standard deviation is 0.967754, which means that variability is higher than for `ad_created` column, but still, it is low

### 5.3 Analyze `registration_year` column

In [46]:
registration_year = autos["registration_year"].value_counts(normalize=True, dropna=False).sort_index()
# registration_year = registration_year.apply(lambda x: x * 100)
print(registration_year)

1000    0.000090
1001    0.000003
1039    0.000003
1111    0.000003
1234    0.000012
          ...   
8500    0.000003
8888    0.000003
9000    0.000006
9450    0.000003
9999    0.000048
Name: registration_year, Length: 137, dtype: float64


In [47]:
autos["registration_year"].describe()

count    332642.000000
mean       2003.878819
std          76.894784
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2007.000000
max        9999.000000
Name: registration_year, dtype: float64

Retrive values before 1900

In [48]:
autos_before_1900 = autos[autos["registration_year"] < 1900]

In [49]:
autos_before_1900.shape

(52, 17)

In [50]:
autos_before_1900["registration_year"].count()

52

In [51]:
autos_before_1900["registration_year"].value_counts()

1000    30
1800     5
1234     4
1500     3
1600     2
1300     2
1400     1
1255     1
1001     1
1111     1
1039     1
1602     1
Name: registration_year, dtype: int64

Retrive values after 2016

In [52]:
autos_after_2016 = autos[autos["registration_year"] > 2016]

In [53]:
autos_after_2016["registration_year"].value_counts().sort_index()

2017    9780
2018    3729
2019      15
2066       1
2200       1
2222       1
2290       1
2500       3
2800       1
2900       1
3000       6
3200       1
3700       1
3800       1
4000       3
4100       1
4500       2
4800       1
5000      16
5300       1
5555       2
5600       1
5900       1
5911       2
6000       4
6500       1
7000       4
7100       1
7800       1
8000       2
8200       1
8500       1
8888       1
9000       2
9450       1
9999      16
Name: registration_year, dtype: int64

Retrive the rows with year lower than 1900 and higher than 2016 and determinate what percentage is this value

In [54]:
invalid_rows = (~(autos['registration_year'].between(1900, 2016))).sum()

In [55]:
invalid_rows_percentage = (invalid_rows / autos.shape[0]) * 100
invalid_rows_percentage

4.106216292590833

#### Observations:
- Invalid rows are only 4% of total amount of rows
- Invalid rows can be removed from the `autos` dataframe

Remove invalid rows from `autos` dataframe

In [56]:
autos = autos[autos['registration_year'].between(1900, 2016)]

In [57]:
autos["registration_year"].value_counts(normalize=True).apply(lambda x: x * 100).sort_values(ascending=False).head(10)

2000    7.223582
1999    6.887201
2005    6.692206
2001    6.142020
2003    6.086845
2006    6.044836
2004    6.026026
2002    5.847647
1998    5.376776
2007    5.128173
Name: registration_year, dtype: float64

In [58]:
autos["registration_year"].value_counts(normalize=True).apply(lambda x: x * 100).sort_values(ascending=False).head(10).sum()

61.455312665565245

#### Observations:
- The bigest amount of offers contains car registrated betweend 1998 and 2007, it is 61% of total

## 6. Explore the `brand` column

Identify each unique `brand` in the data set.

In [59]:
brands = autos["brand"].unique()
brands

array(['volkswagen', 'jeep', 'skoda', 'bmw', 'peugeot', 'ford', 'mazda',
       'renault', 'mercedes_benz', 'seat', 'honda', 'fiat', 'opel',
       'mini', 'smart', 'hyundai', 'sonstige_autos', 'audi', 'nissan',
       'alfa_romeo', 'subaru', 'volvo', 'mitsubishi', 'kia', 'suzuki',
       'lancia', 'citroen', 'toyota', 'chevrolet', 'dacia', 'daihatsu',
       'trabant', 'chrysler', 'jaguar', 'daewoo', 'porsche', 'rover',
       'saab', 'land_rover', 'lada'], dtype=object)

Select 20 brands with the highest number of offers

In [60]:
top20 = autos["brand"].value_counts().index[:20]

Retrive avarage price value for top 20 car brands

In [75]:
avg_price_by_brand = {}

for brand in top20:
    
    selected_brands = autos[autos["brand"] == brand]
    price_mean = selected_brands["price"].mean()
    
    avg_price_by_brand[brand] = round(price_mean, 2)
    sorted_prices = sorted(avg_price_by_brand.items(), key=lambda x: x[1], reverse=True)

    
for item in sorted_prices:
    print(item[0],':',item[1]) 

<class 'list'>
mini : 8243.06
bmw : 5714.7
audi : 5693.2
skoda : 5579.09
mercedes_benz : 5296.44
hyundai : 4847.2
toyota : 4591.92
volkswagen : 4151.89
volvo : 3892.79
seat : 3866.49
nissan : 3816.81
smart : 3559.53
mazda : 3542.57
citroen : 3468.14
peugeot : 3083.21
ford : 3017.07
mitsubishi : 2834.15
fiat : 2744.63
opel : 2728.23
renault : 2288.74


In [76]:
sorted_prices_series = pd.Series(sorted_prices)
sorted_prices_series

0              (mini, 8243.06)
1                (bmw, 5714.7)
2               (audi, 5693.2)
3             (skoda, 5579.09)
4     (mercedes_benz, 5296.44)
5            (hyundai, 4847.2)
6            (toyota, 4591.92)
7        (volkswagen, 4151.89)
8             (volvo, 3892.79)
9              (seat, 3866.49)
10           (nissan, 3816.81)
11            (smart, 3559.53)
12            (mazda, 3542.57)
13          (citroen, 3468.14)
14          (peugeot, 3083.21)
15             (ford, 3017.07)
16       (mitsubishi, 2834.15)
17             (fiat, 2744.63)
18             (opel, 2728.23)
19          (renault, 2288.74)
dtype: object

#### Observations:
- Mini cars have the bigest average per price