## <p style="background-color:#033E3E; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Handling with Missing Values</p>

**In this study, we will continue from the data set we cleaned. Here, we will fill in the missing data and make the final evaluations about the columns.**
**Note: The dataset was taken in 2019**

## <p style="background-color:#033E3E; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

pd.set_option('display.float_format', lambda x: '%.2f' % x)

## <p style="background-color:#033E3E; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Dataset Overview</p>

In [2]:
df = pd.read_csv("clean_scout.csv")
df.head(3).T

Unnamed: 0,0,1,2
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...
body_type,Sedans,Sedans,Sedans
price,15770,14500,14640
vat,VAT deductible,Price negotiable,VAT deductible
km,56013.00,80000.00,83450.00
type,Used,Used,Used
next_inspection,06/2021,,
inspection_new,Yes,,


In [3]:
df.shape

(15919, 42)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 42 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   url                      15919 non-null  object 
 1   make_model               15919 non-null  object 
 2   short_description        15873 non-null  object 
 3   body_type                15859 non-null  object 
 4   price                    15919 non-null  int64  
 5   vat                      11406 non-null  object 
 6   km                       14895 non-null  float64
 7   type                     15917 non-null  object 
 8   next_inspection          3535 non-null   object 
 9   inspection_new           3932 non-null   object 
 10  warranty                 10499 non-null  object 
 11  full_service             8215 non-null   object 
 12  non_smoking_vehicle      7177 non-null   object 
 13  first_registration       14322 non-null  float64
 14  body_color            

In [5]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,15919.0,18019.9,7386.17,13.0,12850.0,16900.0,21900.0,74600.0
km,14895.0,34130.13,37352.98,0.0,5153.0,22740.0,49371.5,317000.0
first_registration,14322.0,2017.46,1.08,2016.0,2016.0,2018.0,2018.0,2019.0
nr_of_doors,15707.0,4.66,0.65,1.0,4.0,5.0,5.0,7.0
nr_of_seats,14942.0,4.95,0.49,2.0,5.0,5.0,5.0,7.0
displacement,15423.0,1423.54,333.53,1.0,1229.0,1461.0,1598.0,16000.0
cylinders,10239.0,3.8,0.42,1.0,4.0,4.0,4.0,8.0
weight,8945.0,1351.11,220.66,1.0,1165.0,1288.0,1487.0,2471.0
co2_emission,13483.0,117.95,20.24,0.0,104.0,116.0,129.0,990.0
gears,11207.0,5.92,0.85,1.0,5.0,6.0,6.0,50.0


In [6]:
df.describe(include="object").T

Unnamed: 0,count,unique,top,freq
url,15919,15919,https://www.autoscout24.com//offers/audi-a1-sp...,1
make_model,15919,9,Audi A3,3097
short_description,15873,10001,SPB 1.6 TDI 116 CV S tronic Sport,64
body_type,15859,9,Sedans,7903
vat,11406,2,VAT deductible,10980
type,15917,5,Used,11096
next_inspection,3535,77,06/2021,471
inspection_new,3932,1,Yes,3932
warranty,10499,515,"['\n', '\n', '\nEuro 6\n']",1868
full_service,8215,122,"['\n', '\n', '\n4 (Green)\n']",2235


In [7]:
df.isnull().sum()

url                            0
make_model                     0
short_description             46
body_type                     60
price                          0
vat                         4513
km                          1024
type                           2
next_inspection            12384
inspection_new             11987
warranty                    5420
full_service                7704
non_smoking_vehicle         8742
first_registration          1597
body_color                   597
paint_type                  5772
nr_of_doors                  212
nr_of_seats                  977
gearing_type                   0
displacement                 496
cylinders                   5680
weight                      6974
drive_chain                 6858
fuel                           0
co2_emission                2436
emission_class              3628
comfort_and_convenience      920
entertainment_and_media     1374
extras                      2962
safety_and_security          982
descriptio

In [8]:
df.isnull().sum() * 100 / df.shape[0]

url                        0.00
make_model                 0.00
short_description          0.29
body_type                  0.38
price                      0.00
vat                       28.35
km                         6.43
type                       0.01
next_inspection           77.79
inspection_new            75.30
warranty                  34.05
full_service              48.39
non_smoking_vehicle       54.92
first_registration        10.03
body_color                 3.75
paint_type                36.26
nr_of_doors                1.33
nr_of_seats                6.14
gearing_type               0.00
displacement               3.12
cylinders                 35.68
weight                    43.81
drive_chain               43.08
fuel                       0.00
co2_emission              15.30
emission_class            22.79
comfort_and_convenience    5.78
entertainment_and_media    8.63
extras                    18.61
safety_and_security        6.17
description                4.00
emission

In [9]:
missing_values = []

[missing_values.append(column) for column in df.columns if any(df[column].isnull())]

print("Columns that have missing values :")

missing_values

Columns that have missing values :


['short_description',
 'body_type',
 'vat',
 'km',
 'type',
 'next_inspection',
 'inspection_new',
 'warranty',
 'full_service',
 'non_smoking_vehicle',
 'first_registration',
 'body_color',
 'paint_type',
 'nr_of_doors',
 'nr_of_seats',
 'displacement',
 'cylinders',
 'weight',
 'drive_chain',
 'co2_emission',
 'emission_class',
 'comfort_and_convenience',
 'entertainment_and_media',
 'extras',
 'safety_and_security',
 'description',
 'emission_label',
 'gears',
 'country_version',
 'previous_owners',
 'hp_kw',
 'warranty_new',
 'upholstery_type',
 'upholstery_color',
 'cons_comb',
 'cons_city',
 'cons_country']

#### <p style="background-color:#033E3E; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:left; border-radius:10px 10px;">Functions</p>

In [10]:
# summary of field such as missing values, unique values, number of unique
def summary(field) :
    print("Column name              : ", field)
    print("--------------------------------")
    print("Total missing value      : ", df[field].isnull().sum())
    print("Percentage of missing    : ", round(df[field].isnull().sum()*100 / df.shape[0], 2)  )
    print("Number of unique values  : ", df[field].nunique() )
    print()
    print("Unique values  : ", df[field].unique(), sep="\n")
    print()
    print("Number of values", df[field].value_counts(dropna=False), sep="\n")

In [11]:
# Fills the missing values with "mode" method according to 1-2-3-stage grouping
def fill_most_freq(df, group_col, col_name):
    if len(group_col) == 1 :
        for group_member in df[group_col[0]].unique():
            cond = df[group_col[0]] == group_member
            mode = list(df[cond][col_name].mode()[0])
            if mode != []:
                df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
            else:
                df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
    
    elif len(group_col) == 2 :
        for group_member1 in df[group_col[0]].unique(): 
                for group_member2 in df[group_col[1]].unique(): 
                    cond1 = df[group_col[0]] == group_member1
                    cond2 = ( df[group_col[0]] == group_member1 ) & ( df[group_col[1]] == group_member2 )
                    mode1 = list(df[cond1][col_name].mode())
                    mode2 = list(df[cond2][col_name].mode()) 
                    if mode2 != []:
                        df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].mode()[0])
                    elif mode1 != []:
                        df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond1][col_name].mode()[0])
                    else:
                        df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[col_name].mode()[0])
                        
    elif len(group_col) == 3 :
        for group_member1 in df[group_col[0]].unique(): 
                for group_member2 in df[group_col[1]].unique(): 
                    for group_member3 in df[group_col[2]].unique(): 
                        cond1 = df[group_col[0]] == group_member1
                        cond2 = ( df[group_col[0]] == group_member1 ) & ( df[group_col[1]] == group_member2 )
                        cond3 = ( df[group_col[0]] == group_member1 ) & ( df[group_col[1]] == group_member2 ) & ( df[group_col[2]] == group_member3 )
                        mode1 = list(df[cond1][col_name].mode())
                        mode2 = list(df[cond2][col_name].mode()) 
                        mode3 = list(df[cond3][col_name].mode()) 
                        if mode3 != []:
                            df.loc[cond3, col_name] = df.loc[cond3, col_name].fillna(df[cond3][col_name].mode()[0])
                        elif mode2 != []:
                            df.loc[cond3, col_name] = df.loc[cond3, col_name].fillna(df[cond2][col_name].mode()[0])
                        elif mode1 != []:
                            df.loc[cond3, col_name] = df.loc[cond3, col_name].fillna(df[cond1][col_name].mode()[0])
                        else:
                            df.loc[cond3, col_name] = df.loc[cond3, col_name].fillna(df[col_name].mode()[0])
                            
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))                            

In [12]:
def fill(df, group_col1, group_col2, col_name, method): # method can be either "mode" or "mean" or "median" or "ffill"
    
    '''Fills the missing values with "mode/mean/median/ffill/bfill method" according to double-stage grouping'''
    
    if method == "mode":
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond1 = df[group_col1]==group1
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                mode1 = list(df[cond1][col_name].mode())
                mode2 = list(df[cond2][col_name].mode())
                if mode2 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].mode()[0])
                elif mode1 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond1][col_name].mode()[0])
                else:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[col_name].mode()[0])

    elif method == "mean":
        df[col_name].fillna(df.groupby([group_col1, group_col2])[col_name].transform("mean"), inplace = True)
        df[col_name].fillna(df.groupby(group_col1)[col_name].transform("mean"), inplace = True)
        df[col_name].fillna(df[col_name].mean(), inplace = True)
        
    elif method == "median":
        df[col_name].fillna(df.groupby([group_col1, group_col2])[col_name].transform("median"), inplace = True)
        df[col_name].fillna(df.groupby(group_col1)[col_name].transform("median"), inplace = True)
        df[col_name].fillna(df[col_name].median(), inplace = True)
        
    elif method == "ffill":           
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(method="ffill").fillna(method="bfill")
                
        for group1 in list(df[group_col1].unique()):
            cond1 = df[group_col1]==group1
            df.loc[cond1, col_name] = df.loc[cond1, col_name].fillna(method="ffill").fillna(method="bfill")            
           
        df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))

In [13]:
def fill_median(df, group_col1, group_col2, group_col3, col_name):
    
    df[col_name].fillna(df.groupby([group_col1, group_col2, group_col3])[col_name].transform("median"), inplace = True)
    df[col_name].fillna(df.groupby([group_col1, group_col2])[col_name].transform("median"), inplace = True)
    df[col_name].fillna(df.groupby(group_col1)[col_name].transform("median"), inplace = True)
    df[col_name].fillna(df[col_name].median(), inplace = True)

    print("Number of NaN : ", df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))  

In [14]:
def fill_mean(df, group_col1, group_col2, group_col3, col_name):
    
    df[col_name].fillna(df.groupby([group_col1, group_col2, group_col3])[col_name].transform("mean"), inplace = True)
    df[col_name].fillna(df.groupby([group_col1, group_col2])[col_name].transform("mean"), inplace = True)
    df[col_name].fillna(df.groupby(group_col1)[col_name].transform("mean"), inplace = True)
    df[col_name].fillna(df[col_name].median(), inplace = True)

    print("Number of NaN : ", df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))   

## <p style="background-color:#033E3E; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">The Examination of Missing Values in the Columns</p>

**When we first look at the dataset, we can see some main columns. "make_model", "body_type", "registration_year". We're going to do a lot of processing around these columns when we're filling missing values.**

## 1. body_type 

In [15]:
summary("body_type")

Column name              :  body_type
--------------------------------
Total missing value      :  60
Percentage of missing    :  0.38
Number of unique values  :  9

Unique values  : 
['Sedans' 'Station wagon' 'Compact' 'Other' 'Coupe' 'Van' 'Off-Road'
 'Convertible' nan 'Transporter']

Number of values
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64


**There is data named "Other". We can accept these as Null values.**

In [16]:
df.body_type.replace("Other", np.nan, inplace=True) 

In [17]:
df['body_type'].value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
NaN               350
Transporter        88
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64

**"We know that the "body_type" column is related to the make and model of vehicles. We can fill in the missing values according to the "make_model" column.**

**The "nr_of_doors" and "short_description" columns were examined for a more precise filling process. But no strong pattern was observed.**

**On the other hand, by examining the "body_types" in the "make_model" breakdown, we can observe their distribution. Considering that the total number of missing values is a few, it is 	suitable to fill the missing values by applying the "mode" operation.**

In [18]:
df[df.body_type.isnull() == True][["make_model","short_description"]][0:50]

Unnamed: 0,make_model,short_description
744,Audi A1,SPORTBACK S line
1764,Audi A1,"Sportback sport 1.0 TFSI UPE 22.800,--"
1793,Audi A1,SPB 1.6 TDI 116 CV Metal plus
1819,Audi A1,SPB 1.6 TDI 116 CV Metal plus
2047,Audi A1,SPB 30 TFSI S tronic S line edition
2078,Audi A1,SPB 30 TFSI S tronic Admired
2104,Audi A1,SPB 30 TFSI 116 CV S tronic Admired
2256,Audi A1,Sportback 25 TFSI
2475,Audi A1,Sportback 25 TFSI
2522,Audi A1,Sportback 25 TFSI


In [19]:
df[(df.make_model == "Audi A1") ][["short_description","body_type"]].sample(10)

Unnamed: 0,short_description,body_type
876,Dsl 1.4 TDi ultra,Sedans
780,SPB 1.4 TDI S tronic Sport,Sedans
2581,SPB 30 TFSI - Pra.65884,Sedans
2448,Sportback 25 TFSI PDC LM Tempo Klima,Sedans
605,1.4 TDI SPB audi concert Display neopatentati,Compact
1196,1.6 TDI Sportback,Sedans
1994,advanced 30 TFSI STRONIC NAVI LED APS SHZ,Compact
1475,1.6 TDi,Sedans
1772,1.0 TFSI Adrenalin 70 kW (95 CV),Compact
1669,Sportback 1.0 TFSI AUT 5-DRS Pro Line (AC/NAV/...,Compact


In [20]:
df[(df.make_model == "Opel Astra") ][["short_description","body_type"]].sample(10)

Unnamed: 0,short_description,body_type
7750,"ST 1,6 CDTI Ultimate S/S Aut.",Station wagon
8027,K 1.0 Turbo Active S/S Sitzheizung,Sedans
7073,K Sports Tourer Innovation+,Station wagon
8044,1.6CDTi S/S Selective Pro 110,Compact
6511,J 1.6 CDTI Style ecoFlex Start/Stop,Station wagon
7358,1.0 Turbo ECOTEC Edition Start/Stop,Sedans
5712,Edition K Sports Tourer Automatik Navi 1 Hand,Station wagon
7159,"1.4, Autom, 1. Hand, Navi, Spurhalteassist",Station wagon
7700,1.6 D 136ch Innovation Automatique Euro6d-T,Sedans
5755,ST 1.6CDTi Excellence Aut. 136 (4.75),Station wagon


In [21]:
df.groupby("nr_of_doors")["body_type"].value_counts()

nr_of_doors  body_type    
1.00         Compact             1
2.00         Compact           157
             Sedans             50
             Convertible         7
             Coupe               1
             Station wagon       1
3.00         Compact           429
             Sedans            371
             Coupe              17
             Transporter         4
             Station wagon       2
4.00         Sedans           1403
             Station wagon     902
             Compact           658
             Van                90
             Coupe               2
             Off-Road            1
             Transporter         1
5.00         Sedans           5958
             Station wagon    2614
             Compact          1870
             Van               681
             Transporter        82
             Off-Road           54
             Coupe               4
             Convertible         1
7.00         Van                 1
Name: body_type, dtype: int6

In [22]:
df[df.body_type.isnull() == True]["make_model"].value_counts()

Opel Corsa        89
Opel Astra        74
Renault Clio      66
Opel Insignia     51
Renault Espace    34
Audi A3           23
Audi A1           13
Name: make_model, dtype: int64

In [23]:
df.groupby("make_model")["body_type"].value_counts()

make_model      body_type    
Audi A1         Sedans           1538
                Compact          1039
                Station wagon      21
                Coupe               2
                Van                 1
Audi A2         Off-Road            1
Audi A3         Sedans           2598
                Station wagon     282
                Compact           182
                Convertible         8
                Coupe               4
Opel Astra      Station wagon    1211
                Sedans           1053
                Compact           185
                Coupe               2
                Off-Road            1
Opel Corsa      Compact          1230
                Sedans            875
                Coupe              13
                Transporter         7
                Off-Road            3
                Van                 2
Opel Insignia   Station wagon    1611
                Sedans            900
                Compact            27
                Off-

**So we're going to fill the missing values by grouping "make_model" column**

In [24]:
# Please check the relevant functions in the Function section.
fill_most_freq(df, ["make_model"], "body_type")

Number of NaN :  0
------------------
Sedans           8005
Station wagon    3678
Compact          3242
Van               817
Transporter        88
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64


In [25]:
missing_values

['short_description',
 'body_type',
 'vat',
 'km',
 'type',
 'next_inspection',
 'inspection_new',
 'warranty',
 'full_service',
 'non_smoking_vehicle',
 'first_registration',
 'body_color',
 'paint_type',
 'nr_of_doors',
 'nr_of_seats',
 'displacement',
 'cylinders',
 'weight',
 'drive_chain',
 'co2_emission',
 'emission_class',
 'comfort_and_convenience',
 'entertainment_and_media',
 'extras',
 'safety_and_security',
 'description',
 'emission_label',
 'gears',
 'country_version',
 'previous_owners',
 'hp_kw',
 'warranty_new',
 'upholstery_type',
 'upholstery_color',
 'cons_comb',
 'cons_city',
 'cons_country']

## 2. vat

In [26]:
summary("vat")

Column name              :  vat
--------------------------------
Total missing value      :  4513
Percentage of missing    :  28.35
Number of unique values  :  2

Unique values  : 
['VAT deductible' 'Price negotiable' nan]

Number of values
VAT deductible      10980
NaN                  4513
Price negotiable      426
Name: vat, dtype: int64


In [27]:
df["vat"].fillna("-", inplace = True)

In [28]:
df.groupby(["make_model", "body_type", "vat"]).price.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,vat,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,-,257.0,16692.86,3377.35,11100.0,14990.0,15850.0,17900.0,29181.0
Audi A1,Compact,Price negotiable,3.0,17631.67,1548.95,15950.0,16947.5,17945.0,18472.5,19000.0
Audi A1,Compact,VAT deductible,779.0,20019.0,4603.19,9950.0,16370.0,19990.0,22730.0,31990.0
Audi A1,Coupe,VAT deductible,2.0,14925.0,1378.86,13950.0,14437.5,14925.0,15412.5,15900.0
Audi A1,Sedans,-,498.0,17650.87,4225.47,8999.0,14819.0,16495.0,20300.0,37900.0
Audi A1,Sedans,Price negotiable,78.0,16224.31,3545.63,10800.0,13912.5,15299.5,18112.5,33900.0
Audi A1,Sedans,VAT deductible,975.0,19370.28,4470.52,10000.0,15950.0,18900.0,22315.0,35900.0
Audi A1,Station wagon,-,4.0,22332.0,7763.74,13999.0,16579.75,23165.0,28917.25,28999.0
Audi A1,Station wagon,VAT deductible,17.0,16747.71,2481.1,12950.0,15750.0,16290.0,16880.0,21450.0
Audi A1,Van,VAT deductible,1.0,29000.0,,29000.0,29000.0,29000.0,29000.0,29000.0


**When the "vat" values is analyzed in the "make_model" and "body_type" breakdown, there is no strong pattern is founded.**

**This column contains too many missing values. Any bigger manupilation is directly effect the price.**

**We can drop this column due to both the number of missing values and high manipulation.**

In [29]:
df.drop("vat", axis=1, inplace=True)

## 3. nr_of_doors
**Column associated with the make, model and body type of vehicles.**

In [30]:
summary("nr_of_doors")

Column name              :  nr_of_doors
--------------------------------
Total missing value      :  212
Percentage of missing    :  1.33
Number of unique values  :  6

Unique values  : 
[ 5.  3.  4.  2. nan  1.  7.]

Number of values
5.00    11575
4.00     3079
3.00      832
2.00      219
NaN       212
1.00        1
7.00        1
Name: nr_of_doors, dtype: int64


In [31]:
df.groupby(["make_model","body_type"])["nr_of_doors"].value_counts(dropna=False)[0:50]

make_model  body_type      nr_of_doors
Audi A1     Compact        5.00            666
                           4.00            207
                           3.00             80
                           2.00             69
                           NaN              17
            Coupe          2.00              1
                           5.00              1
            Sedans         5.00           1056
                           4.00            326
                           3.00            130
                           2.00             29
                           NaN              10
            Station wagon  5.00             17
                           4.00              3
                           3.00              1
            Van            5.00              1
Audi A2     Off-Road       5.00              1
Audi A3     Compact        5.00            164
                           3.00             11
                           4.00              6
                     

**The number of missing values in the "make_model" and "body_type" breakdown is few. We can fill in each breakdown with the "mode" method.**

In [32]:
fill_most_freq(df, ["make_model","body_type"], "nr_of_doors")

Number of NaN :  0
------------------
5.00    11787
4.00     3079
3.00      832
2.00      219
1.00        1
7.00        1
Name: nr_of_doors, dtype: int64


## 4. nr_of_seats
**We can repeat the filling process like "nr_of_doors"**

In [33]:
summary("nr_of_seats")

Column name              :  nr_of_seats
--------------------------------
Total missing value      :  977
Percentage of missing    :  6.14
Number of unique values  :  6

Unique values  : 
[ 5.  4. nan  6.  3.  2.  7.]

Number of values
5.00    13336
4.00     1125
NaN       977
7.00      362
2.00      116
6.00        2
3.00        1
Name: nr_of_seats, dtype: int64


In [34]:
fill_most_freq(df, ["make_model","body_type"], "nr_of_seats")

Number of NaN :  0
------------------
5.00    14308
4.00     1127
7.00      362
2.00      119
6.00        2
3.00        1
Name: nr_of_seats, dtype: int64


## 5. emission_class

In [35]:
summary("emission_class")

Column name              :  emission_class
--------------------------------
Total missing value      :  3628
Percentage of missing    :  22.79
Number of unique values  :  3

Unique values  : 
['Euro 6' nan 'Euro 5' 'Euro 4']

Number of values
Euro 6    12173
NaN        3628
Euro 5       78
Euro 4       40
Name: emission_class, dtype: int64


**We can fill in missing data using columns "short_description", "warranty", "full_service", "non_smoking_vehicle", "description".Some data is in these columns**

In [36]:
df.short_description.str.extract("(Euro \d)")[0]

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
15914    NaN
15915    NaN
15916    NaN
15917    NaN
15918    NaN
Name: 0, Length: 15919, dtype: object

In [37]:
df.emission_class.fillna( df.short_description.str.extract("(Euro \d)")[0] , inplace=True)
df.emission_class.value_counts(dropna=False)

Euro 6    12185
NaN        3616
Euro 5       78
Euro 4       40
Name: emission_class, dtype: int64

In [38]:
df.emission_class.fillna( df.warranty.astype("str").str.extract("(Euro \d)")[0], inplace=True)
df.emission_class.value_counts(dropna=False)

Euro 6    12310
NaN        3491
Euro 5       78
Euro 4       40
Name: emission_class, dtype: int64

In [39]:
df.emission_class.fillna( df.full_service.astype("str").str.extract("(Euro \d)")[0] ,inplace=True)
df.emission_class.value_counts(dropna=False)

Euro 6    12737
NaN        3064
Euro 5       78
Euro 4       40
Name: emission_class, dtype: int64

In [40]:
df.emission_class.fillna( df.non_smoking_vehicle.astype("str").str.extract("(Euro \d)")[0], inplace=True)
df.emission_class.value_counts(dropna=False)

Euro 6    12857
NaN        2944
Euro 5       78
Euro 4       40
Name: emission_class, dtype: int64

In [41]:
df.emission_class.fillna( df.description.astype("str").str.extract("(Euro \d)")[0], inplace=True)
df.emission_class.value_counts(dropna=False)

Euro 6    13006
NaN        2790
Euro 5       79
Euro 4       40
Euro 3        2
Euro 2        1
Euro 1        1
Name: emission_class, dtype: int64

**After the last filling process, Euro 3, Euro 2, Euro 1 values came. These are 4 in total.
We will change to Null these values to reduce complexity for ML.**

In [42]:
df.emission_class.replace(["Euro 3", "Euro 2", "Euro 1"], np.nan, inplace=True)

In [43]:
df.groupby(["make_model","body_type"])["emission_class"].value_counts(dropna=False)

make_model      body_type      emission_class
Audi A1         Compact        Euro 6             939
                               NaN                 94
                               Euro 5               5
                               Euro 4               1
                Coupe          NaN                  1
                               Euro 6               1
                Sedans         Euro 6            1232
                               NaN                314
                               Euro 5               5
                Station wagon  Euro 6              19
                               NaN                  2
                Van            NaN                  1
Audi A2         Off-Road       Euro 6               1
Audi A3         Compact        Euro 6             131
                               NaN                 51
                Convertible    Euro 6               7
                               NaN                  1
                Coupe          Euro 

In [44]:
fill_most_freq(df, ["make_model","body_type"], "emission_class")

Number of NaN :  0
------------------
Euro 6    15798
Euro 5       79
Euro 4       42
Name: emission_class, dtype: int64


## 6.emission_label

In [45]:
summary("emission_label")

Column name              :  emission_label
--------------------------------
Total missing value      :  11974
Percentage of missing    :  75.22
Number of unique values  :  5

Unique values  : 
[nan '4 (Green)' '1 (No sticker)' '5 (Blue)' '3 (Yellow)' '2 (Red)']

Number of values
NaN               11974
4 (Green)          3553
1 (No sticker)      381
5 (Blue)              8
3 (Yellow)            2
2 (Red)               1
Name: emission_label, dtype: int64


In [46]:
df.warranty.astype("str").str.extract("((\d \(Blue\))|(\d \(Green\))|(\d \(Yellow\))|(\d \(Red\))|(\d \(No sticker\)))")[0].value_counts(dropna=False)

NaN               15207
4 (Green)           683
1 (No sticker)       21
5 (Blue)              8
Name: 0, dtype: int64

In [47]:
df.emission_label.fillna( df.short_description.astype("str").str.extract("((\d \(Blue\))|(\d \(Green\))|(\d \(Yellow\))|(\d \(Red\))|(\d \(No sticker\)))")[0] , inplace=True)


df.emission_label.fillna( df.warranty.astype("str").str.extract("((\d \(Blue\))|(\d \(Green\))|(\d \(Yellow\))|(\d \(Red\))|(\d \(No sticker\)))")[0], inplace=True)


df.emission_label.fillna( df.full_service.astype("str").str.extract("((\d \(Blue\))|(\d \(Green\))|(\d \(Yellow\))|(\d \(Red\))|(\d \(No sticker\)))")[0] ,inplace=True)


df.emission_label.fillna( df.non_smoking_vehicle.astype("str").str.extract("((\d \(Blue\))|(\d \(Green\))|(\d \(Yellow\))|(\d \(Red\))|(\d \(No sticker\)))")[0], inplace=True)


df.emission_label.fillna( df.description.astype("str").str.extract("((\d \(Blue\))|(\d \(Green\))|(\d \(Yellow\))|(\d \(Red\))|(\d \(No sticker\)))")[0], inplace=True)


In [48]:
df.emission_label.value_counts(dropna=False)

NaN               7984
4 (Green)         7456
1 (No sticker)     435
5 (Blue)            41
3 (Yellow)           2
2 (Red)              1
Name: emission_label, dtype: int64

In [49]:
df.groupby(["emission_class"])["emission_label"].value_counts(dropna=False)

emission_class  emission_label
Euro 4          NaN                 37
                4 (Green)            5
Euro 5          NaN                 57
                4 (Green)           20
                1 (No sticker)       2
Euro 6          NaN               7890
                4 (Green)         7431
                1 (No sticker)     433
                5 (Blue)            41
                3 (Yellow)           2
                2 (Red)              1
Name: emission_label, dtype: int64

In [50]:
df.groupby(["make_model"])["emission_label"].value_counts(dropna=False)

make_model      emission_label
Audi A1         4 (Green)         1378
                NaN               1117
                1 (No sticker)     116
                5 (Blue)             3
Audi A2         4 (Green)            1
Audi A3         NaN               1894
                4 (Green)         1002
                1 (No sticker)     199
                5 (Blue)             2
Opel Astra      4 (Green)         1431
                NaN               1058
                1 (No sticker)      28
                5 (Blue)             8
                3 (Yellow)           1
Opel Corsa      4 (Green)         1194
                NaN                980
                1 (No sticker)      38
                5 (Blue)             6
                2 (Red)              1
Opel Insignia   4 (Green)         1520
                NaN               1043
                1 (No sticker)      20
                5 (Blue)            15
Renault Clio    NaN               1176
                4 (Green)        

In [51]:
df.groupby(["make_model","emission_class"])["emission_label"].value_counts()

make_model      emission_class  emission_label
Audi A1         Euro 4          4 (Green)            1
                Euro 5          4 (Green)            3
                Euro 6          4 (Green)         1374
                                1 (No sticker)     116
                                5 (Blue)             3
Audi A2         Euro 6          4 (Green)            1
Audi A3         Euro 5          4 (Green)            1
                Euro 6          4 (Green)         1001
                                1 (No sticker)     199
                                5 (Blue)             2
Opel Astra      Euro 5          4 (Green)            5
                                1 (No sticker)       1
                Euro 6          4 (Green)         1426
                                1 (No sticker)      27
                                5 (Blue)             8
                                3 (Yellow)           1
Opel Corsa      Euro 5          4 (Green)            1
                Eu

**Even though we filled in the missing data with the relevant columns, there is still a lot of missing data.**

**There is no strong model when analyzed in the breakdown of "make_model" and "emission_class".
"4 (Green)" takes the lead by a wide margin. It is not to give an effective learning in  ML  when we fill in the missing data.**

**So, We can drop the column**

In [52]:
df.drop(columns="emission_label",inplace=True)

## 7. co2_emission

In [53]:
summary("co2_emission")

Column name              :  co2_emission
--------------------------------
Total missing value      :  2436
Percentage of missing    :  15.3
Number of unique values  :  122

Unique values  : 
[ 99.    129.    109.     92.     98.     97.        nan 105.    112.
 103.    102.     95.    104.     91.     94.    117.    123.    106.
 108.    121.    107.    101.    113.    137.    100.    116.    114.
 118.    331.    115.    119.     90.    136.    134.    110.    111.
 120.     89.    142.    126.    122.    128.    127.    138.    130.
 125.     85.    124.    152.     88.    189.    194.    149.    153.
 188.     36.      1.06   96.    990.    146.    135.    158.     12.087
 141.    172.    154.    150.    167.    174.     93.    133.    131.
 145.    147.    156.     87.      5.    148.    139.    151.    144.
 168.    160.    170.     80.    132.    155.     14.    159.      0.
 143.    140.     82.     12.324  84.    165.     51.    157.    169.
 166.    253.    164.    175.    190

In [54]:
df.groupby(["fuel", "make_model", "body_type", "co2_emission"]).price.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count,mean,std,min,25%,50%,75%,max
fuel,make_model,body_type,co2_emission,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Benzine,Audi A1,Compact,97.00,77.00,15115.64,1746.38,10900.00,13820.00,15444.00,16669.00,18775.00
Benzine,Audi A1,Compact,98.00,23.00,17077.57,3010.23,13999.00,15290.00,16100.00,16900.00,25256.00
Benzine,Audi A1,Compact,100.00,3.00,15903.33,1859.90,14220.00,14905.00,15590.00,16745.00,17900.00
Benzine,Audi A1,Compact,102.00,106.00,17401.34,3100.21,9950.00,15772.50,16950.00,19889.25,29150.00
Benzine,Audi A1,Compact,103.00,30.00,21233.03,1794.55,18350.00,19225.00,22189.50,22448.50,23550.00
...,...,...,...,...,...,...,...,...,...,...,...
LPG/CNG,Opel Corsa,Sedans,110.00,1.00,7500.00,,7500.00,7500.00,7500.00,7500.00,7500.00
LPG/CNG,Opel Corsa,Sedans,113.00,6.00,9583.33,1338.91,7500.00,8900.00,9700.00,10687.50,10950.00
LPG/CNG,Opel Corsa,Sedans,124.00,2.00,7850.00,1202.08,7000.00,7425.00,7850.00,8275.00,8700.00
LPG/CNG,Opel Corsa,Sedans,135.00,1.00,12400.00,,12400.00,12400.00,12400.00,12400.00,12400.00


**Since the CO2 emission values of electric and hybrid vehicles will be different than others, I preferred to fill in the missing emission values of these vehicles.**

In [55]:
df.loc[df.fuel == "Electric", "co2_emission"]

13397   NaN
Name: co2_emission, dtype: float64

In [56]:
df.loc[df.fuel == "Hybrid", "co2_emission"]

3356     NaN
3612   36.00
3615   36.00
3617   36.00
Name: co2_emission, dtype: float64

In [57]:
df.loc[df.fuel == "Hybrid", "co2_emission"] = df.loc[df.fuel == "Hybrid", "co2_emission"].fillna(df.loc[df.fuel == "Hybrid", "co2_emission"].mode()[0])
df.loc[df.fuel == "Electric", "co2_emission"] = df.loc[df.fuel == "Electric", "co2_emission"].fillna(df.loc[df.fuel == "Hybrid", "co2_emission"].mode()[0])

In [58]:
df.loc[df.fuel == "Electric", "co2_emission"]

13397   36.00
Name: co2_emission, dtype: float64

In [59]:
df.loc[df.fuel == "Hybrid", "co2_emission"]

3356   36.00
3612   36.00
3615   36.00
3617   36.00
Name: co2_emission, dtype: float64

**For the rest of the values we can use "median" for filling the missing values.**

In [60]:
fill_median(df, "make_model", "body_type", "fuel", "co2_emission")

Number of NaN :  0
------------------
120.00    1000
104.00     782
97.00      631
99.00      593
124.00     574
          ... 
80.00        1
14.00        1
51.00        1
165.00       1
193.00       1
Name: co2_emission, Length: 124, dtype: int64


## 8. cons_comb  & cons_city  &  cons_country

In [61]:
df[["cons_comb", "cons_city", "cons_country"]].isnull().sum()

cons_comb       3759
cons_city       4008
cons_country    3335
dtype: int64

**We can fill in missing data using columns "warranty", "full_service", "non_smoking_vehicle". Some data is in these columns**

In [62]:
df.warranty.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(comb\))", str(x))).value_counts()

[]                       15801
[5.4 l/100 km (comb)]       20
[5.6 l/100 km (comb)]       14
[6 l/100 km (comb)]         10
[6.1 l/100 km (comb)]       10
[5.2 l/100 km (comb)]        9
[4.9 l/100 km (comb)]        7
[6.3 l/100 km (comb)]        6
[5.9 l/100 km (comb)]        5
[4.3 l/100 km (comb)]        5
[5.1 l/100 km (comb)]        4
[4.7 l/100 km (comb)]        4
[5 l/100 km (comb)]          4
[6.2 l/100 km (comb)]        4
[6.7 l/100 km (comb)]        2
[5.7 l/100 km (comb)]        2
[5.8 l/100 km (comb)]        2
[6.4 l/100 km (comb)]        2
[6.6 l/100 km (comb)]        2
[4.5 l/100 km (comb)]        2
[6.5 l/100 km (comb)]        1
[5.5 l/100 km (comb)]        1
[4.4 l/100 km (comb)]        1
[7.4 l/100 km (comb)]        1
Name: warranty, dtype: int64

In [63]:
df["cons_comb_new"] = df.warranty.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(comb\))", str(x)))
df["cons_city_new"] = df.warranty.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(city\))", str(x)))
df["cons_country_new"] = df.warranty.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(country\))", str(x)))

In [64]:
df["cons_comb_new"].fillna(df.full_service.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(comb\))", str(x))),inplace=True)
df["cons_city_new"].fillna(df.full_service.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(city\))", str(x))),inplace=True)
df["cons_country_new"].fillna(df.full_service.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(country\))", str(x))),inplace=True)

df["cons_comb_new"].fillna(df.non_smoking_vehicle.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(comb\))", str(x))),inplace=True)
df["cons_city_new"].fillna(df.non_smoking_vehicle.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(city\))", str(x))),inplace=True)
df["cons_country_new"].fillna(df.non_smoking_vehicle.apply(lambda x : re.findall("(\d*\.?\d? l/100 km *\(country\))", str(x))),inplace=True)

In [65]:
df.cons_comb.fillna(   df["cons_comb_new"].astype("str").str.extract("(\d+)").loc[:,0] , inplace=True)
df.cons_city.fillna(   df["cons_city_new"].astype("str").str.extract("(\d+)").loc[:,0] , inplace=True)
df.cons_country.fillna(   df["cons_country_new"].astype("str").str.extract("(\d+)").loc[:,0] , inplace=True)

In [66]:
df[["cons_comb", "cons_city", "cons_country"]] = df[["cons_comb", "cons_city", "cons_country"]].astype("float")

In [67]:
df[["cons_comb", "cons_city", "cons_country"]].isnull().sum()

cons_comb       3745
cons_city       4006
cons_country    3328
dtype: int64

**I could continue with the "cons_comb" column to represent these three columns. Before that we can make use of "cons_city" and "cons_country" column**

In [68]:
cons_comb_2 = round((df["cons_country"] + df["cons_city"])/2, 1)
cons_comb_2.value_counts(dropna=False)

NaN      4526
5.60      933
4.80      759
3.90      641
4.00      630
6.00      585
5.20      578
4.60      550
6.40      534
5.40      487
4.10      472
5.80      459
4.20      411
5.00      352
3.80      351
5.90      225
6.20      213
3.30      211
6.60      208
5.30      200
5.10      197
4.90      194
6.10      192
4.70      186
3.70      167
3.60      166
4.50      161
5.50      159
7.20      147
6.50      133
7.00       96
3.50       96
3.40       87
4.40       84
4.30       76
6.90       75
6.80       67
6.30       48
5.70       45
7.80       39
9.00       36
7.60       36
7.10       21
3.20       16
6.70       15
7.70       13
8.20       12
7.30        4
8.90        4
7.50        3
8.70        3
7.40        3
8.40        2
8.60        2
9.30        2
8.00        2
8.80        2
9.60        1
9.10        1
15.10       1
dtype: int64

In [69]:
df["cons_comb"] = df["cons_comb"].fillna(cons_comb_2)

In [70]:
df["cons_comb"].isnull().sum()

2459

In [71]:
df["cons_comb"].fillna("-", inplace=True)

In [72]:
df.groupby(["make_model", "body_type", "fuel", "cons_comb"]).price.describe()[0:20]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,fuel,cons_comb,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Audi A1,Compact,Benzine,4.2,87.0,15631.13,2420.35,10900.0,13970.0,15599.0,16822.5,25256.0
Audi A1,Compact,Benzine,4.3,17.0,16628.71,2467.67,13999.0,15290.0,15900.0,16900.0,22919.0
Audi A1,Compact,Benzine,4.4,193.0,17618.92,3250.85,9950.0,15689.0,16950.0,19990.0,28750.0
Audi A1,Compact,Benzine,4.5,11.0,21482.91,1918.66,18400.0,19744.0,22444.0,23075.0,23250.0
Audi A1,Compact,Benzine,4.6,52.0,20729.96,2195.46,14470.0,19890.0,21490.0,22095.0,22990.0
Audi A1,Compact,Benzine,4.7,33.0,22102.67,1889.93,15490.0,21490.0,22400.0,22790.0,27980.0
Audi A1,Compact,Benzine,4.8,93.0,25162.55,3080.47,17330.0,22785.0,25750.0,27980.0,29197.0
Audi A1,Compact,Benzine,4.9,176.0,23133.29,3723.04,13480.0,20288.0,22490.0,26980.0,28999.0
Audi A1,Compact,Benzine,5.1,50.0,18187.28,4445.45,12550.0,15850.0,15850.0,20759.75,28980.0
Audi A1,Compact,Benzine,5.2,22.0,19618.5,3531.95,13475.0,17142.5,19260.0,21937.5,28880.0


In [73]:
df.cons_comb.replace("-",np.nan, inplace=True)

**Since consumption values of electric and hybrid vehicles will be different than others, I preferred to fill in the missing emission values of these vehicles.**

In [74]:
df.loc[df.fuel == "Electric", "cons_comb"]

13397   NaN
Name: cons_comb, dtype: float64

In [75]:
df.loc[df.fuel == "Hybrid", "co2_emission"]

3356   36.00
3612   36.00
3615   36.00
3617   36.00
Name: co2_emission, dtype: float64

In [76]:
df.loc[df.fuel == "Electric", "cons_comb"] = df.loc[df.fuel == "Electric", "cons_comb"].fillna(df.loc[df.fuel == "Hybrid", "co2_emission"].mode()[0])
df.loc[df.fuel == "Electric", "cons_comb"]

13397   36.00
Name: cons_comb, dtype: float64

In [77]:
fill_median(df, "make_model", "body_type", "fuel", "cons_comb")

Number of NaN :  0
------------------
5.40     1035
5.60      852
4.70      846
5.30      830
3.80      801
5.10      794
3.90      770
5.20      758
4.80      746
4.10      652
4.40      625
4.20      589
4.60      545
4.50      523
3.70      468
3.30      462
5.90      411
4.90      393
5.50      384
4.00      353
5.70      343
6.20      320
4.30      308
3.50      288
3.60      251
6.40      221
6.30      212
6.10      184
5.80      165
6.80      159
6.60      148
3.40      130
7.40       66
6.50       46
6.70       43
7.10       38
6.90       27
3.20       25
8.30       20
7.60       16
6.00       12
7.80        7
3.10        7
5.85        7
7.20        6
5.00        5
7.50        4
8.60        4
8.70        3
6.45        3
1.60        3
7.90        3
7.30        2
8.10        2
9.10        1
13.80       1
36.00       1
1.20        1
Name: cons_comb, dtype: int64


In [78]:
df.drop(columns=["cons_city" ,"cons_country" ,"cons_comb_new" ,"cons_city_new" ,"cons_country_new"],inplace=True)

## 9. body_color & paint_type

In [79]:
df[['body_color', 'paint_type']].value_counts()

body_color  paint_type 
Grey        Metallic       2730
Black       Metallic       2654
Silver      Metallic       1425
Blue        Metallic       1006
White       Metallic        865
Red         Metallic        526
Brown       Metallic        231
White       Uni/basic       145
Green       Metallic        115
Beige       Metallic         74
Black       Uni/basic        65
Grey        Uni/basic        62
Yellow      Metallic         37
Silver      Uni/basic        30
Red         Uni/basic        22
Blue        Uni/basic        17
Violet      Metallic         13
Bronze      Metallic          3
White       Perl effect       3
Green       Uni/basic         2
Beige       Uni/basic         2
Yellow      Uni/basic         2
Orange      Metallic          1
Brown       Perl effect       1
Red         Perl effect       1
Blue        Perl effect       1
dtype: int64

In [80]:
df.groupby(["make_model", "body_type", 'body_color']).price.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,body_color,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,Beige,6.00,20556.50,2475.51,16240.00,19766.75,21420.00,21700.00,23250.00
Audi A1,Compact,Black,320.00,18196.28,4206.97,9950.00,14990.00,16890.00,21390.00,28997.00
Audi A1,Compact,Blue,96.00,19145.41,4541.86,11444.00,15870.00,16925.00,22226.00,28980.00
Audi A1,Compact,Brown,9.00,16982.00,2964.39,11445.00,15993.00,16820.00,18850.00,20750.00
Audi A1,Compact,Green,17.00,23558.12,3849.70,19388.00,19388.00,22490.00,28240.00,28400.00
...,...,...,...,...,...,...,...,...,...,...
Renault Espace,Van,Brown,27.00,25718.26,9649.96,12614.00,19225.00,22990.00,27187.50,47990.00
Renault Espace,Van,Grey,301.00,30236.72,8635.88,15500.00,24800.00,28000.00,34500.00,64332.00
Renault Espace,Van,Silver,36.00,27373.00,4809.68,12990.00,24560.00,28500.00,29900.00,35100.00
Renault Espace,Van,Violet,11.00,24434.27,4603.89,19900.00,20499.50,23900.00,26299.00,34990.00


In [81]:
df.groupby(["make_model", "body_type", 'paint_type']).price.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,paint_type,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,Metallic,723.0,19412.38,4665.03,9950.0,15740.0,18750.0,22490.0,31990.0
Audi A1,Compact,Uni/basic,5.0,16815.2,3405.16,12900.0,13500.0,17900.0,19888.0,19888.0
Audi A1,Coupe,Metallic,2.0,14925.0,1378.86,13950.0,14437.5,14925.0,15412.5,15900.0
Audi A1,Sedans,Metallic,1000.0,18829.3,4396.32,10300.0,15680.0,16975.0,21900.0,32000.0
Audi A1,Sedans,Perl effect,1.0,28290.0,,28290.0,28290.0,28290.0,28290.0,28290.0
Audi A1,Sedans,Uni/basic,57.0,20329.4,4082.87,13900.0,16900.0,19999.0,22490.0,29000.0
Audi A1,Station wagon,Metallic,16.0,17666.38,3546.88,14440.0,15850.0,16451.0,17990.0,28890.0
Audi A1,Station wagon,Uni/basic,1.0,21450.0,,21450.0,21450.0,21450.0,21450.0,21450.0
Audi A1,Van,Metallic,1.0,29000.0,,29000.0,29000.0,29000.0,29000.0,29000.0
Audi A2,Off-Road,Metallic,1.0,28200.0,,28200.0,28200.0,28200.0,28200.0,28200.0


**It is not appear to be a significant relationship between Paint_Type and price column. We can fill missing values with ffill/bfill method to maintain the current proportionality.**

In [82]:
fill(df, "make_model", "body_type", 'body_color', "ffill")

Number of NaN :  0
------------------
Black     3900
Grey      3615
White     3520
Silver    1707
Blue      1522
Red        995
Brown      298
Green      166
Beige      116
Yellow      51
Violet      18
Bronze       6
Orange       3
Gold         2
Name: body_color, dtype: int64


In [83]:
fill(df, "make_model", "body_type", 'paint_type', "ffill")

Number of NaN :  0
------------------
Metallic       15250
Uni/basic        637
Perl effect       32
Name: paint_type, dtype: int64


## 10.drive_chain

In [84]:
summary("drive_chain")

Column name              :  drive_chain
--------------------------------
Total missing value      :  6858
Percentage of missing    :  43.08
Number of unique values  :  3

Unique values  : 
['front' nan '4WD' 'rear']

Number of values
front    8886
NaN      6858
4WD       171
rear        4
Name: drive_chain, dtype: int64


In [85]:
df["drive_chain"].fillna("-", inplace=True)

In [86]:
df.groupby(["make_model", "body_type","drive_chain"]).price.describe()[0:100]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,drive_chain,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,-,352.0,17620.87,4226.12,10490.0,14990.0,15900.0,20885.75,29190.0
Audi A1,Compact,4WD,2.0,14790.0,1258.65,13900.0,14345.0,14790.0,15235.0,15680.0
Audi A1,Compact,front,685.0,20008.22,4511.35,9950.0,16430.0,19890.0,22690.0,31990.0
Audi A1,Coupe,-,2.0,14925.0,1378.86,13950.0,14437.5,14925.0,15412.5,15900.0
Audi A1,Sedans,-,561.0,17830.44,4362.32,8999.0,14900.0,16490.0,20700.0,37900.0
Audi A1,Sedans,4WD,1.0,15450.0,,15450.0,15450.0,15450.0,15450.0,15450.0
Audi A1,Sedans,front,989.0,19133.79,4441.97,10000.0,15838.0,18500.0,21999.0,32000.0
Audi A1,Station wagon,-,3.0,24593.0,7537.22,15890.0,22390.0,28890.0,28944.5,28999.0
Audi A1,Station wagon,front,18.0,16681.11,2493.67,12950.0,15000.0,16356.0,17300.0,21450.0
Audi A1,Van,front,1.0,29000.0,,29000.0,29000.0,29000.0,29000.0,29000.0


**The type of drive_chain of cars changes by theirs make_models and body_types most of the time. So I have decided to fill missing values with mode value of related group.**

In [87]:
df.drive_chain.replace("-", np.nan, inplace=True)

In [88]:
fill(df, "make_model", "body_type", "drive_chain", "mode")

Number of NaN :  0
------------------
front    15711
4WD        204
rear         4
Name: drive_chain, dtype: int64


## 11. weight

In [89]:
df["weight"]

0       1220.00
1       1255.00
2           NaN
3       1195.00
4           NaN
          ...  
15914   1758.00
15915   1708.00
15916       NaN
15917   1758.00
15918   1685.00
Name: weight, Length: 15919, dtype: float64

In [90]:
df["weight"].fillna("-", inplace=True)

In [91]:
df.groupby(["make_model", "body_type","weight"]).price.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,weight,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,102.0,1.00,19229.00,,19229.00,19229.00,19229.00,19229.00,19229.00
Audi A1,Compact,1010.0,2.00,15450.00,707.11,14950.00,15200.00,15450.00,15700.00,15950.00
Audi A1,Compact,1035.0,6.00,16796.67,2617.87,14390.00,15892.50,15900.00,16575.00,21900.00
Audi A1,Compact,1040.0,2.00,20424.50,2933.79,18350.00,19387.25,20424.50,21461.75,22499.00
Audi A1,Compact,1065.0,36.00,20971.78,1982.55,15500.00,18987.50,21690.00,22400.00,23550.00
...,...,...,...,...,...,...,...,...,...,...
Renault Espace,Van,2037.0,1.00,47950.00,,47950.00,47950.00,47950.00,47950.00,47950.00
Renault Espace,Van,2353.0,1.00,22990.00,,22990.00,22990.00,22990.00,22990.00,22990.00
Renault Espace,Van,2410.0,1.00,23990.00,,23990.00,23990.00,23990.00,23990.00,23990.00
Renault Espace,Van,2471.0,5.00,24738.00,8470.64,17400.00,20900.00,20900.00,25500.00,38990.00


In [92]:
df["weight"].replace("-", np.nan, inplace=True)

In [93]:
fill(df, "make_model", "body_type", "weight", "mode")

Number of NaN :  0
------------------
1163.00    1582
1360.00    1419
1487.00     966
1135.00     837
1425.00     744
           ... 
1331.00       1
1132.00       1
1252.00       1
1792.00       1
2037.00       1
Name: weight, Length: 434, dtype: int64


## 12. first_registration

In [94]:
summary("first_registration")

Column name              :  first_registration
--------------------------------
Total missing value      :  1597
Percentage of missing    :  10.03
Number of unique values  :  4

Unique values  : 
[2016. 2017. 2018.   nan 2019.]

Number of values
2018.00    4522
2016.00    3674
2017.00    3273
2019.00    2853
NaN        1597
Name: first_registration, dtype: int64


**The age of the vehicles is one of the criteria to be evaluated. From the "first_registration" column I can find the age of the vehicles. In the next process, I can fill in my missing data using the "age" column.***

In [95]:
# The data used for this project were scraped in 2019.

df['age'] = 2019 - df['first_registration']

In [96]:
df['age'].fillna('-', inplace=True)

In [97]:
df["age"].value_counts(dropna=False)

1.0    4522
3.0    3674
2.0    3273
0.0    2853
-      1597
Name: age, dtype: int64

**We can estimate using mileage data. Evaluate the age of the vehicle by determining the estimated mileage data**

In [98]:
df.groupby("age").km.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,2706.0,2085.36,5365.88,1.0,10.0,50.0,3000.0,127022.0
1.0,4484.0,18035.24,11052.52,1.0,9990.0,17872.0,25078.5,136000.0
2.0,3272.0,41754.94,28295.75,1.0,21541.75,34752.0,54805.5,317000.0
3.0,3674.0,77442.52,39170.14,10.0,48000.0,72914.5,99950.0,291800.0
-,759.0,934.5,7416.24,0.0,5.0,10.0,10.0,89982.0


In [99]:
df[df["age"] == "-"]["km"].value_counts(dropna=False)

NaN         838
10.00       369
1.00        146
5.00         58
20.00        32
15.00        21
0.00         19
11.00        12
8.00         11
50.00        10
12.00         8
100.00        8
7.00          7
3.00          4
9.00          4
4.00          3
250.00        3
25.00         3
30.00         3
3000.00       2
39962.00      2
2.00          2
22627.00      2
784.00        1
89692.00      1
3500.00       1
325.00        1
497.00        1
99.00         1
77.00         1
40.00         1
19500.00      1
6100.00       1
11000.00      1
89982.00      1
4307.00       1
141.00        1
34164.00      1
500.00        1
150.00        1
11200.00      1
20768.00      1
32084.00      1
142.00        1
81800.00      1
281.00        1
6.00          1
68485.00      1
85000.00      1
196.00        1
4500.00       1
60.00         1
5000.00       1
Name: km, dtype: int64

In [100]:
cond0 = (df['km'] < 10000)
cond1 = ((df['km'] >= 10000) & (df['km'] < 20000))
cond2 = ((df['km'] >= 20000) & (df['km'] < 45000))
cond3 = (df['km'] >= 45000)

In [101]:
df.loc[cond0,'age'] = df.loc[cond0,'age'].replace('-', 0)
df.loc[cond1,'age'] = df.loc[cond1,'age'].replace('-', 1)
df.loc[cond2,'age'] = df.loc[cond2,'age'].replace('-', 2)
df.loc[cond3,'age'] = df.loc[cond3,'age'].replace('-', 3)

In [102]:
df.age.value_counts(dropna=False)

1.0    4525
3.0    3679
0.0    3597
2.0    3280
-       838
Name: age, dtype: int64

In [103]:
df.groupby(['make_model',"body_type", 'age']).price.describe()[:20]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,age,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,0.0,198.0,23277.43,3510.41,14900.0,20503.5,22492.0,26798.5,31990.0
Audi A1,Compact,1.0,268.0,18596.04,2659.91,13980.0,16445.0,16980.0,20950.0,23829.0
Audi A1,Compact,2.0,161.0,16602.81,2085.38,10999.0,15450.0,15850.0,17700.0,22150.0
Audi A1,Compact,3.0,234.0,14532.91,1908.91,9950.0,13407.5,13994.5,15480.0,18900.0
Audi A1,Compact,-,178.0,23996.26,3383.85,16220.0,21515.0,22875.0,27380.0,29181.0
Audi A1,Coupe,2.0,1.0,15900.0,,15900.0,15900.0,15900.0,15900.0,15900.0
Audi A1,Coupe,3.0,1.0,13950.0,,13950.0,13950.0,13950.0,13950.0,13950.0
Audi A1,Sedans,0.0,376.0,24151.03,3175.53,15990.0,21800.0,23745.0,26890.0,37900.0
Audi A1,Sedans,1.0,466.0,18396.3,2582.45,13450.0,16365.0,16949.5,20285.0,33900.0
Audi A1,Sedans,2.0,269.0,16624.1,2252.47,11600.0,14970.0,15900.0,18000.0,23490.0


**The average price of the null values is close to the average price of the vehicles with zero. For this project we can assume that the age of the cars is zero for the Null values**

In [104]:
df['age'].replace('-', 0, inplace=True)

In [105]:
df["age"].value_counts(dropna=False)

1.00    4525
0.00    4435
3.00    3679
2.00    3280
Name: age, dtype: int64

In [106]:
df.drop(columns="first_registration", inplace=True)

## 13. type

In [107]:
summary("type")

Column name              :  type
--------------------------------
Total missing value      :  2
Percentage of missing    :  0.01
Number of unique values  :  5

Unique values  : 
['Used' "Employee's car" 'New' 'Demonstration' 'Pre-registered' nan]

Number of values
Used              11096
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: type, dtype: int64


**We can fill it using "mode" as there are only 2 missing values**

In [108]:
fill(df,"make_model","body_type","type","mode")

Number of NaN :  0
------------------
Used              11098
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
Name: type, dtype: int64


## 14. next_inspection  &  inspection_new

In [109]:
df.next_inspection.value_counts()[:10]

06/2021    471
03/2021    210
05/2021    180
04/2021    171
02/2021    168
04/2022    144
05/2022    143
01/2021    132
03/2022    121
03/2020    113
Name: next_inspection, dtype: int64

In [110]:
df.inspection_new.value_counts()

Yes    3932
Name: inspection_new, dtype: int64

**The inspection date is not a factor that directly affects the car price.
So we can drop the "next_inspection" column. Let's focus on the "inspection_new" column instead.**

**This column includes new and old controls. There are only "Yes" and Null values in the column. For this project, we can consider Null values as "No".**

In [111]:
df.drop(columns="next_inspection", inplace=True)

In [112]:
df.inspection_new.fillna("No", inplace=True)

In [113]:
summary("inspection_new")

Column name              :  inspection_new
--------------------------------
Total missing value      :  0
Percentage of missing    :  0.0
Number of unique values  :  2

Unique values  : 
['Yes' 'No']

Number of values
No     11987
Yes     3932
Name: inspection_new, dtype: int64


## 15. country_version

In [114]:
summary("country_version")

Column name              :  country_version
--------------------------------
Total missing value      :  8333
Percentage of missing    :  52.35
Number of unique values  :  23

Unique values  : 
[nan 'Germany' 'Italy' 'Belgium' 'Netherlands' 'Spain' 'European Union'
 'Switzerland' 'Austria' 'Luxembourg' 'France' 'Denmark' 'Poland'
 'Romania' 'Slovakia' 'Sweden' 'Czech Republic' 'Hungary' 'Slovenia'
 'Croatia' 'Egypt' 'Serbia' 'Bulgaria' 'Japan']

Number of values
NaN               8333
Germany           4502
Italy             1038
European Union     507
Netherlands        464
Spain              325
Belgium            314
Austria            208
Czech Republic      52
Poland              49
France              38
Denmark             33
Hungary             28
Japan                8
Slovakia             4
Croatia              4
Sweden               3
Romania              2
Bulgaria             2
Luxembourg           1
Switzerland          1
Slovenia             1
Egypt                1
Serb

**In the "short_description" and "description" columns we can catch clues to country names. The code below will help us.**

In [115]:
list_of_search = set()
for i in df.country_version.unique()[1:]:
    search_in_short_description = df[df.short_description.str.contains(i, re.IGNORECASE) == True]["short_description"].values
    search_in_description = df[df.description.str.contains(i, re.IGNORECASE) == True]["description"].values

    if len(search_in_short_description) != 0 :
        list_of_search.update(search_in_short_description)
        
    if len(search_in_description) != 0 :
        list_of_search.update(search_in_description)
len(list_of_search)

82

In [116]:
df[df.short_description.str.contains("Austria", re.IGNORECASE) == True]["short_description"]

1750    1.0 TFSI Austria
2177    1.0 TFSI Austria
2590    1.0 TFSI Austria
2591    1.0 TFSI Austria
2593    1.0 TFSI Austria
Name: short_description, dtype: object

**We can fill in missing data using columns "short_description", "description".**

In [117]:
df[df.description.str.contains("Belgium", re.IGNORECASE) == True]["description"].iloc[1][975:1033]

'(mileage certificate in case the car comes from Belgium), '

**As a result of the examinations, there are values with some country names as above. However, as you will see in part of the explanation above, we learned that vehicles can change countries. So it doesn't make sense to fill in the missing values of "country_version".**

**We will do "ffill" and "bfill" so that the ratio of the values in the column does not change.**

In [118]:
df[df.short_description.str.contains("Belgium", re.IGNORECASE) == True][["short_description", "country_version"]]

Unnamed: 0,short_description,country_version
9760,1.2B Belgium Car - Navi met Touch Screen - All...,
9765,1.2B Belgium Car - Navigatie met Touch Screen ...,


In [119]:
df[df.short_description.str.contains("Austria", re.IGNORECASE) == True][["short_description", "country_version"]]

Unnamed: 0,short_description,country_version
1750,1.0 TFSI Austria,
2177,1.0 TFSI Austria,
2590,1.0 TFSI Austria,
2591,1.0 TFSI Austria,
2593,1.0 TFSI Austria,


In [120]:
df[df.short_description.str.contains(i, re.IGNORECASE) == True]["short_description"]

Series([], Name: short_description, dtype: object)

In [121]:
len('\n, Car immediately available!, *8900 Eur, : This price is without VAT (only for EU companies and/or export), *10769 Eur, : Price including VAT, without guarantee and Technical inspection, *11.569 Eur, : incl.VAT - For PRIVATE individuals: includes 1 Year of Guarantee and Technical inspection - Mandatory for Private customers in Belgium.,  Solafcars is a car dealer and specializes in selling second-hand vehicles provided by financial institutions and official brand dealers. All vehicles originate directly from , first owners, . All our vehicles are , maintained,  according to manufacturer’s regulations. Our employees carefully select quality vehicles from our partners and suppliers for a low cost without compromising on service. Although our main focus is on business-2-business sales we are always open to new private customers/individuals. ,  , Solafcars offers,  every vehicle with the following documents:,  *Full service history from official dealer,  *Carpass ')

975

In [122]:
df.country_version = df.country_version.fillna(method="ffill").fillna(method="bfill")
summary("country_version")

Column name              :  country_version
--------------------------------
Total missing value      :  0
Percentage of missing    :  0.0
Number of unique values  :  23

Unique values  : 
['Germany' 'Italy' 'Belgium' 'Netherlands' 'Spain' 'European Union'
 'Switzerland' 'Austria' 'Luxembourg' 'France' 'Denmark' 'Poland'
 'Romania' 'Slovakia' 'Sweden' 'Czech Republic' 'Hungary' 'Slovenia'
 'Croatia' 'Egypt' 'Serbia' 'Bulgaria' 'Japan']

Number of values
Germany           8835
Italy             2776
Netherlands        912
European Union     891
Spain              819
Belgium            785
Austria            404
Czech Republic     118
France              98
Poland              96
Denmark             73
Hungary             38
Japan               34
Slovakia            10
Croatia              7
Sweden               5
Romania              4
Switzerland          4
Serbia               3
Bulgaria             3
Luxembourg           2
Slovenia             1
Egypt                1
Name: country

## 16. upholstery_type & upholstery_color

In [123]:
summary("upholstery_type")

Column name              :  upholstery_type
--------------------------------
Total missing value      :  4871
Percentage of missing    :  30.6
Number of unique values  :  5

Unique values  : 
['Cloth' nan 'Part leather' 'Full leather' 'alcantara' 'Velour']

Number of values
Cloth           8423
NaN             4871
Part leather    1499
Full leather    1009
Velour            60
alcantara         57
Name: upholstery_type, dtype: int64


In [124]:
summary("upholstery_color")

Column name              :  upholstery_color
--------------------------------
Total missing value      :  6038
Percentage of missing    :  37.93
Number of unique values  :  9

Unique values  : 
['Black' 'Grey' nan 'White' 'Red' 'Blue' 'Orange' 'Brown' 'Beige' 'Yellow']

Number of values
Black     8201
NaN       6038
Grey      1376
Brown      207
Beige       54
Blue        16
White       13
Red          9
Yellow       4
Orange       1
Name: upholstery_color, dtype: int64


**Checked "short_description" and "description" columns for "upholstery_type"**

In [125]:
df[  df.upholstery_type.isnull()   &\
     df["short_description"].str.lower().str.contains("cloth|part leather|full leather|alcantara|velour")\
  ]["short_description"].shape  

(0,)

In [126]:
df[  df.upholstery_type.isnull()   &\
     df["description"].str.lower().str.contains("cloth|part leather|full leather|alcantara|velour")\
  ]["description"].shape  

(87,)

In [127]:
df[  df.upholstery_type.isnull()   &\
     df["description"].str.lower().str.contains("cloth|part leather|full leather|alcantara|velour")\
      ]["description"]

87       \nMagnifique Audi A1 Design S ligne 3 portes S...
230      \nAccessori: ,  Sensori parcheggio ,  Sensori ...
595      \nAirbags grand volume conducteur et passager ...
759      \nIch verkaufe schweren Herzens meinen geliebt...
1004     \nGarantie bis 03/2020, Das Fahrzeug ist in ei...
2394     \nJetzt SUPER PREISE BEI LEASING, Sie entschei...
2462     \n, Getriebe:,  Schaltgetriebe, Technik:,  Bor...
2818     \n(((Nettopreis:14285 ,-Euro))), A3 Sportback ...
2851     \nAudi A3 s-line Sedan DSG auto , Intérieur al...
3073     \nDEK:[2740823], Codice riferimento: 16659 - P...
3078     \nDEK:[2740823], Codice riferimento: 16659 - P...
3079     \nDEK:[2740823], Codice riferimento: 16659 - P...
3080     \nDEK:[2740823], Codice riferimento: 16659 - P...
3161     \n, Sehr guter Allgemeinzustand, im Kundenauft...
3165     \nPrecio financiado del vehículo:14. 500€, Pre...
3338     \nmacchina pari a nuovo sempre tenuta in garag...
3405     \nVettura in perfette condizioni, carrozzeria .

In [128]:
df[  df.upholstery_type.isnull()   &\
     df["description"].str.lower().str.contains("cloth|part leather|full leather|alcantara|velour")\
  ]["description"].iloc[1][288:320]

'Sedili in alcantara riscaldabili'

**translation:  "Heated Alcantara seats"**

In [129]:
df[  df.upholstery_type.isnull()   &\
     df["description"].str.lower().str.contains("cloth|part leather|full leather|alcantara|velour")\
      ]["description"].iloc[2][2367:2430]

"Moquette en velours de couleur assortie à celle de l'habitacle,"

**translation:  "Car Mat in a color matching that of the passenger compartment"**

**As can be understood from the translation, some fabric types are associated with different materials of the vehicle.**

In [130]:
#  France "tissu" == Cloth
#  German "stoff" == Cloth
# 
df[  df.upholstery_type.isnull()   &\
     df["description"].str.lower().str.contains("tissu|stoff")\
      ]["description"]

9        \nClim automatique,Ecran multifonction couleur...
29       \n, Sonderausstattung:, Airbag Beifahrerseite ...
55       \n, FINANZIERUNG OHNE ANZAHLUNG MIT EINER MONA...
64       \n, Gut erhaltener Audi A1 1.4 TDI ultra Sport...
180      \n, Highlights: , Qualitätssiegel GW:plus, 1. ...
                               ...                        
15618    \n, Sonderausstattung:, 3.Sitzreihe mit Einzel...
15659    \n, Klimaautomatik 2-fach, Fahrerairbag, Beifa...
15709    \n, Audio-Navigationssystem R-Link 2 mit Touch...
15771    \n, Klimaautomatik 2-fach, Fahrerairbag, Beifa...
15777    \n, Klimaautomatik 2-fach, Fahrerairbag, Beifa...
Name: description, Length: 451, dtype: object

**In the "description" column, there are results when the related upholstery types are searched in different languages. But we will fill the missing values with "ffill" and "bfill" to shorten the EDA process.**

In [131]:
fill(df, "make_model", "body_type", "upholstery_type", "ffill")

Number of NaN :  0
------------------
Cloth           12157
Part leather     2128
Full leather     1458
alcantara          95
Velour             81
Name: upholstery_type, dtype: int64


In [132]:
fill(df, "make_model", "body_type", "upholstery_color", "ffill")

Number of NaN :  0
------------------
Black     13035
Grey       2228
Brown       488
Beige        80
White        34
Blue         29
Red          18
Yellow        6
Orange        1
Name: upholstery_color, dtype: int64


## 17. previous_owners

In [133]:
summary("previous_owners")

Column name              :  previous_owners
--------------------------------
Total missing value      :  6640
Percentage of missing    :  41.71
Number of unique values  :  5

Unique values  : 
[ 2. nan  1.  0.  3.  4.]

Number of values
1.00    8294
NaN     6640
2.00     778
0.00     188
3.00      17
4.00       2
Name: previous_owners, dtype: int64


In [134]:
df["previous_owners"].fillna("-", inplace = True)

In [135]:
df.groupby(['make_model', 'body_type', 'age', 'previous_owners']).price.describe()[0:100]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,age,previous_owners,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Audi A1,Compact,0.0,0.0,16.0,22445.56,1877.17,19490.0,22174.75,22400.0,22480.0,28650.0
Audi A1,Compact,0.0,1.0,81.0,23396.51,3380.45,14900.0,21330.0,22900.0,25800.0,29179.0
Audi A1,Compact,0.0,2.0,1.0,21760.0,,21760.0,21760.0,21760.0,21760.0,21760.0
Audi A1,Compact,0.0,-,278.0,23756.33,3555.13,16220.0,20988.0,22725.0,27380.0,31990.0
Audi A1,Compact,1.0,1.0,195.0,18108.74,2381.81,13980.0,16430.0,16940.0,19989.0,23829.0
Audi A1,Compact,1.0,2.0,7.0,19319.71,2516.06,16960.0,16980.0,18470.0,21949.0,21950.0
Audi A1,Compact,1.0,-,66.0,19959.05,2981.8,14500.0,16800.0,20970.0,22448.5,23700.0
Audi A1,Compact,2.0,1.0,65.0,16785.78,2346.62,10999.0,15450.0,15975.0,17970.0,21490.0
Audi A1,Compact,2.0,2.0,43.0,16099.07,1080.32,14220.0,15850.0,15850.0,15850.0,20990.0
Audi A1,Compact,2.0,-,53.0,16787.09,2318.33,12490.0,15290.0,15900.0,18950.0,22150.0


In [136]:
df.groupby(['make_model', 'body_type', 'age', 'previous_owners']).km.describe()[200:300]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,age,previous_owners,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Opel Insignia,Station wagon,2.0,2.0,29.0,35447.41,12793.61,7120.0,26400.0,34700.0,46560.0,57400.0
Opel Insignia,Station wagon,2.0,3.0,1.0,37125.0,,37125.0,37125.0,37125.0,37125.0,37125.0
Opel Insignia,Station wagon,2.0,-,83.0,50995.65,33550.97,9500.0,20346.0,45000.0,80531.5,140000.0
Opel Insignia,Station wagon,3.0,1.0,305.0,103980.02,41252.59,10791.0,75185.0,96760.0,130089.0,232000.0
Opel Insignia,Station wagon,3.0,2.0,38.0,89875.53,43779.44,31000.0,56220.0,81030.0,115625.0,200000.0
Opel Insignia,Station wagon,3.0,3.0,1.0,67000.0,,67000.0,67000.0,67000.0,67000.0,67000.0
Opel Insignia,Station wagon,3.0,-,93.0,85331.02,32207.78,25259.0,63000.0,83000.0,101000.0,166500.0
Opel Insignia,Van,1.0,1.0,1.0,23726.0,,23726.0,23726.0,23726.0,23726.0,23726.0
Renault Clio,Compact,0.0,0.0,8.0,8.88,6.22,0.0,7.75,10.0,10.0,20.0
Renault Clio,Compact,0.0,1.0,55.0,361.75,980.54,1.0,32.5,150.0,150.0,5000.0


**It is not appear to be a significant relationship between Previous_Owners and price/km columns except Renault Duster. It seems that Renault Duster's Previous_Owners values should be "0".**

In [137]:
cond = (df["make_model"]=="Renault Duster") & (df["previous_owners"] == "-")
df.loc[cond, "previous_owners"] = 0.0

In [138]:
df["previous_owners"].replace("-", np.nan, inplace=True)

In [139]:
fill(df, "make_model", "age", "previous_owners", "ffill")

Number of NaN :  0
------------------
1.00    14153
2.00     1172
0.00      563
3.00       29
4.00        2
Name: previous_owners, dtype: int64


## 18. warranty_new

In [140]:
summary("warranty_new")

Column name              :  warranty_new
--------------------------------
Total missing value      :  11066
Percentage of missing    :  69.51
Number of unique values  :  41

Unique values  : 
[nan '12 months' '3 months' '6 months' '24 months' '50 months' '48 months'
 '36 months' '20 months' '23 months' '60 months' '13 months' '26 months'
 '46 months' '47 months' '49 months' '18 months' '56 months' '16 months'
 '22 months' '28 months' '10 months' '19 months' '25 months' '11 months'
 '72 months' '2 months' '1 months' '4 months' '8 months' '7 months'
 '15 months' '17 months' '45 months' '14 months' '9 months' '65 months'
 '21 months' '34 months' '33 months' '40 months' '30 months']

Number of values
NaN          11066
12 months     2594
24 months     1118
60 months      401
36 months      279
48 months      149
6 months       125
72 months       59
3 months        33
23 months       11
18 months       10
20 months        7
25 months        6
2 months         5
50 months        4
26 months

**There is too much lost value. Filling these values means we're going to do a lot of manipulation to the dataset. I don't think it will have much of an impact on ML. We can drop the column**

In [141]:
df.drop("warranty_new", axis=1, inplace=True)

## 19. km

In [142]:
summary("km")

Column name              :  km
--------------------------------
Total missing value      :  1024
Percentage of missing    :  6.43
Number of unique values  :  6689

Unique values  : 
[5.6013e+04 8.0000e+04 8.3450e+04 ... 2.8640e+03 1.5060e+03 5.7000e+01]

Number of values
10.00       1045
NaN         1024
1.00         367
5.00         170
50.00        148
            ... 
67469.00       1
43197.00       1
10027.00       1
35882.00       1
57.00          1
Name: km, Length: 6690, dtype: int64


In [143]:
df.groupby("age").km.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,3450.0,1647.36,4828.82,0.0,10.0,14.0,1432.75,127022.0
1.0,4487.0,18032.47,11049.82,1.0,9990.0,17869.0,25076.0,136000.0
2.0,3279.0,41730.52,28272.68,1.0,21565.0,34720.0,54669.0,317000.0
3.0,3679.0,77450.06,39145.12,10.0,48000.0,72945.0,99950.0,291800.0


In [144]:
df.groupby(["age", "previous_owners"]).km.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
age,previous_owners,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0.0,0.0,297.0,1222.21,5752.69,0.0,1.0,10.0,47.0,82400.0
0.0,1.0,3139.0,1666.53,4549.01,1.0,10.0,15.0,1510.0,127022.0
0.0,2.0,14.0,6368.36,20236.33,8.0,15.0,97.0,875.0,76300.0
1.0,0.0,13.0,27350.08,13275.72,4500.0,16749.0,30800.0,35616.0,50000.0
1.0,1.0,4361.0,17994.32,11045.15,1.0,9931.0,17869.0,25044.0,136000.0
1.0,2.0,113.0,18433.08,10592.47,10.0,11000.0,17719.0,25134.0,50000.0
2.0,0.0,16.0,40998.56,22780.88,9459.0,21250.0,38308.0,62082.25,86000.0
2.0,1.0,2792.0,43136.51,29127.58,1.0,22847.5,36020.5,57963.75,317000.0
2.0,2.0,457.0,33506.26,21038.55,1.0,17768.0,26820.0,47219.0,116184.0
2.0,3.0,13.0,31474.92,19173.78,4000.0,18890.0,27355.0,43847.0,66000.0


In [145]:
df.groupby(["age", "previous_owners"]).km.transform("mean") 

0       60064.31
1       43136.51
2       80920.81
3       80920.81
4       80920.81
          ...   
15914    1666.53
15915    1666.53
15916    1666.53
15917    1666.53
15918    1666.53
Name: km, Length: 15919, dtype: float64

In [146]:
df.km.fillna(df.groupby(["age", "previous_owners"]).km.transform("mean"), inplace=True)

In [147]:
df.km.isnull().sum()

0

## 20. displacement 

In [148]:
df.groupby(["make_model", "body_type" ,"cylinders" ])["displacement"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,cylinders,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,3.0,451.0,1060.36,149.89,999.0,999.0,999.0,999.0,1596.0
Audi A1,Compact,4.0,109.0,1506.73,166.32,999.0,1395.0,1498.0,1598.0,1984.0
Audi A1,Compact,8.0,1.0,999.0,,999.0,999.0,999.0,999.0,999.0
Audi A1,Coupe,3.0,1.0,999.0,,999.0,999.0,999.0,999.0,999.0
Audi A1,Sedans,3.0,895.0,1108.37,185.62,998.0,999.0,999.0,1422.0,1596.0
Audi A1,Sedans,4.0,277.0,1541.17,143.41,999.0,1395.0,1598.0,1598.0,1984.0
Audi A1,Station wagon,3.0,10.0,1252.8,218.44,999.0,999.0,1422.0,1422.0,1422.0
Audi A1,Station wagon,4.0,6.0,1462.67,104.83,1395.0,1395.0,1395.0,1547.25,1598.0
Audi A3,Compact,3.0,17.0,1057.82,166.05,999.0,999.0,999.0,999.0,1499.0
Audi A3,Compact,4.0,136.0,1547.41,141.77,1395.0,1395.0,1598.0,1598.0,1968.0


**We found a relationship between the number of cylinders and displacement. As the number of cylinders increases, the displacement increases**

In [149]:
fill_mean(df,"make_model", "body_type" ,"cylinders", "displacement" )

Number of NaN :  0
------------------
1598.00     4761
999.00      2438
1398.00     1314
1399.00      749
1229.00      677
            ... 
2967.00        1
1856.00        1
16000.00       1
1662.37        1
1800.00        1
Name: displacement, Length: 114, dtype: int64


## 21. hp_kw

In [150]:
summary("hp_kw")

Column name              :  hp_kw
--------------------------------
Total missing value      :  88
Percentage of missing    :  0.55
Number of unique values  :  80

Unique values  : 
[ 66. 141.  85.  70.  92. 112.  60.  71.  67. 110.  93. 147.  86. 140.
  87.  nan  81.  82. 135. 132. 100.  96. 162. 150. 294. 228. 270. 137.
   9. 133.  77. 101.  78. 103.   1.  74. 118.  84.  88.  80.  76. 149.
  44.  51.  55.  52.  63.  40.  65.  75. 125. 120. 184. 239. 121. 143.
 191.  89. 195. 127. 122. 154. 155. 104. 123. 146.  90.  53.  54.  56.
 164.   4. 163.  57. 119. 165. 117. 115.  98. 168. 167.]

Number of values
85.00     2542
66.00     2122
81.00     1402
100.00    1308
110.00    1112
70.00      888
125.00     707
51.00      695
55.00      569
118.00     516
92.00      466
121.00     392
147.00     380
77.00      345
56.00      286
54.00      276
103.00     253
87.00      232
165.00     194
88.00      177
60.00      160
162.00      98
NaN         88
74.00       81
96.00       72
71.00       59

In [151]:
df.groupby(["make_model", "body_type" ,"cylinders" ])["hp_kw"].describe()  

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,cylinders,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,3.0,450.0,73.0,8.07,60.0,70.0,70.0,85.0,85.0
Audi A1,Compact,4.0,110.0,94.18,16.59,60.0,85.0,92.0,92.0,147.0
Audi A1,Compact,8.0,1.0,70.0,,70.0,70.0,70.0,70.0,70.0
Audi A1,Coupe,3.0,1.0,70.0,,70.0,70.0,70.0,70.0,70.0
Audi A1,Sedans,3.0,896.0,72.5,8.43,60.0,66.0,70.0,85.0,85.0
Audi A1,Sedans,4.0,280.0,89.53,13.56,60.0,85.0,85.0,92.0,147.0
Audi A1,Station wagon,3.0,10.0,66.6,2.99,60.0,66.0,66.0,69.0,70.0
Audi A1,Station wagon,4.0,6.0,92.67,9.16,85.0,86.75,92.0,92.0,110.0
Audi A3,Compact,3.0,17.0,87.94,8.3,85.0,85.0,85.0,85.0,110.0
Audi A3,Compact,4.0,136.0,89.49,11.73,81.0,81.0,85.0,92.0,150.0


In [152]:
df.groupby(["make_model", "body_type" ,"gears" ])["hp_kw"].describe()  

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,gears,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,5.0,279.0,69.1,6.27,60.0,66.0,70.0,70.0,86.0
Audi A1,Compact,6.0,83.0,85.75,7.31,71.0,85.0,85.0,85.0,147.0
Audi A1,Compact,7.0,202.0,82.92,15.69,66.0,70.0,85.0,85.0,147.0
Audi A1,Compact,8.0,1.0,70.0,,70.0,70.0,70.0,70.0,70.0
Audi A1,Coupe,5.0,1.0,70.0,,70.0,70.0,70.0,70.0,70.0
Audi A1,Sedans,5.0,527.0,70.25,7.59,60.0,66.0,70.0,70.0,87.0
Audi A1,Sedans,6.0,101.0,87.33,11.17,66.0,85.0,85.0,85.0,147.0
Audi A1,Sedans,7.0,451.0,81.35,12.98,66.0,70.0,85.0,85.0,141.0
Audi A1,Sedans,8.0,1.0,60.0,,60.0,60.0,60.0,60.0,60.0
Audi A1,Station wagon,5.0,10.0,67.5,7.01,60.0,66.0,66.0,69.0,85.0


**Couldn't find pattern by looking at the "cylinders" and "gears" columns. It can be filled using "mode".**

In [153]:
fill(df, "make_model", "body_type", "hp_kw", "mode")

Number of NaN :  0
------------------
85.00     2543
66.00     2124
81.00     1403
100.00    1314
110.00    1113
70.00      890
125.00     711
51.00      696
55.00      589
118.00     550
92.00      466
121.00     392
147.00     380
77.00      353
56.00      294
54.00      276
103.00     253
87.00      232
165.00     194
88.00      177
60.00      160
162.00      98
74.00       81
96.00       72
71.00       59
101.00      47
67.00       40
154.00      39
122.00      35
119.00      30
164.00      27
135.00      24
52.00       22
82.00       22
1.00        20
78.00       20
294.00      18
146.00      18
141.00      16
57.00       10
120.00       8
104.00       8
112.00       7
191.00       7
155.00       6
117.00       6
184.00       5
65.00        4
90.00        4
76.00        4
168.00       3
98.00        3
149.00       3
80.00        3
93.00        3
167.00       2
228.00       2
53.00        2
143.00       2
150.00       2
140.00       2
270.00       2
63.00        2
40.00        2
12

## 22. cylinders

In [154]:
summary("cylinders")

Column name              :  cylinders
--------------------------------
Total missing value      :  5680
Percentage of missing    :  35.68
Number of unique values  :  7

Unique values  : 
[ 3.  4. nan  8.  5.  1.  6.  2.]

Number of values
4.00    8105
NaN     5680
3.00    2104
5.00      22
6.00       3
8.00       2
2.00       2
1.00       1
Name: cylinders, dtype: int64


In [155]:
def group_displacement(x):
    if np.isnan(x):
        return x
    elif x < 1000 :
        return 1
    elif x < 1200 :
        return 2
    elif x < 1400 :
        return 3
    elif x < 1500 :
        return 4
    elif x < 1600 :
        return 5
    else:
        return 6

In [156]:
df["displacement_new"]= df.displacement.apply(group_displacement)

In [157]:
df.groupby(["make_model", "body_type", "displacement_new", ]).cylinders.describe().iloc[0:100]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,displacement_new,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Audi A1,Compact,1,385.0,3.02,0.26,3.0,3.0,3.0,3.0,8.0
Audi A1,Compact,2,5.0,3.2,0.45,3.0,3.0,3.0,3.0,4.0
Audi A1,Compact,3,44.0,4.0,0.0,4.0,4.0,4.0,4.0,4.0
Audi A1,Compact,4,78.0,3.18,0.39,3.0,3.0,3.0,3.0,4.0
Audi A1,Compact,5,42.0,3.98,0.15,3.0,4.0,4.0,4.0,4.0
Audi A1,Compact,6,8.0,4.0,0.0,4.0,4.0,4.0,4.0,4.0
Audi A1,Coupe,1,1.0,3.0,,3.0,3.0,3.0,3.0,3.0
Audi A1,Coupe,5,0.0,,,,,,,
Audi A1,Sedans,1,667.0,3.01,0.11,3.0,3.0,3.0,3.0,4.0
Audi A1,Sedans,2,6.0,3.0,0.0,3.0,3.0,3.0,3.0,3.0


**There is no good pattern for displacements**

**There are columns like "displacement" , "hp_kw", "gears" to represent this column. using them we can go to the ML model. We can drop this column**

In [158]:
df.drop(columns=["displacement_new", "cylinders"], inplace=True)

## 23. gears

In [159]:
summary("gears")

Column name              :  gears
--------------------------------
Total missing value      :  4712
Percentage of missing    :  29.6
Number of unique values  :  10

Unique values  : 
[nan  7.  6.  5.  8.  1.  2. 50.  9.  3.  4.]

Number of values
6.00     5822
NaN      4712
5.00     3239
7.00     1908
8.00      224
9.00        6
1.00        2
3.00        2
4.00        2
2.00        1
50.00       1
Name: gears, dtype: int64


**The number of gears of cars changes by theirs make_models, body_types and gear_types most of the time. So We can fill missing values with mode value of related group.**

In [160]:
df.groupby(["make_model", "body_type", "gearing_type", "gears"]).price.describe()[:50]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count,mean,std,min,25%,50%,75%,max
make_model,body_type,gearing_type,gears,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Audi A1,Compact,Automatic,5.0,3.0,22184.33,3421.3,18497.0,20648.5,22800.0,24028.0,25256.0
Audi A1,Compact,Automatic,6.0,6.0,21038.33,4282.04,16430.0,18725.0,20920.0,21060.0,28860.0
Audi A1,Compact,Automatic,7.0,199.0,22059.25,3918.12,13990.0,18970.0,21790.0,24365.0,29181.0
Audi A1,Compact,Automatic,8.0,1.0,16880.0,,16880.0,16880.0,16880.0,16880.0,16880.0
Audi A1,Compact,Manual,5.0,277.0,16329.47,3040.93,9950.0,13990.0,15900.0,16940.0,22990.0
Audi A1,Compact,Manual,6.0,77.0,20538.3,2061.0,12550.0,19588.0,20881.0,21990.0,22989.0
Audi A1,Compact,Semi-automatic,7.0,3.0,24028.33,7208.44,17945.0,20047.5,22150.0,27070.0,31990.0
Audi A1,Coupe,Manual,5.0,1.0,13950.0,,13950.0,13950.0,13950.0,13950.0,13950.0
Audi A1,Sedans,Automatic,5.0,14.0,18078.57,1329.65,15200.0,17500.0,18900.0,18900.0,19300.0
Audi A1,Sedans,Automatic,6.0,11.0,22974.36,5983.73,14500.0,17540.0,22990.0,28840.0,29000.0


In [161]:
fill_most_freq(df, ["make_model", "body_type", "gearing_type"], "gears"  )

Number of NaN :  0
------------------
6.00     8615
5.00     4255
7.00     2810
8.00      225
9.00        6
1.00        2
3.00        2
4.00        2
2.00        1
50.00       1
Name: gears, dtype: int64


## 24. comfort_and_convenience

In [162]:
summary("comfort_and_convenience")

Column name              :  comfort_and_convenience
--------------------------------
Total missing value      :  920
Percentage of missing    :  5.78
Number of unique values  :  6198

Unique values  : 
['Air conditioning, Armrest, Automatic climate control, Cruise control, Electrical side mirrors, Hill Holder, Leather steering wheel, Light sensor, Multi-function steering wheel, Navigation system, Park Distance Control, Parking assist system sensors rear, Power windows, Rain sensor, Seat heating, Start-stop system'
 'Air conditioning, Automatic climate control, Hill Holder, Leather steering wheel, Lumbar support, Parking assist system sensors rear, Power windows, Start-stop system, Tinted windows'
 'Air conditioning, Cruise control, Electrical side mirrors, Hill Holder, Leather steering wheel, Multi-function steering wheel, Navigation system, Park Distance Control, Parking assist system sensors front, Parking assist system sensors rear, Power windows, Seat heating, Start-stop system'
 .

In [163]:
fill(df, "make_model", "body_type", "comfort_and_convenience", "mode")

Number of NaN :  0
------------------
Air conditioning, Electrical side mirrors, Hill Holder, Power windows                                                                                                                                                                                                                                                                                                                                                                                                                                                      388
Air conditioning, Armrest, Automatic climate control, Cruise control, Electrical side mirrors, Leather steering wheel, Light sensor, Lumbar support, Multi-function steering wheel, Navigation system, Park Distance Control, Parking assist system sensors front, Parking assist system sensors rear, Power windows, Rain sensor, Seat heating, Start-stop system                                                                                                 

In [164]:
# Car Comfort & Conveniance Packages

premium = ["Electrical side mirrors", "Parking assist", "Air conditioning", "Hill Holder", "Power windows"]
premium_plus = ["Keyless central door lock", "Heads-up", "Massage seats", "heating", "Automatic climate control", "Heated"]

comfort_package = df['comfort_and_convenience'].apply(lambda sentence: "Premium Plus" if all(word in sentence for word in premium_plus) else ("Premium" if all(word in sentence for word in premium) else "Standard"))
comfort_package.value_counts()

Standard        10786
Premium          5045
Premium Plus       88
Name: comfort_and_convenience, dtype: int64

In [165]:
df.drop("comfort_and_convenience", axis=1, inplace=True)

## 25. entertainment_and_media

In [166]:
summary("entertainment_and_media")

Column name              :  entertainment_and_media
--------------------------------
Total missing value      :  1374
Percentage of missing    :  8.63
Number of unique values  :  346

Unique values  : 
['Bluetooth, Hands-free equipment, On-board computer, Radio'
 'Bluetooth, Hands-free equipment, On-board computer, Radio, Sound system'
 'MP3, On-board computer'
 'Bluetooth, CD player, Hands-free equipment, MP3, On-board computer, Radio, Sound system, USB'
 'Bluetooth, CD player, Hands-free equipment, MP3, On-board computer, Radio, USB'
 'Bluetooth, Hands-free equipment, On-board computer, Radio, Sound system, USB'
 'Bluetooth, CD player, Hands-free equipment, On-board computer, Radio, Sound system, USB'
 'CD player, MP3, Radio' 'Radio' nan
 'CD player, Hands-free equipment, On-board computer, Radio, USB'
 'Bluetooth, On-board computer, Radio'
 'Bluetooth, CD player, Hands-free equipment, On-board computer, Radio'
 'Bluetooth, CD player, Hands-free equipment, MP3, On-board computer, Rad

In [167]:
fill(df, "make_model", "body_type", "entertainment_and_media", "mode")

Number of NaN :  0
------------------
Bluetooth, Hands-free equipment, On-board computer, Radio, USB                         1738
Bluetooth, Hands-free equipment, MP3, On-board computer, Radio, USB                    1134
Bluetooth, CD player, Hands-free equipment, MP3, On-board computer, Radio, USB         1010
On-board computer                                                                       615
Radio                                                                                   558
                                                                                       ... 
Bluetooth, CD player, MP3                                                                 1
CD player, USB                                                                            1
Bluetooth, CD player, Digital radio, Radio, USB                                           1
Bluetooth, CD player, Digital radio, MP3, On-board computer, Radio, Television, USB       1
Hands-free equipment, On-board computer, R

In [168]:
# Car Entertainment & Media Packages

media_plus = ["Digital radio", "Hands-free", "Television"]

entertainment_media_package = df['entertainment_and_media'].apply(lambda sentence: "Plus" if any(word in sentence for word in media_plus) else "Standard")
entertainment_media_package.value_counts()

Plus        10811
Standard     5108
Name: entertainment_and_media, dtype: int64

In [169]:
df.drop("entertainment_and_media", axis=1, inplace=True)

##  26. extras

In [170]:
summary("extras")

Column name              :  extras
--------------------------------
Total missing value      :  2962
Percentage of missing    :  18.61
Number of unique values  :  659

Unique values  : 
['Alloy wheels, Catalytic Converter, Voice Control'
 'Alloy wheels, Sport seats, Sport suspension, Voice Control'
 'Alloy wheels, Voice Control' 'Alloy wheels, Sport seats, Voice Control'
 'Alloy wheels, Sport package, Sport suspension, Voice Control'
 'Alloy wheels, Sport package, Sport seats, Sport suspension'
 'Alloy wheels' nan 'Alloy wheels, Shift paddles'
 'Alloy wheels, Sport seats'
 'Alloy wheels, Catalytic Converter, Sport package, Sport seats, Sport suspension, Voice Control'
 'Alloy wheels, Sport seats, Sport suspension'
 'Alloy wheels, Sport package, Sport seats' 'Alloy wheels, Sport package'
 'Alloy wheels, Catalytic Converter, Shift paddles, Voice Control'
 'Alloy wheels, Shift paddles, Sport package, Voice Control'
 'Alloy wheels, Catalytic Converter, Sport seats, Voice Control, Winter ty

In [171]:
fill(df, "make_model", "body_type", "extras", "mode")

Number of NaN :  0
------------------
Alloy wheels                                                                                                     5786
Alloy wheels, Touch screen                                                                                        697
Roof rack                                                                                                         596
Alloy wheels, Voice Control                                                                                       582
Alloy wheels, Touch screen, Voice Control                                                                         544
                                                                                                                 ... 
Alloy wheels, Catalytic Converter, Shift paddles, Sport package, Sport seats, Sport suspension, Voice Control       1
Alloy wheels, Catalytic Converter, Roof rack, Sport package, Sport seats, Trailer hitch                             1
Alloy wheels, Cata

In [172]:
df.extras.str.count(",").add(1)

0        3
1        4
2        2
3        3
4        4
        ..
15914    2
15915    3
15916    1
15917    2
15918    2
Name: extras, Length: 15919, dtype: int64

In [173]:
df["extras"].apply(lambda x: x.count(",")).add(1)

0        3
1        4
2        2
3        3
4        4
        ..
15914    2
15915    3
15916    1
15917    2
15918    2
Name: extras, Length: 15919, dtype: int64

In [174]:
df["Num_of_Extras"] = df["extras"].apply(lambda x: x.count(",")).add(1)

In [175]:
df.drop("extras", axis=1, inplace=True)

## 27. safety_and_security

In [176]:
summary("safety_and_security")

Column name              :  safety_and_security
--------------------------------
Total missing value      :  982
Percentage of missing    :  6.17
Number of unique values  :  4443

Unique values  : 
['ABS, Central door lock, Daytime running lights, Driver-side airbag, Electronic stability control, Fog lights, Immobilizer, Isofix, Passenger-side airbag, Power steering, Side airbag, Tire pressure monitoring system, Traction control, Xenon headlights'
 'ABS, Central door lock, Central door lock with remote control, Daytime running lights, Driver-side airbag, Electronic stability control, Head airbag, Immobilizer, Isofix, Passenger-side airbag, Power steering, Side airbag, Tire pressure monitoring system, Traction control, Xenon headlights'
 'ABS, Central door lock, Daytime running lights, Driver-side airbag, Electronic stability control, Immobilizer, Isofix, Passenger-side airbag, Power steering, Side airbag, Tire pressure monitoring system, Traction control'
 ...
 'ABS, Adaptive headlight

In [177]:
fill(df, "make_model", "body_type", "safety_and_security", "mode")

Number of NaN :  0
------------------
ABS, Central door lock, Daytime running lights, Driver-side airbag, Electronic stability control, Fog lights, Immobilizer, Isofix, Passenger-side airbag, Power steering, Side airbag, Tire pressure monitoring system, Traction control                                                                                                                                           729
ABS, Central door lock, Daytime running lights, Driver-side airbag, Electronic stability control, Immobilizer, Isofix, Passenger-side airbag, Power steering, Side airbag, Tire pressure monitoring system, Traction control                                                                                                                                                       480
ABS, Central door lock, Daytime running lights, Driver-side airbag, Electronic stability control, Fog lights, Immobilizer, Isofix, LED Daytime Running Lights, Passenger-side airbag, Power steering, Side airbag, T

In [178]:
# Car Safety & Security Packages

premium = ["Tire pressure", "Traction control", "Daytime running lights", "LED Headlight", "Tire pressure"]
premium_plus = ["Emergency brake assistant", "Electronic stability control"]

safety_security_package = df['safety_and_security'].apply(lambda sentence: "Safety Premium Package" if any(word in sentence for word in premium) else ("Safety Premium Plus Package" if any(word in sentence for word in premium_plus) else "Safety Standard Package"))
safety_security_package.value_counts()

Safety Premium Package         14621
Safety Premium Plus Package      798
Safety Standard Package          500
Name: safety_and_security, dtype: int64

In [179]:
df.drop("safety_and_security", axis=1, inplace=True)

## Drop

In [180]:
df.drop(columns=["url", 
                 'short_description',
                 'warranty',
                 'full_service',
                 'non_smoking_vehicle',
                 'description'
                ], inplace=True )

In [181]:
df.head(3).T

Unnamed: 0,0,1,2
make_model,Audi A1,Audi A1,Audi A1
body_type,Sedans,Sedans,Sedans
price,15770,14500,14640
km,56013.00,80000.00,83450.00
type,Used,Used,Used
inspection_new,Yes,No,No
body_color,Black,Red,Black
paint_type,Metallic,Metallic,Metallic
nr_of_doors,5.00,3.00,4.00
nr_of_seats,5.00,4.00,4.00


In [182]:
df.to_csv("filled_missing_scout.csv", index=False)

**We have completed filling the missing values. We'll look at outliers in the next notebook**