# WELCOME!

## Introduction
Welcome to "***AutoScout Data Analysis Project***". This is the capstone project of ***Data Analysis*** Module. **Auto Scout** data which using for this project, scraped from the on-line car trading company in 2019, contains many features of 9 different car models. In this project, you will have the opportunity to apply many commonly used algorithms for Data Cleaning and Exploratory Data Analysis by using many Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy you will analyze clean dataset.

The project consists of 3 parts:
* First part is related with 'data cleaning'. It deals ed with 'handling outliers of data' via Visualisation libraries. Some insights are extracted.
with Incorrect Headers, Incorrect Format, Anomalies, Dropping useless columns.
* Second part is related with 'filling data'. It deals with Missing Values. Categorical to numeric transformation is done.
* Third part is relat

# PART- 2 `( Handling Missing Values )`

# Missing Values & Outliers

- # Handling with Missing Values

**Missing value handling methods**

 1. <b>Deleting Rows</b> ----->if it has more than 70-75% of missing values
    
 2. <b>Replacing With Mean/Median/Mode (Imputation)</b>--->can be applied on a feature which has numeric data

 3. <b> Assigning An Unique Category</b>--->If a categorical feature has definite number of classes, we can assign another class
    
 4. <b>Predicting The Missing Values</b>---> we can predict the nulls with the help of a machine learning algorithm like linear regression

 5. <b>Using Algorithms Which Support Missing Values</b>--->KNN is a machine learning algorithm which works on the principle of distance measure.  This algorithm can be used when there are nulls present in the dataset.  KNN considers the missing values by taking the majority of the K nearest values

NaN, standing for not a number, is a numeric data type used to represent any value that is undefined or unpresentable.

For example, 0/0 is undefined as a real number and is, therefore, represented by NaN. The square root of a negative number is an imaginary number that cannot be represented as a real number, so, it is represented by NaN.

NaN is also assigned to variables, in a computation, that do not have values and have yet to be computed.

In [97]:
import pandas as pd
import numpy as np
import statistics as stat
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats.mstats import winsorize
import re

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

%matplotlib inline
# %matplotlib notebook

pd.set_option('display.max_rows',1000)
pd.set_option('display.max_columns', 1000)

# pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.float_format', lambda x: '%.3f' % x)

plt.rcParams["figure.figsize"] = (10,6)

In [98]:
df = pd.read_csv("clean_scout.csv")

In [99]:
df.shape

(15919, 36)

In [100]:
df.head().T

Unnamed: 0,0,1,2,3,4
make_model,Audi A1,Audi A1,Audi A1,Audi A1,Audi A1
body_type,Sedans,Sedans,Sedans,Sedans,Sedans
price,15770,14500,14640,14500,16790
vat,VAT deductible,Price negotiable,VAT deductible,,
km,56013.000,80000.000,83450.000,73000.000,16200.000
hp,66.000,141.000,85.000,66.000,66.000
type,Used,Used,Used,Used,Used
previous_owners,2.000,,1.000,1.000,1.000
inspection_new,Yes,,,,Yes
warranty,,,,,


In [101]:
(df.isnull().sum()/df.isnull().count()*100).sort_values(ascending=False)

inspection_new        75.300
warranty              69.514
country_version       52.346
weight                43.809
drive_chain           43.081
previous_owners       41.711
paint_type            36.259
cylinders             35.681
upholstery_color      31.899
upholstery_type       30.599
gears                 29.600
vat                   28.350
emission_class        22.790
extras                18.607
cons_city             15.302
co2_emission          15.302
cons_country          14.926
cons_comb             12.771
age                   10.032
entertainment_media    8.631
km                     6.433
safety_security        6.169
nr_of_seats            6.137
comfort_convenience    5.779
body_color             3.750
displacement           3.116
nr_of_doors            1.332
hp                     0.553
body_type              0.377
type                   0.013
price                  0.000
model                  0.000
make                   0.000
fuel                   0.000
gearing_type  

In [102]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 36 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   make_model           15919 non-null  object 
 1   body_type            15859 non-null  object 
 2   price                15919 non-null  int64  
 3   vat                  11406 non-null  object 
 4   km                   14895 non-null  float64
 5   hp                   15831 non-null  float64
 6   type                 15917 non-null  object 
 7   previous_owners      9279 non-null   float64
 8   inspection_new       3932 non-null   object 
 9   warranty             4853 non-null   float64
 10  make                 15919 non-null  object 
 11  model                15919 non-null  object 
 12  body_color           15322 non-null  object 
 13  paint_type           10147 non-null  object 
 14  nr_of_doors          15707 non-null  float64
 15  nr_of_seats          14942 non-null 

In [103]:
miss_val = []
[miss_val.append(i) for i in df.columns if df[i].isnull().any()]
len(miss_val)


30

In [104]:
miss_val

['body_type',
 'vat',
 'km',
 'hp',
 'type',
 'previous_owners',
 'inspection_new',
 'warranty',
 'body_color',
 'paint_type',
 'nr_of_doors',
 'nr_of_seats',
 'displacement',
 'cylinders',
 'weight',
 'drive_chain',
 'co2_emission',
 'emission_class',
 'comfort_convenience',
 'entertainment_media',
 'extras',
 'safety_security',
 'gears',
 'country_version',
 'age',
 'upholstery_type',
 'upholstery_color',
 'cons_comb',
 'cons_city',
 'cons_country']

In [105]:
def missing_values(df):
    missing_number = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent'])
    return missing_values[missing_values['Missing_Number']>0]
missing_values(df)

Unnamed: 0,Missing_Number,Missing_Percent
inspection_new,11987,75.3
warranty,11066,69.514
country_version,8333,52.346
weight,6974,43.809
drive_chain,6858,43.081
previous_owners,6640,41.711
paint_type,5772,36.259
cylinders,5680,35.681
upholstery_color,5078,31.899
upholstery_type,4871,30.599



<div class="alert alert-warning" role="alert">
 Function for first looking to the columns :
</div>


In [106]:
def first_looking_col(col):
    print("column name    : ", col)
    print("--------------------------------")
    print("per_of_nulls   : ", "%", round(df[col].isnull().sum()/df.shape[0]*100, 2))
    print("num_of_nulls   : ", df[col].isnull().sum())
    print("num_of_uniques : ", df[col].nunique())
    print(df[col].value_counts(dropna = False))

<div class="alert alert-warning" role="alert">
 Functions to fill the missing values :
</div>


In [107]:
def fill_most(df, group_col, col_name):
    '''Fills the missing values with the most existing value (mode) in the relevant column according to single-stage grouping'''
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        mode = list(df[cond][col_name].mode())
        if mode != []:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
        else:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))

In [108]:
def fill_prop(df, group_col, col_name):
    '''Fills the missing values with "ffill and bfill method" according to single-stage grouping'''
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        df.loc[cond, col_name] = df.loc[cond, col_name].fillna(method="ffill").fillna(method="bfill")
    df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))

In [109]:
def fill(df, group_col1, group_col2, col_name, method): # method can be "mode" or "median" or "ffill"
    if method == "mode":
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond1 = df[group_col1]==group1
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                mode1 = list(df[cond1][col_name].mode())
                mode2 = list(df[cond2][col_name].mode())
                if mode2 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].mode()[0])
                elif mode1 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond1][col_name].mode()[0])
                else:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[col_name].mode()[0])
                
    elif method == "median":
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond1 = df[group_col1]==group1
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].median()).fillna(df[cond1][col_name].median()).fillna(df[col_name].median())
                
    elif method == "ffill":           
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(method="ffill").fillna(method="bfill")
                
        for group1 in list(df[group_col1].unique()):
            cond1 = df[group_col1]==group1
            df.loc[cond1, col_name] = df.loc[cond1, col_name].fillna(method="ffill").fillna(method="bfill")            
           
        df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))

# 1 - vat

In [110]:
first_looking_col("vat")

column name    :  vat
--------------------------------
per_of_nulls   :  % 28.35
num_of_nulls   :  4513
num_of_uniques :  2
VAT deductible      10980
NaN                  4513
Price negotiable      426
Name: vat, dtype: int64


In [111]:
df.vat = df.vat.fillna(method="ffill").fillna(method="bfill")

In [112]:
 df.vat.fillna(method="ffill").fillna(method="bfill").value_counts(dropna=False)

VAT deductible      15048
Price negotiable      871
Name: vat, dtype: int64

In [113]:
df.vat.fillna(method="ffill").value_counts(dropna=False)

VAT deductible      15048
Price negotiable      871
Name: vat, dtype: int64

In [18]:
df.vat.fillna(method="bfill").value_counts(dropna=False)

VAT deductible      15048
Price negotiable      871
Name: vat, dtype: int64

In [19]:
df.vat.value_counts(dropna=False)

VAT deductible      15048
Price negotiable      871
Name: vat, dtype: int64


<div class="alert alert-success" role="alert">
   There is no relationship between vat and other columns, and we can use fill..
</div>

# 2 - age

In [20]:
first_looking_col("age")

column name    :  age
--------------------------------
per_of_nulls   :  % 10.03
num_of_nulls   :  1597
num_of_uniques :  4
1.000    4522
3.000    3674
2.000    3273
0.000    2853
NaN      1597
Name: age, dtype: int64


In [21]:
df.age.fillna("-", inplace=True)

In [22]:
df.age.value_counts(dropna=False)

1.0    4522
3.0    3674
2.0    3273
0.0    2853
-      1597
Name: age, dtype: int64

In [23]:
df.groupby("age").km.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,2706.0,2085.355,5365.881,1.0,10.0,50.0,3000.0,127022.0
1.0,4484.0,18035.239,11052.524,1.0,9990.0,17872.0,25078.5,136000.0
2.0,3272.0,41754.941,28295.748,1.0,21541.75,34752.0,54805.5,317000.0
3.0,3674.0,77442.521,39170.143,10.0,48000.0,72914.5,99950.0,291800.0
-,759.0,934.497,7416.244,0.0,5.0,10.0,10.0,89982.0


In [91]:
#df[df.age == "-"]["km"].value_counts(dropna=False)

In [25]:
df.loc[df.km < 10000, ["km","age"]].sample(10)

Unnamed: 0,km,age
13094,3744.0,2.000
14362,1600.0,1.000
3399,4300.0,2.000
5050,45.0,0.000
4551,7860.0,1.000
5401,10.0,0.000
12584,1.0,0.000
2090,5660.0,0.000
14270,5.0,-
15820,1550.0,0.000


In [26]:
# if age value is "-" replace them as 0 (zero)

df.loc[df.km < 10000,'age'].replace('-', 0)

16      1.000
23      1.000
24      1.000
68      2.000
69      1.000
         ... 
15911   0.000
15913   0.000
15915   0.000
15916   0.000
15917   0.000
Name: age, Length: 4711, dtype: float64


<div class="alert alert-success" role="alert">
   Fill in our nan age values column according to the "km" value status..
</div>
   

In [27]:
cond1 = (df['km'] < 10000)
cond2 = ((df['km'] >= 10000) & (df['km'] < 28000))
cond3 = ((df['km'] >= 28000) & (df['km'] < 50000))
cond4 = (df['km'] >= 50000)

In [28]:
df.loc[cond1,'age'] = df.loc[cond1,'age'].replace('-', 0)
df.loc[cond2,'age'] = df.loc[cond2,'age'].replace('-', 1)
df.loc[cond3,'age'] = df.loc[cond3,'age'].replace('-', 2)
df.loc[cond4,'age'] = df.loc[cond4,'age'].replace('-', 3)

In [29]:
df.groupby('age').km.mean()

age
0.0    1647.363
1.0   18035.130
2.0   41748.577
3.0   77450.063
-           NaN
Name: km, dtype: float64

In [30]:
df["km"].isnull().sum()  

1024


<div class="alert alert-success" role="alert">
   Used "km" column to fill nan values of "age" column, but we have nan values for "km".
</div>

In [31]:
df.age.value_counts(dropna=False)

1.0    4528
3.0    3679
0.0    3597
2.0    3277
-       838
Name: age, dtype: int64

<div class="alert alert-success" role="alert">
   No missing value of "age" --> when we are talking about "km"
</div>

In [90]:
#df.groupby(['make_model', 'age']).km.describe()

<div class="alert alert-success" role="alert">
   If we check price, we have missing values for "age"
</div>

In [89]:
#df.groupby(['make_model',"body_type", 'age']).price.describe()

<div class="alert alert-success" role="alert">
   Okayi let's fill all missing values of "age" columns' as 0
</div>

In [34]:
df['age'].replace('-',0, inplace=True) 

In [35]:
df.groupby('age').km.mean()

age
0.000    1647.363
1.000   18035.130
2.000   41748.577
3.000   77450.063
Name: km, dtype: float64

In [36]:
df["age"].value_counts(dropna=False)

1.000    4528
0.000    4435
3.000    3679
2.000    3277
Name: age, dtype: int64

# 3 - upholstery_type

In [37]:
first_looking_col("upholstery_type")

column name    :  upholstery_type
--------------------------------
per_of_nulls   :  % 30.6
num_of_nulls   :  4871
num_of_uniques :  5
Cloth           8423
NaN             4871
Part leather    1499
Full leather    1009
Velour            60
alcantara         57
Name: upholstery_type, dtype: int64


In [38]:
df.upholstery_type.replace(["Velour", "alcantara", "Part leather", "Full leather"], ["Cloth", "Part/Full Leather", "Part/Full Leather", "Part/Full Leather"], inplace=True)

In [39]:
df.upholstery_type.value_counts(dropna=False)

Cloth                8483
NaN                  4871
Part/Full Leather    2565
Name: upholstery_type, dtype: int64

In [88]:
#df.groupby(["make_model", "body_type", "upholstery_type"])["make_model", "body_type", "upholstery_type"].head()

In [41]:
fill(df, "make_model", "body_type", "upholstery_type", "ffill")

Number of NaN :  0
------------------
Cloth                12267
Part/Full Leather     3652
Name: upholstery_type, dtype: int64


# 4 - upholstery_color

In [42]:
# df.drop("upholstery_color", axis=1, inplace=True)

# 5 - cons_comb

In [87]:
# first_looking_col("cons_comb")

In [44]:
cons_comb = (df["cons_country"] + df["cons_city"])/2

In [45]:
df["cons_comb"] = df["cons_comb"].fillna(cons_comb)

In [86]:
# df["cons_comb"].value_counts(dropna=False)

In [47]:
df["cons_comb"].fillna("-", inplace=True)

In [85]:
# df.groupby(["make_model", "body_type","cons_comb"]).price.describe()

In [49]:
df["cons_comb"].replace([0.0, 1.0, 1.2, 1.6, 10, 11, 13.8, 32.0, 33.0, 38.0, 40.0, 43.0, 46.0, 50.0, 51.0, 54.0, 55.0, "-"], np.nan, inplace=True)

In [83]:
# df["cons_comb"].value_counts(dropna=False)

In [84]:
# df.groupby(["make_model", "body_type", "cons_comb"])["make_model", "body_type", "cons_comb"].head()

In [82]:
# fill(df, "make_model", "body_type", "cons_comb", "median")

# 6 - cons_country

In [53]:
# df.drop("cons_country", axis = 1, inplace = True)

# 7 - cons_city 

In [54]:
# df.drop("cons_city", axis = 1, inplace = True)

# 8 - body_type

In [56]:
df["body_type"].isna().value_counts()

False    15859
True        60
Name: body_type, dtype: int64

In [55]:
df.body_type.value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64

In [81]:
# df.groupby(["body_type", "make_model"]).price.describe()

In [58]:
for model in df['make_model'].unique():
    cond = df['make_model']==model
    mode = list(df[cond]['body_type'].mode())
    if mode != []:
        df.loc[cond, 'body_type'] = df.loc[cond, 'body_type'].fillna(df[cond]['body_type'].mode()[0])
    else:
        df.loc[cond, 'body_type'] = df.loc[cond, 'body_type'].fillna(df['body_type'].mode()[0])

In [59]:
df.body_type.value_counts(dropna=False)

Sedans           7925
Station wagon    3563
Compact          3155
Van               809
Other             290
Transporter        88
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64

# 9 - km

In [60]:
df.km.value_counts(dropna=False)

10.000       1045
NaN          1024
1.000         367
5.000         170
50.000        148
             ... 
67469.000       1
43197.000       1
10027.000       1
35882.000       1
57.000          1
Name: km, Length: 6690, dtype: int64

In [61]:
df.groupby("age").km.mean()

age
0.000    1647.363
1.000   18035.130
2.000   41748.577
3.000   77450.063
Name: km, dtype: float64

In [None]:
df.groupby("Age").Km.transform("mean").sample(10)

In [62]:
df["km"].fillna(df.groupby("age").km.transform("mean"), inplace=True)

In [63]:
df["km"].value_counts(dropna=False)

10.000       1045
1647.363      985
1.000         367
5.000         170
50.000        148
             ... 
67469.000       1
43197.000       1
10027.000       1
35882.000       1
57.000          1
Name: km, Length: 6692, dtype: int64

In [None]:
df["Km"].fillna(df.groupby("Age").Km.transform("mean"), inplace=True)

In [None]:
.km.fillna(method="ffill").fillna(method="bfill")   # Hala nan değer çıkınca böyle doldurdum. Pek mantıklı gelmedi ama başka bi şey yapamadım.

In [None]:
# df.km.fillna(method="ffill").fillna(method="bfill").value_counts(dropna=False)

In [64]:
df.km.isnull().any()

False

In [None]:
# fill(df, "age", "type", "km", "mode")

In [None]:
# df.groupby(['make_model',"age", "type"]).km.describe()

In [None]:
# df.groupby(['make_model',"age", 'km'])['make_model',"age", 'km'].head(100)

In [None]:
# df.km.value_counts(dropna=False)

In [None]:
# cond = (df.km == '-') & (df.age == 0) & (df.type == "")
# df.loc[cond, 'km'] = df.loc[cond, 'km'].map({'-':'15000.0'})

In [None]:
# cond = (df.km == '-') & (df.age == 1)
# df.loc[cond, 'km'] = df.loc[cond, 'km'].map({'-':'40000.0'})

In [None]:
# cond = df.km == '-'
# df.loc[cond, 'km'] = df.loc[cond, 'km'].map({'-':'01/2016'})      #başarısız denemelerim

# 10 - hp

In [80]:
# df.hp.value_counts(dropna=False)

In [None]:
# df.groupby(["make_model","body_type", "hp"]).price.describe()

In [None]:
# fill(df, "make_model", "body_type", "hp", "mode")    # Bu bende hata veriyor. Nedenini anlayamadım.

In [None]:
# df.hp.isnull().any()

# 11 - type

In [66]:
df.type.value_counts(dropna=False)

Used              11096
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: type, dtype: int64

In [None]:
df.groupby(["Type","Make_Model","Age"])["Price"].describe()

In [67]:
df.type=df.type.fillna(method='ffill')

In [68]:
df.type.value_counts(dropna=False)

Used              11097
New                1650
Pre-registered     1365
Employee's car     1011
Demonstration       796
Name: type, dtype: int64

# 12 - previous_owners

In [92]:
df.previous_owners.value_counts(dropna=False)

1.000    8294
NaN      6640
2.000     778
0.000     188
3.000      17
4.000       2
Name: previous_owners, dtype: int64

In [93]:
df['Previous_Owners'] = df['previous_owners']

In [94]:
df.age.value_counts(dropna=False)

1.000    4528
0.000    4435
3.000    3679
2.000    3277
Name: age, dtype: int64

In [96]:
for age in list(df["Age"].unique()): # if your age is one of your unique age values
    cond = df["Age"]==age
    df.loc[cond, "Previous_Owners"] = df.loc[cond, "Previous_Owners"].fillna(method="ffill").fillna(method="bfill")

KeyError: 'Age'

In [None]:
df.Previous_Owners.value_counts(dropna=False)

# 13 - inspection_new


In [None]:
df.inspection_new.value_counts(dropna=False)

In [None]:
df["inspection_new"].fillna(value="No",inplace=True)

In [None]:
df.inspection_new.value_counts(dropna=False)

In [None]:
df.groupby(["make_model", "body_type", "age", "inspection_new"]).price.describe()

In [None]:
df.inspection_new.replace(["Yes", "No"], [1,0], inplace=True)

In [None]:
df.inspection_new.value_counts(dropna=False)

# 14 - warranty

In [None]:
df.warranty.value_counts(dropna=False)

In [None]:
df.warranty.replace(np.nan, '-', inplace = True)

In [None]:
df.groupby(['make_model', 'age', 'warranty']).price.describe()

In [None]:
df.warranty.isnull().sum()/len(df.warranty)*100

In [None]:
df.loc[df.make == 'Audi', 'warranty'] = df.loc[df.make == 'Audi', 'warranty'].replace('-', 24)

In [None]:
df.Warranty.isnull().sum()/len(df.Warranty)*100

In [None]:
# df.drop("Warranty", axis =1 , inplace=True)

# 15 - body_color

In [70]:
df.body_color.value_counts(dropna=False)

Black     3745
Grey      3505
White     3406
Silver    1647
Blue      1431
Red        957
NaN        597
Brown      289
Green      154
Beige      108
Yellow      51
Violet      18
Bronze       6
Orange       3
Gold         2
Name: body_color, dtype: int64

In [None]:
df["body_color"].fillna("-", inplace = True)

In [None]:
df.groupby(["make_model", "body_type", 'body_color']).price.describe()

In [None]:
df["body_color"].value_counts(dropna=False)

In [None]:
df['body_color'].replace('-', np.nan, inplace = True)

In [None]:
fill(df, "make_model", "body_type", "body_color", "ffill")

In [None]:
# df.drop("body_color", axis=1, inplace=True)

# 16 - paint_type

In [None]:
df["paint_type"].value_counts(dropna=False)

In [None]:
df["paint_type"].fillna("-", inplace = True)

In [None]:
df.groupby(["make_model", "body_type", "age", 'paint_type']).price.describe()

In [None]:
fill(df, "make_model", "body_type", "paint_type", "ffill")

# 17 - nr_of_doors

In [None]:
df.nr_of_doors.value_counts(dropna=False)

In [None]:
fill(df, "make_model", "body_type", "nr_of_doors", "mode")

# 18 - nr_of_seats

In [None]:
df.nr_of_seats.value_counts(dropna=False)

In [None]:
fill(df, "make_model", "body_type", "nr_of_seats", "mode")

In [None]:
df.nr_of_seats.value_counts(dropna=False)

# 19 - make_model 

In [72]:
df.make_model.value_counts

<bound method IndexOpsMixin.value_counts of 0               Audi A1
1               Audi A1
2               Audi A1
3               Audi A1
4               Audi A1
              ...      
15914    Renault Espace
15915    Renault Espace
15916    Renault Espace
15917    Renault Espace
15918    Renault Espace
Name: make_model, Length: 15919, dtype: object>

# 20 - price 

In [75]:
df.price.value_counts()

14990    154
15990    151
10990    139
15900    106
17990    102
        ... 
17559      1
17560      1
17570      1
17575      1
39875      1
Name: price, Length: 2956, dtype: int64

In [None]:
df.groupby("make_model")[["price"]].mean()

In [None]:
df.groupby("make_model").body_type.value_counts()

# 21 - displacement

In [None]:
df.displacement.value_counts(dropna=False)

In [None]:
df.displacement.sample(10)

In [None]:
df.displacement.isnull().sum()

In [None]:
df.displacement.mean()

In [None]:
df.groupby(["make_model","body_type","displacement"]).price.describe()

In [None]:
fill(df,"make_model","body_type","displacement","mode")

# 22 - cylinders

In [None]:
df.Cylinders.value_counts(dropna=False)

In [None]:
fill(df, "Make_Model", "Body_Type", "Cylinders", "mode")  

In [None]:
# df.drop("Cylinders", axis = 1, inplace = True)

# 23 - weight

In [None]:
df.Weight.value_counts(dropna=False)

In [None]:
df.groupby(["Make_Model", "Body_Type","Weight"]).Price.describe()

In [None]:
fill(df, "Make_Model", "Body_Type", "Weight", "mode")

# 24 - drive_chain

In [None]:
df.Drive_Chain.value_counts(dropna=False)

In [None]:
df.groupby(["Make_Model", "Body_Type", "Drive_Chain"]).Price.describe()

In [None]:
fill(df, "Make_Model", "Body_Type", "Drive_Chain", "mode")

In [None]:
df.Drive_Chain.value_counts(dropna=False)

# 25 - fuel

In [78]:
df.fuel.value_counts()

Benzine                              8198
Diesel (Particulate Filter)          4315
Diesel                               2984
Super 95 (Particulate Filter)         268
Gasoline (Particulate Filter)          77
LPG/CNG                                51
Liquid petroleum gas (LPG)             10
Super E10 95 (Particulate Filter)       7
Electric                                5
CNG (Particulate Filter)                3
Others (Particulate Filter)             1
Name: fuel, dtype: int64

# 26 - co2_emission

In [None]:
df.Co2_Emission.value_counts(dropna=False)

In [None]:
df.groupby(["Make_Model","Body_Type", "Co2_Emission"]).Price.describe()

In [None]:
fill(df,"Make_Model","Body_Type","Co2_Emission", "median")

In [None]:
# df.drop("Co2_Emission", axis=1, inplace=True)

# 27 - comfort_convenience

In [None]:
df.comfort_convenience.value_counts(dropna=False)

In [None]:
fill(df,"Make_Model", "Body_Type", "Comfort_Convenience", "mode")

# 28 - entertainment_media

In [None]:
df.Entertainment_Media.value_counts(dropna=False).head(20)

In [None]:
fill(df, "Make_Model", "Body_Type","Entertainment_Media", "mode")

# 29 - extras

In [None]:
df.Extras.value_counts(dropna=False)

In [None]:
fill(df,"Make_Model","Body_Type","Extras", "mode")

# 30 - safety_security

In [None]:
df.Safety_Security.value_counts(dropna=False)

In [None]:
fill(df, "Make_Model", "Body_Type", "Safety_Security","mode")

# 31 - gears

In [None]:
df.Gears.value_counts(dropna=False)

In [None]:
df.groupby(["Make_Model", "Body_Type", "Gearing_Type", "Gears"]).Price.describe()

In [None]:
df["Gears"].replace([1,2,3,4,9,50,"-"], np.nan, inplace=True)  # most rare value_counts

In [None]:
# Our created fil function is only taking 2 grouping parameter, but we have 3 here

# df.groupby(["make_model", "body_type", "gearing_type", "gears"]).price.describe()


In [None]:
df[(df["Make_Model"]=="Renault Clio") & (df["Body_Type"]=="Sedans") & (df["Gearing_Type"]=="Automatic")]["Gears"].mode()

In [None]:
df[(df["Make_Model"]=="Renault Clio") & (df["Body_Type"]=="Sedans") & (df["Gearing_Type"]=="Automatic")]["Gears"].mode()

In [None]:
for group1 in list(df["Make_Model"].unique()):
    for group2 in list(df["Body_Type"].unique()):
        for group3 in list(df["Gearing_Type"].unique()):
            cond1 = df["Make_Model"]==group1
            cond2 = (df["Make_Model"]==group1) & (df["Body_Type"]==group2)
            cond3 = (df["Make_Model"]==group1) & (df["Body_Type"]==group2) & (df["Gearing_Type"]==group3)
            mode1 = list(df[cond1]["Gears"].mode())
            mode2 = list(df[cond2]["Gears"].mode())
            mode3 = list(df[cond3]["Gears"].mode())
            if mode3 != []:
                df.loc[cond3, "Gears"] = df.loc[cond3, "Gears"].fillna(df[cond3]["Gears"].mode()[0])
            elif mode2 != []:
                df.loc[cond3, "Gears"] = df.loc[cond3, "Gears"].fillna(df[cond2]["Gears"].mode()[0])
            elif mode1 != []:
                df.loc[cond3, "Gears"] = df.loc[cond3, "Gears"].fillna(df[cond1]["Gears"].mode()[0])
            else:
                df.loc[cond3, "Gears"] = df.loc[cond3, "Gears"].fillna(df["Gears"].mode()[0])

In [None]:
df["Gears"].value_counts(dropna=False)

# 32 - country_version

# 33 - emission_class

In [None]:
df.Emission_Class.value_counts(dropna=False)

In [None]:
df.groupby(["Make_Model","Age","Emission_Class"]).Price.describe()

In [None]:
fill(df, "Age", "Fuel", "Emission_Class", "ffill")

In [None]:
df.drop("Emission_Class", axis=1, inplace=True)

# 34 - make

In [76]:
df.model.value_counts()

A3          3097
A1          2614
Insignia    2598
Astra       2526
Corsa       2219
Clio        1839
Espace       991
Duster        34
A2             1
Name: model, dtype: int64

In [None]:
# df.drop("model", axis = 1, inplace = True) 

# 35 - model

In [77]:
df.model.value_counts()

A3          3097
A1          2614
Insignia    2598
Astra       2526
Corsa       2219
Clio        1839
Espace       991
Duster        34
A2             1
Name: model, dtype: int64

In [None]:
# df.drop("model", axis = 1, inplace = True) 

# 36 - gearing_type

In [79]:
df.gearing_type.value_counts()

Manual            8153
Automatic         7297
Semi-automatic     469
Name: gearing_type, dtype: int64

In [None]:
df.shape

In [None]:
df.isnull().sum()/df.shape[0]*100

In [None]:
# df.to_csv("filled_scout.csv", index=False)