# CAR PRICE ESTIMATION MODEL PROJECT (EDA - DATA CLEANING)

## 1 - INTRODUCTION

Welcome to "AutoScout Data Analysis Project". This is the capstone project of Data Analysis Module. Auto Scout data which using for this project, scraped from the on-line car trading company in 2019, contains many features of 9 different car models. In this project, you will have the opportunity to apply many commonly used algorithms for Data Cleaning and Exploratory Data Analysis by using many Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy you will analyze clean dataset.
The project consists of 3 parts:
- First part is related with 'data cleaning'. It deals with Incorrect Headers, Incorrect Format, Anomalies, Dropping useless columns.
- Second part is related with 'filling data'. It deals with Missing Values. Categorical to numeric transformation is done.
- Third part is related with 'handling outliers of data' via Visualisation libraries. Some insights are extracted.

## 2- IMPORTING THE LIBRARIES

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import warnings;
warnings.filterwarnings("ignore")
import re
pd.set_option("display.max_columns",None)

## 3 - IMPORT THE DATASET

In [2]:
df1 = pd.read_json('scout_car.json', lines=True)

In [3]:
df = df1.copy()

In [4]:
df.head().T

Unnamed: 0,0,1,2,3,4
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...,1.4 TDi Design S tronic,Sportback 1.4 TDI S-Tronic S-Line Ext. admired...
body_type,Sedans,Sedans,Sedans,Sedans,Sedans
price,15770,14500,14640,14500,16790
vat,VAT deductible,Price negotiable,VAT deductible,,
km,"56,013 km","80,000 km","83,450 km","73,000 km","16,200 km"
registration,01/2016,03/2017,02/2016,08/2016,05/2016
prev_owner,2 previous owners,,1 previous owner,1 previous owner,1 previous owner
kW,,,,,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            15919 non-null  object 
 1   make_model                     15919 non-null  object 
 2   short_description              15873 non-null  object 
 3   body_type                      15859 non-null  object 
 4   price                          15919 non-null  int64  
 5   vat                            11406 non-null  object 
 6   km                             15919 non-null  object 
 7   registration                   15919 non-null  object 
 8   prev_owner                     9091 non-null   object 
 9   kW                             0 non-null      float64
 10  hp                             15919 non-null  object 
 11  Type                           15917 non-null  object 
 12  Previous Owners                9279 non-null  

## 4 - DATA CLEANING

### Column 0: url

In [6]:
#df.url.value_counts()
df.drop(columns = ['url'], inplace = True) #includes links for every index. Not necessary for our analysis

### Column 1: make_model

In [7]:
df["make_model"].value_counts(dropna=False)

Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: make_model, dtype: int64

### Column 2: short_description

In [8]:
df.short_description.value_counts(dropna=False)

SPB 1.6 TDI 116 CV S tronic Sport                     64
NaN                                                   46
1.4 66kW (90CV) Selective                             40
MOVE KLIMA CD USB ALLWETTER BLUETOOTH                 38
SPB 30 TFSI S tronic Admired                          35
                                                      ..
2,0 CDTI Aut 4x4 Innovat Xen Navi Leder                1
1,6 D (CDTI) Aut. ST Innov. Navi/SHZ/RFK               1
Sportback 1.4 TFSI Bose Panorama                       1
120 Jahre 5-Türer 1.4                                  1
1.2B Belgium Car - Navi met Touch Screen - All Sea     1
Name: short_description, Length: 10002, dtype: int64

In [9]:
df['short_description'] = df['short_description'].str.findall('\d\.\d').str[0].astype(float)*1000
df['short_description']

0        1400.0
1        1800.0
2        1600.0
3        1400.0
4        1400.0
          ...  
15914       NaN
15915       NaN
15916       NaN
15917       NaN
15918       NaN
Name: short_description, Length: 15919, dtype: float64

In [10]:
df['short_description'] = df['short_description'].replace(1600,1598).replace(1800,1798)

In [11]:
df['short_description'].value_counts(dropna=False)

NaN       5068
1598.0    3891
1400.0    2535
1000.0    1334
1200.0     957
1500.0     890
2000.0     888
1300.0     135
1798.0      60
900.0       49
4000.0      28
2500.0      21
4300.0       9
5700.0       8
5000.0       3
5500.0       3
1700.0       3
1100.0       3
800.0        2
9800.0       2
3900.0       2
6000.0       2
3000.0       2
5100.0       1
7800.0       1
700.0        1
4600.0       1
2200.0       1
2300.0       1
2800.0       1
200.0        1
8800.0       1
8400.0       1
600.0        1
9900.0       1
7300.0       1
5300.0       1
0.0          1
4500.0       1
9600.0       1
5600.0       1
8900.0       1
6100.0       1
7900.0       1
300.0        1
8500.0       1
4200.0       1
Name: short_description, dtype: int64

In [12]:
df.loc[df['short_description']<800,'short_description'] = np.nan

This Column will be dropped at the Column 32

### Column 3: body_type

In [13]:
df.body_type.value_counts(dropna=False)

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64

In [14]:
#I will drop 'body_type' too since Column 27 ('body') includes same values
df.drop(columns = ['body_type'], inplace = True)

### Column 4: price

In [15]:
df.price.describe()

count    15919.000000
mean     18019.896727
std       7386.169409
min         13.000000
25%      12850.000000
50%      16900.000000
75%      21900.000000
max      74600.000000
Name: price, dtype: float64

In [16]:
df.price = df.price.astype(float)
df.price.head(3)

0    15770.0
1    14500.0
2    14640.0
Name: price, dtype: float64

In [17]:
df.price.isnull().sum()

0

### Column 5: vat

In [18]:
df.vat.value_counts(dropna=False)

VAT deductible      10980
NaN                  4513
Price negotiable      426
Name: vat, dtype: int64

### Column 6: km

In [19]:
df.km.head()

0    56,013 km
1    80,000 km
2    83,450 km
3    73,000 km
4    16,200 km
Name: km, dtype: object

In [20]:
#Remove the commas, get the numeric values, and convert to float
df['km'] = df['km'].str.replace(',' , '').str.findall('\d+').str[0].astype(float)
df['km'].head()

0    56013.0
1    80000.0
2    83450.0
3    73000.0
4    16200.0
Name: km, dtype: float64

### Column 7: registration

In [21]:
#check if there is any non-numeric value exists after removing '/'
df.registration.loc[~(df.registration.replace('/', '', regex=True).str.isnumeric())]

122      -/-
710      -/-
734      -/-
741      -/-
743      -/-
        ... 
15896    -/-
15902    -/-
15907    -/-
15912    -/-
15914    -/-
Name: registration, Length: 1597, dtype: object

In [22]:
#infer datetime formats and return non-numeric values as NaN
df.registration = pd.to_datetime(df.registration, infer_datetime_format=True, errors='coerce').dt.year
df.registration.head(3)

0    2016.0
1    2017.0
2    2016.0
Name: registration, dtype: float64

In [23]:
df.registration.isnull().sum()

1597

### Column 8: prev_owner

In [24]:
#This column includes same info with Column 12 (Previous Owners)
#df['Previous Owners'].value_counts(dropna=False)
df.prev_owner.value_counts(dropna=False)

1 previous owner     8294
NaN                  6828
2 previous owners     778
3 previous owners      17
4 previous owners       2
Name: prev_owner, dtype: int64

In [25]:
#Maybe we can fill some of the Null values by comparing both columns.
#But first convert both columns to same types
df[["prev_owner","Previous Owners"]].head(5)

Unnamed: 0,prev_owner,Previous Owners
0,2 previous owners,\n2\n
1,,
2,1 previous owner,\n1\n
3,1 previous owner,\n1\n
4,1 previous owner,\n1\n


In [26]:
df['prev_owner']=df['prev_owner'].str.findall("\d+").str[0].astype(float)
df["Previous Owners"]=df["Previous Owners"].str.strip().astype(float)

In [27]:
#Define a function to compare columns and get available values.
#Return Null if both columns are empty
def prev_owner_combine(p1,p2):
    if p1 == p2:
        return p1
    elif np.isnan(p1) :
        if np.isnan(p2):
            return np.nan
        else:
            return p2
    elif np.isnan(p2):
        if np.isnan(p1):
            return np.nan
        else:
            return p1
    else:
        return 'conflict'

In [28]:
df["prev_owner"]=df.apply(lambda x: prev_owner_combine(x['prev_owner'],x['Previous Owners']), axis=1)

In [29]:
#Drop Column 27 (Previous Owners)
df.drop('Previous Owners',axis=1,inplace=True)

In [206]:
df["prev_owner"].value_counts(dropna=False)

1.0    8294
NaN    6665
2.0     778
0.0     163
3.0      17
4.0       2
Name: prev_owner, dtype: int64

### Column 9: kW

In [30]:
df.kW.value_counts()

Series([], Name: kW, dtype: int64)

In [31]:
df.kW.unique()

array([nan])

In [32]:
#Drop the empty Column
df.drop(columns = 'kW', inplace = True)

### Column 10: hp

In [33]:
df.hp.value_counts()

85 kW     2542
66 kW     2122
81 kW     1402
100 kW    1308
110 kW    1112
          ... 
75 kW        1
239 kW       1
123 kW       1
44 kW        1
137 kW       1
Name: hp, Length: 81, dtype: int64

In [34]:
#remove ' kW'
df['hp'] = df['hp'].map(lambda x: x.rstrip(' kW'))

In [35]:
df.hp = df.hp.replace('-', np.NaN).astype('float')

In [36]:
df.hp.value_counts(dropna=False)

85.0     2542
66.0     2122
81.0     1402
100.0    1308
110.0    1112
         ... 
239.0       1
115.0       1
163.0       1
4.0         1
75.0        1
Name: hp, Length: 81, dtype: int64

### Column 11: Type

In [37]:
#The column consists of list values
df.Type.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[, Used, , Diesel (Particulate Filter)]                                                                                                       3475
[, Used, , Diesel]                                                                                                                            2516
[, Used, , Gasoline]                                                                                                                          2367
[, Used, , Super 95]                                                                                                                          1818
[, Pre-registered, , Super 95]                                                                                                                 500
                                                                                                                                              ... 
[, Used, , Regular/Benzine 91 / Super 95 / Super Plus 98 / Super Plus E10 98 / Super E10 95 / Regular/Benzine E10 91 (

In [38]:
df.Type=df.Type.str[1].str.strip()

### Column 12: Previous Owners	

In [39]:
#I have already dropped this column (column 9)

### Column 13: Next Inspection

In [40]:
#The column consists of list values
df['Next Inspection'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


\n04/2022\n                                                                                      62
\n03/2021\n                                                                                      38
\n03/2022\n                                                                                      36
\n06/2021\n                                                                                      34
\n01/2022\n                                                                                      32
                                                                                                 ..
[\n10/2018\n, \n119 g CO2/km (comb)\n]                                                            1
[\n11/2020\n, \n119 g CO2/km (comb)\n]                                                            1
[\n10/2020\n, \n153 g CO2/km (comb)\n]                                                            1
[\n05/2022\n, \n, 5 l/100 km (comb), \n, 5.9 l/100 km (city), \n, 4.5 l/100 km (country), \n]     1


In [41]:
next_ins = df["Next Inspection"].apply(pd.Series)
next_ins

Unnamed: 0,0,1,2,3,4,5,6,7
0,\n06/2021\n,\n99 g CO2/km (comb)\n,,,,,,
1,,,,,,,,
2,,,,,,,,
3,,,,,,,,
4,,,,,,,,
...,...,...,...,...,...,...,...,...
15914,,,,,,,,
15915,\n01/2022\n,\n168 g CO2/km (comb)\n,,,,,,
15916,,,,,,,,
15917,,,,,,,,


In [42]:
#There are only 3535 non-null values in column 0 that shows Next Inspection date
next_ins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3535 non-null   object
 1   1       2793 non-null   object
 2   2       242 non-null    object
 3   3       242 non-null    object
 4   4       242 non-null    object
 5   5       242 non-null    object
 6   6       239 non-null    object
 7   7       239 non-null    object
dtypes: object(8)
memory usage: 995.1+ KB


In [43]:
#Convert Column 0 to a datetime format and assign to df['Next Inspection']
df["Next Inspection"] = pd.to_datetime(next_ins[0], infer_datetime_format=True, errors='coerce').dt.year
df["Next Inspection"].head()

0    2021.0
1       NaN
2       NaN
3       NaN
4       NaN
Name: Next Inspection, dtype: float64

### Column 14: Inspection new

In [44]:
df['Inspection new'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[\nYes\n, \nEuro 6\n]                                                                          523
\nYes\n                                                                                        362
[\nYes\n, \n102 g CO2/km (comb)\n]                                                             174
[\nYes\n, \n4 (Green)\n]                                                                       166
[\nYes\n, \nEuro 6d-TEMP\n]                                                                    134
                                                                                              ... 
[\nYes\n, \n88 g CO2/km (comb)\n]                                                                1
[\nYes\n, \n, 6 l/100 km (comb), \n, 7.5 l/100 km (city), \n, 5.2 l/100 km (country), \n]        1
[\nYes\n, \n, 6 l/100 km (comb), \n, 8 l/100 km (city), \n, 4.9 l/100 km (country), \n]          1
[\nYes\n, \n87 g CO2/km (comb)\n]                                                                1
[\nYes\n, 

In [45]:
#Create a new dataset from list values
ins_new = df['Inspection new'].apply(pd.Series)
ins_new

Unnamed: 0,0,1,2,3,4,5,6,7
0,\nYes\n,\nEuro 6\n,,,,,,
1,,,,,,,,
2,,,,,,,,
3,,,,,,,,
4,\nYes\n,\n109 g CO2/km (comb)\n,,,,,,
...,...,...,...,...,...,...,...,...
15914,,,,,,,,
15915,,,,,,,,
15916,\nYes\n,\nEuro 6d-TEMP\n,,,,,,
15917,,,,,,,,


In [46]:
ins_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3932 non-null   object
 1   1       3490 non-null   object
 2   2       336 non-null    object
 3   3       336 non-null    object
 4   4       336 non-null    object
 5   5       336 non-null    object
 6   6       331 non-null    object
 7   7       331 non-null    object
dtypes: object(8)
memory usage: 995.1+ KB


In [47]:
#Fill the Null values with 'unknown'
ins_new.iloc[:, :1].fillna('unknown', inplace=True)

In [48]:
df['Inspection new'] = ins_new[0].map(lambda x: x.rstrip('\n').lstrip('\n'))

In [49]:
df['Inspection new'].value_counts(dropna=False)

unknown    11987
Yes         3932
Name: Inspection new, dtype: int64

### Column 15: Warranty

In [50]:
#Some values are list and some are not
df.Warranty.value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


NaN                                                                                                5420
[\n, \n, \nEuro 6\n]                                                                               1868
\n12 months\n                                                                                      1177
\n                                                                                                  979
\n24 months\n                                                                                       566
                                                                                                   ... 
[\n33 months\n, \nEuro 6\n]                                                                           1
[\n46 months\n, \n4 (Green)\n]                                                                        1
[\n72 months\n, \n154 g CO2/km (comb)\n]                                                              1
[\n, \n, \n144 g CO2/km (comb)\n]                               

In [51]:
#Define a function to clean the values
def clean_alt_list(a):
    if type(a) == list:
        b = re.findall(r'\d+', a[0])
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    elif type(a) ==str:
        b = re.findall(r'\d+', a)
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    else:
        return a

In [52]:
df['Warranty'] = df['Warranty'].apply(clean_alt_list)

In [53]:
#Rename the column
df = df.rename({'Warranty': 'Warranty(months)'}, axis=1)

In [54]:
df['Warranty(months)'] = df['Warranty(months)'].astype('float')

In [55]:
df['Warranty(months)'].value_counts(dropna=False)

NaN     11066
12.0     2594
24.0     1118
60.0      401
36.0      279
48.0      149
6.0       125
72.0       59
3.0        33
23.0       11
18.0       10
20.0        7
25.0        6
2.0         5
50.0        4
26.0        4
16.0        4
19.0        3
1.0         3
4.0         3
13.0        3
34.0        3
45.0        2
14.0        2
17.0        2
11.0        2
46.0        2
28.0        2
21.0        2
22.0        2
9.0         2
30.0        1
33.0        1
56.0        1
40.0        1
7.0         1
15.0        1
8.0         1
10.0        1
49.0        1
47.0        1
65.0        1
Name: Warranty(months), dtype: int64

### Column 16: Full Service

In [56]:
df['Full Service'].value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


NaN                                                                                           7704
[\n, \n, \n4 (Green)\n]                                                                       2235
[\n, \n, \nEuro 6\n]                                                                          2097
[\n, \n]                                                                                      1702
[\n, \n, \nEuro 6d-TEMP\n]                                                                     399
                                                                                              ... 
[\n, \n, \n, 5.4 l/100 km (comb), \n, 7 l/100 km (city), \n, 4.5 l/100 km (country), \n]         1
[\n, \n, \n, 4.7 l/100 km (comb), \n, 5.4 l/100 km (city), \n, 4.2 l/100 km (country), \n]       1
[\n, \n, \n164 g CO2/km (comb)\n]                                                                1
[\n, \n, \n, 5.4 l/100 km (comb), \n, 7.3 l/100 km (city), \n, 4.2 l/100 km (country), \n]       1
[\n, \n, \

In [57]:
#Drop this column. Same values exist in other columns
df.drop(columns = ['Full Service'], inplace = True)

### Column 17: Non-smoking Vehicle

In [58]:
#Drop this column. Same values exist in other columns
df.drop(columns = ['Non-smoking Vehicle'], inplace = True)

### Column 18: Null

In [59]:
df.drop(columns = ['null'], inplace = True)

### Column 19: Make

In [60]:
#Same values with Column 1 ()
df.Make.value_counts()

\nOpel\n       7343
\nAudi\n       5712
\nRenault\n    2864
Name: Make, dtype: int64

In [187]:
df.drop(columns = ['Make'], inplace = True)

### 20 Column: Model

In [62]:
#Same values with Column 1 ()
df.Model.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[\n, A3, \n]          3097
[\n, A1, \n]          2614
[\n, Insignia, \n]    2598
[\n, Astra, \n]       2526
[\n, Corsa, \n]       2219
[\n, Clio, \n]        1839
[\n, Espace, \n]       991
[\n, Duster, \n]        34
[\n, A2, \n]             1
Name: Model, dtype: int64

In [188]:
df.drop(columns = ['Model'], inplace = True)

### Column 21: Offer Number

In [65]:
#I will drop the column since the values seems irrelevant for our analysis
df.drop(columns = ['Offer Number'], inplace = True)

### Column 22: First Registration

In [66]:
df['First Registration']

0        [\n, 2016, \n]
1        [\n, 2017, \n]
2        [\n, 2016, \n]
3        [\n, 2016, \n]
4        [\n, 2016, \n]
              ...      
15914               NaN
15915    [\n, 2019, \n]
15916    [\n, 2019, \n]
15917    [\n, 2019, \n]
15918    [\n, 2019, \n]
Name: First Registration, Length: 15919, dtype: object

In [67]:
#The values are the year part of the column 8 (registration). So, I will drop the column
df.drop(columns = ['First Registration'], inplace = True)

### Column 23: Body Color

In [68]:
df['Body Color'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[\n, Black, \n]     3745
[\n, Grey, \n]      3505
[\n, White, \n]     3406
[\n, Silver, \n]    1647
[\n, Blue, \n]      1431
[\n, Red, \n]        957
[\n, Brown, \n]      289
[\n, Green, \n]      154
[\n, Beige, \n]      108
[\n, Yellow, \n]      51
[\n, Violet, \n]      18
[\n, Bronze, \n]       6
[\n, Orange, \n]       3
[\n, Gold, \n]         2
Name: Body Color, dtype: int64

In [69]:
df['Body Color'] = df['Body Color'].str[1]
df['Body Color']

0        Black
1          Red
2        Black
3        Brown
4        Black
         ...  
15914     Grey
15915     Grey
15916    White
15917     Grey
15918     Grey
Name: Body Color, Length: 15919, dtype: object

In [70]:
df['Body Color'].isnull().sum()

597

### Column 24: Paint Type

In [71]:
df['Paint Type']

0        [\nMetallic\n]
1                   NaN
2        [\nMetallic\n]
3        [\nMetallic\n]
4        [\nMetallic\n]
              ...      
15914    [\nMetallic\n]
15915    [\nMetallic\n]
15916               NaN
15917               NaN
15918    [\nMetallic\n]
Name: Paint Type, Length: 15919, dtype: object

In [72]:
df['Paint Type'] = df['Paint Type'].str[0].fillna('unknown').map(lambda x: x.rstrip('\n').lstrip('\n'))

In [73]:
df['Paint Type']

0        Metallic
1         unknown
2        Metallic
3        Metallic
4        Metallic
           ...   
15914    Metallic
15915    Metallic
15916     unknown
15917     unknown
15918    Metallic
Name: Paint Type, Length: 15919, dtype: object

### Column 25: Body Color Original

In [74]:
df['Body Color Original']

0                 [\nMythosschwarz\n]
1                                 NaN
2        [\nmythosschwarz metallic\n]
3                                 NaN
4        [\nMythosschwarz Metallic\n]
                     ...             
15914              [\nGrigio scuro\n]
15915       [\nStahl-Grau Metallic\n]
15916               [\narktis-weiß\n]
15917                    [\nGrigio\n]
15918    [\nTitanium-Grau Metallic\n]
Name: Body Color Original, Length: 15919, dtype: object

In [75]:
df['Body Color Original'] = df['Body Color Original'].str[0].fillna('unknown').map(lambda x: x.rstrip('\n').lstrip('\n'))
df['Body Color Original']

0                 Mythosschwarz
1                       unknown
2        mythosschwarz metallic
3                       unknown
4        Mythosschwarz Metallic
                  ...          
15914              Grigio scuro
15915       Stahl-Grau Metallic
15916               arktis-weiß
15917                    Grigio
15918    Titanium-Grau Metallic
Name: Body Color Original, Length: 15919, dtype: object

In [76]:
df['Body Color Original'].value_counts()

unknown                      3759
Onyx Schwarz                  338
Bianco                        282
Mythosschwarz Metallic        238
Brillantschwarz               216
                             ... 
Noir mythic                     1
Cosmos Blau Met                 1
KOKOSNUSS BRAUN (M2)            1
NERO MYTHOS METALLIZZATO        1
Rouge Flamme (rood parelm       1
Name: Body Color Original, Length: 1928, dtype: int64

### Column 26: Upholstery

In [77]:
df.Upholstery

0               [\nCloth, Black\n]
1                [\nCloth, Grey\n]
2               [\nCloth, Black\n]
3                              NaN
4               [\nCloth, Black\n]
                   ...            
15914                          NaN
15915                  [\nCloth\n]
15916    [\nFull leather, Black\n]
15917           [\nPart leather\n]
15918    [\nFull leather, Brown\n]
Name: Upholstery, Length: 15919, dtype: object

In [78]:
df.Upholstery.str[0].value_counts()

\nCloth, Black\n           5821
\nPart leather, Black\n    1121
\nCloth\n                  1005
\nCloth, Grey\n             891
\nCloth, Other\n            639
\nFull leather, Black\n     575
\nBlack\n                   491
\nGrey\n                    273
\nOther, Other\n            182
\nPart leather\n            140
\nFull leather\n            139
\nPart leather, Grey\n      116
\nFull leather, Brown\n     116
\nOther, Black\n            110
\nFull leather, Other\n      72
\nFull leather, Grey\n       67
\nPart leather, Other\n      65
\nOther\n                    56
\nPart leather, Brown\n      50
\nalcantara, Black\n         47
\nVelour, Black\n            36
\nFull leather, Beige\n      36
\nCloth, Brown\n             28
\nVelour\n                   16
\nOther, Grey\n              15
\nCloth, Beige\n             13
\nBrown\n                    12
\nCloth, Blue\n              12
\nVelour, Grey\n              8
\nCloth, White\n              8
\nalcantara, Grey\n           6
\nCloth,

In [79]:
#fill the Null values with 'unknown's and clean it
df.Upholstery = df.Upholstery.str[0].fillna('unknown').map(lambda x: x.rstrip('\n').lstrip('\n'))
df.Upholstery

0               Cloth, Black
1                Cloth, Grey
2               Cloth, Black
3                    unknown
4               Cloth, Black
                ...         
15914                unknown
15915                  Cloth
15916    Full leather, Black
15917           Part leather
15918    Full leather, Brown
Name: Upholstery, Length: 15919, dtype: object

In [80]:
#Textures and colors are mixed
df.Upholstery.value_counts()

Cloth, Black           5821
unknown                3720
Part leather, Black    1121
Cloth                  1005
Cloth, Grey             891
Cloth, Other            639
Full leather, Black     575
Black                   491
Grey                    273
Other, Other            182
Part leather            140
Full leather            139
Full leather, Brown     116
Part leather, Grey      116
Other, Black            110
Full leather, Other      72
Full leather, Grey       67
Part leather, Other      65
Other                    56
Part leather, Brown      50
alcantara, Black         47
Full leather, Beige      36
Velour, Black            36
Cloth, Brown             28
Velour                   16
Other, Grey              15
Cloth, Beige             13
Cloth, Blue              12
Brown                    12
Velour, Grey              8
Cloth, White              8
alcantara, Grey           6
Cloth, Red                5
Other, Yellow             4
Beige                     3
Part leather, Red   

In [81]:
#Define a function to add 'unknown' to indexes which have only color
def func1(x):
    if x in ['Black', 'Grey', 'Brown', 'Beige', 'Blue', 'White']:
        x = 'unknown, ' + x
    return x

In [82]:
df.Upholstery = df.Upholstery.apply(func1)

In [83]:
uph = df.Upholstery.str.split(',', 1, expand=True)
uph

Unnamed: 0,0,1
0,Cloth,Black
1,Cloth,Grey
2,Cloth,Black
3,unknown,
4,Cloth,Black
...,...,...
15914,unknown,
15915,Cloth,
15916,Full leather,Black
15917,Part leather,


In [84]:
uph[0].value_counts()

Cloth           8423
unknown         4503
Part leather    1499
Full leather    1009
Other            368
Velour            60
alcantara         57
Name: 0, dtype: int64

In [85]:
uph[uph[0] == 'Black']

Unnamed: 0,0,1


In [86]:
#Add both columns as separate columns to our dataset
df['Upholstery_Texture'] = uph[0]
df['Upholstery_Color'] = uph[1]

In [87]:
df.drop(columns = ['Upholstery'], inplace = True)

### Column 27: Body

In [88]:
df.Body.str[1].value_counts()

Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
Off-Road           56
Coupe              25
Convertible         8
Name: Body, dtype: int64

In [89]:
df.Body = df.Body.str[1]

### Column 28: Nr. of Doors

In [90]:
df['Nr. of Doors'].str[0].value_counts(dropna=False)

\n5\n    11575
\n4\n     3079
\n3\n      832
\n2\n      219
NaN        212
\n7\n        1
\n1\n        1
Name: Nr. of Doors, dtype: int64

In [91]:
df['Nr. of Doors'] = df['Nr. of Doors'].str[0].str[1].astype('float')

In [92]:
df['Nr. of Doors']

0        5.0
1        3.0
2        4.0
3        3.0
4        5.0
        ... 
15914    5.0
15915    5.0
15916    5.0
15917    5.0
15918    5.0
Name: Nr. of Doors, Length: 15919, dtype: float64

### Column 29: Nr. of Seats

In [93]:
df['Nr. of Seats'] = df['Nr. of Seats'].str[0].str[1].astype('float')

In [94]:
df['Nr. of Seats']

0        5.0
1        4.0
2        4.0
3        4.0
4        5.0
        ... 
15914    5.0
15915    5.0
15916    7.0
15917    7.0
15918    5.0
Name: Nr. of Seats, Length: 15919, dtype: float64

### Column 30: Model Code

In [95]:
df['Model Code']

0        [\n0588/BDF\n]
1        [\n0588/BCY\n]
2                   NaN
3                   NaN
4        [\n0588/BDF\n]
              ...      
15914               NaN
15915    [\n0000/000\n]
15916               NaN
15917               NaN
15918    [\n3333/BHJ\n]
Name: Model Code, Length: 15919, dtype: object

In [173]:
#I think model code is not influential on price. So, I will drop this column
df.drop(columns = ['Model Code'], inplace = True)

### Column 31: Gearing Type

In [98]:
df['Gearing Type'] = df['Gearing Type'].str[1]
df['Gearing Type']

0        Automatic
1        Automatic
2        Automatic
3        Automatic
4        Automatic
           ...    
15914    Automatic
15915    Automatic
15916    Automatic
15917    Automatic
15918    Automatic
Name: Gearing Type, Length: 15919, dtype: object

### Column 32: Displacement

In [99]:
#Remove comma, get digits, and convert to float
df['Displacement'] = df['Displacement'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype('float')
df['Displacement']

0        1422.0
1        1798.0
2        1598.0
3        1422.0
4        1422.0
          ...  
15914    1997.0
15915    1798.0
15916    1997.0
15917    1997.0
15918    1798.0
Name: Displacement, Length: 15919, dtype: float64

In [100]:
df['Displacement'].value_counts(dropna=False)

1598.0    4761
999.0     2438
1398.0    1314
1399.0     749
1229.0     677
          ... 
1368.0       1
1390.0       1
54.0         1
1856.0       1
1533.0       1
Name: Displacement, Length: 78, dtype: int64

In [190]:
def disp(d1,d2):
    if (d1>4000) | (d1<700) | np.isnan(d1):
        if np.isnan(d2):
            return d1
        else:
            return d2
    else:
        return d1

In [192]:
df['Displacement'] = df.apply(lambda x: disp(x['Displacement'],x['short_description']),axis=1)

In [193]:
df['Displacement'].isnull().sum()

180

In [194]:
#Drop Column 2
df.drop(["short_description"],axis=1,inplace=True)

### Column 33: Cylinders

In [101]:
df['Cylinders'] = df['Cylinders'].str[0].str.findall('\d+').str[0].astype('float')
df['Cylinders']

0        3.0
1        4.0
2        NaN
3        3.0
4        3.0
        ... 
15914    4.0
15915    4.0
15916    4.0
15917    4.0
15918    4.0
Name: Cylinders, Length: 15919, dtype: float64

### Column 34: Weight

In [102]:
df['Weight'] = df['Weight'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype('float')
df['Weight']

0        1220.0
1        1255.0
2           NaN
3        1195.0
4           NaN
          ...  
15914    1758.0
15915    1708.0
15916       NaN
15917    1758.0
15918    1685.0
Name: Weight, Length: 15919, dtype: float64

### Column 35: Drive chain

In [231]:
df["Drive chain"]=df1["Drive chain"].str[0].str.strip()
df['Drive chain']

0        front
1        front
2        front
3          NaN
4        front
         ...  
15914    front
15915    front
15916    front
15917    front
15918      4WD
Name: Drive chain, Length: 15919, dtype: object

### Column 36: Fuel

In [104]:
df["Fuel"]=df["Fuel"].str[1]

In [105]:
def func2(x):
    if 'Diesel' in x:
        return 'Diesel'
    elif 'Super' in x:
        return 'Benzine'
    elif 'Gasoline' in x:
        return 'Benzine'
    elif 'Benzine' in x:
        return 'Benzine'
    else:
        return x

In [106]:
df["Fuel"]=df["Fuel"].apply(func2)

In [107]:
df['Fuel'].replace(['CNG','LPG', 'Liquid petroleum gas (LPG)','CNG (Particulate Filter)', 'Domestic gas H','Biogas'], 'Gas',inplace = True)

In [108]:
df['Fuel'].replace(['Others (Particulate Filter)','Others','Electric'], 'Others',inplace = True)

In [109]:
df["Fuel"].value_counts(dropna=False)

Benzine    8549
Diesel     7299
Gas          64
Others        7
Name: Fuel, dtype: int64

### Column 37: Consumption

In [197]:
df1.Consumption[2]

[['3.8 l/100 km (comb)'], ['4.4 l/100 km (city)'], ['3.4 l/100 km (country)']]

In [198]:
def consume_combined(a):
    if type(a) == list:
        if len(a) >3:
            for i in a:
                if 'comb' in i:
                    return i
        else:
            return a[0]            
    
    else:
        return a
    
df['Consumption_combined'] = df1['Consumption'].apply(consume_combined)

In [199]:
def cleaning_consumption(a):
    if type(a) == list:
        if len(a) > 0:
            b = re.findall("\d\.?\d?", a[0])
            return b[0]
        else:
            return np.nan
    elif type(a) == str:
        b = re.findall("\d\.?\d?",a)
        return b[0]        
    else:
        return a

In [200]:
def consume_city(a):
    if type(a) == list:
        if len(a) >3:
            for i in a:
                if 'city' in i:
                    return i
        else:
            return a[1]           
    
    else:
        return a
    
df['Consumption_city'] = df1['Consumption'].apply(consume_city)

In [201]:
def consume_country(a):
    if type(a)== list:
        if len(a) >3:
            for i in a:
                if 'country' in i:
                    return i
        else:
            return a[2]            
    
    else:
        return a
    
df['Consumption_country'] = df1['Consumption'].apply(consume_country)

In [202]:
df['Consumption_combined'] = df['Consumption_combined'].apply(cleaning_consumption).astype('float')
df['Consumption_city'] = df['Consumption_city'].apply(cleaning_consumption).astype('float')
df['Consumption_country'] = df['Consumption_country'].apply(cleaning_consumption).astype('float')

### Column 38: CO2 Emission

In [121]:
df['CO2 Emission'] = df['CO2 Emission'].str[0].str.findall('\d+').str[0]
df['CO2 Emission']

0         99
1        129
2         99
3         99
4        109
        ... 
15914    139
15915    168
15916    139
15917    139
15918    153
Name: CO2 Emission, Length: 15919, dtype: object

In [122]:
df['CO2 Emission'] = df['CO2 Emission'].astype('float')

### Column 39: Emission Class

In [123]:
df['Emission Class'] = df['Emission Class'].str[0].fillna('unknown').str.rstrip('\n').str.lstrip('\n')
df['Emission Class']

0              Euro 6
1              Euro 6
2              Euro 6
3              Euro 6
4              Euro 6
             ...     
15914         unknown
15915         unknown
15916    Euro 6d-TEMP
15917          Euro 6
15918          Euro 6
Name: Emission Class, Length: 15919, dtype: object

In [124]:
df['Emission Class'].value_counts(dropna=False)

Euro 6          10139
unknown          3021
Euro 6d-TEMP     1845
NaN               607
Euro 6c           127
Euro 5             78
Euro 6d            62
Euro 4             40
Name: Emission Class, dtype: int64

In [125]:
df['Emission Class'] = df['Emission Class'].fillna('unknown')

In [126]:
df['Emission Class'].value_counts(dropna=False)

Euro 6          10139
unknown          3628
Euro 6d-TEMP     1845
Euro 6c           127
Euro 5             78
Euro 6d            62
Euro 4             40
Name: Emission Class, dtype: int64

In [127]:
df['Emission Class'].replace(['Euro 6d-TEMP', 'Euro 6c', 'Euro 6d'], 'Euro 6', inplace = True)

In [128]:
df['Emission Class'].value_counts(dropna=False)

Euro 6     12173
unknown     3628
Euro 5        78
Euro 4        40
Name: Emission Class, dtype: int64

### Column 40: Comfort & Convenience

In [129]:
#The Column consists of list values including Comfort & Convenience features.
df['\nComfort & Convenience\n']

0        [Air conditioning, Armrest, Automatic climate ...
1        [Air conditioning, Automatic climate control, ...
2        [Air conditioning, Cruise control, Electrical ...
3        [Air suspension, Armrest, Auxiliary heating, E...
4        [Air conditioning, Armrest, Automatic climate ...
                               ...                        
15914    [Air conditioning, Automatic climate control, ...
15915    [Air conditioning, Automatic climate control, ...
15916    [Air conditioning, Armrest, Automatic climate ...
15917    [Air conditioning, Automatic climate control, ...
15918    [Air conditioning, Automatic climate control, ...
Name: \nComfort & Convenience\n, Length: 15919, dtype: object

In [225]:
#Assuming number of features is influential on price, I will take the numbers instead of features as categories.
df['Comfort & Convenience'] = df['\nComfort & Convenience\n'].str.len()

In [226]:
df.drop(columns = ['\nComfort & Convenience\n'], inplace = True)

### Column 41: Entertainment & Media

In [130]:
df['\nEntertainment & Media\n']

0        [Bluetooth, Hands-free equipment, On-board com...
1        [Bluetooth, Hands-free equipment, On-board com...
2                                 [MP3, On-board computer]
3        [Bluetooth, CD player, Hands-free equipment, M...
4        [Bluetooth, CD player, Hands-free equipment, M...
                               ...                        
15914    [Bluetooth, Digital radio, Hands-free equipmen...
15915    [Bluetooth, Digital radio, Hands-free equipmen...
15916    [Bluetooth, Hands-free equipment, On-board com...
15917               [Bluetooth, Digital radio, Radio, USB]
15918                                                [USB]
Name: \nEntertainment & Media\n, Length: 15919, dtype: object

In [219]:
df['\nEntertainment & Media\n'][0]

['Bluetooth', 'Hands-free equipment', 'On-board computer', 'Radio']

In [221]:
df['Entertainment & Media'] = df['\nEntertainment & Media\n'].str.len()

In [222]:
df.drop(columns = ['\nEntertainment & Media\n'], inplace = True)

### Column 42: Extras

In [131]:
#Consists of list values
df['\nExtras\n']

0        [Alloy wheels, Catalytic Converter, Voice Cont...
1        [Alloy wheels, Sport seats, Sport suspension, ...
2                            [Alloy wheels, Voice Control]
3               [Alloy wheels, Sport seats, Voice Control]
4        [Alloy wheels, Sport package, Sport suspension...
                               ...                        
15914                         [Alloy wheels, Touch screen]
15915          [Alloy wheels, Touch screen, Voice Control]
15916                                       [Alloy wheels]
15917                         [Alloy wheels, Touch screen]
15918                         [Alloy wheels, Touch screen]
Name: \nExtras\n, Length: 15919, dtype: object

In [211]:
df['\nExtras\n'][1]

['Alloy wheels', 'Sport seats', 'Sport suspension', 'Voice Control']

In [212]:
#Equate the column to the number of Extras features
df['Extras'] = df['\nExtras\n'].str.len()

In [213]:
df.drop(columns = ['\nExtras\n'], inplace = True)

### Column 43: Safety & Security

In [132]:
#Consists of list values
df['\nSafety & Security\n']

0        [ABS, Central door lock, Daytime running light...
1        [ABS, Central door lock, Central door lock wit...
2        [ABS, Central door lock, Daytime running light...
3        [ABS, Alarm system, Central door lock with rem...
4        [ABS, Central door lock, Driver-side airbag, E...
                               ...                        
15914    [ABS, Central door lock, Central door lock wit...
15915    [ABS, Adaptive Cruise Control, Blind spot moni...
15916    [ABS, Adaptive Cruise Control, Blind spot moni...
15917    [ABS, Blind spot monitor, Driver-side airbag, ...
15918    [ABS, Blind spot monitor, Daytime running ligh...
Name: \nSafety & Security\n, Length: 15919, dtype: object

In [216]:
df['\nSafety & Security\n'][0]

['ABS',
 'Central door lock',
 'Daytime running lights',
 'Driver-side airbag',
 'Electronic stability control',
 'Fog lights',
 'Immobilizer',
 'Isofix',
 'Passenger-side airbag',
 'Power steering',
 'Side airbag',
 'Tire pressure monitoring system',
 'Traction control',
 'Xenon headlights']

In [217]:
#Equate the column to the number of Safety & Security features
df['Safety & Security'] = df['\nSafety & Security\n'].str.len()

In [218]:
df.drop(columns = ['\nSafety & Security\n'], inplace = True)

### 44 Column: description

In [133]:
df.description

0        [\n, Sicherheit:,  , Deaktivierung für Beifahr...
1        [\nLangstreckenfahrzeug daher die hohe Kilomet...
2        [\n, Fahrzeug-Nummer: AM-95365,  , Ehem. UPE 2...
3        [\nAudi A1: , - 1e eigenaar , - Perfecte staat...
4        [\n, Technik & Sicherheit:, Xenon plus, Klimaa...
                               ...                        
15914    [\nVettura visionabile nella sede in Via Roma ...
15915    [\nDach: Panorama-Glas-Schiebedach, Lackierung...
15916    [\n, Getriebe:,  Automatik, Technik:,  Bordcom...
15917    [\nDEK:[2691331], Renault Espace Blue dCi 200C...
15918    [\n, Sicherheit Airbags:,  , Seitenairbag,  , ...
Name: description, Length: 15919, dtype: object

In [134]:
df.drop(columns=['description'], inplace = True)

### Column 45: Emission Label

In [135]:
df['Emission Label'].value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


NaN                     11934
[\n4 (Green)\n]          3553
[\n1 (No sticker)\n]      381
[[], [], []]               40
[\n5 (Blue)\n]              8
[\n3 (Yellow)\n]            2
[\n2 (Red)\n]               1
Name: Emission Label, dtype: int64

In [136]:
#Rate of non-Null values are too low
df.drop(columns = ['Emission Label'], inplace =True)

### Column 46: Gears

In [137]:
df.Gears = df.Gears.str[0].str.findall('\d+').str[0].astype('float')
df.Gears

0        NaN
1        7.0
2        NaN
3        6.0
4        NaN
        ... 
15914    6.0
15915    7.0
15916    6.0
15917    6.0
15918    NaN
Name: Gears, Length: 15919, dtype: float64

### Column 47: Country version

In [138]:
df['Country version'].value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


NaN                     8333
[\nGermany\n]           4502
[\nItaly\n]             1038
[\nEuropean Union\n]     507
[\nNetherlands\n]        464
[\nSpain\n]              325
[\nBelgium\n]            314
[\nAustria\n]            208
[\nCzech Republic\n]      52
[\nPoland\n]              49
[\nFrance\n]              38
[\nDenmark\n]             33
[\nHungary\n]             28
[\nJapan\n]                8
[\nSlovakia\n]             4
[\nCroatia\n]              4
[\nSweden\n]               3
[\nBulgaria\n]             2
[\nRomania\n]              2
[\nEgypt\n]                1
[\nSerbia\n]               1
[\nLuxembourg\n]           1
[\nSwitzerland\n]          1
[\nSlovenia\n]             1
Name: Country version, dtype: int64

In [139]:
df['Country version'] = df['Country version'].str[0].fillna('unknown').str.rstrip('\n').str.lstrip('\n')
df['Country version']

0        unknown
1        unknown
2        unknown
3        unknown
4        Germany
          ...   
15914    unknown
15915    Germany
15916    Austria
15917    unknown
15918    Germany
Name: Country version, Length: 15919, dtype: object

### Column 48: Electricity consumption

In [140]:
df['Electricity consumption'].value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


NaN                          15782
[\n0 kWh/100 km (comb)\n]      137
Name: Electricity consumption, dtype: int64

In [141]:
df.drop(columns = ['Electricity consumption'], inplace = True)

### Column 49: Last Service Date

In [142]:
df['Last Service Date'].value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


NaN                                       15353
[\n02/2019\n, \nEuro 6\n]                    23
[\n05/2019\n, \nEuro 6\n]                    16
[\n01/2018\n, \n118 g CO2/km (comb)\n]       15
[\n03/2019\n, \nEuro 6\n]                    15
                                          ...  
[\n06/2019\n, \n99 g CO2/km (comb)\n]         1
\n07/2019\n                                   1
[\n06/2019\n, \n94 g CO2/km (comb)\n]         1
[\n06/2018\n, \n120 g CO2/km (comb)\n]        1
[\n05/2019\n, \n135 g CO2/km (comb)\n]        1
Name: Last Service Date, Length: 267, dtype: int64

In [143]:
df.drop(columns = ['Last Service Date'], inplace = True)

### Column 50: Other Fuel Types

In [144]:
df['Other Fuel Types'].value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


NaN             15039
[[], [], []]      880
Name: Other Fuel Types, dtype: int64

In [145]:
df.drop(columns = ['Other Fuel Types'], inplace = True)

### Column 51: Availability

In [146]:
df.Availability.value_counts(dropna=False)

NaN                              15284
\nin 90 days from ordering\n       196
\nin 120 days from ordering\n      182
\nin 1 day from ordering\n          51
\nin 5 days from ordering\n         35
\nin 3 days from ordering\n         35
\nin 180 days from ordering\n       24
\nin 14 days from ordering\n        24
\nin 7 days from ordering\n         20
\nin 150 days from ordering\n       18
\nin 2 days from ordering\n         16
\nin 60 days from ordering\n        13
\nin 42 days from ordering\n        10
\nin 21 days from ordering\n         8
\nin 4 days from ordering\n          2
\nin 6 days from ordering\n          1
Name: Availability, dtype: int64

In [147]:
df.drop(columns = ['Availability'], inplace = True)

### Column 52: Last Timing Belt Service Date

In [148]:
df.drop(columns = ['Last Timing Belt Service Date'], inplace = True)

### Column 53: Available from

In [149]:
df.drop(columns = ['Available from'], inplace = True)

### Dataframe After Cleaning

In [207]:
df.head(3).T

Unnamed: 0,0,1,2
make_model,Audi A1,Audi A1,Audi A1
price,15770.0,14500.0,14640.0
vat,VAT deductible,Price negotiable,VAT deductible
km,56013.0,80000.0,83450.0
registration,2016.0,2017.0,2016.0
prev_owner,2.0,,1.0
hp,66.0,141.0,85.0
Type,Used,Used,Used
Next Inspection,2021.0,,
Inspection new,Yes,unknown,unknown


In [232]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 36 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   make_model             15919 non-null  object 
 1   price                  15919 non-null  float64
 2   vat                    11406 non-null  object 
 3   km                     14895 non-null  float64
 4   registration           14322 non-null  float64
 5   prev_owner             9254 non-null   float64
 6   hp                     15831 non-null  float64
 7   Type                   15917 non-null  object 
 8   Next Inspection        3535 non-null   float64
 9   Inspection new         15919 non-null  object 
 10  Warranty(months)       4853 non-null   float64
 11  Body Color             15322 non-null  object 
 12  Paint Type             15919 non-null  object 
 13  Body Color Original    15919 non-null  object 
 14  Body                   15859 non-null  object 
 15  Nr

In [233]:
#Save the dataframe as a csv file
df.to_csv('auto_scout_cleaned.csv')