# AUTOSCOUT CAPSTONE PROJECT

<img src=https://i.ibb.co/wJW61Y2/Used-cars.jpg width="700" height="200">

## Introduction
Welcome to "***AutoScout Data Analysis Project***". This is the capstone project of ***Data Analysis*** Module. **Auto Scout** data which using for this project, scraped from the on-line car trading company in 2019, contains many features of 9 different car models. In this project, you will have the opportunity to apply many commonly used algorithms for Data Cleaning and Exploratory Data Analysis by using many Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy you will analyze clean dataset.

**Some Reminders on Exploratory data analysis (EDA)

Exploratory data analysis (EDA) is an especially important activity in the routine of a data analyst or scientist. It enables an in depth understanding of the dataset, define or discard hypotheses and create predictive models on a solid basis. It uses data manipulation techniques and several statistical tools to describe and understand the relationship between variables and how these can impact business. By means of EDA, we can obtain meaningful insights that can impact analysis under the following questions (If a checklist is good enough for pilots to use every flight, it’s good enough for data scientists to use with every dataset).
1. What question are you trying to solve (or prove wrong)?
2. What kind of data do you have?
3. What’s missing from the data?
4. Where are the outliers?
5. How can you add, change or remove features to get more out of your data?

**``Exploratory data analysis (EDA)``** is often an **iterative brainstorming process** where you pose a question, review the data, and develop further questions to investigate before beginning model development work. The image below shows how the brainstorming phase is connected with that of understanding the variables and how this in turn is connected again with the brainstorming phase.<br>

<img src=https://i.ibb.co/k0MC950/EDA-Process.png width="300" height="100">

[Image Credit: Andrew D.](https://towardsdatascience.com/exploratory-data-analysis-in-python-a-step-by-step-process-d0dfa6bf94ee)

**``In this context, the project consists of 3 parts in general:``**
* **The first part** is related to 'Data Cleaning'. It deals with Incorrect Headers, Incorrect Format, Anomalies, and Dropping useless columns.
* **The second part** is related to 'Filling Data', in other words 'Imputation'. It deals with Missing Values. Categorical to numeric transformation is done as well.
* **The third part** is related to 'Handling Outliers of Data' via Visualization libraries. So, some insights will be extracted.

**``NOTE:``**  However, you are free to create your own style. You do NOT have to stick to the steps above. We, the DA & DV instructors, recommend you study each part separately to create a source notebook for each part title for your further studies. 

In [5]:
!jupyter nbconvert --to webpdf --allow-chromium-download AutoScout_Ismail-1.ipynb

[NbConvertApp] Converting notebook AutoScout_Ismail-1.ipynb to webpdf
[INFO] Starting Chromium download.

  0%|          | 0.00/137M [00:00<?, ?b/s]
  0%|          | 41.0k/137M [00:00<05:47, 394kb/s]
  0%|          | 102k/137M [00:00<04:52, 468kb/s] 
  0%|          | 246k/137M [00:00<02:39, 857kb/s]
  0%|          | 399k/137M [00:00<02:03, 1.11Mb/s]
  0%|          | 532k/137M [00:00<01:56, 1.17Mb/s]
  1%|          | 737k/137M [00:00<01:33, 1.46Mb/s]
  1%|          | 922k/137M [00:00<01:27, 1.55Mb/s]
  1%|          | 1.15M/137M [00:00<01:16, 1.77Mb/s]
  1%|1         | 1.54M/137M [00:00<00:56, 2.41Mb/s]
  1%|1         | 1.87M/137M [00:01<00:50, 2.69Mb/s]
  2%|1         | 2.31M/137M [00:01<00:42, 3.20Mb/s]
  2%|2         | 2.78M/137M [00:01<00:37, 3.60Mb/s]
  2%|2         | 3.14M/137M [00:01<00:38, 3.43Mb/s]
  3%|2         | 3.49M/137M [00:01<00:39, 3.34Mb/s]
  3%|2         | 3.83M/137M [00:01<00:45, 2.93Mb/s]
  3%|3         | 4.14M/137M [00:01<00:49, 2.67Mb/s]
  3%|3         | 4.53M/137M

## Understanding Big Picture 

In [1]:
#Import python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import datetime
import plotly
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff


from pandas.plotting import register_matplotlib_converters
from pylab import rcParams
#from skimpy import clean_columns

import warnings
warnings.filterwarnings("ignore")

plt.rcParams["figure.figsize"] = (12, 8)
pd.set_option('display.max_columns', None)
sns.set_theme(font_scale=1.2, style="darkgrid")
#pd.set_option('display.float_format', lambda x: '%.3' % x)

In [2]:
# Reading file from json
df_origin = pd.read_json('scout_car.json', lines=True)
df = df_origin.copy()
df.head()

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,kW,hp,Type,Previous Owners,Next Inspection,Inspection new,Warranty,Full Service,Non-smoking Vehicle,null,Make,Model,Offer Number,First Registration,Body Color,Paint Type,Body Color Original,Upholstery,Body,Nr. of Doors,Nr. of Seats,Model Code,Gearing Type,Displacement,Cylinders,Weight,Drive chain,Fuel,Consumption,CO2 Emission,Emission Class,\nComfort & Convenience\n,\nEntertainment & Media\n,\nExtras\n,\nSafety & Security\n,description,Emission Label,Gears,Country version,Electricity consumption,Last Service Date,Other Fuel Types,Availability,Last Timing Belt Service Date,Available from
0,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.4 TDI S-tronic Xenon Navi Klima,Sedans,15770,VAT deductible,"56,013 km",01/2016,2 previous owners,,66 kW,"[, Used, , Diesel (Particulate Filter)]",\n2\n,"[\n06/2021\n, \n99 g CO2/km (comb)\n]","[\nYes\n, \nEuro 6\n]","[\n, \n, \n4 (Green)\n]","[\n, \n]","[\n, \n]",[],\nAudi\n,"[\n, A1, \n]",[\nLR-062483\n],"[\n, 2016, \n]","[\n, Black, \n]",[\nMetallic\n],[\nMythosschwarz\n],"[\nCloth, Black\n]","[\n, Sedans, \n]",[\n5\n],[\n5\n],[\n0588/BDF\n],"[\n, Automatic, \n]","[\n1,422 cc\n]",[\n3\n],"[\n1,220 kg\n]",[\nfront\n],"[\n, Diesel (Particulate Filter), \n]","[[3.8 l/100 km (comb)], [4.3 l/100 km (city)],...",[\n99 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Armrest, Automatic climate ...","[Bluetooth, Hands-free equipment, On-board com...","[Alloy wheels, Catalytic Converter, Voice Cont...","[ABS, Central door lock, Daytime running light...","[\n, Sicherheit:, , Deaktivierung für Beifahr...",,,,,,,,,
1,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.8 TFSI sport,Sedans,14500,Price negotiable,"80,000 km",03/2017,,,141 kW,"[, Used, , Gasoline]",,,,,,,[],\nAudi\n,"[\n, A1, \n]",,"[\n, 2017, \n]","[\n, Red, \n]",,,"[\nCloth, Grey\n]","[\n, Sedans, \n]",[\n3\n],[\n4\n],[\n0588/BCY\n],"[\n, Automatic, \n]","[\n1,798 cc\n]",[\n4\n],"[\n1,255 kg\n]",[\nfront\n],"[\n, Gasoline, \n]","[[5.6 l/100 km (comb)], [7.1 l/100 km (city)],...",[\n129 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Automatic climate control, ...","[Bluetooth, Hands-free equipment, On-board com...","[Alloy wheels, Sport seats, Sport suspension, ...","[ABS, Central door lock, Central door lock wit...",[\nLangstreckenfahrzeug daher die hohe Kilomet...,[\n4 (Green)\n],[\n7\n],,,,,,,
2,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...,Sedans,14640,VAT deductible,"83,450 km",02/2016,1 previous owner,,85 kW,"[, Used, , Diesel (Particulate Filter)]",\n1\n,,,"[\n, \n, \n99 g CO2/km (comb)\n]",,,[],\nAudi\n,"[\n, A1, \n]",[\nAM-95365\n],"[\n, 2016, \n]","[\n, Black, \n]",[\nMetallic\n],[\nmythosschwarz metallic\n],"[\nCloth, Black\n]","[\n, Sedans, \n]",[\n4\n],[\n4\n],,"[\n, Automatic, \n]","[\n1,598 cc\n]",,,[\nfront\n],"[\n, Diesel (Particulate Filter), \n]","[[3.8 l/100 km (comb)], [4.4 l/100 km (city)],...",[\n99 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Cruise control, Electrical ...","[MP3, On-board computer]","[Alloy wheels, Voice Control]","[ABS, Central door lock, Daytime running light...","[\n, Fahrzeug-Nummer: AM-95365, , Ehem. UPE 2...",[\n4 (Green)\n],,,,,,,,
3,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.4 TDi Design S tronic,Sedans,14500,,"73,000 km",08/2016,1 previous owner,,66 kW,"[, Used, , Diesel (Particulate Filter)]",\n1\n,,,,"[\n, \n, \n99 g CO2/km (comb)\n]","[\n, \n, \nEuro 6\n]",[],\nAudi\n,"[\n, A1, \n]",,"[\n, 2016, \n]","[\n, Brown, \n]",[\nMetallic\n],,,"[\n, Sedans, \n]",[\n3\n],[\n4\n],,"[\n, Automatic, \n]","[\n1,422 cc\n]",[\n3\n],"[\n1,195 kg\n]",,"[\n, Diesel (Particulate Filter), \n]","[[3.8 l/100 km (comb)], [4.3 l/100 km (city)],...",[\n99 g CO2/km (comb)\n],[\nEuro 6\n],"[Air suspension, Armrest, Auxiliary heating, E...","[Bluetooth, CD player, Hands-free equipment, M...","[Alloy wheels, Sport seats, Voice Control]","[ABS, Alarm system, Central door lock with rem...","[\nAudi A1: , - 1e eigenaar , - Perfecte staat...",,[\n6\n],,,,,,,
4,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.4 TDI S-Tronic S-Line Ext. admired...,Sedans,16790,,"16,200 km",05/2016,1 previous owner,,66 kW,"[, Used, , Diesel (Particulate Filter)]",\n1\n,,"[\nYes\n, \n109 g CO2/km (comb)\n]","[\n, \n, \nEuro 6\n]","[\n, \n, \n4 (Green)\n]","[\n, \n]",[],\nAudi\n,"[\n, A1, \n]",[\nC1626\n],"[\n, 2016, \n]","[\n, Black, \n]",[\nMetallic\n],[\nMythosschwarz Metallic\n],"[\nCloth, Black\n]","[\n, Sedans, \n]",[\n5\n],[\n5\n],[\n0588/BDF\n],"[\n, Automatic, \n]","[\n1,422 cc\n]",[\n3\n],,[\nfront\n],"[\n, Diesel (Particulate Filter), \n]","[[4.1 l/100 km (comb)], [4.6 l/100 km (city)],...",[\n109 g CO2/km (comb)\n],[\nEuro 6\n],"[Air conditioning, Armrest, Automatic climate ...","[Bluetooth, CD player, Hands-free equipment, M...","[Alloy wheels, Sport package, Sport suspension...","[ABS, Central door lock, Driver-side airbag, E...","[\n, Technik & Sicherheit:, Xenon plus, Klimaa...",,,[\nGermany\n],,,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            15919 non-null  object 
 1   make_model                     15919 non-null  object 
 2   short_description              15873 non-null  object 
 3   body_type                      15859 non-null  object 
 4   price                          15919 non-null  int64  
 5   vat                            11406 non-null  object 
 6   km                             15919 non-null  object 
 7   registration                   15919 non-null  object 
 8   prev_owner                     9091 non-null   object 
 9   kW                             0 non-null      float64
 10  hp                             15919 non-null  object 
 11  Type                           15917 non-null  object 
 12  Previous Owners                9279 non-null  

In [4]:
df.head().T

Unnamed: 0,0,1,2,3,4
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...,1.4 TDi Design S tronic,Sportback 1.4 TDI S-Tronic S-Line Ext. admired...
body_type,Sedans,Sedans,Sedans,Sedans,Sedans
price,15770,14500,14640,14500,16790
vat,VAT deductible,Price negotiable,VAT deductible,,
km,"56,013 km","80,000 km","83,450 km","73,000 km","16,200 km"
registration,01/2016,03/2017,02/2016,08/2016,05/2016
prev_owner,2 previous owners,,1 previous owner,1 previous owner,1 previous owner
kW,,,,,


## Preparation 

In [5]:
#df = clean_columns(df) normally it is done with skimpy
df.columns = ['url', 'make_model', 'short_description', 'body_type', 'price', 'vat',
       'km', 'registration', 'prev_owner', 'k_w', 'hp', 'type',
       'previous_owners', 'next_inspection', 'inspection_new', 'warranty',
       'full_service', 'non_smoking_vehicle', 'null', 'make', 'model',
       'offer_number', 'first_registration', 'body_color', 'paint_type',
       'body_color_original', 'upholstery', 'body', 'nr_of_doors',
       'nr_of_seats', 'model_code', 'gearing_type', 'displacement',
       'cylinders', 'weight', 'drive_chain', 'fuel', 'consumption',
       'co_2_emission', 'emission_class', 'comfort_&_convenience',
       'entertainment_&_media', 'extras', 'safety_&_security', 'description',
       'emission_label', 'gears', 'country_version', 'electricity_consumption',
       'last_service_date', 'other_fuel_types', 'availability',
       'last_timing_belt_service_date', 'available_from']

In [6]:
#Function to look variables to analyze
def first_look(col):
    print('column name : ', col)
    print("--"*20)
    print('Per_of_Nulls   : ', '%', round(df[col].isnull().sum() / df.shape[0]*100, 2))
    print('Number of Nulls  : ', df[col].isnull().sum())
    print('Number of Uniques: ', df[col].nunique())
    print('Type of columns: ', df[col].dtype)
    print("--"*20)
    print('Unique values of columns: ', df[col].unique())
    print("--"*20)
    print(df[col].value_counts(dropna = False).sort_index())
    print("--"*20)
    print(df[col].value_counts(dropna = False))
    print("##"*40)
    print()

In [7]:
# Changing some columns value type from list to str
df = df.applymap(lambda x: ','.join(map(str,x)) if type(x) == list else x)
df.head()

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,k_w,hp,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,null,make,model,offer_number,first_registration,body_color,paint_type,body_color_original,upholstery,body,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,description,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from
0,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.4 TDI S-tronic Xenon Navi Klima,Sedans,15770,VAT deductible,"56,013 km",01/2016,2 previous owners,,66 kW,",Used,,Diesel (Particulate Filter)",\n2\n,"\n06/2021\n,\n99 g CO2/km (comb)\n","\nYes\n,\nEuro 6\n","\n,\n,\n4 (Green)\n","\n,\n","\n,\n",,\nAudi\n,"\n,A1,\n",\nLR-062483\n,"\n,2016,\n","\n,Black,\n",\nMetallic\n,\nMythosschwarz\n,"\nCloth, Black\n","\n,Sedans,\n",\n5\n,\n5\n,\n0588/BDF\n,"\n,Automatic,\n","\n1,422 cc\n",\n3\n,"\n1,220 kg\n",\nfront\n,"\n,Diesel (Particulate Filter),\n","['3.8 l/100 km (comb)'],['4.3 l/100 km (city)'...",\n99 g CO2/km (comb)\n,\nEuro 6\n,"Air conditioning,Armrest,Automatic climate con...","Bluetooth,Hands-free equipment,On-board comput...","Alloy wheels,Catalytic Converter,Voice Control","ABS,Central door lock,Daytime running lights,D...","\n,Sicherheit:, ,Deaktivierung für Beifahrer-A...",,,,,,,,,
1,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.8 TFSI sport,Sedans,14500,Price negotiable,"80,000 km",03/2017,,,141 kW,",Used,,Gasoline",,,,,,,,\nAudi\n,"\n,A1,\n",,"\n,2017,\n","\n,Red,\n",,,"\nCloth, Grey\n","\n,Sedans,\n",\n3\n,\n4\n,\n0588/BCY\n,"\n,Automatic,\n","\n1,798 cc\n",\n4\n,"\n1,255 kg\n",\nfront\n,"\n,Gasoline,\n","['5.6 l/100 km (comb)'],['7.1 l/100 km (city)'...",\n129 g CO2/km (comb)\n,\nEuro 6\n,"Air conditioning,Automatic climate control,Hil...","Bluetooth,Hands-free equipment,On-board comput...","Alloy wheels,Sport seats,Sport suspension,Voic...","ABS,Central door lock,Central door lock with r...",\nLangstreckenfahrzeug daher die hohe Kilomete...,\n4 (Green)\n,\n7\n,,,,,,,
2,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...,Sedans,14640,VAT deductible,"83,450 km",02/2016,1 previous owner,,85 kW,",Used,,Diesel (Particulate Filter)",\n1\n,,,"\n,\n,\n99 g CO2/km (comb)\n",,,,\nAudi\n,"\n,A1,\n",\nAM-95365\n,"\n,2016,\n","\n,Black,\n",\nMetallic\n,\nmythosschwarz metallic\n,"\nCloth, Black\n","\n,Sedans,\n",\n4\n,\n4\n,,"\n,Automatic,\n","\n1,598 cc\n",,,\nfront\n,"\n,Diesel (Particulate Filter),\n","['3.8 l/100 km (comb)'],['4.4 l/100 km (city)'...",\n99 g CO2/km (comb)\n,\nEuro 6\n,"Air conditioning,Cruise control,Electrical sid...","MP3,On-board computer","Alloy wheels,Voice Control","ABS,Central door lock,Daytime running lights,D...","\n,Fahrzeug-Nummer: AM-95365, ,Ehem. UPE 24.64...",\n4 (Green)\n,,,,,,,,
3,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.4 TDi Design S tronic,Sedans,14500,,"73,000 km",08/2016,1 previous owner,,66 kW,",Used,,Diesel (Particulate Filter)",\n1\n,,,,"\n,\n,\n99 g CO2/km (comb)\n","\n,\n,\nEuro 6\n",,\nAudi\n,"\n,A1,\n",,"\n,2016,\n","\n,Brown,\n",\nMetallic\n,,,"\n,Sedans,\n",\n3\n,\n4\n,,"\n,Automatic,\n","\n1,422 cc\n",\n3\n,"\n1,195 kg\n",,"\n,Diesel (Particulate Filter),\n","['3.8 l/100 km (comb)'],['4.3 l/100 km (city)'...",\n99 g CO2/km (comb)\n,\nEuro 6\n,"Air suspension,Armrest,Auxiliary heating,Elect...","Bluetooth,CD player,Hands-free equipment,MP3,O...","Alloy wheels,Sport seats,Voice Control","ABS,Alarm system,Central door lock with remote...","\nAudi A1: ,- 1e eigenaar ,- Perfecte staat: s...",,\n6\n,,,,,,,
4,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.4 TDI S-Tronic S-Line Ext. admired...,Sedans,16790,,"16,200 km",05/2016,1 previous owner,,66 kW,",Used,,Diesel (Particulate Filter)",\n1\n,,"\nYes\n,\n109 g CO2/km (comb)\n","\n,\n,\nEuro 6\n","\n,\n,\n4 (Green)\n","\n,\n",,\nAudi\n,"\n,A1,\n",\nC1626\n,"\n,2016,\n","\n,Black,\n",\nMetallic\n,\nMythosschwarz Metallic\n,"\nCloth, Black\n","\n,Sedans,\n",\n5\n,\n5\n,\n0588/BDF\n,"\n,Automatic,\n","\n1,422 cc\n",\n3\n,,\nfront\n,"\n,Diesel (Particulate Filter),\n","['4.1 l/100 km (comb)'],['4.6 l/100 km (city)'...",\n109 g CO2/km (comb)\n,\nEuro 6\n,"Air conditioning,Armrest,Automatic climate con...","Bluetooth,CD player,Hands-free equipment,MP3,O...","Alloy wheels,Sport package,Sport suspension,Vo...","ABS,Central door lock,Driver-side airbag,Elect...","\n,Technik & Sicherheit:,Xenon plus,Klimaautom...",,,\nGermany\n,,,,,,


In [8]:
for i in df.select_dtypes(include="O"):
    first_look(i)

column name :  url
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  15919
Type of columns:  object
----------------------------------------
Unique values of columns:  ['https://www.autoscout24.com//offers/audi-a1-sportback-1-4-tdi-s-tronic-xenon-navi-klima-diesel-black-bdab349a-caa5-41b0-98eb-c1345b84445e'
 'https://www.autoscout24.com//offers/audi-a1-1-8-tfsi-sport-gasoline-red-b2547f8a-e83f-6237-e053-e250040a56df'
 'https://www.autoscout24.com//offers/audi-a1-sportback-1-6-tdi-s-tronic-einparkhilfe-plus-music-diesel-black-6183cb6a-8570-4b86-a132-9b54214bca88'
 ...
 'https://www.autoscout24.com//offers/renault-espace-blue-dci-200-edc-initiale-paris-leder-led-navi-key-diesel-white-6256d1a3-ea68-4193-91de-b5b5ffa4631c'
 'https://www.autoscout24.com//offers/renault-espace-blue-dci-200cv-edc-business-nuova-da-immatricola-diesel-grey-5b0251a1-bd88-475c-a039-7e499da85d9d'
 'https://www.autoscout24.com//offers/renault-espace-initiale-

Per_of_Nulls   :  % 0.01
Number of Nulls  :  2
Number of Uniques:  169
Type of columns:  object
----------------------------------------
Unique values of columns:  [',Used,,Diesel (Particulate Filter)' ',Used,,Gasoline' ',Used,,Super 95'
 ',Used,,Regular/Benzine 91'
 ",Employee's car,,Diesel (Particulate Filter)" ',Used,,Diesel'
 ',Used,,Regular/Benzine 91 / Super Plus 98 / Regular/Benzine E10 91 / Super 95 / Super E10 95 / Super Plus E10 98'
 ",Employee's car,,Regular/Benzine 91" ",Employee's car,,Diesel"
 ",Employee's car,,Super E10 95" ',New,,Super 95 (Particulate Filter)'
 ',Used,,Super 95 / Regular/Benzine 91' ",Employee's car,,Super 95"
 ',Used,,Super 95 / Super Plus 98 / Super E10 95 / Super Plus E10 98'
 ',Used,,Super E10 95 / Super 95' ',Used,,Super E10 95'
 ',Used,,Super 95 / Regular/Benzine 91 / Super Plus 98'
 ',Demonstration,,Super 95'
 ',Used,,Super 95 / Super Plus 98 / Super E10 95'
 ',Used,,Super 95 / Super Plus 98' ",Employee's car,,Gasoline"
 ',Used,,Super 95 / Regula

Per_of_Nulls   :  % 19.94
Number of Nulls  :  3175
Number of Uniques:  11440
Type of columns:  object
----------------------------------------
Unique values of columns:  ['\nLR-062483\n' nan '\nAM-95365\n' ... '\nEspace16\n' '\n2691331\n'
 '\nRe_30000008029\n']
----------------------------------------
\n# 250678\n          1
\n# 8H6050830\n       1
\n# G1024529\n        1
\n# G6050580\n        1
\n#8023778\n          2
                   ... 
\nx_45689v\n          2
\ny8fx64x\n           1
\nzr11914\n           1
\nzr11916\n           1
NaN                3175
Name: offer_number, Length: 11441, dtype: int64
----------------------------------------
NaN                                             3175
\nLT67679\n                                       27
\nUN89904\n                                       27
\nXJ38068\n                                       27
\nJV03654\n                                       27
                                                ... 
\n160_dcbb6c3e-a6da-43a3-8

Per_of_Nulls   :  % 0.38
Number of Nulls  :  60
Number of Uniques:  9
Type of columns:  object
----------------------------------------
Unique values of columns:  ['\n,Sedans,\n' '\n,Station wagon,\n' '\n,Compact,\n' '\n,Other,\n'
 '\n,Coupe,\n' '\n,Van,\n' '\n,Off-Road,\n' '\n,Convertible,\n' nan
 '\n,Transporter,\n']
----------------------------------------
\n,Compact,\n          3153
\n,Convertible,\n         8
\n,Coupe,\n              25
\n,Off-Road,\n           56
\n,Other,\n             290
\n,Sedans,\n           7903
\n,Station wagon,\n    3553
\n,Transporter,\n        88
\n,Van,\n               783
NaN                      60
Name: body, dtype: int64
----------------------------------------
\n,Sedans,\n           7903
\n,Station wagon,\n    3553
\n,Compact,\n          3153
\n,Van,\n               783
\n,Other,\n             290
\n,Transporter,\n        88
NaN                      60
\n,Off-Road,\n           56
\n,Coupe,\n              25
\n,Convertible,\n         8
Name: body, 

Unique values of columns:  ['\n,Diesel (Particulate Filter),\n' '\n,Gasoline,\n' '\n,Super 95,\n'
 '\n,Regular/Benzine 91,\n' '\n,Diesel,\n'
 '\n,Regular/Benzine 91 / Super Plus 98 / Regular/Benzine E10 91 / Super 95 / Super E10 95 / Super Plus E10 98,\n'
 '\n,Super E10 95,\n' '\n,Super 95 (Particulate Filter),\n'
 '\n,Super 95 / Regular/Benzine 91,\n'
 '\n,Super 95 / Super Plus 98 / Super E10 95 / Super Plus E10 98,\n'
 '\n,Super E10 95 / Super 95,\n'
 '\n,Super 95 / Regular/Benzine 91 / Super Plus 98,\n'
 '\n,Super 95 / Super Plus 98 / Super E10 95,\n'
 '\n,Super 95 / Super Plus 98,\n'
 '\n,Super 95 / Regular/Benzine 91 / Super E10 95 / Super Plus E10 98 / Super Plus 98 / Regular/Benzine E10 91,\n'
 '\n,Others,\n' '\n,Super 95 / Super E10 95,\n'
 '\n,Gasoline (Particulate Filter),\n'
 '\n,Regular/Benzine E10 91 / Regular/Benzine 91 / Super 95 / Super Plus 98 / Super E10 95 / Super Plus E10 98,\n'
 '\n,Super E10 95 / Super 95 / Super Plus 98 / Super Plus E10 98 (Particulate Filter),\n

ABS                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        4
ABS,Adaptive Cruise Control,Adaptive headlights,Alarm system,Blind spot monitor,Central door lock with remote control,Daytime running lights,Driver-side airbag,Electronic stability control,Emergency system,Fog lights,Head airbag,Immobilizer,Isofix,LED Daytime Running Lights,Passenger-side airbag,Power steering,Side airbag,Tire pressure monitoring system,Traction control,Traffic sign recognition,Xenon headlights                                                                             1
              

Unique values of columns:  ['\n,Sicherheit:, ,Deaktivierung für Beifahrer-Airbag, ,ESC mit elektronischer Quersperre, ,Tagfahrlicht, ,Reifendruck-Kontrollanzeige, ,Kopfairbag-System mit Seiten-Airbags vorn, ,Sicherheitslenksäule,Assistenzsysteme:, ,Berganfahrassistent,Komfort:, ,Scheinwerferreinigung, ,Xenon plus inklusive Scheinwerfer-Reinigungsanlage, ,Scheinwerfer-Reinigungsanlage, ,Einparkhilfe hinten, ,Licht-/Regensensor, ,Funkfernbedienung, ,Elektrische Luftzusatzheizung,Interieur:, ,Rücksitzanlage 2 + 1, ,Multifunktions-Sportlederlenkrad im 3-Speichen-Design, ,automatische Leuchtweitenregulierung, ,Fahrerinformationssystem, ,Staub- und Pollenfilter, ,Kopfstützen hinten (3 Stück), ,Stoff Zeitgeist, ,Fahrersitz manuell höheneinstellbar, ,Scheiben seitlich und hinten in Wärmeschutzverglasung, ,Kindersitzbefestigung ISOFIX und Top Tether für die äußeren Fondsitze, ,Dachhimmel in Stoff titangrau, ,Waschwasser-Standanzeige, ,Nichtraucherfahrzeug,Exterieur:, ,Elektrische Aussenspiegel,

In [9]:
#Dropping null and k_W columns. There is no data in that columns. And also description because of Deutch
df.drop(columns=["null","k_w","description"], inplace=True)

In [10]:
#Changing consumption column
df["consumption"] = df["consumption"].str.replace("\n,?","")

In [11]:
# Controlling consumption columns
print("Total value of comb in consumption column = ", df["consumption"].str.contains("comb").value_counts().head(1).values)
print("Total value of city in consumption column = ", df["consumption"].str.contains("city").value_counts().head(1).values)
print("Total value of country in consumption column = ", df["consumption"].str.contains("country").value_counts().head(1).values)

Total value of comb in consumption column =  [13886]
Total value of city in consumption column =  [13483]
Total value of country in consumption column =  [13543]


In [12]:
#Split consumption column into 3 parts (Comb, City, Country)
df["consumption_comb"] = df["consumption"].str.split(",",expand=True)[0].str.extract("(\d.*)\s\w+.*/100 km\s\(comb")
df["consumption_city"] = df["consumption"].str.split(",",expand=True)[1].str.extract("(\d.*)\s\w+.*/100 km\s\(city")
df["consumption_country"] = df["consumption"].str.split(",",expand=True)[2].str.extract("(\d.*)\s\w+.*/100 km\s\(country")
print("Total of non-null value in consumption_comb column = ", len(df["consumption_comb"]) - df["consumption_comb"].isnull().sum())
print("Total of non-null value in consumption_city column = ", len(df["consumption_city"]) - df["consumption_city"].isnull().sum())
print("Total of non-null value in consumption_country column = ", len(df["consumption_country"]) - df["consumption_country"].isnull().sum())


Total of non-null value in consumption_comb column =  13886
Total of non-null value in consumption_city column =  13483
Total of non-null value in consumption_country column =  13543


In [13]:
#Changing column, replace unappropriate values
for i in df.select_dtypes(include="O"):
    df[i] = df[i].str.replace(",?\n,?","")

In [14]:
df.head().T

Unnamed: 0,0,1,2,3,4
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...,1.4 TDi Design S tronic,Sportback 1.4 TDI S-Tronic S-Line Ext. admired...
body_type,Sedans,Sedans,Sedans,Sedans,Sedans
price,15770,14500,14640,14500,16790
vat,VAT deductible,Price negotiable,VAT deductible,,
km,"56,013 km","80,000 km","83,450 km","73,000 km","16,200 km"
registration,01/2016,03/2017,02/2016,08/2016,05/2016
prev_owner,2 previous owners,,1 previous owner,1 previous owner,1 previous owner
hp,66 kW,141 kW,85 kW,66 kW,66 kW


In [15]:
# def check_other_column_values(row) :
#     if any(i in str(row) for i in col_list) :
#         return True
#     else :
#         False

# def find_easy(col_main, col_other, pattern) :
#     global col_list
#     col_list = list(df[col_main].dropna().unique())
#     if df[col_other].apply(check_other_column_values).isnull().sum() > 0 :
#         df["find"] = df[col_other].str.extract(pattern)
#         x_sum = df[df[col_main].isnull()]["find"].notna().sum()
#         if x_sum > 0 :
#             return f"{col_other} has {x_sum} values. You can fill {col_main} with them."
#         else :
#             return f"There ins't any useful information in {col_other} to fill {col_main}."

In [16]:
df.duplicated().sum()

0

##  Understanding Variables

### url

In [17]:
first_look("url")

column name :  url
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  15919
Type of columns:  object
----------------------------------------
Unique values of columns:  ['https://www.autoscout24.com//offers/audi-a1-sportback-1-4-tdi-s-tronic-xenon-navi-klima-diesel-black-bdab349a-caa5-41b0-98eb-c1345b84445e'
 'https://www.autoscout24.com//offers/audi-a1-1-8-tfsi-sport-gasoline-red-b2547f8a-e83f-6237-e053-e250040a56df'
 'https://www.autoscout24.com//offers/audi-a1-sportback-1-6-tdi-s-tronic-einparkhilfe-plus-music-diesel-black-6183cb6a-8570-4b86-a132-9b54214bca88'
 ...
 'https://www.autoscout24.com//offers/renault-espace-blue-dci-200-edc-initiale-paris-leder-led-navi-key-diesel-white-6256d1a3-ea68-4193-91de-b5b5ffa4631c'
 'https://www.autoscout24.com//offers/renault-espace-blue-dci-200cv-edc-business-nuova-da-immatricola-diesel-grey-5b0251a1-bd88-475c-a039-7e499da85d9d'
 'https://www.autoscout24.com//offers/renault-espace-initiale-

### short_description

In [18]:
first_look("short_description")

column name :  short_description
----------------------------------------
Per_of_Nulls   :  % 0.29
Number of Nulls  :  46
Number of Uniques:  10001
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Sportback 1.4 TDI S-tronic Xenon Navi Klima' '1.8 TFSI sport'
 'Sportback 1.6 TDI S tronic Einparkhilfe plus+music' ...
 'ELYSEE ENERGY dCi 160 EDC' 'INITIALE Paris TCe 225 EDC GPF ACC EU6'
 'TCe 225 EDC GPF LIM Deluxe Pano,RFK']
----------------------------------------
 1,6 CDTI**NAVI*Klimaaut*Tempomat*EURO6*TÜV NEU     1
 1.4 75 cv5 porte b-color                           1
 1.4 TFSI Attraction (125 CV)                       6
 1.4 Turbo S&S Excellence (125 CV)                  2
 1.6 CDTi Dynamic (110 CV)                          4
                                                   ..
van 1.5 dci 75cv S S E6                             1
van 1.5 dci 75cv S&S E6                            24
zoé life                                            1


###  make_model & make & model

In [19]:
first_look("make_model")

column name :  make_model
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  9
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Audi A1' 'Audi A2' 'Audi A3' 'Opel Astra' 'Opel Corsa' 'Opel Insignia'
 'Renault Clio' 'Renault Duster' 'Renault Espace']
----------------------------------------
Audi A1           2614
Audi A2              1
Audi A3           3097
Opel Astra        2526
Opel Corsa        2219
Opel Insignia     2598
Renault Clio      1839
Renault Duster      34
Renault Espace     991
Name: make_model, dtype: int64
----------------------------------------
Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: make_model, dtype: int64
################################################################################



In [20]:
first_look("make")

column name :  make
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  3
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Audi' 'Opel' 'Renault']
----------------------------------------
Audi       5712
Opel       7343
Renault    2864
Name: make, dtype: int64
----------------------------------------
Opel       7343
Audi       5712
Renault    2864
Name: make, dtype: int64
################################################################################



In [21]:
first_look("model")

column name :  model
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  9
Type of columns:  object
----------------------------------------
Unique values of columns:  ['A1' 'A2' 'A3' 'Astra' 'Corsa' 'Insignia' 'Clio' 'Duster' 'Espace']
----------------------------------------
A1          2614
A2             1
A3          3097
Astra       2526
Clio        1839
Corsa       2219
Duster        34
Espace       991
Insignia    2598
Name: model, dtype: int64
----------------------------------------
A3          3097
A1          2614
Insignia    2598
Astra       2526
Corsa       2219
Clio        1839
Espace       991
Duster        34
A2             1
Name: model, dtype: int64
################################################################################



In [22]:
df[["make_model","make","model"]].value_counts(dropna=False)

make_model      make     model   
Audi A3         Audi     A3          3097
Audi A1         Audi     A1          2614
Opel Insignia   Opel     Insignia    2598
Opel Astra      Opel     Astra       2526
Opel Corsa      Opel     Corsa       2219
Renault Clio    Renault  Clio        1839
Renault Espace  Renault  Espace       991
Renault Duster  Renault  Duster        34
Audi A2         Audi     A2             1
dtype: int64

In [23]:
df.drop(columns=["make","model"], inplace=True)

### body_type & body

In [24]:
first_look("body_type")

column name :  body_type
----------------------------------------
Per_of_Nulls   :  % 0.38
Number of Nulls  :  60
Number of Uniques:  9
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Sedans' 'Station wagon' 'Compact' 'Other' 'Coupe' 'Van' 'Off-Road'
 'Convertible' None 'Transporter']
----------------------------------------
Compact          3153
Convertible         8
Coupe              25
Off-Road           56
Other             290
Sedans           7903
Station wagon    3553
Transporter        88
Van               783
NaN                60
Name: body_type, dtype: int64
----------------------------------------
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body_type, dtype: int64
################################################################################



In [25]:
first_look("body")

column name :  body
----------------------------------------
Per_of_Nulls   :  % 0.38
Number of Nulls  :  60
Number of Uniques:  9
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Sedans' 'Station wagon' 'Compact' 'Other' 'Coupe' 'Van' 'Off-Road'
 'Convertible' nan 'Transporter']
----------------------------------------
Compact          3153
Convertible         8
Coupe              25
Off-Road           56
Other             290
Sedans           7903
Station wagon    3553
Transporter        88
Van               783
NaN                60
Name: body, dtype: int64
----------------------------------------
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: body, dtype: int64
################################################################################



In [26]:
df[["body_type","body"]].value_counts(dropna=False)

body_type      body         
Sedans         Sedans           7903
Station wagon  Station wagon    3553
Compact        Compact          3153
Van            Van               783
Other          Other             290
Transporter    Transporter        88
NaN            NaN                60
Off-Road       Off-Road           56
Coupe          Coupe              25
Convertible    Convertible         8
dtype: int64

In [27]:
#looking for the null values of body_type in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["body_type"].isnull()][i].str.contains("sedans|station wagon|compact|coupe|van|off-road|transporter", regex=True).any():
        print(i)

url
short_description


In [28]:
df[df["body_type"].isnull()][df[df["body_type"].isnull()]["short_description"].str.contains("sedans|station wagon|compact|coupe|van|off-road|transporter", regex=True) == True]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,hp,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,offer_number,first_registration,body_color,paint_type,body_color_original,upholstery,body,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country
13852,https://www.autoscout24.com//offers/renault-cl...,Renault Clio,van 1.5 dci 75cv S S E6,,8500,,"34,564 km",06/2017,,55 kW,",Used,,Diesel",,,,,,,902099,2017,White,,Bianco,,,5,,,Manual,"1,461 cc",4,,,Diesel,,,,,,,"ABS,Power steering",,,,,,,,,,,,


In [29]:
#In url column there is a van value, so body type gets van.
df.loc[df[df["body_type"].isnull()][df[df["body_type"].isnull()]["short_description"].str.contains("sedans|station wagon|compact|coupe|van|off-road|transporter", regex=True) == True].index,"body_type"] = "Van"

In [30]:
df.drop(columns="body", inplace=True)

### vat

In [31]:
first_look("vat")

column name :  vat
----------------------------------------
Per_of_Nulls   :  % 28.35
Number of Nulls  :  4513
Number of Uniques:  2
Type of columns:  object
----------------------------------------
Unique values of columns:  ['VAT deductible' 'Price negotiable' None]
----------------------------------------
Price negotiable      426
VAT deductible      10980
NaN                  4513
Name: vat, dtype: int64
----------------------------------------
VAT deductible      10980
NaN                  4513
Price negotiable      426
Name: vat, dtype: int64
################################################################################



### km

In [32]:
first_look("km")

column name :  km
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  6690
Type of columns:  object
----------------------------------------
Unique values of columns:  ['56,013 km' '80,000 km' '83,450 km' ... '2,864 km' '1,506 km' '57 km']
----------------------------------------
- km         1024
0 km           19
1 km          367
1,000 km       46
1,001 km        4
             ... 
99,999 km       1
990 km          2
991 km          1
995 km          1
999 km          3
Name: km, Length: 6690, dtype: int64
----------------------------------------
10 km        1045
- km         1024
1 km          367
5 km          170
50 km         148
             ... 
67,469 km       1
43,197 km       1
10,027 km       1
35,882 km       1
57 km           1
Name: km, Length: 6690, dtype: int64
################################################################################



In [33]:
df["km"] = df["km"].str.strip(" km").str.replace(",","").replace("-",np.nan).astype(float)

In [34]:
first_look("km")

column name :  km
----------------------------------------
Per_of_Nulls   :  % 6.43
Number of Nulls  :  1024
Number of Uniques:  6689
Type of columns:  float64
----------------------------------------
Unique values of columns:  [5.6013e+04 8.0000e+04 8.3450e+04 ... 2.8640e+03 1.5060e+03 5.7000e+01]
----------------------------------------
0.0           19
1.0          367
2.0            6
3.0           33
4.0           15
            ... 
248000.0       1
260000.0       1
291800.0       1
317000.0       1
NaN         1024
Name: km, Length: 6690, dtype: int64
----------------------------------------
10.0       1045
NaN        1024
1.0         367
5.0         170
50.0        148
           ... 
67469.0       1
43197.0       1
10027.0       1
35882.0       1
57.0          1
Name: km, Length: 6690, dtype: int64
################################################################################



### registration & first_registration

In [35]:
first_look("registration")

column name :  registration
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  48
Type of columns:  object
----------------------------------------
Unique values of columns:  ['01/2016' '03/2017' '02/2016' '08/2016' '05/2016' '03/2016' '06/2017'
 '05/2018' '09/2016' '06/2016' '10/2016' '04/2016' '06/2018' '11/2017'
 '07/2016' '12/2016' '09/2018' '04/2018' '07/2017' '02/2017' '11/2016'
 '01/2018' '05/2017' '04/2017' '07/2018' '01/2017' '02/2018' '03/2018'
 '-/-' '08/2017' '10/2017' '08/2018' '09/2017' '12/2017' '04/2019'
 '12/2018' '03/2019' '06/2019' '05/2019' '02/2019' '01/2019' '10/2018'
 '11/2018' '09/2019' '07/2019' '08/2019' '11/2019' '12/2019']
----------------------------------------
-/-        1597
01/2016     376
01/2017     306
01/2018     511
01/2019     541
02/2016     472
02/2017     368
02/2018     539
02/2019     585
03/2016     536
03/2017     471
03/2018     695
03/2019     543
04/2016     532
04/2017     380
04/2

In [36]:
first_look("first_registration")

column name :  first_registration
----------------------------------------
Per_of_Nulls   :  % 10.03
Number of Nulls  :  1597
Number of Uniques:  4
Type of columns:  object
----------------------------------------
Unique values of columns:  ['2016' '2017' '2018' nan '2019']
----------------------------------------
2016    3674
2017    3273
2018    4522
2019    2853
NaN     1597
Name: first_registration, dtype: int64
----------------------------------------
2018    4522
2016    3674
2017    3273
2019    2853
NaN     1597
Name: first_registration, dtype: int64
################################################################################



In [37]:
df["registration"] = pd.to_datetime(df["registration"].replace("-/-", np.nan))
df["first_registration"] = pd.to_datetime(df["first_registration"].replace("-/-", np.nan))

In [38]:
df[~(df["registration"].dt.year == df["first_registration"].dt.year)][["registration","first_registration"]]

Unnamed: 0,registration,first_registration
122,NaT,NaT
710,NaT,NaT
734,NaT,NaT
741,NaT,NaT
743,NaT,NaT
...,...,...
15896,NaT,NaT
15902,NaT,NaT
15907,NaT,NaT
15912,NaT,NaT


In [39]:
df[(df["registration"].isnull())&(df["first_registration"].isnull())].shape

(1597, 51)

In [40]:
first_look("registration")

column name :  registration
----------------------------------------
Per_of_Nulls   :  % 10.03
Number of Nulls  :  1597
Number of Uniques:  47
Type of columns:  datetime64[ns]
----------------------------------------
Unique values of columns:  ['2016-01-01T00:00:00.000000000' '2017-03-01T00:00:00.000000000'
 '2016-02-01T00:00:00.000000000' '2016-08-01T00:00:00.000000000'
 '2016-05-01T00:00:00.000000000' '2016-03-01T00:00:00.000000000'
 '2017-06-01T00:00:00.000000000' '2018-05-01T00:00:00.000000000'
 '2016-09-01T00:00:00.000000000' '2016-06-01T00:00:00.000000000'
 '2016-10-01T00:00:00.000000000' '2016-04-01T00:00:00.000000000'
 '2018-06-01T00:00:00.000000000' '2017-11-01T00:00:00.000000000'
 '2016-07-01T00:00:00.000000000' '2016-12-01T00:00:00.000000000'
 '2018-09-01T00:00:00.000000000' '2018-04-01T00:00:00.000000000'
 '2017-07-01T00:00:00.000000000' '2017-02-01T00:00:00.000000000'
 '2016-11-01T00:00:00.000000000' '2018-01-01T00:00:00.000000000'
 '2017-05-01T00:00:00.000000000' '2017-04

In [41]:
df.drop(columns="first_registration", inplace=True)

### prev_owner & previous_owners

In [42]:
first_look("prev_owner")

column name :  prev_owner
----------------------------------------
Per_of_Nulls   :  % 42.89
Number of Nulls  :  6828
Number of Uniques:  4
Type of columns:  object
----------------------------------------
Unique values of columns:  ['2 previous owners' None '1 previous owner' '3 previous owners'
 '4 previous owners']
----------------------------------------
1 previous owner     8294
2 previous owners     778
3 previous owners      17
4 previous owners       2
NaN                  6828
Name: prev_owner, dtype: int64
----------------------------------------
1 previous owner     8294
NaN                  6828
2 previous owners     778
3 previous owners      17
4 previous owners       2
Name: prev_owner, dtype: int64
################################################################################



In [43]:
first_look("previous_owners")

column name :  previous_owners
----------------------------------------
Per_of_Nulls   :  % 41.71
Number of Nulls  :  6640
Number of Uniques:  101
Type of columns:  object
----------------------------------------
Unique values of columns:  ['2' nan '1' '0' '3' '4' '1102 g CO2/km (comb)' '1105 g CO2/km (comb)'
 '1110 g CO2/km (comb)' '1116 g CO2/km (comb)'
 '14.8 l/100 km (comb)5.9 l/100 km (city)4.2 l/100 km (country)'
 '00 kWh/100 km (comb)' '0105 g CO2/km (comb)' '0106 g CO2/km (comb)'
 '04.6 l/100 km (comb)5.8 l/100 km (city)3.9 l/100 km (country)'
 '0104 g CO2/km (comb)' '1111 g CO2/km (comb)'
 '14.9 l/100 km (comb)6 l/100 km (city)4.2 l/100 km (country)'
 '2127 g CO2/km (comb)' '199 g CO2/km (comb)' '2122 g CO2/km (comb)'
 '1114 g CO2/km (comb)' '1109 g CO2/km (comb)' '1107 g CO2/km (comb)'
 '15.2 l/100 km (comb)6 l/100 km (city)4.8 l/100 km (country)'
 '0117 g CO2/km (comb)' '0107 g CO2/km (comb)' '0114 g CO2/km (comb)'
 '2124 g CO2/km (comb)' '1125 g CO2/km (comb)' '2119 g CO2/k

In [44]:
df["previous_owners"].str.extract("(\d).*").value_counts()

1    8294
2     778
0     188
3      17
4       2
dtype: int64

In [45]:
df.loc[df[df["prev_owner"].isnull()][df[df["prev_owner"].isnull()]["previous_owners"].str.startswith("0") == True].index,"prev_owner"] = "0 previous owner"

In [46]:
df["prev_owner"].value_counts()

1 previous owner     8294
2 previous owners     778
0 previous owner      188
3 previous owners      17
4 previous owners       2
Name: prev_owner, dtype: int64

In [47]:
df["prev_owner"] = df["prev_owner"].str.extract("(\d)").astype(float)

In [48]:
df["prev_owner"].value_counts(dropna=False)

1.0    8294
NaN    6640
2.0     778
0.0     188
3.0      17
4.0       2
Name: prev_owner, dtype: int64

In [49]:
first_look("prev_owner")

column name :  prev_owner
----------------------------------------
Per_of_Nulls   :  % 41.71
Number of Nulls  :  6640
Number of Uniques:  5
Type of columns:  float64
----------------------------------------
Unique values of columns:  [ 2. nan  1.  0.  3.  4.]
----------------------------------------
0.0     188
1.0    8294
2.0     778
3.0      17
4.0       2
NaN    6640
Name: prev_owner, dtype: int64
----------------------------------------
1.0    8294
NaN    6640
2.0     778
0.0     188
3.0      17
4.0       2
Name: prev_owner, dtype: int64
################################################################################



### hp 

In [50]:
first_look("hp")

column name :  hp
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  81
Type of columns:  object
----------------------------------------
Unique values of columns:  ['66 kW' '141 kW' '85 kW' '70 kW' '92 kW' '112 kW' '60 kW' '71 kW' '67 kW'
 '110 kW' '93 kW' '147 kW' '86 kW' '140 kW' '87 kW' '- kW' '81 kW' '82 kW'
 '135 kW' '132 kW' '100 kW' '96 kW' '162 kW' '150 kW' '294 kW' '228 kW'
 '270 kW' '137 kW' '9 kW' '133 kW' '77 kW' '101 kW' '78 kW' '103 kW'
 '1 kW' '74 kW' '118 kW' '84 kW' '88 kW' '80 kW' '76 kW' '149 kW' '44 kW'
 '51 kW' '55 kW' '52 kW' '63 kW' '40 kW' '65 kW' '75 kW' '125 kW' '120 kW'
 '184 kW' '239 kW' '121 kW' '143 kW' '191 kW' '89 kW' '195 kW' '127 kW'
 '122 kW' '154 kW' '155 kW' '104 kW' '123 kW' '146 kW' '90 kW' '53 kW'
 '54 kW' '56 kW' '164 kW' '4 kW' '163 kW' '57 kW' '119 kW' '165 kW'
 '117 kW' '115 kW' '98 kW' '168 kW' '167 kW']
----------------------------------------
- kW        88
1 kW        20
100 kW    1

In [51]:
df["hp_kw"] = df.hp.str.extract("(\d+) kW").astype(float)

In [52]:
first_look("hp_kw")

column name :  hp_kw
----------------------------------------
Per_of_Nulls   :  % 0.55
Number of Nulls  :  88
Number of Uniques:  80
Type of columns:  float64
----------------------------------------
Unique values of columns:  [ 66. 141.  85.  70.  92. 112.  60.  71.  67. 110.  93. 147.  86. 140.
  87.  nan  81.  82. 135. 132. 100.  96. 162. 150. 294. 228. 270. 137.
   9. 133.  77. 101.  78. 103.   1.  74. 118.  84.  88.  80.  76. 149.
  44.  51.  55.  52.  63.  40.  65.  75. 125. 120. 184. 239. 121. 143.
 191.  89. 195. 127. 122. 154. 155. 104. 123. 146.  90.  53.  54.  56.
 164.   4. 163.  57. 119. 165. 117. 115.  98. 168. 167.]
----------------------------------------
1.0      20
4.0       1
9.0       1
40.0      2
44.0      1
         ..
228.0     2
239.0     1
270.0     2
294.0    18
NaN      88
Name: hp_kw, Length: 81, dtype: int64
----------------------------------------
85.0     2542
66.0     2122
81.0     1402
100.0    1308
110.0    1112
         ... 
84.0        1
195.0      

In [53]:
#looking for the null values of hp_kw in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["hp_kw"].isnull()][i].str.contains("kw|Kw|KW").any():
        print(i)

url
short_description


In [54]:
df[df["hp_kw"].isnull()][df[df["hp_kw"].isnull()]["short_description"].str.contains("kw|Kw|KW")]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,hp,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,offer_number,body_color,paint_type,body_color_original,upholstery,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country,hp_kw
1269,https://www.autoscout24.com//offers/audi-a1-ad...,Audi A1,ADRENALIN 1.6 TDI 85KW (116CV) SPORTBACK,Compact,15500,,11284.0,2018-06-01,,- kW,",Used,,Diesel",,,,36 months,,,2642843,Blue,,Azul,,5,,,Manual,,,,,Diesel,,,,,,,,,,,,,,,,,,,,
4259,https://www.autoscout24.com//offers/audi-a3-de...,Audi A3,DESIGN EDITION 1.6 TDI 85KW SPORTBACK,Compact,18700,,16316.0,2018-06-01,,- kW,",Used,,Diesel",,,,24 months,,,2683780,Grey,,Gris,,5,,,Manual,,,,,Diesel,,,,,,,,,,,,,,,,,,,,


In [55]:
#Looking at short description column and get KW and change it to float and change the value in hp_kw with it
df.loc[df[df["hp_kw"].isnull()][df[df["hp_kw"].isnull()]["short_description"].str.contains("kw|Kw|KW")].index, "hp_kw"] = df[df["hp_kw"].isnull()]["short_description"].str.extract("(\d+)KW").loc[df[df["hp_kw"].isnull()][df[df["hp_kw"].isnull()]["short_description"].str.contains("kw|Kw|KW")].index].astype(float).values

In [56]:
df.drop(columns="hp", inplace=True)

### type & fuel

In [57]:
first_look("type")

column name :  type
----------------------------------------
Per_of_Nulls   :  % 0.01
Number of Nulls  :  2
Number of Uniques:  169
Type of columns:  object
----------------------------------------
Unique values of columns:  [',Used,,Diesel (Particulate Filter)' ',Used,,Gasoline' ',Used,,Super 95'
 ',Used,,Regular/Benzine 91'
 ",Employee's car,,Diesel (Particulate Filter)" ',Used,,Diesel'
 ',Used,,Regular/Benzine 91 / Super Plus 98 / Regular/Benzine E10 91 / Super 95 / Super E10 95 / Super Plus E10 98'
 ",Employee's car,,Regular/Benzine 91" ",Employee's car,,Diesel"
 ",Employee's car,,Super E10 95" ',New,,Super 95 (Particulate Filter)'
 ',Used,,Super 95 / Regular/Benzine 91' ",Employee's car,,Super 95"
 ',Used,,Super 95 / Super Plus 98 / Super E10 95 / Super Plus E10 98'
 ',Used,,Super E10 95 / Super 95' ',Used,,Super E10 95'
 ',Used,,Super 95 / Regular/Benzine 91 / Super Plus 98'
 ',Demonstration,,Super 95'
 ',Used,,Super 95 / Super Plus 98 / Super E10 95'
 ',Used,,Super 95 / Super Pl

In [58]:
first_look("fuel")

column name :  fuel
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  77
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Diesel (Particulate Filter)' 'Gasoline' 'Super 95' 'Regular/Benzine 91'
 'Diesel'
 'Regular/Benzine 91 / Super Plus 98 / Regular/Benzine E10 91 / Super 95 / Super E10 95 / Super Plus E10 98'
 'Super E10 95' 'Super 95 (Particulate Filter)'
 'Super 95 / Regular/Benzine 91'
 'Super 95 / Super Plus 98 / Super E10 95 / Super Plus E10 98'
 'Super E10 95 / Super 95' 'Super 95 / Regular/Benzine 91 / Super Plus 98'
 'Super 95 / Super Plus 98 / Super E10 95' 'Super 95 / Super Plus 98'
 'Super 95 / Regular/Benzine 91 / Super E10 95 / Super Plus E10 98 / Super Plus 98 / Regular/Benzine E10 91'
 'Others' 'Super 95 / Super E10 95' 'Gasoline (Particulate Filter)'
 'Regular/Benzine E10 91 / Regular/Benzine 91 / Super 95 / Super Plus 98 / Super E10 95 / Super Plus E10 98'
 'Super E

In [59]:
df[df["type"].isnull()]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,offer_number,body_color,paint_type,body_color_original,upholstery,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country,hp_kw
2765,https://www.autoscout24.com//offers/audi-a3-sp...,Audi A3,SPB 2.0 TDI S tronic Sport,Sedans,17900,,115137.0,2016-10-01,,,,,,",,,Diesel",,,57551628,White,,,"Cloth, Other",5,5,,Automatic,"1,968 cc",4,,front,Diesel,"['4.5 l/100 km (comb)'],['5.3 l/100 km (city)'...",118 g CO2/km (comb),Euro 6,Air conditioning,"Bluetooth,Hands-free equipment","Alloy wheels,Sport seats,Sport suspension","ABS,Central door lock,Driver-side airbag,Isofi...",1 (No sticker),6,,,,,,,,4.5,5.3,4.1,110.0
5237,https://www.autoscout24.com//offers/audi-a3-sp...,Audi A3,SPB 1.6 TDI 116 CV S tronic,Sedans,25400,,,NaT,,,,,,,,,57247041,Grey,,,"Cloth, Other",5,5,,Automatic,"1,598 cc",4,,front,Diesel,"['3.9 l/100 km (comb)'],['4.1 l/100 km (city)'...",103 g CO2/km (comb),Euro 6,Air conditioning,"Bluetooth,Hands-free equipment",Alloy wheels,"ABS,Central door lock,Driver-side airbag,Isofi...",1 (No sticker),7,,,,,,,,3.9,4.1,3.7,85.0


In [60]:
#Contolling whether fuel in type or not
def isin_control(column1, column2):
    return column1 in column2
df[["type","fuel"]].apply(lambda x: isin_control(str(x.fuel), str(x.type)), axis=1).value_counts()

True     15917
False        2
dtype: int64

In [61]:
df["type"] = df["type"].str.split(",", expand=True)[1]

In [62]:
first_look("type")

column name :  type
----------------------------------------
Per_of_Nulls   :  % 0.01
Number of Nulls  :  2
Number of Uniques:  5
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Used' "Employee's car" 'New' 'Demonstration' 'Pre-registered' nan]
----------------------------------------
Demonstration       796
Employee's car     1011
New                1650
Pre-registered     1364
Used              11096
NaN                   2
Name: type, dtype: int64
----------------------------------------
Used              11096
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: type, dtype: int64
################################################################################



In [63]:
df["Particulate_Filter"] = df["fuel"].transform(lambda x: 1 if "Particulate" in str(x) else 0)
df["Particulate_Filter"].value_counts()

0    11100
1     4819
Name: Particulate_Filter, dtype: int64

In [64]:
def fuel_categorise(x):
    if  "Electric/Gasoline" in x:
        return "Hybrid"
    elif "LPG" in x or "gas" in x or "CNG" in x:
        return "Gas"
    elif "Electric" in x:
        return "Electric"
    elif "Diesel" in x:
        return "Diesel"
    elif "Gasoline" in x or "Super" in x or "Benzine" in x or "Regular" in x:
        return "Benzine"
    else : return "Other"
df["fuel"] = df["fuel"].transform(lambda x: fuel_categorise(x))

In [65]:
df.fuel.value_counts()

Benzine     8545
Diesel      7299
Gas           64
Other          6
Hybrid         4
Electric       1
Name: fuel, dtype: int64

In [66]:
df[df["fuel"] == "Other"]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,offer_number,body_color,paint_type,body_color_original,upholstery,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country,hp_kw,Particulate_Filter
819,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.4TDI Sportback /Euro6 /Navi /SHZ /PDC,Sedans,14388,VAT deductible,25684.0,2016-10-01,1.0,Used,1.0,,Yes4 (Green),,,,0185N,White,Metallic,Gletscherweiss,"Cloth, Black",5,5.0,0588/BDF,Manual,"1,422 cc",,,front,Other,,,"[],[],[]","Armrest,Cruise control,Electrical side mirrors...","Bluetooth,Hands-free equipment,On-board comput...",,,4 (Green),5.0,,,,,,,,,,,66.0,0
2885,https://www.autoscout24.com//offers/audi-a3-sp...,Audi A3,SPORTBACK 1.4 TFSI G-Tron S-Tronic Adrenalin S...,Compact,16400,VAT deductible,123748.0,2016-04-01,,Used,,04/2020,,,Euro 6,,JH480V,White,,Wit,"Cloth, Black",5,5.0,,Automatic,"1,395 cc",4.0,"1,255 kg",,Other,,"[],[],[]",Euro 6,"Air conditioning,Armrest,Automatic climate con...","CD player,Hands-free equipment,On-board comput...","Alloy wheels,Sport package,Trailer hitch","ABS,Central door lock,Driver-side airbag,Fog l...",,7.0,Netherlands,,,,,,,,,,81.0,0
4003,https://www.autoscout24.com//offers/audi-a3-sp...,Audi A3,Sportback 1.6 TDI S-Tronic*LED*Navigation*APS,Sedans,19490,VAT deductible,49267.0,2017-08-01,1.0,Used,1.0,08/20204 (Green),,,,,M-289025,Red,Metallic,Tangorot Metallic,"Cloth, Black",5,5.0,,Automatic,"1,598 cc",4.0,,front,Other,,,"[],[],[]","Air conditioning,Automatic climate control,Cru...","Bluetooth,CD player,Hands-free equipment,On-bo...",Alloy wheels,"ABS,Central door lock,Daytime running lights,D...",4 (Green),7.0,Germany,,,,,,,,,,85.0,1
10374,https://www.autoscout24.com//offers/opel-corsa...,Opel Corsa,1.4 GLP Selective Pro 90,Sedans,11300,VAT deductible,5.0,2019-05-01,1.0,Pre-registered,1.0,,,24 months126 g CO2/km (comb),,,,White,,,"Cloth, Black",5,5.0,,Manual,"1,398 cc",4.0,"1,163 kg",,Other,"['5.1 l/100 km (comb)'],['6.5 l/100 km (city)'...",126 g CO2/km (comb),Euro 6d-TEMP,"Air conditioning,Cruise control,Electrical sid...","Bluetooth,Hands-free equipment,On-board comput...","Alloy wheels,Voice Control","ABS,Adaptive headlights,Central door lock with...",,5.0,Spain,,,,,,,5.1,6.5,4.3,66.0,0
11677,https://www.autoscout24.com//offers/opel-insig...,Opel Insignia,Edition,Other,18480,VAT deductible,14937.0,2018-02-01,,Used,,,Yes,,,,00-40-12992,,,,Other,4,,,Manual,"1,500 cc",,"1,518 kg",front,Other,,,"[],[],[]","Air conditioning,Cruise control,Electrical sid...","On-board computer,Radio,USB","Alloy wheels,Trailer hitch","Central door lock,Daytime running lights,Drive...",4 (Green),,Germany,,,,,,,,,,103.0,0
14500,https://www.autoscout24.com//offers/renault-cl...,Renault Clio,TCe Energy GLP Limited 66kW 90CV,Sedans,10800,VAT deductible,13000.0,2018-06-01,1.0,Employee's car,1.0,,,12 monthsEuro 6d-TEMP,,,,White,,BLANCO GLACIAR,"Cloth, Black",5,5.0,,Manual,898 cc,3.0,"1,082 kg",front,Other,"['4.7 l/100 km (comb)'],['5.7 l/100 km (city)'...",108 g CO2/km (comb),Euro 6d-TEMP,"Air conditioning,Automatic climate control,Cru...","Bluetooth,Hands-free equipment,Radio,Sound sys...","Alloy wheels,Touch screen","ABS,Central door lock with remote control,Dayt...",,5.0,Spain,,06/2019108 g CO2/km (comb),,,,,4.7,5.7,4.1,66.0,0


In [67]:
df.loc[[819,4003],"fuel"] = "Diesel"

In [68]:
df.loc[df[df["fuel"] == "Other"].index,"fuel"] = "Benzine"

In [69]:
first_look("fuel")

column name :  fuel
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  5
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Diesel' 'Benzine' 'Gas' 'Hybrid' 'Electric']
----------------------------------------
Benzine     8549
Diesel      7301
Electric       1
Gas           64
Hybrid         4
Name: fuel, dtype: int64
----------------------------------------
Benzine     8549
Diesel      7301
Gas           64
Hybrid         4
Electric       1
Name: fuel, dtype: int64
################################################################################



### next_inspection

In [70]:
first_look("next_inspection")

column name :  next_inspection
----------------------------------------
Per_of_Nulls   :  % 77.79
Number of Nulls  :  12384
Number of Uniques:  1384
Type of columns:  object
----------------------------------------
Unique values of columns:  ['06/202199 g CO2/km (comb)' nan '02/202097 g CO2/km (comb)' ...
 '02/2022153 g CO2/km (comb)'
 '06/20217.4 l/100 km (comb)9.2 l/100 km (city)6.3 l/100 km (country)'
 '01/2022168 g CO2/km (comb)']
----------------------------------------
01/1921120 g CO2/km (comb)                                               1
01/1955                                                                  1
01/1999                                                                  1
01/2001120 g CO2/km (comb)                                               1
01/2001122 g CO2/km (comb)                                               2
                                                                     ...  
12/2021167 g CO2/km (comb)                                            

In [71]:
df["next_inspection"].str.extract("(.*/\d\d\d\d)").value_counts(dropna=False)

NaN        12384
06/2021      471
03/2021      210
05/2021      180
04/2021      171
           ...  
05/2014        1
05/2016        1
05/2017        1
01/1955        1
01/1921        1
Length: 78, dtype: int64

In [72]:
df["inspection_time"] = df["next_inspection"].str.extract("(.*/\d\d\d\d)")
df["inspection_time"] = pd.to_datetime(df["inspection_time"])
df["inspection_time"].value_counts(dropna=False) 

NaT           12384
2021-06-01      471
2021-03-01      210
2021-05-01      180
2021-04-01      171
              ...  
2014-05-01        1
2016-04-01        1
1955-01-01        1
2018-01-01        1
2022-11-01        1
Name: inspection_time, Length: 78, dtype: int64

In [73]:
#Getting some values in inspection time to np.nan because of year before 2019 and date before registration
df.loc[df[df["inspection_time"].dt.year < 2019].index, "inspection_time"] = np.nan
df.loc[df[df["registration"] > df["inspection_time"]].index, "inspection_time"] = np.nan

In [74]:
first_look("inspection_time")

column name :  inspection_time
----------------------------------------
Per_of_Nulls   :  % 78.21
Number of Nulls  :  12450
Number of Uniques:  53
Type of columns:  datetime64[ns]
----------------------------------------
Unique values of columns:  ['2021-06-01T00:00:00.000000000'                           'NaT'
 '2020-02-01T00:00:00.000000000' '2019-09-01T00:00:00.000000000'
 '2019-10-01T00:00:00.000000000' '2020-04-01T00:00:00.000000000'
 '2020-07-01T00:00:00.000000000' '2019-12-01T00:00:00.000000000'
 '2019-05-01T00:00:00.000000000' '2019-07-01T00:00:00.000000000'
 '2021-02-01T00:00:00.000000000' '2019-06-01T00:00:00.000000000'
 '2020-05-01T00:00:00.000000000' '2019-11-01T00:00:00.000000000'
 '2020-11-01T00:00:00.000000000' '2021-09-01T00:00:00.000000000'
 '2020-03-01T00:00:00.000000000' '2021-04-01T00:00:00.000000000'
 '2019-04-01T00:00:00.000000000' '2021-07-01T00:00:00.000000000'
 '2021-01-01T00:00:00.000000000' '2022-05-01T00:00:00.000000000'
 '2020-06-01T00:00:00.000000000' '201

### inspection_new 

In [75]:
first_look("inspection_new")

column name :  inspection_new
----------------------------------------
Per_of_Nulls   :  % 75.3
Number of Nulls  :  11987
Number of Uniques:  201
Type of columns:  object
----------------------------------------
Unique values of columns:  ['YesEuro 6' nan 'Yes109 g CO2/km (comb)' 'Yes98 g CO2/km (comb)'
 'Yes97 g CO2/km (comb)' 'Yes112 g CO2/km (comb)' 'Yes0 kWh/100 km (comb)'
 'Yes104 g CO2/km (comb)' 'Yes102 g CO2/km (comb)' 'Yes91 g CO2/km (comb)'
 'Yes4.4 l/100 km (comb)5.2 l/100 km (city)3.9 l/100 km (country)'
 'Yes99 g CO2/km (comb)' 'Yes' 'Yes92 g CO2/km (comb)'
 'Yes5 l/100 km (comb)6.2 l/100 km (city)4.3 l/100 km (country)'
 'Yes103 g CO2/km (comb)'
 'Yes4.9 l/100 km (comb)6.2 l/100 km (city)4.2 l/100 km (country)'
 'Yes107 g CO2/km (comb)'
 'Yes4.4 l/100 km (comb)5.4 l/100 km (city)3.8 l/100 km (country)'
 'Yes4.6 l/100 km (comb)5.6 l/100 km (city)4 l/100 km (country)'
 'Yes121 g CO2/km (comb)' 'Yes123 g CO2/km (comb)'
 'Yes113 g CO2/km (comb)'
 'Yes4.7 l/100 km (comb)5.8 l/

In [76]:
df["inspection_situation"] = df["inspection_new"].apply(lambda x: 1 if "Yes" in str(x) else 0)
df["inspection_situation"].value_counts()

0    11987
1     3932
Name: inspection_situation, dtype: int64

In [77]:
first_look("inspection_situation")

column name :  inspection_situation
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  2
Type of columns:  int64
----------------------------------------
Unique values of columns:  [1 0]
----------------------------------------
0    11987
1     3932
Name: inspection_situation, dtype: int64
----------------------------------------
0    11987
1     3932
Name: inspection_situation, dtype: int64
################################################################################



In [78]:
#looking for the null values of inspection_situation in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["inspection_situation"].isnull()][i].str.contains("Yes").any():
        print(i)
#There is no column to get values

### warranty

In [79]:
first_look("warranty")

column name :  warranty
----------------------------------------
Per_of_Nulls   :  % 34.05
Number of Nulls  :  5420
Number of Uniques:  506
Type of columns:  object
----------------------------------------
Unique values of columns:  ['4 (Green)' nan '99 g CO2/km (comb)' 'Euro 6' '12 monthsEuro 6'
 '3 months' '12 months' '6 months103 g CO2/km (comb)' ''
 '12 months105 g CO2/km (comb)' '12 months112 g CO2/km (comb)'
 '12 months97 g CO2/km (comb)' '12 months104 g CO2/km (comb)' '24 months'
 '12 months102 g CO2/km (comb)' '97 g CO2/km (comb)' '50 months4 (Green)'
 '12 months106 g CO2/km (comb)' '48 months' '107 g CO2/km (comb)'
 '108 g CO2/km (comb)' '113 g CO2/km (comb)' '102 g CO2/km (comb)'
 '112 g CO2/km (comb)' '36 months99 g CO2/km (comb)' '109 g CO2/km (comb)'
 '20 months4 (Green)' '12 months99 g CO2/km (comb)' '106 g CO2/km (comb)'
 '36 monthsEuro 6' '12 months109 g CO2/km (comb)' '100 g CO2/km (comb)'
 '12 months0 kWh/100 km (comb)' '12 months101 g CO2/km (comb)'
 '4.9 l/100 km (c

In [80]:
df["warranty_month"] = df["warranty"].str.extract("(\d+) month").astype(float)

In [81]:
#looking for the null values of inspection_situation in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["warranty_month"].isnull()][i].str.contains("month").any():
        print(i)
#There is no column to get values

In [82]:
first_look("warranty_month")

column name :  warranty_month
----------------------------------------
Per_of_Nulls   :  % 69.51
Number of Nulls  :  11066
Number of Uniques:  41
Type of columns:  float64
----------------------------------------
Unique values of columns:  [nan 12.  3.  6. 24. 50. 48. 36. 20. 23. 60. 13. 26. 46. 47. 49. 18. 56.
 16. 22. 28. 10. 19. 25. 11. 72.  2.  1.  4.  8.  7. 15. 17. 45. 14.  9.
 65. 21. 34. 33. 40. 30.]
----------------------------------------
1.0         3
2.0         5
3.0        33
4.0         3
6.0       125
7.0         1
8.0         1
9.0         2
10.0        1
11.0        2
12.0     2594
13.0        3
14.0        2
15.0        1
16.0        4
17.0        2
18.0       10
19.0        3
20.0        7
21.0        2
22.0        2
23.0       11
24.0     1118
25.0        6
26.0        4
28.0        2
30.0        1
33.0        1
34.0        3
36.0      279
40.0        1
45.0        2
46.0        2
47.0        1
48.0      149
49.0        1
50.0        4
56.0        1
60.0      401
6

### co_2_emission

In [83]:
first_look("co_2_emission")

column name :  co_2_emission
----------------------------------------
Per_of_Nulls   :  % 11.36
Number of Nulls  :  1808
Number of Uniques:  123
Type of columns:  object
----------------------------------------
Unique values of columns:  ['99 g CO2/km (comb)' '129 g CO2/km (comb)' '109 g CO2/km (comb)'
 '92 g CO2/km (comb)' '98 g CO2/km (comb)' '97 g CO2/km (comb)' nan
 '105 g CO2/km (comb)' '112 g CO2/km (comb)' '103 g CO2/km (comb)'
 '102 g CO2/km (comb)' '95 g CO2/km (comb)' '104 g CO2/km (comb)'
 '[],[],[]' '91 g CO2/km (comb)' '94 g CO2/km (comb)'
 '117 g CO2/km (comb)' '123 g CO2/km (comb)' '106 g CO2/km (comb)'
 '108 g CO2/km (comb)' '121 g CO2/km (comb)' '107 g CO2/km (comb)'
 '101 g CO2/km (comb)' '113 g CO2/km (comb)' '137 g CO2/km (comb)'
 '100 g CO2/km (comb)' '116 g CO2/km (comb)' '114 g CO2/km (comb)'
 '118 g CO2/km (comb)' '331 g CO2/km (comb)' '115 g CO2/km (comb)'
 '119 g CO2/km (comb)' '90 g CO2/km (comb)' '136 g CO2/km (comb)'
 '134 g CO2/km (comb)' '110 g CO2/km (co

In [84]:
df["co2_emission_gr"] = df["co_2_emission"].str.extract("(.*) g")[0].str.replace(",",".").astype(float)

In [85]:
first_look("co2_emission_gr")

column name :  co2_emission_gr
----------------------------------------
Per_of_Nulls   :  % 15.3
Number of Nulls  :  2436
Number of Uniques:  122
Type of columns:  float64
----------------------------------------
Unique values of columns:  [ 99.    129.    109.     92.     98.     97.        nan 105.    112.
 103.    102.     95.    104.     91.     94.    117.    123.    106.
 108.    121.    107.    101.    113.    137.    100.    116.    114.
 118.    331.    115.    119.     90.    136.    134.    110.    111.
 120.     89.    142.    126.    122.    128.    127.    138.    130.
 125.     85.    124.    152.     88.    189.    194.    149.    153.
 188.     36.      1.06   96.    990.    146.    135.    158.     12.087
 141.    172.    154.    150.    167.    174.     93.    133.    131.
 145.    147.    156.     87.      5.    148.    139.    151.    144.
 168.    160.    170.     80.    132.    155.     14.    159.      0.
 143.    140.     82.     12.324  84.    165.     51.    

In [86]:
df[df["co2_emission_gr"].isnull()]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,offer_number,body_color,paint_type,body_color_original,upholstery,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country,hp_kw,Particulate_Filter,inspection_time,inspection_situation,warranty_month,co2_emission_gr
9,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,SPORTBACK TFSI ULTRA 95 S-TRONIC AMB.,Sedans,17990,,16103.0,2017-06-01,,Used,,,,3 months,,,,White,,Blanc,,5,4,,Automatic,999 cc,,,,Benzine,,,,,,,,,7,,,,,,,,,,,70.0,0,NaT,0,3.0,
13,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.4 TFSI 150ch COD Ambition Luxe S tronic 7,Sedans,18399,,45764.0,2016-06-01,,Used,,,,12 months,,,scap-ED-642-CV,Grey,,Gris Clair,,3,4,,Automatic,"1,395 cc",,,,Benzine,,,,"Air conditioning,Automatic climate control,Cru...","Bluetooth,On-board computer,Radio","Alloy wheels,Sport seats,Sport suspension","ABS,Adaptive headlights,Central door lock,Dayt...",,,,,,,,,,,,,112.0,0,NaT,0,12.0,
30,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.4TDI Attraction,Compact,11290,,38400.0,2016-06-01,,Used,,,,12 months,,,2780690,Black,,Negro,,5,5,,Manual,"1,422 cc",3,"1,195 kg",front,Diesel,"['3 l/100 km (comb)'],['3 l/100 km (city)'],['...",,,,,,,,5,,,,,,,,3,3,3,66.0,0,NaT,0,12.0,
39,https://www.autoscout24.com//offers/audi-a1-89...,Audi A1,"(€8950,-excl.VAT/Guarantee) SB 1.4 TDi Ultra Eur6",Compact,11630,VAT deductible,110715.0,2016-01-01,1.0,Used,1,,,,Euro 6,,64/06/2019,Silver,Metallic,Silver,Cloth,5,5,,Manual,"1,422 cc",,,,Diesel,,"[],[],[]",Euro 6,"Air conditioning,Automatic climate control,Cru...","Bluetooth,CD player,Hands-free equipment,MP3,O...","Alloy wheels,Sport suspension","ABS,Daytime running lights,Driver-side airbag,...",,5,,,,,,,,,,,66.0,1,NaT,0,,
43,https://www.autoscout24.com//offers/audi-a1-89...,Audi A1,"(€8900,-excl.VAT/Guarantee) SB 1.4TDI Ultra Eu...",Compact,11569,VAT deductible,111265.0,2016-03-01,1.0,Used,1,,,,Euro 6,,148/05/2019,Black,Metallic,Black,Cloth,5,5,,Manual,"1,422 cc",,,,Diesel,,"[],[],[]",Euro 6,"Air conditioning,Automatic climate control,Lig...","Bluetooth,CD player,Hands-free equipment,MP3,O...",Alloy wheels,"ABS,Daytime running lights,Driver-side airbag,...",,5,,,,,,,,,,,66.0,1,NaT,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15883,https://www.autoscout24.com//offers/renault-es...,Renault Espace,RENAULT INITIALE BLUE DCI 200PS,Van,47950,,57.0,2019-05-01,,Demonstration,,,,48 months,,,8-900099,Black,Metallic,BLACK-PEARL MET.,,5,5,,Automatic,"1,997 cc",,"1,847 kg",,Diesel,,"[],[],[]",,"Air conditioning,Armrest,Automatic climate con...","Bluetooth,Hands-free equipment,MP3,On-board co...","Alloy wheels,Touch screen,Voice Control","ABS,Adaptive Cruise Control,Adaptive headlight...",,7,Austria,,,,,,,,,,147.0,0,NaT,0,48.0,
15903,https://www.autoscout24.com//offers/renault-es...,Renault Espace,1.8 TCe 225ch FAP Initiale Paris EDC,Van,39990,,10.0,2019-01-01,,Used,,,,12 months,,,re62c9-FD-691-GF,Black,,NOIR ETOILE,,5,5,,Automatic,"1,798 cc",,,,Benzine,,,,"Air conditioning,Automatic climate control,Cru...","Bluetooth,Hands-free equipment,On-board comput...",Alloy wheels,"ABS,Central door lock,Daytime running lights,D...",,,,,,,,,,,,,168.0,0,NaT,0,12.0,
15906,https://www.autoscout24.com//offers/renault-es...,Renault Espace,V Tce 225 EDC FAP Initiale Paris,Van,39990,VAT deductible,10.0,2019-01-01,,Used,,,,12 months,,,VO190299,White,,blanc,,5,,,Automatic,,,,,Benzine,"['7.4 l/100 km (comb)'],[],[]",,,"Cruise control,Power windows",,Alloy wheels,"ABS,Driver-side airbag,Passenger-side airbag,S...",,,,,,,,,,7.4,,,,0,NaT,0,12.0,
15908,https://www.autoscout24.com//offers/renault-es...,Renault Espace,1.8 TCe 225ch energy Initiale Paris EDC,Van,39990,,10.0,2019-02-01,,Used,,,,12 months,,,re74c2-FE-210-EB,Black,,NOIRE,,5,5,,Automatic,"1,798 cc",,,,Benzine,,,,"Air conditioning,Automatic climate control,Cru...","Bluetooth,Hands-free equipment,On-board comput...",Alloy wheels,"ABS,Central door lock,Daytime running lights,D...",,,,,,,,,,,,,167.0,0,NaT,0,12.0,


In [87]:
#looking for the null values of inspection_situation in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["co2_emission_gr"].isnull()][i].str.contains("g CO2").any():
        print(i)

full_service


In [88]:
df[df["co2_emission_gr"].isnull()][df[df["co2_emission_gr"].isnull()]["full_service"].str.contains("g CO2") == True]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,offer_number,body_color,paint_type,body_color_original,upholstery,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country,hp_kw,Particulate_Filter,inspection_time,inspection_situation,warranty_month,co2_emission_gr
363,https://www.autoscout24.com//offers/audi-a1-1-...,Audi A1,1.0 TFSI S tronic Navi Automatik Sitzh.,Sedans,14959,VAT deductible,56967.0,2016-03-01,2.0,Used,2,,Yes4.4 l/100 km (comb)5.4 l/100 km (city)3.8 l...,12 months0 kWh/100 km (comb),102 g CO2/km (comb),Euro 6,13938,Black,Metallic,Mythosschwarz Metallic,"Cloth, Black",3,4,0588/BCV,Automatic,999 cc,3,,,Benzine,"4.4 l/100 km (comb),5.4 l/100 km (city),3.8 l/...",,,"Air conditioning,Armrest,Automatic climate con...","Bluetooth,CD player,Hands-free equipment,MP3,O...","Alloy wheels,Sport suspension","ABS,Central door lock,Daytime running lights,D...",4 (Green),7,Germany,0 kWh/100 km (comb),,"[],[],[]",,,,4.4,5.4,3.8,70.0,0,NaT,1,12.0,


In [89]:
df.loc[df[df["co2_emission_gr"].isnull()][df[df["co2_emission_gr"].isnull()]["full_service"].str.contains("g CO2")== True].index, "co2_emission_gr"] = df[df["co2_emission_gr"].isnull()]["full_service"].str.extract("(\d+) g CO2/km").loc[df[df["co2_emission_gr"].isnull()][df[df["co2_emission_gr"].isnull()]["full_service"].str.contains("g CO2") == True].index].astype(float).values

### full_service	

In [90]:
first_look("full_service")

column name :  full_service
----------------------------------------
Per_of_Nulls   :  % 48.39
Number of Nulls  :  7704
Number of Uniques:  121
Type of columns:  object
----------------------------------------
Unique values of columns:  ['' nan '99 g CO2/km (comb)' '4 (Green)' '92 g CO2/km (comb)' 'Euro 6'
 '5 (Blue)' '97 g CO2/km (comb)' '102 g CO2/km (comb)'
 '104 g CO2/km (comb)' '103 g CO2/km (comb)' '112 g CO2/km (comb)'
 'Euro 6d-TEMP' '1 (No sticker)' '91 g CO2/km (comb)'
 '114 g CO2/km (comb)' 'Euro 5' '118 g CO2/km (comb)'
 '115 g CO2/km (comb)' '117 g CO2/km (comb)'
 '4.4 l/100 km (comb)5.2 l/100 km (city)3.9 l/100 km (country)'
 '105 g CO2/km (comb)' '129 g CO2/km (comb)' '101 g CO2/km (comb)'
 '120 g CO2/km (comb)' '110 g CO2/km (comb)' '111 g CO2/km (comb)'
 '137 g CO2/km (comb)' '0 kWh/100 km (comb)' '108 g CO2/km (comb)'
 '98 g CO2/km (comb)' '126 g CO2/km (comb)' 'Euro 6d'
 '106 g CO2/km (comb)'
 '4.8 l/100 km (comb)5.9 l/100 km (city)4.2 l/100 km (country)'
 '4.9 l/100

### non_smoking_vehicle

In [91]:
first_look("non_smoking_vehicle")

column name :  non_smoking_vehicle
----------------------------------------
Per_of_Nulls   :  % 54.92
Number of Nulls  :  8742
Number of Uniques:  93
Type of columns:  object
----------------------------------------
Unique values of columns:  ['' nan 'Euro 6' '4 (Green)' '102 g CO2/km (comb)' '97 g CO2/km (comb)'
 '103 g CO2/km (comb)' '99 g CO2/km (comb)' '106 g CO2/km (comb)'
 '117 g CO2/km (comb)' '107 g CO2/km (comb)' 'Euro 5'
 '123 g CO2/km (comb)' 'Euro 6d-TEMP' '90 g CO2/km (comb)'
 '108 g CO2/km (comb)' '98 g CO2/km (comb)' '104 g CO2/km (comb)'
 '112 g CO2/km (comb)' '105 g CO2/km (comb)' '91 g CO2/km (comb)'
 '110 g CO2/km (comb)' '142 g CO2/km (comb)' '115 g CO2/km (comb)'
 '111 g CO2/km (comb)' '136 g CO2/km (comb)' '116 g CO2/km (comb)'
 '101 g CO2/km (comb)' '118 g CO2/km (comb)' '137 g CO2/km (comb)'
 '127 g CO2/km (comb)' '119 g CO2/km (comb)' '189 g CO2/km (comb)'
 '1.6 l/100 km (comb)' '126 g CO2/km (comb)' '134 g CO2/km (comb)'
 '1 (No sticker)' '122 g CO2/km (comb)'

### offer_number

In [92]:
first_look("offer_number")

column name :  offer_number
----------------------------------------
Per_of_Nulls   :  % 19.94
Number of Nulls  :  3175
Number of Uniques:  11440
Type of columns:  object
----------------------------------------
Unique values of columns:  ['LR-062483' nan 'AM-95365' ... 'Espace16' '2691331' 'Re_30000008029']
----------------------------------------
# 250678          1
# 8H6050830       1
# G1024529        1
# G6050580        1
#8023778          2
               ... 
x_45689v          2
y8fx64x           1
zr11914           1
zr11916           1
NaN            3175
Name: offer_number, Length: 11441, dtype: int64
----------------------------------------
NaN                                         3175
LT67679                                       27
UN89904                                       27
XJ38068                                       27
JV03654                                       27
                                            ... 
160_dcbb6c3e-a6da-43a3-8754-ccd994cec93b      

In [93]:
df.drop(columns="offer_number", inplace=True)

### body_color & body_color_original

In [94]:
first_look("body_color")

column name :  body_color
----------------------------------------
Per_of_Nulls   :  % 3.75
Number of Nulls  :  597
Number of Uniques:  14
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Black' 'Red' 'Brown' 'White' 'Grey' 'Silver' 'Blue' nan 'Beige' 'Violet'
 'Yellow' 'Green' 'Bronze' 'Orange' 'Gold']
----------------------------------------
Beige      108
Black     3745
Blue      1431
Bronze       6
Brown      289
Gold         2
Green      154
Grey      3505
Orange       3
Red        957
Silver    1647
Violet      18
White     3406
Yellow      51
NaN        597
Name: body_color, dtype: int64
----------------------------------------
Black     3745
Grey      3505
White     3406
Silver    1647
Blue      1431
Red        957
NaN        597
Brown      289
Green      154
Beige      108
Yellow      51
Violet      18
Bronze       6
Orange       3
Gold         2
Name: body_color, dtype: int64
#######################################################

In [95]:
first_look("body_color_original")

column name :  body_color_original
----------------------------------------
Per_of_Nulls   :  % 23.61
Number of Nulls  :  3759
Number of Uniques:  1927
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Mythosschwarz' nan 'mythosschwarz metallic' ... 'Grau - Stahl Grau'
 'titaniumgraumetallic' 'Perlmutt-Weiß Metallic (Weiß)']
----------------------------------------
"PLATIN"                          1
"satinsilber "                    1
(0C0C) Monsungrau Metallic        1
(B4B4) Cortinaweiss               1
(NNP)                             1
                               ... 
wählbar - ggf. mit Aufpreis     118
wählbar -ggfl. mit Aufpreis       1
wählbar, ggf gegen Aufpreis       2
zwart                             2
NaN                            3759
Name: body_color_original, Length: 1928, dtype: int64
----------------------------------------
NaN                              3759
Onyx Schwarz                      338
Bianco              

In [96]:
df[df["body_color"].isnull()]["body_color_original"].value_counts()

wählbar - ggf. mit Aufpreis       118
wählbar                            88
Metallic o. Uni (wählbar)          33
wählbar - ggf gegen Aufpreis       25
null                               15
                                 ... 
rouge eclat                         1
quarzgrau dunkel Grau metallic      1
Licht Grau M2                       1
graphitgrau Metallik                1
Farbe: Sonstige                     1
Name: body_color_original, Length: 86, dtype: int64

In [97]:
df[df["body_color"].isnull()]["body_color_original"].apply(lambda x: x in ['Schwarz' ,'Rot', 'Braun' ,'Weiß', 'Grau', 'Silber',  'Blau', 'Beige' ,'Violett', 'Gelb' ,'Grün' ,'Bronze' ,'Orange' ,'Gold']).value_counts()

False    597
Name: body_color_original, dtype: int64

In [98]:
#looking for the null values of inspection_situation in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["body_color"].isnull()][i].str.contains('Black|Red|Brown|White|Grey|Silver|Blue |Beige|Violet|Yellow|Green|Bronze|Orange|Gold', regex=True).any():
        print(i)

short_description
inspection_new
warranty
full_service
non_smoking_vehicle
body_color_original
upholstery
emission_label


In [99]:
df[df["body_color"].isnull()][df[df["body_color"].isnull()]["body_color_original"].str.contains('Black|Red|Brown|White|Grey|Silver|Blue|Beige|Violet|Yellow|Green|Bronze|Orange|Gold', regex=True) == True]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,body_color,paint_type,body_color_original,upholstery,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country,hp_kw,Particulate_Filter,inspection_time,inspection_situation,warranty_month,co2_emission_gr
1706,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 1.0 TFSI Ultra *Sitzheizung*Start/St...,Sedans,16490,VAT deductible,100.0,2018-07-01,,Pre-registered,,,,,,,,,Cortina White,"Cloth, Black",5,5.0,,Manual,999 cc,,,front,Benzine,"['4.2 l/100 km (comb)'],['5 l/100 km (city)'],...",97 g CO2/km (comb),Euro 6,"Air conditioning,Armrest,Automatic climate con...","CD player,MP3,On-board computer,Radio",Alloy wheels,"ABS,Central door lock,Daytime running lights,E...",4 (Green),5.0,European Union,,,,,,,4.2,5.0,3.7,70.0,0,NaT,0,,97.0
9075,https://www.autoscout24.com//offers/opel-corsa...,Opel Corsa,E Edition Aut. SHZG LHZG GRA PDC Klima Bluetooth,Sedans,13890,VAT deductible,10.0,2018-08-01,1.0,Pre-registered,1.0,,Yes149 g CO2/km (comb),,,Euro 6,,Metallic,Graphit Grau/Graffiti Grey,,5,5.0,0035/BCB,Automatic,"1,398 cc",4.0,"1,248 kg",,Benzine,"['6.6 l/100 km (comb)'],['8.2 l/100 km (city)'...",149 g CO2/km (comb),Euro 6,"Air conditioning,Cruise control,Electrical sid...","Bluetooth,Hands-free equipment,MP3,On-board co...",Alloy wheels,"ABS,Central door lock,Daytime running lights,D...",4 (Green),6.0,,,,,,,,6.6,8.2,5.6,66.0,0,NaT,1,,149.0
11591,https://www.autoscout24.com//offers/opel-insig...,Opel Insignia,2 SPORT TOURER EDITION 136 CDTI,Station wagon,18654,,13953.0,2017-10-01,,Used,,,,,,,,,GAN Sovereign Silver,,5,,,Manual,,,,,Diesel,,"[],[],[]",,Park Distance Control,,,"Passenger-side airbag,Side airbag",,,,,,,,,,,,,100.0,0,NaT,0,,
11620,https://www.autoscout24.com//offers/opel-insig...,Opel Insignia,2 GRAND SPORT EDITION 110 CDTI,Coupe,17094,,28142.0,2017-06-01,,Used,,,,,,,,,GDX Darkmoon Blue,,4,,,Manual,,,,,Diesel,,"[],[],[]",,Park Distance Control,,,"Passenger-side airbag,Side airbag",,,,,,,,,,,,,81.0,0,NaT,0,,
11627,https://www.autoscout24.com//offers/opel-insig...,Opel Insignia,2 SPORT TOURER EDITION 136 CDTI,Station wagon,18054,,24178.0,2017-11-01,,Used,,,,,,,,,GF6 Satin Steel Grey,,5,,,Manual,,,,,Diesel,,"[],[],[]",,Park Distance Control,,,"Passenger-side airbag,Side airbag",,,,,,,,,,,,,100.0,0,NaT,0,,


In [100]:
df.loc[[1706,9075,11591,11620,11627],"body_color"] = ["White","Grey","Silver","Blue","Grey"]

In [101]:
df[df["body_color"].isnull()][df[df["body_color"].isnull()]["short_description"].str.contains('Black|Red|Brown|White|Grey|Silver|Blue|Beige|Violet|Yellow|Green|Bronze|Orange|Gold', regex=True) == True]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,body_color,paint_type,body_color_original,upholstery,nr_of_doors,nr_of_seats,model_code,gearing_type,displacement,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country,hp_kw,Particulate_Filter,inspection_time,inspection_situation,warranty_month,co2_emission_gr
1641,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 35 TFSI Black line S tronic,Compact,23032,,,NaT,,New,,,,,,,,,,,5,5,,Automatic,"1,498 cc",4.0,"1,180 kg",front,Benzine,"['5 l/100 km (comb)'],['6 l/100 km (city)'],['...",,,,,,,,7,,,,,,,,5.0,6.0,3.0,110.0,0,NaT,0,,
1680,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 30 TFSI Black line S tronic,Compact,21664,,,NaT,,New,,,,,,,,,,,5,5,,Automatic,999 cc,3.0,"1,200 kg",front,Benzine,"['4 l/100 km (comb)'],['5 l/100 km (city)'],['...",,,,,,,,7,,,,,,,,4.0,5.0,4.0,85.0,0,NaT,0,,
1928,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 40 TFSI Black line S tronic,Compact,28200,,,NaT,,New,,,,,,,,,,,5,5,,Automatic,"1,984 cc",4.0,"1,180 kg",front,Benzine,"['6 l/100 km (comb)'],['8 l/100 km (city)'],['...",,,,,,,,7,,,,,,,,6.0,8.0,4.0,147.0,0,NaT,0,,
2321,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 30 TFSI Black line,Compact,19826,,,NaT,,New,,,,,,,,,,,5,5,,Manual,999 cc,3.0,"1,180 kg",front,Benzine,"['4 l/100 km (comb)'],['5 l/100 km (city)'],['...",,,,,,,,6,,,,,,,,4.0,5.0,4.0,85.0,0,NaT,0,,
2361,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 25 TFSI Black line,Compact,18670,,,NaT,,New,,,,,,,,,,,5,5,,Manual,999 cc,3.0,"1,165 kg",front,Benzine,"['4 l/100 km (comb)'],['5 l/100 km (city)'],['...",,,,,,,,5,,,,,,,,4.0,5.0,3.0,70.0,0,NaT,0,,
2362,https://www.autoscout24.com//offers/audi-a1-sp...,Audi A1,Sportback 25 TFSI Black line,Compact,18670,,,NaT,,New,,,,,,,,,,,5,5,,Manual,999 cc,3.0,"1,165 kg",front,Benzine,"['4 l/100 km (comb)'],['5 l/100 km (city)'],['...",,,,,,,,5,,,,,,,,4.0,5.0,3.0,70.0,0,NaT,0,,
8804,https://www.autoscout24.com//offers/opel-corsa...,Opel Corsa,E Klimaanlage Radio Bluetooth ISOFIX,Compact,6450,VAT deductible,73519.0,2016-04-01,2.0,Used,2.0,,,60 months5.4 l/100 km (comb),128 g CO2/km (comb),Euro 6,,,,"Cloth, Black",2,5,,Manual,"1,229 cc",,,,Benzine,"5.4 l/100 km (comb),",128 g CO2/km (comb),,"Air conditioning,Electrical side mirrors,Hill ...","On-board computer,Radio",,"ABS,Central door lock,Daytime running lights,D...",4 (Green),5,Germany,,,"[],[],[]",,,,5.4,,,51.0,0,NaT,0,60.0,128.0
10131,https://www.autoscout24.com//offers/opel-corsa...,Opel Corsa,1.4 90CV aut. 5 porte Black Edition,Sedans,14900,,,2019-03-01,,Pre-registered,,,,,,,,,,,5,5,,Automatic,"1,398 cc",4.0,"1,199 kg",front,Benzine,"['6 l/100 km (comb)'],['8 l/100 km (city)'],['...",139 g CO2/km (comb),Euro 6,"Air conditioning,Electrical side mirrors,Power...","Bluetooth,CD player,Radio,USB","Alloy wheels,Sport package,Sport seats","ABS,Central door lock,Driver-side airbag,Elect...",,6,,,,,,,,6.0,8.0,4.9,66.0,0,NaT,0,,139.0
15493,https://www.autoscout24.com//offers/renault-es...,Renault Espace,Blue dCi TT Intens EDC 118kW,Van,32550,,,NaT,,New,,,,,,,,,,,5,5,,Automatic,"1,997 cc",4.0,"1,659 kg",front,Diesel,"['5 l/100 km (comb)'],['5 l/100 km (city)'],['...",,,,,,,,6,,,,,,,,5.0,5.0,4.0,118.0,0,NaT,0,,
15654,https://www.autoscout24.com//offers/renault-es...,Renault Espace,Blue dCi TT Limited EDC 118kW,Van,35146,,,NaT,,New,,,,,,,,,,,5,5,,Automatic,"1,997 cc",4.0,"1,659 kg",front,Diesel,"['5 l/100 km (comb)'],['5 l/100 km (city)'],['...",,,,,,,,6,,,,,,,,5.0,5.0,4.0,118.0,0,NaT,0,,


In [102]:
df[df["body_color"].isnull()][df[df["body_color"].isnull()]["short_description"].str.contains('Black|Red|Brown|White|Grey|Silver|Blue|Beige|Violet|Yellow|Green|Bronze|Orange|Gold', regex=True) == True]["short_description"].str.extract("(Black |Blue )")

Unnamed: 0,0
1641,Black
1680,Black
1928,Black
2321,Black
2361,Black
2362,Black
8804,
10131,Black
15493,Blue
15654,Blue


In [103]:
df.loc[df[df["body_color"].isnull()][df[df["body_color"].isnull()]["short_description"].str.contains('Black|Red|Brown|White|Grey|Silver|Blue|Beige|Violet|Yellow|Green|Bronze|Orange|Gold', regex=True) == True].index, "body_color"] = df[df["body_color"].isnull()][df[df["body_color"].isnull()]["short_description"].str.contains('Black|Red|Brown|White|Grey|Silver|Blue|Beige|Violet|Yellow|Green|Bronze|Orange|Gold', regex=True) == True]["short_description"].str.extract("(Black |Blue )").values

In [104]:
df.drop(columns="body_color_original", inplace=True)

### paint_type 

In [105]:
first_look("paint_type")

column name :  paint_type
----------------------------------------
Per_of_Nulls   :  % 36.26
Number of Nulls  :  5772
Number of Uniques:  3
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Metallic' nan 'Uni/basic' 'Perl effect']
----------------------------------------
Metallic       9794
Perl effect       6
Uni/basic       347
NaN            5772
Name: paint_type, dtype: int64
----------------------------------------
Metallic       9794
NaN            5772
Uni/basic       347
Perl effect       6
Name: paint_type, dtype: int64
################################################################################



### upholstery

In [106]:
first_look("upholstery")

column name :  upholstery
----------------------------------------
Per_of_Nulls   :  % 23.37
Number of Nulls  :  3720
Number of Uniques:  46
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Cloth, Black' 'Cloth, Grey' nan 'Part leather, Black' 'Cloth, Other'
 'Full leather, Black' 'Cloth, White' 'Black' 'Cloth' 'Other, Black'
 'Part leather' 'Full leather' 'Full leather, Red' 'alcantara, Black'
 'Other, Other' 'Velour' 'Cloth, Blue' 'Cloth, Red' 'Grey' 'Blue'
 'Velour, Black' 'Velour, Grey' 'Other' 'Part leather, Grey'
 'Cloth, Orange' 'Part leather, Other' 'alcantara, Grey' 'Other, Grey'
 'Full leather, Grey' 'Part leather, Brown' 'alcantara, Other'
 'Full leather, Brown' 'Part leather, Beige' 'Cloth, Beige'
 'Full leather, Other' 'White' 'Cloth, Brown' 'Other, Yellow' 'alcantara'
 'Full leather, Beige' 'Beige' 'Part leather, White' 'Part leather, Red'
 'Brown' 'Full leather, Blue' 'Other, Brown' 'Full leather, White']
--------------------

In [107]:
#classify upholstery by color and type
df["upholstery_type"] = df["upholstery"].str.extract("(Cloth|\w+ leather|Other|Velour|alcantara)")

In [108]:
df["upholstery_color"] = df["upholstery"].str.extract("(Black|Grey|\sOther|Brown|Beige|Blue|Red|Yellow|White|Orange)")

In [109]:
#looking for the null values of upholstery_type in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["upholstery_type"].isnull()][i].str.contains("(Cloth|\w+ leather|Velour|alcantara)", regex=True).any():
        print(i)
#There is no column to get values

In [110]:
df.drop(columns="upholstery", inplace=True)

### nr_of_doors & nr_of_seats

In [111]:
first_look("nr_of_doors")

column name :  nr_of_doors
----------------------------------------
Per_of_Nulls   :  % 1.33
Number of Nulls  :  212
Number of Uniques:  6
Type of columns:  object
----------------------------------------
Unique values of columns:  ['5' '3' '4' '2' nan '1' '7']
----------------------------------------
1          1
2        219
3        832
4       3079
5      11575
7          1
NaN      212
Name: nr_of_doors, dtype: int64
----------------------------------------
5      11575
4       3079
3        832
2        219
NaN      212
1          1
7          1
Name: nr_of_doors, dtype: int64
################################################################################



In [112]:
first_look("nr_of_seats")

column name :  nr_of_seats
----------------------------------------
Per_of_Nulls   :  % 6.14
Number of Nulls  :  977
Number of Uniques:  6
Type of columns:  object
----------------------------------------
Unique values of columns:  ['5' '4' nan '6' '3' '2' '7']
----------------------------------------
2        116
3          1
4       1125
5      13336
6          2
7        362
NaN      977
Name: nr_of_seats, dtype: int64
----------------------------------------
5      13336
4       1125
NaN      977
7        362
2        116
6          2
3          1
Name: nr_of_seats, dtype: int64
################################################################################



In [113]:
#Changing data types to float
df["nr_of_doors"] = df["nr_of_doors"].astype(float)
df["nr_of_seats"] = df["nr_of_seats"].astype(float)

### model_code 

In [114]:
first_look("model_code")

column name :  model_code
----------------------------------------
Per_of_Nulls   :  % 68.73
Number of Nulls  :  10941
Number of Uniques:  232
Type of columns:  object
----------------------------------------
Unique values of columns:  ['0588/BDF' '0588/BCY' nan '0588/BDC' '0588/BDB' '0588/BCV' '0588/BCZ'
 '0588/BHL' '0588/BDG' '0588/BCW' '0588/BDA' '0588/BDD' '0588/BCX'
 '0588/BHM' '0000/000' '0588/BDE' '0588/BNO' '0588/000' '0588/AUJ'
 '0588/BNQ' '0588/BN0' '0588/BNP' '0588/BNN' '0588/BJV' '0588/AWJ'
 '0588/AYB' '0588/BAH' '0588/BAI' '0588/AXC' '0588/BAF' '0588/AVO'
 '0588/AWS' '0588/BAM' '0588/BAD' '0588/AWX' '0588/BAE' '0588/BAJ'
 '0588/AZX' '0588/AVR' '0588/AYA' '0588/AVQ' '0588/BET' '0588/AWQ'
 '0588/AGL' '0588/BIB' '0588/AZY' '0588/BLH' '0588/BER' '0588/BHX'
 '0588/BLL' '0588/BLG' '0588/BLF' '0588/AZZ' '0588/BHT' '0588/AYC'
 '0588/BLK' '0035/BHZ' '0035/BHQ' '0035/BFM' '0035/BHV' '0035/BHP'
 '1844/ADY' '0035/BGH' '0035/BHM' '0035/BGL' '0035/ASL' '0035/BFQ'
 '1844/AFF' '0035/BKN' 

In [115]:
df.drop(columns="model_code", inplace=True)

### gearing_type

In [116]:
first_look("gearing_type")

column name :  gearing_type
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  3
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Automatic' 'Manual' 'Semi-automatic']
----------------------------------------
Automatic         7297
Manual            8153
Semi-automatic     469
Name: gearing_type, dtype: int64
----------------------------------------
Manual            8153
Automatic         7297
Semi-automatic     469
Name: gearing_type, dtype: int64
################################################################################



### displacement

In [117]:
first_look("displacement")

column name :  displacement
----------------------------------------
Per_of_Nulls   :  % 3.12
Number of Nulls  :  496
Number of Uniques:  77
Type of columns:  object
----------------------------------------
Unique values of columns:  ['1,422 cc' '1,798 cc' '1,598 cc' '999 cc' '1,395 cc' '929 cc' nan
 '1,596 cc' '1,600 cc' '1,000 cc' '1,984 cc' '1,498 cc' '1,197 cc'
 '995 cc' '998 cc' '1,968 cc' '1,400 cc' '2,000 cc' '1,568 cc' '1,896 cc'
 '2,480 cc' '1,499 cc' '1,495 cc' '1,398 cc' '1,584 cc' '997 cc'
 '1,399 cc' '1,364 cc' '1,490 cc' '996 cc' '1,696 cc' '1,686 cc'
 '1,396 cc' '15,898 cc' '139 cc' '1,368 cc' '140 cc' '1,397 cc' '1,248 cc'
 '1,229 cc' '1,300 cc' '1,200 cc' '973 cc' '1,239 cc' '1,350 cc'
 '1,369 cc' '1,390 cc' '122 cc' '1,198 cc' '1,195 cc' '1,956 cc'
 '1,998 cc' '2 cc' '2,967 cc' '1,856 cc' '16,000 cc' '1,500 cc' '1,496 cc'
 '1,533 cc' '1 cc' '1,599 cc' '1,995 cc' '1,461 cc' '1,618 cc' '1,149 cc'
 '1,199 cc' '898 cc' '890 cc' '900 cc' '54 cc' '1,100 cc' '1,333 cc'
 '899

In [118]:
df["displacement"].str.contains("cc").value_counts(dropna=False)

True    15423
NaN       496
Name: displacement, dtype: int64

In [119]:
df["engine_size"] = df["displacement"].str.replace(",","").str.extract("(\d+)").astype(float)

In [120]:
for i in df.select_dtypes(include="O").columns:
    if df[df["engine_size"].isnull()][i].str.contains("\w+-(\d-\d)-", regex=True).any():
        print(i)


url


In [121]:
df[df["engine_size"].isnull()][df[df["engine_size"].isnull()]["url"].str.contains("-\d-\d-", regex=True) == True]["url"].unique()

array(['https://www.autoscout24.com//offers/audi-a1-1-6-tdi-metal-plus-s-tronic-bi-color-navi-xeno-diesel-black-2db3a838-ee5a-457d-853d-a88d77cd3426',
       'https://www.autoscout24.com//offers/audi-a1-sportback-sportback-1-4-tdi-s-tronic-adrenalin2-diesel-grey-8bbc9cf9-2598-4f9b-a6cb-35e6d48c12b3',
       'https://www.autoscout24.com//offers/audi-a1-spb-1-6-tdi-116-cv-s-tronic-diesel-white-4adfa6f2-1721-40a0-a505-bb9af9f435cd',
       'https://www.autoscout24.com//offers/audi-a1-sportback-sportback-1-0-tfsi-95cv-attraction-gasoline-black-7e4808de-7057-4ea6-9580-50af607954bd',
       'https://www.autoscout24.com//offers/audi-a1-sportback-1-4-tdi-ultra-5deurs-2016-46580-kms-diesel-black-0842913d-77a9-44fa-a7ef-7a851984018c',
       'https://www.autoscout24.com//offers/audi-a1-sportback-1-0-tfsi-95ch-adrenalin-s-tronic-7-gasoline-red-1a964d44-2688-46b7-9d56-ab891f529d46',
       'https://www.autoscout24.com//offers/audi-a1-adrenalin-1-4-tdi-66kw-90cv-sportback-somos-conc-diesel-black-19

In [122]:
df[df["engine_size"].isnull()][df[df["engine_size"].isnull()]["url"].str.contains("-\d-\d-", regex=True) == True]["url"].str.extract("\w+-(\d-\d)-")[0].str.replace("-","").astype(float)*100

142      1600.0
191      1400.0
412      1600.0
505      1000.0
636      1400.0
          ...  
15472    1600.0
15477    1600.0
15496    1600.0
15573    1600.0
15706    1600.0
Name: 0, Length: 320, dtype: float64

In [123]:
df.loc[df[df["engine_size"].isnull()][df[df["engine_size"].isnull()]["url"].str.contains("-\d-\d-", regex=True) == True].index, "engine_size"] = df[df["engine_size"].isnull()][df[df["engine_size"].isnull()]["url"].str.contains("-\d-\d-", regex=True) == True]["url"].str.extract("\w+-(\d-\d)-")[0].str.replace("-","").astype(float)*100

In [124]:
first_look("engine_size")

column name :  engine_size
----------------------------------------
Per_of_Nulls   :  % 1.11
Number of Nulls  :  176
Number of Uniques:  79
Type of columns:  float64
----------------------------------------
Unique values of columns:  [1.4220e+03 1.7980e+03 1.5980e+03 9.9900e+02 1.3950e+03 9.2900e+02
 1.6000e+03 1.4000e+03 1.5960e+03        nan 1.0000e+03 1.9840e+03
 1.4980e+03 1.1970e+03 9.9500e+02 9.9800e+02 1.9680e+03 2.0000e+03
 1.5680e+03 1.8960e+03 2.4800e+03 1.4990e+03 1.4950e+03 1.3980e+03
 1.5840e+03 9.9700e+02 1.3990e+03 1.3640e+03 1.4900e+03 9.9600e+02
 1.6960e+03 1.6860e+03 1.3960e+03 1.5898e+04 1.3900e+02 1.3680e+03
 1.4000e+02 1.3970e+03 1.2480e+03 1.3000e+03 1.2290e+03 1.2000e+03
 9.7300e+02 1.2390e+03 1.3500e+03 1.3690e+03 1.3900e+03 1.2200e+02
 1.1980e+03 1.1950e+03 1.9560e+03 1.9980e+03 2.8000e+03 2.0000e+00
 2.9670e+03 1.8560e+03 1.6000e+04 1.5000e+03 1.4960e+03 1.5330e+03
 1.0000e+00 1.5990e+03 1.9950e+03 1.4610e+03 1.6180e+03 1.1490e+03
 1.1990e+03 8.9800e+02 4.1000

In [125]:
df.drop(columns="displacement", inplace=True)

### cylinders

In [126]:
first_look("cylinders")

column name :  cylinders
----------------------------------------
Per_of_Nulls   :  % 35.68
Number of Nulls  :  5680
Number of Uniques:  7
Type of columns:  object
----------------------------------------
Unique values of columns:  ['3' '4' nan '8' '5' '1' '6' '2']
----------------------------------------
1         1
2         2
3      2104
4      8105
5        22
6         3
8         2
NaN    5680
Name: cylinders, dtype: int64
----------------------------------------
4      8105
NaN    5680
3      2104
5        22
6         3
8         2
2         2
1         1
Name: cylinders, dtype: int64
################################################################################



In [127]:
df["cylinders"] = df["cylinders"].astype(float)

### weight 

In [128]:
first_look("weight")

column name :  weight
----------------------------------------
Per_of_Nulls   :  % 43.81
Number of Nulls  :  6974
Number of Uniques:  434
Type of columns:  object
----------------------------------------
Unique values of columns:  ['1,220 kg' '1,255 kg' nan '1,195 kg' '1,275 kg' '1,250 kg' '1,135 kg'
 '1,175 kg' '1,065 kg' '1,180 kg' '1,190 kg' '1,630 kg' '1,165 kg'
 '1,205 kg' '1,110 kg' '1,675 kg' '1,720 kg' '1,625 kg' '1,215 kg'
 '1,200 kg' '1,115 kg' '1,665 kg' '1,040 kg' '1,660 kg' '1,225 kg'
 '1,090 kg' '1,217 kg' '1,610 kg' '1,580 kg' '1,155 kg' '1,140 kg'
 '1,230 kg' '1,157 kg' '1,120 kg' '1,792 kg' '1,635 kg' '1,210 kg'
 '1,640 kg' '1,060 kg' '1,105 kg' '1,285 kg' '1,500 kg' '1,235 kg'
 '1,088 kg' '1,097 kg' '1,125 kg' '1,240 kg' '1,695 kg' '1,280 kg'
 '1,035 kg' '1,010 kg' '1,223 kg' '1,530 kg' '1,540 kg' '1,134 kg'
 '1,600 kg' '1,705 kg' '1,650 kg' '1,565 kg' '1,145 kg' '1,265 kg'
 '102 kg' '1,263 kg' '1,485 kg' '1,094 kg' '1,345 kg' '1,130 kg'
 '1,680 kg' '1,166 kg' '1,114 

In [129]:
#looking for the null values of weight in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["weight"].isnull()][i].str.contains("\d+ kg", regex=True).any():
        print(i)
#There is no column to get values

consumption


In [130]:
df[df["weight"].isnull()][df[df["weight"].isnull()]["consumption"].str.contains("\d+ kg", regex=True) == True]

Unnamed: 0,url,make_model,short_description,body_type,price,vat,km,registration,prev_owner,type,previous_owners,next_inspection,inspection_new,warranty,full_service,non_smoking_vehicle,body_color,paint_type,nr_of_doors,nr_of_seats,gearing_type,cylinders,weight,drive_chain,fuel,consumption,co_2_emission,emission_class,comfort_&_convenience,entertainment_&_media,extras,safety_&_security,emission_label,gears,country_version,electricity_consumption,last_service_date,other_fuel_types,availability,last_timing_belt_service_date,available_from,consumption_comb,consumption_city,consumption_country,hp_kw,Particulate_Filter,inspection_time,inspection_situation,warranty_month,co2_emission_gr,upholstery_type,upholstery_color,engine_size
5161,https://www.autoscout24.com//offers/audi-a3-sp...,Audi A3,SPB 30 g-tron S tronic Business,Sedans,26900,VAT deductible,60.0,2019-05-01,,Pre-registered,,,,24 months,,,Grey,Metallic,5.0,5.0,Automatic,4.0,,front,Gas,"['8.3 kg/100 km (comb)'],['10.5 kg/100 km (cit...",95 g CO2/km (comb),Euro 6d-TEMP,"Air conditioning,Armrest,Automatic climate con...","Bluetooth,CD player,Hands-free equipment,On-bo...","Alloy wheels,Voice Control","ABS,Central door lock,Daytime running lights,D...",1 (No sticker),7,,,,,,,,8.3,10.5,6.9,96.0,0,NaT,0,24.0,95.0,Cloth,Grey,1498.0


In [131]:
df["weight"] = df["weight"].str.replace(",","").str.extract("(\d+)").astype(float)

### drive_chain

In [132]:
first_look("drive_chain")

column name :  drive_chain
----------------------------------------
Per_of_Nulls   :  % 43.08
Number of Nulls  :  6858
Number of Uniques:  3
Type of columns:  object
----------------------------------------
Unique values of columns:  ['front' nan '4WD' 'rear']
----------------------------------------
4WD       171
front    8886
rear        4
NaN      6858
Name: drive_chain, dtype: int64
----------------------------------------
front    8886
NaN      6858
4WD       171
rear        4
Name: drive_chain, dtype: int64
################################################################################



### emission_class

In [133]:
first_look("emission_class")

column name :  emission_class
----------------------------------------
Per_of_Nulls   :  % 18.98
Number of Nulls  :  3021
Number of Uniques:  7
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Euro 6' nan 'Euro 5' 'Euro 6d-TEMP' '[],[],[]' 'Euro 6c' 'Euro 4'
 'Euro 6d']
----------------------------------------
Euro 4             40
Euro 5             78
Euro 6          10139
Euro 6c           127
Euro 6d            62
Euro 6d-TEMP     1845
[],[],[]          607
NaN              3021
Name: emission_class, dtype: int64
----------------------------------------
Euro 6          10139
NaN              3021
Euro 6d-TEMP     1845
[],[],[]          607
Euro 6c           127
Euro 5             78
Euro 6d            62
Euro 4             40
Name: emission_class, dtype: int64
################################################################################



In [134]:
df.loc[df[~(df["emission_class"].str.contains("Euro")==True)].index,"emission_class"] = np.nan

In [135]:
for i in df.select_dtypes(include="O").columns:
    if df[df["emission_class"].isnull()][i].str.contains("Euro \d", regex=True).any():
        print(i)

short_description
inspection_new
warranty
full_service
non_smoking_vehicle
last_service_date


In [136]:
for i in df.select_dtypes(include="O").columns:
    if df[df["emission_class"].isnull()][i].str.contains("Euro \d", regex=True).any():
        df.loc[df[df["emission_class"].isnull()][df[df["emission_class"].isnull()][i].str.contains("Euro \d", regex=True) == True].index, "emission_class"] = df[df["emission_class"].isnull()][df[df["emission_class"].isnull()][i].str.contains("Euro \d", regex=True) == True][i].str.extract("(Euro \d+.*)")[0]

In [137]:
df["emission_class"] = df["emission_class"].str.extract("(Euro \d\w?|\d\w-TEMP)")

In [138]:
first_look("emission_class")   

column name :  emission_class
----------------------------------------
Per_of_Nulls   :  % 18.34
Number of Nulls  :  2920
Number of Uniques:  5
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Euro 6' nan 'Euro 5' 'Euro 6d' 'Euro 6c' 'Euro 4']
----------------------------------------
Euro 4        40
Euro 5        78
Euro 6     10625
Euro 6c      135
Euro 6d     2121
NaN         2920
Name: emission_class, dtype: int64
----------------------------------------
Euro 6     10625
NaN         2920
Euro 6d     2121
Euro 6c      135
Euro 5        78
Euro 4        40
Name: emission_class, dtype: int64
################################################################################



### comfort_&_convenience	& entertainment_&_media	& extras & safety_&_security & description

In [139]:
def get_diff_category_column(Series:pd.Series, exclude=''',/\n''', pattern=r'''[,\n]| /''', strip='''\n' "!?|.,*+-_/]['''):
    """
    exclude  - satırlar da kategori ayırım yerleri
    pattern  - satırdaki veriyi ayırma regex
    strip    - alakasız işaretleri kaldırma
    """
    import re
    column = Series.dropna().apply(str).str.strip(strip)
    diff_value = list()
    for row in column:
        if not any(x in exclude for x in row) and row not in diff_value:
            diff_value.append(row)
        else:
            for data in map(lambda x: x.strip(strip), filter(None, re.split(pattern, row))):
                if data not in diff_value:
                    diff_value.append(data)
    return dict(enumerate(sorted(diff_value)))

In [140]:
first_look("comfort_&_convenience")

column name :  comfort_&_convenience
----------------------------------------
Per_of_Nulls   :  % 5.78
Number of Nulls  :  920
Number of Uniques:  6198
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Air conditioning,Armrest,Automatic climate control,Cruise control,Electrical side mirrors,Hill Holder,Leather steering wheel,Light sensor,Multi-function steering wheel,Navigation system,Park Distance Control,Parking assist system sensors rear,Power windows,Rain sensor,Seat heating,Start-stop system'
 'Air conditioning,Automatic climate control,Hill Holder,Leather steering wheel,Lumbar support,Parking assist system sensors rear,Power windows,Start-stop system,Tinted windows'
 'Air conditioning,Cruise control,Electrical side mirrors,Hill Holder,Leather steering wheel,Multi-function steering wheel,Navigation system,Park Distance Control,Parking assist system sensors front,Parking assist system sensors rear,Power windows,Seat heating,Start-stop sy

In [141]:
get_diff_category_column(df["comfort_&_convenience"])

{0: 'Air conditioning',
 1: 'Air suspension',
 2: 'Armrest',
 3: 'Automatic climate control',
 4: 'Auxiliary heating',
 5: 'Cruise control',
 6: 'Electric Starter',
 7: 'Electric tailgate',
 8: 'Electrical side mirrors',
 9: 'Electrically adjustable seats',
 10: 'Electrically heated windshield',
 11: 'Heads-up display',
 12: 'Heated steering wheel',
 13: 'Hill Holder',
 14: 'Keyless central door lock',
 15: 'Leather seats',
 16: 'Leather steering wheel',
 17: 'Light sensor',
 18: 'Lumbar support',
 19: 'Massage seats',
 20: 'Multi-function steering wheel',
 21: 'Navigation system',
 22: 'Panorama roof',
 23: 'Park Distance Control',
 24: 'Parking assist system camera',
 25: 'Parking assist system self-steering',
 26: 'Parking assist system sensors front',
 27: 'Parking assist system sensors rear',
 28: 'Power windows',
 29: 'Rain sensor',
 30: 'Seat heating',
 31: 'Seat ventilation',
 32: 'Split rear seats',
 33: 'Start-stop system',
 34: 'Sunroof',
 35: 'Tinted windows',
 36: 'Wind de

In [142]:
first_look("entertainment_&_media")

column name :  entertainment_&_media
----------------------------------------
Per_of_Nulls   :  % 8.63
Number of Nulls  :  1374
Number of Uniques:  346
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Bluetooth,Hands-free equipment,On-board computer,Radio'
 'Bluetooth,Hands-free equipment,On-board computer,Radio,Sound system'
 'MP3,On-board computer'
 'Bluetooth,CD player,Hands-free equipment,MP3,On-board computer,Radio,Sound system,USB'
 'Bluetooth,CD player,Hands-free equipment,MP3,On-board computer,Radio,USB'
 'Bluetooth,Hands-free equipment,On-board computer,Radio,Sound system,USB'
 'Bluetooth,CD player,Hands-free equipment,On-board computer,Radio,Sound system,USB'
 'CD player,MP3,Radio' 'Radio' nan
 'CD player,Hands-free equipment,On-board computer,Radio,USB'
 'Bluetooth,On-board computer,Radio'
 'Bluetooth,CD player,Hands-free equipment,On-board computer,Radio'
 'Bluetooth,CD player,Hands-free equipment,MP3,On-board computer,Radio'
 '

In [143]:
get_diff_category_column(df["entertainment_&_media"])

{0: 'Bluetooth',
 1: 'CD player',
 2: 'Digital radio',
 3: 'Hands-free equipment',
 4: 'MP3',
 5: 'On-board computer',
 6: 'Radio',
 7: 'Sound system',
 8: 'Television',
 9: 'USB'}

In [144]:
first_look("extras")

column name :  extras
----------------------------------------
Per_of_Nulls   :  % 18.61
Number of Nulls  :  2962
Number of Uniques:  659
Type of columns:  object
----------------------------------------
Unique values of columns:  ['Alloy wheels,Catalytic Converter,Voice Control'
 'Alloy wheels,Sport seats,Sport suspension,Voice Control'
 'Alloy wheels,Voice Control' 'Alloy wheels,Sport seats,Voice Control'
 'Alloy wheels,Sport package,Sport suspension,Voice Control'
 'Alloy wheels,Sport package,Sport seats,Sport suspension' 'Alloy wheels'
 nan 'Alloy wheels,Shift paddles' 'Alloy wheels,Sport seats'
 'Alloy wheels,Catalytic Converter,Sport package,Sport seats,Sport suspension,Voice Control'
 'Alloy wheels,Sport seats,Sport suspension'
 'Alloy wheels,Sport package,Sport seats' 'Alloy wheels,Sport package'
 'Alloy wheels,Catalytic Converter,Shift paddles,Voice Control'
 'Alloy wheels,Shift paddles,Sport package,Voice Control'
 'Alloy wheels,Catalytic Converter,Sport seats,Voice Control,W

In [145]:
get_diff_category_column(df["extras"])

{0: 'Alloy wheels',
 1: 'Cab or rented Car',
 2: 'Catalytic Converter',
 3: 'Handicapped enabled',
 4: 'Right hand drive',
 5: 'Roof rack',
 6: 'Shift paddles',
 7: 'Ski bag',
 8: 'Sliding door',
 9: 'Sport package',
 10: 'Sport seats',
 11: 'Sport suspension',
 12: 'Touch screen',
 13: 'Trailer hitch',
 14: 'Tuned car',
 15: 'Voice Control',
 16: 'Winter tyres'}

In [146]:
first_look("safety_&_security")

column name :  safety_&_security
----------------------------------------
Per_of_Nulls   :  % 6.17
Number of Nulls  :  982
Number of Uniques:  4443
Type of columns:  object
----------------------------------------
Unique values of columns:  ['ABS,Central door lock,Daytime running lights,Driver-side airbag,Electronic stability control,Fog lights,Immobilizer,Isofix,Passenger-side airbag,Power steering,Side airbag,Tire pressure monitoring system,Traction control,Xenon headlights'
 'ABS,Central door lock,Central door lock with remote control,Daytime running lights,Driver-side airbag,Electronic stability control,Head airbag,Immobilizer,Isofix,Passenger-side airbag,Power steering,Side airbag,Tire pressure monitoring system,Traction control,Xenon headlights'
 'ABS,Central door lock,Daytime running lights,Driver-side airbag,Electronic stability control,Immobilizer,Isofix,Passenger-side airbag,Power steering,Side airbag,Tire pressure monitoring system,Traction control'
 ...
 'ABS,Adaptive headl

In [147]:
get_diff_category_column(df["safety_&_security"])

{0: 'ABS',
 1: 'Adaptive Cruise Control',
 2: 'Adaptive headlights',
 3: 'Alarm system',
 4: 'Blind spot monitor',
 5: 'Central door lock',
 6: 'Central door lock with remote control',
 7: 'Daytime running lights',
 8: 'Driver drowsiness detection',
 9: 'Driver-side airbag',
 10: 'Electronic stability control',
 11: 'Emergency brake assistant',
 12: 'Emergency system',
 13: 'Fog lights',
 14: 'Head airbag',
 15: 'Immobilizer',
 16: 'Isofix',
 17: 'LED Daytime Running Lights',
 18: 'LED Headlights',
 20: 'Night view assist',
 21: 'Passenger-side airbag',
 22: 'Power steering',
 23: 'Rear airbag',
 24: 'Side airbag',
 25: 'Tire pressure monitoring system',
 26: 'Traction control',
 27: 'Traffic sign recognition',
 28: 'Xenon headlights'}

### emission_label 

In [148]:
first_look("emission_label")

column name :  emission_label
----------------------------------------
Per_of_Nulls   :  % 74.97
Number of Nulls  :  11934
Number of Uniques:  6
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan '4 (Green)' '1 (No sticker)' '5 (Blue)' '[],[],[]' '3 (Yellow)'
 '2 (Red)']
----------------------------------------
1 (No sticker)      381
2 (Red)               1
3 (Yellow)            2
4 (Green)          3553
5 (Blue)              8
[],[],[]             40
NaN               11934
Name: emission_label, dtype: int64
----------------------------------------
NaN               11934
4 (Green)          3553
1 (No sticker)      381
[],[],[]             40
5 (Blue)              8
3 (Yellow)            2
2 (Red)               1
Name: emission_label, dtype: int64
################################################################################



In [149]:
for i in df.select_dtypes(include="O").columns:
    if df[df["emission_label"].isnull()][i].str.contains("4 \(Green\)|1 \(No sticker\)|5 \(Blue\)|3 \(Yellow\)|2 \(Red\)", regex=True).any():
        df.loc[df[df["emission_label"].isnull()][df[df["emission_label"].isnull()][i].str.contains("4 \(Green\)|1 \(No sticker\)|5 \(Blue\)|3 \(Yellow\)|'2 \(Red\)", regex=True) == True].index, "emission_label"] = df[df["emission_label"].isnull()][df[df["emission_label"].isnull()][i].str.contains("4 \(Green\)|1 \(No sticker\)|5 \(Blue\)|3 \(Yellow\)|2 \(Red\)", regex=True) == True][i].str.extract("(\d \(.*\))")[0]

In [150]:
df["emission_label"] = df["emission_label"].str.extract("(\d)").astype(float)

In [151]:
first_look("emission_label")

column name :  emission_label
----------------------------------------
Per_of_Nulls   :  % 49.94
Number of Nulls  :  7950
Number of Uniques:  5
Type of columns:  float64
----------------------------------------
Unique values of columns:  [ 4. nan  1.  5.  3.  2.]
----------------------------------------
1.0     435
2.0       1
3.0       2
4.0    7488
5.0      43
NaN    7950
Name: emission_label, dtype: int64
----------------------------------------
NaN    7950
4.0    7488
1.0     435
5.0      43
3.0       2
2.0       1
Name: emission_label, dtype: int64
################################################################################



### gears

In [152]:
first_look("gears")

column name :  gears
----------------------------------------
Per_of_Nulls   :  % 29.6
Number of Nulls  :  4712
Number of Uniques:  10
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan '7' '6' '5' '8' '1' '2' '50' '9' '3' '4']
----------------------------------------
1         2
2         1
3         2
4         2
5      3239
50        1
6      5822
7      1908
8       224
9         6
NaN    4712
Name: gears, dtype: int64
----------------------------------------
6      5822
NaN    4712
5      3239
7      1908
8       224
9         6
1         2
3         2
4         2
2         1
50        1
Name: gears, dtype: int64
################################################################################



In [153]:
df["gears"] = df["gears"].astype(float)

### country_version 

In [154]:
first_look("country_version")

column name :  country_version
----------------------------------------
Per_of_Nulls   :  % 52.35
Number of Nulls  :  8333
Number of Uniques:  23
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan 'Germany' 'Italy' 'Belgium' 'Netherlands' 'Spain' 'European Union'
 'Switzerland' 'Austria' 'Luxembourg' 'France' 'Denmark' 'Poland'
 'Romania' 'Slovakia' 'Sweden' 'Czech Republic' 'Hungary' 'Slovenia'
 'Croatia' 'Egypt' 'Serbia' 'Bulgaria' 'Japan']
----------------------------------------
Austria            208
Belgium            314
Bulgaria             2
Croatia              4
Czech Republic      52
Denmark             33
Egypt                1
European Union     507
France              38
Germany           4502
Hungary             28
Italy             1038
Japan                8
Luxembourg           1
Netherlands        464
Poland              49
Romania              2
Serbia               1
Slovakia             4
Slovenia             1
Spain

### electricity_consumption 

In [155]:
first_look("electricity_consumption")

column name :  electricity_consumption
----------------------------------------
Per_of_Nulls   :  % 99.14
Number of Nulls  :  15782
Number of Uniques:  1
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan '0 kWh/100 km (comb)']
----------------------------------------
0 kWh/100 km (comb)      137
NaN                    15782
Name: electricity_consumption, dtype: int64
----------------------------------------
NaN                    15782
0 kWh/100 km (comb)      137
Name: electricity_consumption, dtype: int64
################################################################################



In [156]:
df[df["electricity_consumption"] == '0 kWh/100 km (comb)']["fuel"].value_counts()

Benzine    105
Diesel      32
Name: fuel, dtype: int64

In [157]:
df.drop(columns="electricity_consumption", inplace = True)

### last_service_date 

In [158]:
first_look("last_service_date")

column name :  last_service_date
----------------------------------------
Per_of_Nulls   :  % 96.44
Number of Nulls  :  15353
Number of Uniques:  255
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan '12/2018Euro 6' '02/2019Euro 6' '06/2018102 g CO2/km (comb)'
 '03/20195 (Blue)' '06/2017' '03/2019' '02/2019' '06/2019Euro 6'
 '01/2018114 g CO2/km (comb)' '03/2019Euro 6' '05/2019109 g CO2/km (comb)'
 '05/2019Euro 6' '05/2019101 g CO2/km (comb)' '02/20194 (Green)' '09/2018'
 '06/2017102 g CO2/km (comb)' '11/2018Euro 6' '06/201992 g CO2/km (comb)'
 '04/2019Euro 6' '11/2018' '01/201998 g CO2/km (comb)' '01/2019Euro 6'
 '03/2016Euro 6' '12/201899 g CO2/km (comb)' '07/201897 g CO2/km (comb)'
 '06/20194 (Green)' '06/2018Euro 6' '10/2018Euro 6'
 '10/201894 g CO2/km (comb)' '10/201897 g CO2/km (comb)'
 '03/201997 g CO2/km (comb)' '04/201990 g CO2/km (comb)' '09/2018Euro 6'
 '05/2018Euro 6' '07/2018102 g CO2/km (comb)' '09/201897 g CO2/km (comb)'
 '

In [159]:
df["last_service_time"] = df["last_service_date"].str.extract("(.*/\d\d\d\d)")
df["last_service_time"] = pd.to_datetime(df["last_service_time"])
df["last_service_time"].value_counts()

2019-05-01    61
2019-02-01    55
2019-01-01    51
2019-06-01    49
2019-04-01    48
2019-03-01    47
2018-12-01    32
2018-10-01    24
2018-06-01    21
2018-05-01    21
2018-07-01    21
2018-01-01    20
2018-09-01    17
2018-11-01    17
2018-04-01    16
2018-08-01    13
2018-03-01     9
2017-06-01     7
2018-02-01     5
2017-05-01     3
2017-12-01     3
2017-02-01     3
2017-10-01     3
2017-01-01     3
2017-11-01     2
2016-06-01     2
2016-04-01     2
2017-07-01     2
2019-11-01     1
2019-10-01     1
2019-07-01     1
2016-03-01     1
2019-09-01     1
2019-08-01     1
2017-09-01     1
2016-05-01     1
2017-04-01     1
Name: last_service_time, dtype: int64

### other_fuel_types 

In [160]:
first_look("other_fuel_types")

column name :  other_fuel_types
----------------------------------------
Per_of_Nulls   :  % 94.47
Number of Nulls  :  15039
Number of Uniques:  1
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan '[],[],[]']
----------------------------------------
[],[],[]      880
NaN         15039
Name: other_fuel_types, dtype: int64
----------------------------------------
NaN         15039
[],[],[]      880
Name: other_fuel_types, dtype: int64
################################################################################



In [161]:
df.drop(columns="other_fuel_types", inplace=True)

### availability 

In [162]:
first_look("availability")

column name :  availability
----------------------------------------
Per_of_Nulls   :  % 96.01
Number of Nulls  :  15284
Number of Uniques:  15
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan 'in 90 days from ordering' 'in 180 days from ordering'
 'in 150 days from ordering' 'in 14 days from ordering'
 'in 7 days from ordering' 'in 1 day from ordering'
 'in 21 days from ordering' 'in 5 days from ordering'
 'in 60 days from ordering' 'in 120 days from ordering'
 'in 3 days from ordering' 'in 4 days from ordering'
 'in 42 days from ordering' 'in 2 days from ordering'
 'in 6 days from ordering']
----------------------------------------
in 1 day from ordering          51
in 120 days from ordering      182
in 14 days from ordering        24
in 150 days from ordering       18
in 180 days from ordering       24
in 2 days from ordering         16
in 21 days from ordering         8
in 3 days from ordering         35
in 4 days from ordering      

In [163]:
df.drop(columns="availability", inplace=True)

### last_timing_belt_service_date 

In [164]:
first_look("last_timing_belt_service_date")

column name :  last_timing_belt_service_date
----------------------------------------
Per_of_Nulls   :  % 99.9
Number of Nulls  :  15903
Number of Uniques:  15
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan '12/1900' '07/2018' '01/1900' '05/2019' '09/2018' '05/2018Euro 6'
 '06/2017' '01/2019' '02/2019' '02/2018' '04/2016' '06/2019' '01/2018'
 '04/2019' '01/1970']
----------------------------------------
01/1900              1
01/1970              1
01/2018              1
01/2019              1
02/2018              1
02/2019              1
04/2016              2
04/2019              1
05/2018Euro 6        1
05/2019              1
06/2017              1
06/2019              1
07/2018              1
09/2018              1
12/1900              1
NaN              15903
Name: last_timing_belt_service_date, dtype: int64
----------------------------------------
NaN              15903
04/2016              2
12/1900              1
07/2018       

In [165]:
df.drop(columns="last_timing_belt_service_date", inplace=True)

### available_from 

In [166]:
first_look("available_from")

column name :  available_from
----------------------------------------
Per_of_Nulls   :  % 98.29
Number of Nulls  :  15647
Number of Uniques:  46
Type of columns:  object
----------------------------------------
Unique values of columns:  [nan '29/06/19' '27/06/19' '29/07/19' '31/10/19' '18/07/19' '06/12/19'
 '03/12/19' '05/12/19' '10/11/19' '01/07/19' '30/07/19' '10/12/19'
 '08/07/19' '05/08/19' '01/08/19' '26/06/19' '31/08/19' '01/09/19'
 '28/06/19' '30/06/19' '05/07/19' '11/08/19' '25/10/19' '15/07/19'
 '20/07/19' '16/07/19' '10/09/19' '10/10/19' '24/08/19' '17/08/19'
 '14/09/19' '24/09/19' '16/08/19' '15/08/19' '27/07/19' '18/08/19'
 '29/09/19' '24/07/19' '10/07/19' '02/07/19' '04/07/19' '19/08/19'
 '30/09/19' '19/07/19' '16/09/19' '03/08/19']
----------------------------------------
01/07/19       11
01/08/19        3
01/09/19        1
02/07/19        1
03/08/19        1
03/12/19        1
04/07/19        2
05/07/19        2
05/08/19        2
05/12/19        1
06/12/19        1
08/

In [167]:
df.drop(columns="available_from", inplace=True)

### consumption

In [168]:
first_look("consumption")

column name :  consumption
----------------------------------------
Per_of_Nulls   :  % 11.97
Number of Nulls  :  1906
Number of Uniques:  881
Type of columns:  object
----------------------------------------
Unique values of columns:  ["['3.8 l/100 km (comb)'],['4.3 l/100 km (city)'],['3.5 l/100 km (country)']"
 "['5.6 l/100 km (comb)'],['7.1 l/100 km (city)'],['4.7 l/100 km (country)']"
 "['3.8 l/100 km (comb)'],['4.4 l/100 km (city)'],['3.4 l/100 km (country)']"
 "['4.1 l/100 km (comb)'],['4.6 l/100 km (city)'],['3.8 l/100 km (country)']"
 "['3.5 l/100 km (comb)'],['4.3 l/100 km (city)'],['3.1 l/100 km (country)']"
 "['3.7 l/100 km (comb)'],['4.3 l/100 km (city)'],['3.4 l/100 km (country)']"
 "['3.7 l/100 km (comb)'],['4.2 l/100 km (city)'],['3.4 l/100 km (country)']"
 nan
 "['4 l/100 km (comb)'],['4.6 l/100 km (city)'],['3.6 l/100 km (country)']"
 "['4.9 l/100 km (comb)'],['6.2 l/100 km (city)'],['4.2 l/100 km (country)']"
 "['4.2 l/100 km (comb)'],['5 l/100 km (city)'],['3.7 l/100

In [169]:
# Consumption column is already divided into 3 three column

In [170]:
#looking for the null values of consumption_comb in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["consumption_comb"].isnull()][i].str.contains("100 km \(comb\)", regex=True).any():
        print(i)

In [171]:
#looking for the null values of consumption_city in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["consumption_city"].isnull()][i].str.contains("100 km \(city\)", regex=True).any():
        print(i)

In [172]:
#looking for the null values of consumption_country in all columns
for i in df.select_dtypes(include="O").columns:
    if df[df["consumption_country"].isnull()][i].str.contains("100 km \(country\)", regex=True).any():
        print(i)

### price

In [173]:
first_look("price")

column name :  price
----------------------------------------
Per_of_Nulls   :  % 0.0
Number of Nulls  :  0
Number of Uniques:  2956
Type of columns:  int64
----------------------------------------
Unique values of columns:  [15770 14500 14640 ... 41390 39885 39875]
----------------------------------------
13       1
120      1
255      1
331      1
4950     1
        ..
64332    1
64900    1
67600    1
68320    1
74600    1
Name: price, Length: 2956, dtype: int64
----------------------------------------
14990    154
15990    151
10990    139
15900    106
17990    102
        ... 
17559      1
17560      1
17570      1
17575      1
39875      1
Name: price, Length: 2956, dtype: int64
################################################################################



In [174]:
df["price"] = df["price"].astype(float)

# END OF THE DATA CLEANING 

In [175]:
df.head().T

Unnamed: 0,0,1,2,3,4
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...,1.4 TDi Design S tronic,Sportback 1.4 TDI S-Tronic S-Line Ext. admired...
body_type,Sedans,Sedans,Sedans,Sedans,Sedans
price,15770.0,14500.0,14640.0,14500.0,16790.0
vat,VAT deductible,Price negotiable,VAT deductible,,
km,56013.0,80000.0,83450.0,73000.0,16200.0
registration,2016-01-01 00:00:00,2017-03-01 00:00:00,2016-02-01 00:00:00,2016-08-01 00:00:00,2016-05-01 00:00:00
prev_owner,2.0,,1.0,1.0,1.0
type,Used,Used,Used,Used,Used


In [176]:
df.drop(columns=["url","short_description","previous_owners","next_inspection","inspection_new","warranty","full_service"
                ,"non_smoking_vehicle","consumption","co_2_emission","last_service_date"], inplace=True)

In [177]:
df.head().T

Unnamed: 0,0,1,2,3,4
make_model,Audi A1,Audi A1,Audi A1,Audi A1,Audi A1
body_type,Sedans,Sedans,Sedans,Sedans,Sedans
price,15770.0,14500.0,14640.0,14500.0,16790.0
vat,VAT deductible,Price negotiable,VAT deductible,,
km,56013.0,80000.0,83450.0,73000.0,16200.0
registration,2016-01-01 00:00:00,2017-03-01 00:00:00,2016-02-01 00:00:00,2016-08-01 00:00:00,2016-05-01 00:00:00
prev_owner,2.0,,1.0,1.0,1.0
type,Used,Used,Used,Used,Used
body_color,Black,Red,Black,Brown,Black
paint_type,Metallic,,Metallic,Metallic,Metallic


In [178]:
df.columns

Index(['make_model', 'body_type', 'price', 'vat', 'km', 'registration',
       'prev_owner', 'type', 'body_color', 'paint_type', 'nr_of_doors',
       'nr_of_seats', 'gearing_type', 'cylinders', 'weight', 'drive_chain',
       'fuel', 'emission_class', 'comfort_&_convenience',
       'entertainment_&_media', 'extras', 'safety_&_security',
       'emission_label', 'gears', 'country_version', 'consumption_comb',
       'consumption_city', 'consumption_country', 'hp_kw',
       'Particulate_Filter', 'inspection_time', 'inspection_situation',
       'warranty_month', 'co2_emission_gr', 'upholstery_type',
       'upholstery_color', 'engine_size', 'last_service_time'],
      dtype='object')

In [179]:
df.columns = ['Model', 'Body_Type', 'Price', 'Vat', 'Km', 'Registration_Date',
       'Prev_Owner', 'Type', 'Body_Color', 'Paint_Type', 'Door_Total',
       'Seat_Total', 'Gear_Type', 'Cylinders', 'Weight', 'Drive_Chain',
       'Fuel', 'Emission_Class', 'Comfort_Convenience',
       'Entertainment_Media', 'Extras', 'Safety_Security',
       'Emission_Label', 'Gears', 'Country', 'Consumption_Comb', 'Consumption_City', 'Consumption_Country', 'Hp',
       'Particulate_Filter', 'Inspection_Time', 'Inspection_Situation',
       'Warranty', 'Emission', 'Upholstery_Type',
       'Upholstery_Color', 'Engine_Size', 'LastServiceTime']

In [180]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 38 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Model                 15919 non-null  object        
 1   Body_Type             15860 non-null  object        
 2   Price                 15919 non-null  float64       
 3   Vat                   11406 non-null  object        
 4   Km                    14895 non-null  float64       
 5   Registration_Date     14322 non-null  datetime64[ns]
 6   Prev_Owner            9279 non-null   float64       
 7   Type                  15917 non-null  object        
 8   Body_Color            15338 non-null  object        
 9   Paint_Type            10147 non-null  object        
 10  Door_Total            15707 non-null  float64       
 11  Seat_Total            14942 non-null  float64       
 12  Gear_Type             15919 non-null  object        
 13  Cylinders       

In [181]:
df.to_csv("AutoScout_Cleaned")