___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

#  The Project of Data Analytics 

# Car Price Prediction EDA

## Introduction
Welcome to "***Car Price Prediction EDA Project***". **Auto Scout** data which using for this project, scraped from the on-line car trading company in 2019, contains many features of 9 different car models. In this project, you will have the opportunity to apply many commonly used algorithms for Data Cleaning and Exploratory Data Analysis by using many Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy.

The project consists of 4 parts:
* Take a quick look at DATA.
* First part is related with 'data cleaning'. It deals with Incorrect Headers, Incorrect Format, Anomalies, Dropping useless columns.
* Second part is related with 'filling data'. It deals with Missing Values. Categorical to numeric transformation is done.
* Third part is related with 'handling outliers of data' via Visualisation libraries. Some insights are extracted.

# PART- 1 `( Take a Quick Look )`

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

%matplotlib inline
# %matplotlib notebook

plt.rcParams["figure.figsize"] = (10,6)
# plt.rcParams['figure.dpi'] = 100

sns.set_style("whitegrid")
pd.set_option('display.float_format', lambda x: '%.3f' % x)

pd.options.display.max_rows = 100
pd.options.display.max_columns = 100


## 1. Take a quick look at DataFrame

In [2]:
# Read DataFrame.
df = pd.read_json('scout_car.json', lines = True)


In [3]:
df.head(4).T

Unnamed: 0,0,1,2,3
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...
make_model,Audi A1,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...,1.4 TDi Design S tronic
body_type,Sedans,Sedans,Sedans,Sedans
price,15770,14500,14640,14500
vat,VAT deductible,Price negotiable,VAT deductible,
km,"56,013 km","80,000 km","83,450 km","73,000 km"
registration,01/2016,03/2017,02/2016,08/2016
prev_owner,2 previous owners,,1 previous owner,1 previous owner
kW,,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
url                              15919 non-null object
make_model                       15919 non-null object
short_description                15873 non-null object
body_type                        15859 non-null object
price                            15919 non-null int64
vat                              11406 non-null object
km                               15919 non-null object
registration                     15919 non-null object
prev_owner                       9091 non-null object
kW                               0 non-null float64
hp                               15919 non-null object
Type                             15917 non-null object
Previous Owners                  9279 non-null object
Next Inspection                  3535 non-null object
Inspection new                   3932 non-null object
Warranty                         10499 non-null object
Full Service       

In [5]:
df.columns

Index(['url', 'make_model', 'short_description', 'body_type', 'price', 'vat',
       'km', 'registration', 'prev_owner', 'kW', 'hp', 'Type',
       'Previous Owners', 'Next Inspection', 'Inspection new', 'Warranty',
       'Full Service', 'Non-smoking Vehicle', 'null', 'Make', 'Model',
       'Offer Number', 'First Registration', 'Body Color', 'Paint Type',
       'Body Color Original', 'Upholstery', 'Body', 'Nr. of Doors',
       'Nr. of Seats', 'Model Code', 'Gearing Type', 'Displacement',
       'Cylinders', 'Weight', 'Drive chain', 'Fuel', 'Consumption',
       'CO2 Emission', 'Emission Class', '\nComfort & Convenience\n',
       '\nEntertainment & Media\n', '\nExtras\n', '\nSafety & Security\n',
       'description', 'Emission Label', 'Gears', 'Country version',
       'Electricity consumption', 'Last Service Date', 'Other Fuel Types',
       'Availability', 'Last Timing Belt Service Date', 'Available from'],
      dtype='object')

In [6]:
df["Comfort_Convenience"] = df["\nComfort & Convenience\n"]
df["Entertainment_Media"] = df["\nEntertainment & Media\n"]
df["Extras"] = df["\nExtras\n"]
df["Safety_Security"] = df["\nSafety & Security\n"]


In [7]:
drop_columns = ["\nComfort & Convenience\n","\nEntertainment & Media\n","\nExtras\n","\nSafety & Security\n"]
df.drop(drop_columns, axis = 1, inplace = True)


**Droping columns that have %90 percent and higher of missing values.**

In [8]:
def show_nans(df, limit):
    missing = df.isnull().sum()*100/df.shape[0]
    return missing.loc[lambda x : x >= limit]

def perc_nans(serial):
    # display percentage of nans in a Series
    return serial.isnull().sum()/serial.shape[0]*100


In [9]:
show_nans(df,90)


kW                              100.000
Electricity consumption          99.139
Last Service Date                96.445
Other Fuel Types                 94.472
Availability                     96.011
Last Timing Belt Service Date    99.899
Available from                   98.291
dtype: float64

In [10]:
drop_columns = show_nans(df,90).index
drop_columns


Index(['kW', 'Electricity consumption', 'Last Service Date',
       'Other Fuel Types', 'Availability', 'Last Timing Belt Service Date',
       'Available from'],
      dtype='object')

In [11]:
df.drop(drop_columns, axis = 1, inplace = True)


In [12]:
df.drop("null", axis = 1, inplace = True)


In [13]:
df.shape

(15919, 46)

In [14]:
df.to_csv("autoScout1.csv", index=False) 

### mehmetfatih