# AUTOSCOUT CAPSTONE PROJECT

<img src=https://i.ibb.co/wJW61Y2/Used-cars.jpg width="700" height="200">

## Introduction
Welcome to "***AutoScout Data Analysis Project***". This is the capstone project of ***Data Analysis*** Module. **Auto Scout** data which using for this project, scraped from the on-line car trading company in 2019, contains many features of 9 different car models. In this project, you will have the opportunity to apply many commonly used algorithms for Data Cleaning and Exploratory Data Analysis by using many Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy you will analyze clean dataset.

**Some Reminders on Exploratory data analysis (EDA)

Exploratory data analysis (EDA) is an especially important activity in the routine of a data analyst or scientist. It enables an in depth understanding of the dataset, define or discard hypotheses and create predictive models on a solid basis. It uses data manipulation techniques and several statistical tools to describe and understand the relationship between variables and how these can impact business. By means of EDA, we can obtain meaningful insights that can impact analysis under the following questions (If a checklist is good enough for pilots to use every flight, it’s good enough for data scientists to use with every dataset).
1. What question are you trying to solve (or prove wrong)?
2. What kind of data do you have?
3. What’s missing from the data?
4. Where are the outliers?
5. How can you add, change or remove features to get more out of your data?

**``Exploratory data analysis (EDA)``** is often an **iterative brainstorming process** where you pose a question, review the data, and develop further questions to investigate before beginning model development work. The image below shows how the brainstorming phase is connected with that of understanding the variables and how this in turn is connected again with the brainstorming phase.<br>

<img src=https://i.ibb.co/k0MC950/EDA-Process.png width="300" height="100">

[Image Credit: Andrew D.](https://towardsdatascience.com/exploratory-data-analysis-in-python-a-step-by-step-process-d0dfa6bf94ee)

**``In this context, the project consists of 3 parts in general:``**
* **The first part** is related to 'Data Cleaning'. It deals with Incorrect Headers, Incorrect Format, Anomalies, and Dropping useless columns.
* **The second part** is related to 'Filling Data', in other words 'Imputation'. It deals with Missing Values. Categorical to numeric transformation is done as well.
* **The third part** is related to 'Handling Outliers of Data' via Visualization libraries. So, some insights will be extracted.

**``NOTE:``**  However, you are free to create your own style. You do NOT have to stick to the steps above. We, the DA & DV instructors, recommend you study each part separately to create a source notebook for each part title for your further studies. 

## Filling Data (Imputation)

In [12]:
#Import python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import datetime
import plotly
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff


from pandas.plotting import register_matplotlib_converters
from pylab import rcParams
#from skimpy import clean_columns

import warnings
warnings.filterwarnings("ignore")

plt.rcParams["figure.figsize"] = (12, 8)
pd.set_option('display.max_columns', None)
sns.set_theme(font_scale=1.2, style="darkgrid")
#pd.set_option('display.float_format', lambda x: '%.3' % x)

In [13]:
# Reading file from json
df_origin = pd.read_csv("AutoScout_Cleaned", index_col=[0])
df = df_origin.copy()
df.head().T

Unnamed: 0,0,1,2,3,4
body_type,Sedans,Sedans,Sedans,Sedans,Sedans
price,15770.0,14500.0,14640.0,14500.0,16790.0
vat,VAT deductible,Price negotiable,VAT deductible,,
km,56013.0,80000.0,83450.0,73000.0,16200.0
registration,2016-01-01,2017-03-01,2016-02-01,2016-08-01,2016-05-01
prev_owner,2.0,,1.0,1.0,1.0
type,Used,Used,Used,Used,Used
make,Audi,Audi,Audi,Audi,Audi
model,A1,A1,A1,A1,A1
body_color,Black,Red,Black,Brown,Black


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15919 entries, 0 to 15918
Data columns (total 40 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   body_type             15860 non-null  object 
 1   price                 15919 non-null  float64
 2   vat                   11406 non-null  object 
 3   km                    14895 non-null  float64
 4   registration          14322 non-null  object 
 5   prev_owner            9279 non-null   float64
 6   type                  15917 non-null  object 
 7   make                  15919 non-null  object 
 8   model                 15919 non-null  object 
 9   body_color            15338 non-null  object 
 10  paint_type            10147 non-null  object 
 11  nr_of_doors           15707 non-null  float64
 12  nr_of_seats           14942 non-null  float64
 13  gearing_type          15919 non-null  object 
 14  cylinders             10239 non-null  float64
 15  weight             