#Capstone Project: EDA and Initial Report
**Overview**: In this project our goal is to create a model that can accurately predict the sale price of a car. As such we have selected a dataset that contains a significat amount of data about the car being sold. We will create 4 different models, LinearRegression, DecisionTree, RandomForest, and XGBoost.

**Data:**
Our dataset can be found [here](https://www.kaggle.com/datasets/imgowthamg/car-price/data) and contains data on categories such as wheelbase, weight, horsepower, and much more.

### Importing and Reading Data

In [2]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as XGBoost

In [3]:
train = pd.read_csv('train-data.csv')
test =  pd.read_csv('test-data.csv')

In [4]:
train.head()

Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74


In [10]:
test.head()

Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price
0,0,Maruti Alto K10 LXI CNG,Delhi,2014,40929,CNG,Manual,First,32.26 km/kg,998 CC,58.2 bhp,4.0,
1,1,Maruti Alto 800 2016-2019 LXI,Coimbatore,2013,54493,Petrol,Manual,Second,24.7 kmpl,796 CC,47.3 bhp,5.0,
2,2,Toyota Innova Crysta Touring Sport 2.4 MT,Mumbai,2017,34000,Diesel,Manual,First,13.68 kmpl,2393 CC,147.8 bhp,7.0,25.27 Lakh
3,3,Toyota Etios Liva GD,Hyderabad,2012,139000,Diesel,Manual,First,23.59 kmpl,1364 CC,null bhp,5.0,
4,4,Hyundai i20 Magna,Mumbai,2014,29000,Petrol,Manual,First,18.5 kmpl,1197 CC,82.85 bhp,5.0,


###Data Cleaning
This will contain all of our checks for missing values as well as duplicate values.

In [5]:
train.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
Name,0
Location,0
Year,0
Kilometers_Driven,0
Fuel_Type,0
Transmission,0
Owner_Type,0
Mileage,2
Engine,36


In [8]:
test.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
Name,0
Location,0
Year,0
Kilometers_Driven,0
Fuel_Type,0
Transmission,0
Owner_Type,0
Mileage,0
Engine,10


We have a few null values in our columns. We will have to decide what to do with these rows.

In [6]:
train.duplicated().sum()

0

In [11]:
test.duplicated().sum()

0

We have no duplicate values.

In [7]:
train.describe()

Unnamed: 0.1,Unnamed: 0,Year,Kilometers_Driven,Seats,Price
count,6019.0,6019.0,6019.0,5977.0,6019.0
mean,3009.0,2013.358199,58738.38,5.278735,9.479468
std,1737.679967,3.269742,91268.84,0.80884,11.187917
min,0.0,1998.0,171.0,0.0,0.44
25%,1504.5,2011.0,34000.0,5.0,3.5
50%,3009.0,2014.0,53000.0,5.0,5.64
75%,4513.5,2016.0,73000.0,5.0,9.95
max,6018.0,2019.0,6500000.0,10.0,160.0


In [12]:
test.describe()

Unnamed: 0.1,Unnamed: 0,Year,Kilometers_Driven,Seats
count,1234.0,1234.0,1234.0,1223.0
mean,616.5,2013.400324,58507.288493,5.284546
std,356.369424,3.1797,35598.702098,0.825622
min,0.0,1996.0,1000.0,2.0
25%,308.25,2011.0,34000.0,5.0
50%,616.5,2014.0,54572.5,5.0
75%,924.75,2016.0,75000.0,5.0
max,1233.0,2019.0,350000.0,10.0


From this we see that we have about 83% training data and 17% test data. We also are missing a small amount of values in both our train and testing data. We should see what columns we need and drop the ones we dont.

###Unnamed: 0
This column seems to be the same as an id column and will not be useful for analysis, so we will drop it.

In [13]:
print(train.shape, test.shape)
train.drop('Unnamed: 0', axis=1, inplace=True)
test.drop('Unnamed: 0', axis=1, inplace=True)
print(train.shape, test.shape)

(6019, 14) (1234, 13)
(6019, 13) (1234, 12)
