<a href="https://www.kaggle.com/code/manpreetsgurutatta/automobiles?scriptVersionId=200862614" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Problem Statement: 

Context:

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.

In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

Objective

To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business.

Data Description

The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.

Data Dictionary

S.No.: Serial number
Name: Name of the car which includes brand name and model name
Location: Location in which the car is being sold or is available for purchase (cities)
Year: Manufacturing year of the car
Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in km
Fuel_Type: The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
Transmission: The type of transmission used by the car (Automatic/Manual)
Owner_Type: Type of ownership
Mileage: The standard mileage offered by the car company in kmpl or km/kg
Engine: The displacement volume of the engine in CC
Power: The maximum power of the engine in bhp
Seats: The number of seats in the car
New_Price: The price of a new car of the same model in INR Lakhs (1 Lakh = 100,000 INR)
Price: The price of the used car in INR Lakhs

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/usedcarsdata/usedcarsdata.csv


In [2]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to split the data into train and test
from sklearn.model_selection import train_test_split

# to build linear regression_model
from sklearn.linear_model import LinearRegression

# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# to suppress warnings
import warnings

warnings.filterwarnings("ignore")

In [3]:
df=pd.read_csv('/kaggle/input/usedcarsdata/usedcarsdata.csv')

In [4]:
df.head()

Unnamed: 0,S.No.,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,5.51,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,16.06,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,11.27,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,53.14,17.74


In [5]:
df.shape

(7253, 14)

In [6]:
df.sample(10)

Unnamed: 0,S.No.,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
6046,6046,Honda Jazz 1.5 V i DTEC,Pune,2015,52000,Diesel,Manual,First,27.3 kmpl,1498 CC,98.6 bhp,5.0,9.6,
3107,3107,Maruti Omni 8 Seater BSIV,Hyderabad,2017,21260,Petrol,Manual,First,14.0 kmpl,796 CC,35 bhp,5.0,7.88,2.9
1543,1543,Hyundai Verna 1.6 CRDI,Mumbai,2011,74000,Diesel,Manual,Second,22.32 kmpl,1582 CC,126.3 bhp,5.0,14.255,3.75
4781,4781,Maruti Alto LXI,Hyderabad,2007,52195,Petrol,Manual,First,19.7 kmpl,796 CC,46.3 bhp,5.0,4.36,1.75
3781,3781,Toyota Etios Liva GD,Coimbatore,2015,49894,Diesel,Manual,First,23.59 kmpl,1364 CC,67.06 bhp,5.0,8.525,5.27
851,851,Hyundai Verna 1.6 CRDi EX AT,Delhi,2013,66000,Diesel,Automatic,First,22.32 kmpl,1582 CC,126.3 bhp,5.0,14.255,5.7
1924,1924,BMW 5 Series 2013-2017 530d M Sport,Coimbatore,2017,27313,Diesel,Automatic,First,14.69 kmpl,2993 CC,258 bhp,5.0,67.87,48.63
7145,7145,Toyota Etios Liva G,Kolkata,2012,37212,Petrol,Manual,First,18.3 kmpl,1197 CC,,5.0,8.525,
6110,6110,Mercedes-Benz B Class B180 Sports,Mumbai,2015,25700,Petrol,Automatic,First,11.9 kmpl,1595 CC,120.7 bhp,5.0,37.03,
578,578,Volkswagen Polo 1.5 TDI Highline,Coimbatore,2018,33558,Diesel,Manual,First,20.14 kmpl,1498 CC,88 bhp,5.0,10.32,8.12


**Observations**

- `S.No.` is just an index for the data entry and will add no value to our analysis. So, we will drop it.

- `Name` contains a lot of model information. Let us check how many individual names we have. If they are too many, we can process this column to extract important information.

- `Mileage`, `Engine`, and `Power` columns will also need some processing before we are able to explore them. We'll have to extract numerical information from these columns.