# Price Prediction of Used Ford Cars

Ford cars are resold depending of various factors such as model, mileage, etc. This datasets includes such information on used Ford cars sold during the years 1996 and 2020. 

**Attributes/Columns**

- model - model of the car
- year - year of manufacture
- price - price of the car
- transmission - type of transmission in the car (Automatic, Manual and Semi-automatic)
- mileage - mileage of the car
- fuelType - type of fuel used in the car (Petrol, Diesel, Electric, Hybrid and Others)
- mpg - miles the car run per gallon
- engineSize - size of the engine used in the car

**Data Source**
<br> https://www.kaggle.com/aishwaryamuthukumar/cars-dataset-audi-bmw-ford-hyundai-skoda-vw

### Importing Packages

In [1]:
import pandas as pd
import numpy as np

# to install scikit-learn, <pip install -U scikit-learn> in Command Prompt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

### Reading the Dataset

In [2]:
# Naming the DataFrame - df
# Reading the .xlsx or .csv file using pandas: pd.read_csv("<location of dataset>")
df = pd.read_csv("ford.csv")

# Displaying the first 5 rows of the DataFrame
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,mpg,engineSize
0,Fiesta,2017,12000,Automatic,15944,Petrol,57.7,1.0
1,Focus,2018,14000,Manual,9083,Petrol,57.7,1.0
2,Focus,2017,13000,Manual,12456,Petrol,57.7,1.0
3,Fiesta,2019,17500,Manual,10460,Petrol,40.3,1.5
4,Fiesta,2019,16500,Automatic,1482,Petrol,48.7,1.0


### Number of rows and columns

In [3]:
# <name_of_DataFrame>.shape
df.shape

# output - (total number of rows, total number of columns)

(17964, 8)

###  Data Types and Missing Values

In [4]:
# <name_of_DataFrame>.info()
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17964 entries, 0 to 17963
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         17964 non-null  object 
 1   year          17964 non-null  int64  
 2   price         17964 non-null  int64  
 3   transmission  17964 non-null  object 
 4   mileage       17964 non-null  int64  
 5   fuelType      17964 non-null  object 
 6   mpg           17964 non-null  float64
 7   engineSize    17964 non-null  float64
dtypes: float64(2), int64(3), object(3)
memory usage: 1.1+ MB


None

**Overview:**

- Total number of observations/rows: 17964 
- Total number of attributes/columns: 8
- Total number of object (string/mixed) data type: 3 (model, transmission, fuelType)
- Total number of integer (positive/negative/zero) data type: 3 (year, price, mileage)
- Total number of float (floating point number) data type: 2 (mpg, engineSize)
- No missing data

### Descriptive Statistics

In [5]:
# Creating Descriptive Statistics table
# <name_of_DataFrame>.describe()

df.describe()

Unnamed: 0,year,price,mileage,mpg,engineSize
count,17964.0,17964.0,17964.0,17964.0,17964.0
mean,2016.864173,12280.078435,23361.880149,57.907832,1.350824
std,2.024987,4741.318119,19471.243292,10.125632,0.432383
min,1996.0,495.0,1.0,20.8,0.0
25%,2016.0,8999.0,9987.0,52.3,1.0
50%,2017.0,11291.0,18242.5,58.9,1.2
75%,2018.0,15299.0,31052.0,65.7,1.5
max,2020.0,54995.0,177644.0,201.8,5.0


**Observations**
 - 50% of the data was collected between 2017 and 2020.
 - The average price of used cars is \\$12280. The lowest price is \\$495 and highest price is  \\$54995 while half of the observations show price below \\$11291. 
 - The mileage ranges from 1 to 177644, however, 75% of the observations shows mileage above 9987, with average at 23361.
 - The mpg varies from 20.8 to 201.
 - The engine size varies from 0 to 5L.

### Listing the columns

In [6]:
print(df.columns)

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'mpg',
       'engineSize'],
      dtype='object')


### Identifying the feature and target variable

In [7]:
features = ['mileage', 'year', 'mpg', 'engineSize']
target = ['price']

X = df[features]
y = df[target]

# printing the no. of rows and column used as feature and target variable.
print(X.shape, y.shape)

(17964, 4) (17964, 1)


### Splitting the train and test set

In [8]:
# x_train are the training dataset of feature variables
# x_test are the testing dataset of feature variables
# y_train are the training dataset of target variables
# y_test are the testing dataset of target variables
# Split into 80% train and 20% test
# random_state = Pseudo random number, specified so that everytime the code is run, it uses the same observations in train and test set 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(14371, 4) (3593, 4) (14371, 1) (3593, 1)


### Running Linear Regression Model

In [9]:
model = LinearRegression()
model = model.fit(X_train, y_train)

In [10]:
y_pred = model.predict(X_test)

print(y_pred)

[[12511.62021941]
 [11728.1369248 ]
 [12146.83147139]
 ...
 [ 7682.61502627]
 [ 5954.1544702 ]
 [16941.0886637 ]]


In [11]:
print(y_test)

       price
1087   16700
9367    9690
4705   10999
10336  29350
8509   11250
...      ...
14866  13487
11183  15299
13788   5495
17265   5685
16043  16495

[3593 rows x 1 columns]


### Finding the Root Mean Squared Error (RMSE) 

In [12]:
# RMSE is the standard deviation of the residual error

RMSE = mean_squared_error(y_test, y_pred, squared=False)
print(RMSE)

2471.5647447160777
