# Linear Regression

Over 370,000 used cars were scraped from Ebay-Kleinanzeigen. The content of the data is in German. The data is available [here](https://www.kaggle.com/orgesleka/used-cars-database) The fields included in the file data/autos.csv are:

- seller : private or dealer
- offerType
- vehicleType
- yearOfRegistration : at which year the car was first registered
- gearbox
- powerPS : power of the car in PS
- model
- kilometer : how many kilometers the car has driven
- monthOfRegistration : at which month the car was first registered
- fuelType
- brand
- notRepairedDamage : if the car has a damage which is not repaired yet
- price : the price on the ad to sell the car. 

**Goal**  
Given the characteristics/features of the car, the sale price of the car is to be predicted. 

In [None]:
#import the required libraries
import pandas as pd
import numpy as np

In [None]:
#Read the dataset
cars = pd.read_csv("../input/autos.csv", encoding='latin1')

In [None]:
#Display the first few rows
cars.head()

In [None]:
#Display the columns in the dataset
cars.columns

In [None]:
#what are the types of the columns?
cars.dtypes

In [None]:
#Find if data has missing values?
#Find missing values by each column
cars.isnull().sum()

In [None]:
#Find proportion of data that is missing for each of the columns
cars.isnull().sum()/cars.shape[0] * 100

In [None]:
#For this exercise, let's drop the rows that have null values

cars_updated = cars.dropna()

In [None]:
cars.shape, cars_updated.shape

In [None]:
#check if there are any missing values
cars_updated.isnull().sum()

In [None]:
#Display first few records of cars_updated
cars_updated.head()

In [None]:
cars_updated.columns

In [None]:
#Let's use only the following columns for our modeling now
cars_updated = cars_updated.iloc[:, [2,3,6,7,8,9,10,11,12,13,14,15,4]]

In [None]:
#Convert text to numeric using Label Encoding
from sklearn import preprocessing

In [None]:
#encode the data
cars_encoded = cars_updated.apply(preprocessing.LabelEncoder().fit_transform)

In [None]:
#Display the first few records
cars_encoded.head()

In [None]:
cars_encoded.columns

In [None]:
#Exploratory data analysis
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
# plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
#Plot year vs price
plt.scatter(cars_encoded.yearOfRegistration, 
           cars_encoded.price,
           s=150, alpha = 0.1)
plt.xlabel('year')
plt.ylabel('price')

### Linear Regression Model

In [None]:
from sklearn import linear_model

In [None]:
#Instantiate the model
model_sklearn = linear_model.LinearRegression()

In [None]:
#fit the model
model_sklearn.fit(cars_encoded.iloc[:,:12], cars_encoded.iloc[:,12])

In [None]:
#Regression coefficients
model_sklearn.coef_

In [None]:
#Model intercept
model_sklearn.intercept_

### Validation

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#Split into train and validation
x_train, x_test, y_train, y_test = train_test_split(cars_encoded.iloc[:,:12], 
                                                    cars_encoded.iloc[:,12],
                                                    test_size=0.2)

In [None]:
#Display data shape
cars_encoded.shape, x_train.shape, y_train.shape, x_test.shape, y_test.shape

In [None]:
#Instantiate the model
model_sklearn_tv = linear_model.LinearRegression()

In [None]:
#fit the model
model_sklearn_tv.fit(x_train, y_train)

In [None]:
y_pred = model_sklearn_tv.predict(x_test)

In [None]:
#Find error : RMSE
np.sqrt(np.mean((y_test - y_pred)**2))

In [None]:
nan