# Car Price Prediction of Germany from Autoscout24
AutoScout24 is one of the largest Europe's car market for new and used cars. We've collected car data from 2011 to 2021. It shows basic fields like make, model, mileage, horse power, etc. Link for the dataset https://www.kaggle.com/datasets/ander289386/cars-germany

### 1 Import of Data and Required Packages
Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [20]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Import the CSV Data as Pandas DataFrame

In [21]:
df = pd.read_csv('data/autoscout24-germany-dataset.csv')
df.head()

Unnamed: 0,mileage,make,model,fuel,gear,offerType,price,hp,year
0,235000,BMW,316,Diesel,Manual,Used,6800,116.0,2011
1,92800,Volkswagen,Golf,Gasoline,Manual,Used,6877,122.0,2011
2,149300,SEAT,Exeo,Gasoline,Manual,Used,6900,160.0,2011
3,96200,Renault,Megane,Gasoline,Manual,Used,6950,110.0,2011
4,156000,Peugeot,308,Gasoline,Manual,Used,6950,156.0,2011


In [22]:
# Let's find out a bit more about the variables in the dataframe:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46405 entries, 0 to 46404
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   mileage    46405 non-null  int64  
 1   make       46405 non-null  object 
 2   model      46262 non-null  object 
 3   fuel       46405 non-null  object 
 4   gear       46223 non-null  object 
 5   offerType  46405 non-null  object 
 6   price      46405 non-null  int64  
 7   hp         46376 non-null  float64
 8   year       46405 non-null  int64  
dtypes: float64(1), int64(3), object(5)
memory usage: 3.2+ MB


### 2 Check Missing values

In [23]:
df.isna().sum()

mileage        0
make           0
model        143
fuel           0
gear         182
offerType      0
price          0
hp            29
year           0
dtype: int64

There are 46405 rows in the dataframe, and 9 variables in total. Of these, 4 are numerical variables (of dtype "int64" or "float64") and 5 are categorical (of dtype "object"). The 'model', 'gear' and 'hp' variables have a few null values, and we shall take care of these by dropping the corresponding rows (we don't expect this to have a big impact since the number of null values is relatively small).

In [24]:
# Numerical columns
num_cols = list(df._get_numeric_data().columns)
print(num_cols)

['mileage', 'price', 'hp', 'year']


In [25]:
# Categorical columns
cat_cols = set(df.columns) - set(num_cols)
print(cat_cols)

{'model', 'fuel', 'make', 'gear', 'offerType'}


### 3 Data Cleaning

#### Replacing the year variable with the age of the vehicles¶
We want to transform the 'year' variable to an 'age' variable representing the age of the car as of now:

In [26]:
from datetime import datetime

# Create a new column: 'age'
df['age'] = datetime.now().year - df['year']

# Drop the 'year' column
df = df.drop('year', axis=1)

# Show the top five rows of the cars dataset
df.head()

Unnamed: 0,mileage,make,model,fuel,gear,offerType,price,hp,age
0,235000,BMW,316,Diesel,Manual,Used,6800,116.0,12
1,92800,Volkswagen,Golf,Gasoline,Manual,Used,6877,122.0,12
2,149300,SEAT,Exeo,Gasoline,Manual,Used,6900,160.0,12
3,96200,Renault,Megane,Gasoline,Manual,Used,6950,110.0,12
4,156000,Peugeot,308,Gasoline,Manual,Used,6950,156.0,12


In [27]:
# Drop the rows with null values
df = df.dropna()

# Display the total number of null values in the resulting dataframe
df.isna().sum()

mileage      0
make         0
model        0
fuel         0
gear         0
offerType    0
price        0
hp           0
age          0
dtype: int64

In [28]:
# It's also a good idea to drop duplicate rows:
df = df.drop_duplicates(keep='first')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43947 entries, 0 to 46399
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   mileage    43947 non-null  int64  
 1   make       43947 non-null  object 
 2   model      43947 non-null  object 
 3   fuel       43947 non-null  object 
 4   gear       43947 non-null  object 
 5   offerType  43947 non-null  object 
 6   price      43947 non-null  int64  
 7   hp         43947 non-null  float64
 8   age        43947 non-null  int64  
dtypes: float64(1), int64(3), object(5)
memory usage: 3.4+ MB
