### Reading the data 

First we have to read our car sales data from the CSV file into a Pandas dataframe.

In [1]:
import pandas as pd

# This line requires the car sales data to be located in a folder called data in the same directory as the notebook
car_data = pd.read_csv("./data/car_sales.csv", low_memory=False)

Now, let's take a first look at how our data looks like.

In [2]:
car_data.head()

Unnamed: 0,Maker,Genmodel,Genmodel_ID,Adv_ID,Adv_year,Adv_month,Color,Reg_year,Bodytype,Runned_Miles,Engin_size,Gearbox,Fuel_type,Price,Seat_num,Door_num
0,Bentley,Arnage,10_1,10_1$$1,2018,4,Silver,2000.0,Saloon,60000,6.8L,Automatic,Petrol,21500,5.0,4.0
1,Bentley,Arnage,10_1,10_1$$2,2018,6,Grey,2002.0,Saloon,44000,6.8L,Automatic,Petrol,28750,5.0,4.0
2,Bentley,Arnage,10_1,10_1$$3,2017,11,Blue,2002.0,Saloon,55000,6.8L,Automatic,Petrol,29999,5.0,4.0
3,Bentley,Arnage,10_1,10_1$$4,2018,4,Green,2003.0,Saloon,14000,6.8L,Automatic,Petrol,34948,5.0,4.0
4,Bentley,Arnage,10_1,10_1$$5,2017,11,Grey,2003.0,Saloon,61652,6.8L,Automatic,Petrol,26555,5.0,4.0


### Preparing the data

For covenience, we will set all of the column names to lowercase.

In [3]:
car_data.columns = car_data.columns.str.lower()

In [4]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268255 entries, 0 to 268254
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   maker         268255 non-null  object 
 1   genmodel      268255 non-null  object 
 2   genmodel_id   268255 non-null  object 
 3   adv_id        268255 non-null  object 
 4   adv_year      268255 non-null  int64  
 5   adv_month     268255 non-null  int64  
 6   color         246380 non-null  object 
 7   reg_year      268248 non-null  float64
 8   bodytype      267301 non-null  object 
 9   runned_miles  267200 non-null  object 
 10  engin_size    266191 non-null  object 
 11  gearbox       268088 non-null  object 
 12  fuel_type     267846 non-null  object 
 13  price         268255 non-null  object 
 14  seat_num      261781 non-null  float64
 15  door_num      263702 non-null  float64
dtypes: float64(3), int64(2), object(11)
memory usage: 32.7+ MB


In [5]:
car_data.shape

(268255, 16)

From the above we can see that our dataset consists of 16 columns and 268255 rows. Furthermore, we can see that we have 8 columns with nominal datatype, such as **maker**, **genmodel**, or **color** for example. Other than that, we also have 8 numeric colums, for example **reg_year**, **runned_miles**, or **door_num**, where the first two are continous and the last one is discrete. <br><br>

We can also observe that a for some colums we have null values present. To look further into this issue, we can do the following:

In [6]:
car_data.isnull().sum()

maker               0
genmodel            0
genmodel_id         0
adv_id              0
adv_year            0
adv_month           0
color           21875
reg_year            7
bodytype          954
runned_miles     1055
engin_size       2064
gearbox           167
fuel_type         409
price               0
seat_num         6474
door_num         4553
dtype: int64

It looks like we have a significant number of null values in the **color** column. However, that should not be a big issue as we can be quite sure that color does not have a significant correlation with price. Next largest we have **seat_num** and **door_num** numbers which might have a greater contribution towards the price of the car. Then, the remaining attributes with null values are **engine_size**, **runned_miles**, **body_type**, **fuel_type**, **gearbox**, and **regyear**, listed according to the number of null values in a decreasing manner. Since all of these attributes are important, we will need to take care of the missing values later on. <br><br>

Now let's have a closer look at the nominal attributes.

In [7]:
car_data.select_dtypes(include=['object'])

Unnamed: 0,maker,genmodel,genmodel_id,adv_id,color,bodytype,runned_miles,engin_size,gearbox,fuel_type,price
0,Bentley,Arnage,10_1,10_1$$1,Silver,Saloon,60000,6.8L,Automatic,Petrol,21500
1,Bentley,Arnage,10_1,10_1$$2,Grey,Saloon,44000,6.8L,Automatic,Petrol,28750
2,Bentley,Arnage,10_1,10_1$$3,Blue,Saloon,55000,6.8L,Automatic,Petrol,29999
3,Bentley,Arnage,10_1,10_1$$4,Green,Saloon,14000,6.8L,Automatic,Petrol,34948
4,Bentley,Arnage,10_1,10_1$$5,Grey,Saloon,61652,6.8L,Automatic,Petrol,26555
...,...,...,...,...,...,...,...,...,...,...,...
268250,Westfield,Sport,97_1,97_1$$1,Yellow,Convertible,1800,2.2L,Manual,Petrol,8750
268251,Westfield,Sport,97_1,97_1$$2,Yellow,Convertible,2009,,Manual,,7995
268252,Zenos,E10,99_1,99_1$$1,Red,Convertible,6,2.0L,Manual,Petrol,27950
268253,Zenos,E10,99_1,99_1$$2,Green,Convertible,1538,2.0L,Manual,Petrol,34950


In [16]:
car_data['price'] = pd.to_numeric(car_data['price'], errors='coerce')
car_data['runned_miles'] = pd.to_numeric(car_data['price'], errors='coerce')


# Identify rows with NaN values in the 'price' column
rows_with_nan = car_data[car_data['runned_miles'].isna()]

print(rows_with_nan)

       maker  genmodel genmodel_id     adv_id  adv_year  adv_month   color  \
196134   BMW  6 Series        8_11    8_11$$1      2018          2   Black   
196135   BMW  6 Series        8_11    8_11$$2      2018          3    Grey   
196136   BMW  6 Series        8_11    8_11$$3      2017          7   Black   
196137   BMW  6 Series        8_11    8_11$$4      2018          3  Silver   
196138   BMW  6 Series        8_11    8_11$$5      2017          9   Black   
...      ...       ...         ...        ...       ...        ...     ...   
200029   BMW        M2        8_29  8_29$$235      2018          4    Blue   
200032   BMW        M2        8_29  8_29$$238      2018          3   White   
200033   BMW        M2        8_29  8_29$$239      2018          3   White   
200035   BMW        M2        8_29  8_29$$241      2018          2   Black   
200036   BMW        M2        8_29  8_29$$242      2018          3    Grey   

        reg_year     bodytype  runned_miles engin_size    gearb

In [14]:
car_data.isnull().sum()

maker               0
genmodel            0
genmodel_id         0
adv_id              0
adv_year            0
adv_month           0
color           21875
reg_year            7
bodytype          954
runned_miles     1055
engin_size       2064
gearbox           167
fuel_type         409
price            1145
seat_num         6474
door_num         4553
dtype: int64