# Car data

In this notebook I will clean and prepare for analysis car price data. Here's the dictionary which explains meaning of the initial data:

1. **Car_ID**: Unique id of each observation (Interger)

2. **Symboling**: Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical) 

3. **carCompany**: Name of car company (Categorical)

4. **fueltype**: Car fuel type i.e gas or diesel (Categorical)

5. **aspiration**: Aspiration used in a car (Categorical)
    - 'std' refers to a naturally aspirated engine, which means it draws in air at atmospheric pressure without any extra components.
    - 'turbo' refers to a turbocharged engine, which uses a turbine powered by exhaust gases to force more air into the engine, resulting in more power.

6. **doornumber**: Number of doors in a car (Categorical)

7. **carbody**: body of car (Categorical)
    - 'convertible': A car with a retractable roof that can be a soft top (fabric) or a hardtop (metal).
    - 'hatchback': A car with a rear hatch door that swings upward, combining the passenger area and cargo space.
    - 'sedan': A car with four doors and a separate, enclosed trunk.
    - 'wagon': Similar to a sedan but with an extended roofline that continues over a large cargo area, often with a rear liftgate.
    - 'hardtop': A car without a pillar between the front and rear side windows, giving it a more open, airy feel when the windows are down.

8. **drivewheel**: type of drive wheel (Categorical)
    - 'rwd' (Rear-Wheel Drive): Power is sent to the rear wheels. This is common in sports cars and trucks.
    - 'fwd' (Front-Wheel Drive): Power is sent to the front wheels. This is the most common setup for passenger cars.
    - '4wd' (Four-Wheel Drive): Power is sent to all four wheels. This is designed for better traction in off-road or slippery conditions.

9. **enginelocation**: Location of car engine (Categorical)

10. **wheelbase**: Weelbase of car (Numeric). The distance between the center points of the front and rear wheels. A longer wheelbase generally results in a smoother ride, while a shorter wheelbase can make a car more agile.

11. **carlength**: Length of car (Numeric)

12. **carwidth**: Width of car (Numeric)

13. **carheight**: height of car (Numeric)

14. **curbweight**: The weight of a car without occupants or baggage. (Numeric)

15. **enginetype**: Type of engine. (Categorical)
    - 'OHC' (Overhead Cam): The camshaft is located in the cylinder head. This is the general term for this design.
    - 'DOHC' (Dual Overhead Cam): The most common modern design, with two separate camshafts per cylinder bank—one for intake valves and one for exhaust valves. This allows for better engine breathing and more power.
    - 'OHCV' (Overhead Cam Valve): This term is similar to OHC but emphasizes that the valves are also located in the cylinder head.
    - 'OHCF' (Overhead Cam with F-Head): A specific and less common design where the intake valves are in the cylinder head and the exhaust valves are in the block.
    - 'l': This likely stands for a L-head engine, an older design where the intake and exhaust valves were located side-by-side in the engine block.
    - 'rotor': A rotary engine, which uses triangular rotors to convert pressure into rotational motion instead of pistons.

16. **cylindernumber**: cylinder placed in the car (Categorical). A cylinder is a chamber where fuel is combusted to generate power.

17. **enginesize**: Size of car (Numeric). This refers to the total volume of air and fuel an engine can draw in during one cycle.

18. **fuelsystem**: Fuel system of car (Categorical). This is the method used to deliver fuel to the engine's combustion chambers.
    - 'mpfi' (Multi-Port Fuel Injection): Each cylinder has its own fuel injector, which is the most common modern design.
    - 'spfi' (Single-Point Fuel Injection): A single injector feeds all cylinders, typically a less efficient system than MPFI.
    - 'mfi' (Mechanical Fuel Injection): An older system that uses mechanical pumps and injectors instead of electronic ones.
    - 'idi' (Indirect Diesel Injection): The fuel is injected into a pre-chamber before entering the main combustion chamber.
    - '1bbl', '2bbl', '4bbl': These refer to carburetor-based systems. bbl stands for "barrel," which is a passageway in the carburetor. More barrels generally allow for more air and fuel to enter the engine, thus more power.

19. **boreratio**: Boreratio of car (Numeric). The bore-stroke ratio is the ratio of the cylinder's diameter (bore) to the piston's travel distance (stroke). This ratio helps determine how the engine produces power.
20. **stroke**: Stroke or volume inside the engine (Numeric). The distance the piston travels up and down inside the cylinder. This is a critical factor in calculating the engine's displacement.

21. **compressionratio**: compression ratio of car (Numeric). This is the ratio of the volume inside a cylinder when the piston is at the bottom of its stroke to the volume when the piston is at the top. A higher compression ratio can result in more power and efficiency but may require higher-octane fuel.

22. **horsepower**: Horsepower (Numeric). A measurement of an engine's power output. It tells you how quickly the engine can produce force, which directly relates to a car's top speed and acceleration.

23. **peakrpm**: car peak rpm (Numeric). This stands for Revolutions Per Minute. It is a measure of how fast the engine crankshaft is rotating. Peak RPM is the specific engine speed at which the engine produces its maximum power.

24. **citympg**: Mileage in city (Numeric). It measures fuel economy under typical urban driving conditions, with frequent stops and starts.

25. **highwaympg**: Mileage on highway (Numeric). It measures fuel economy during a consistent, higher-speed drive, like on a highway, without stopping. Highway MPG is almost always higher because a car is more fuel-efficient when it doesn't have to constantly accelerate from a stop.

26. **price** (Dependent variable): Price of car (Numeric)

In [75]:
import pandas as pd

## Loading the data
Let's first load the data, see its general shape and check for missing values:

In [76]:
df = pd.read_csv("../data/raw/CarPrice_Assignment.csv")
print(df.shape)
df.head(10)

(205, 26)


Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0
5,6,2,audi fox,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250.0
6,7,1,audi 100ls,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710.0
7,8,1,audi 5000,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920.0
8,9,1,audi 4000,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875.0
9,10,0,audi 5000s (diesel),gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,17859.167


In [77]:
df.isnull().sum()

car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

There are no missing values in this dataset. Let's confirm the data type of each of the columns:

In [78]:
df.dtypes

car_ID                int64
symboling             int64
CarName              object
fueltype             object
aspiration           object
doornumber           object
carbody              object
drivewheel           object
enginelocation       object
wheelbase           float64
carlength           float64
carwidth            float64
carheight           float64
curbweight            int64
enginetype           object
cylindernumber       object
enginesize            int64
fuelsystem           object
boreratio           float64
stroke              float64
compressionratio    float64
horsepower            int64
peakrpm               int64
citympg               int64
highwaympg            int64
price               float64
dtype: object

### Unique values
Before we move on let's try to understand our data. I will search for unique values for different categorical data:

##### fueltype:

In [79]:
print(df['fueltype'].unique())

['gas' 'diesel']


##### aspiration:

In [80]:
print(df['aspiration'].unique())

['std' 'turbo']


##### doornumber:

In [81]:
print(df['doornumber'].unique())

['two' 'four']


carbody:

In [82]:
print(df['carbody'].unique())

['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']


drivewheel:

In [83]:
print(df['drivewheel'].unique())

['rwd' 'fwd' '4wd']


enginelocation:

In [84]:
print(df['enginelocation'].unique())

['front' 'rear']


enginetype:

In [85]:
print(df['enginetype'].unique())

['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf' 'dohcv']


cylindernumber:

In [86]:
print(df['cylindernumber'].unique())

['four' 'six' 'five' 'three' 'twelve' 'two' 'eight']


fuelsystem:

In [87]:
print(df['fuelsystem'].unique())

['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' '4bbl' 'idi' 'spdi']


## Data cleaning and transformation

This data is already clean and can be used to plot the data. However there are two things that can be done to transform it to be even better for that purpose:

### Converting `cylindernumber` and `doornumber` to a numerical value
We can clearly encode `cylindernumber` column as numbers. Since we know all unique outputs I will first create a dictionary that maps numbers to strings, and then use it to change values.

In [88]:
cylinder_dict = {
    'four': 4, 'six': 6, 'five': 5,  'three': 3, 'twelve': 12, 'two': 2, 'eight': 8
}

df['cylindernumber'].replace(to_replace=cylinder_dict, inplace=True)
df['cylindernumber'].unique()


array([ 4,  6,  5,  3, 12,  2,  8], dtype=int64)

Although less likely to be useful, we can do the same with `doornumber`, using the same dictionary:

In [89]:
df['doornumber'].replace(to_replace=cylinder_dict, inplace=True)
df['doornumber'].unique()

array([2, 4], dtype=int64)

### Extracting brand name from the the `CarName` string
There is probably some amount of correlation between the car brand and car price due to factors like marketing or brand recognition. It is therefore a desirable idea to extract this variable. You can notice that the brand name is the first word of the car name, therefore its extraction is easy:

In [90]:
df['brand'] = df['CarName'].str.split(' ', n=1).str[0]
print(df['brand'].unique())
df.head()

['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'maxda' 'mazda' 'buick' 'mercury' 'mitsubishi' 'Nissan' 'nissan'
 'peugeot' 'plymouth' 'porsche' 'porcshce' 'renault' 'saab' 'subaru'
 'toyota' 'toyouta' 'vokswagen' 'volkswagen' 'vw' 'volvo']


Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price,brand
0,1,3,alfa-romero giulia,gas,std,2,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0,alfa-romero
1,2,3,alfa-romero stelvio,gas,std,2,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0,alfa-romero
2,3,1,alfa-romero Quadrifoglio,gas,std,2,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0,alfa-romero
3,4,2,audi 100 ls,gas,std,4,sedan,fwd,front,99.8,...,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0,audi
4,5,2,audi 100ls,gas,std,4,sedan,4wd,front,99.4,...,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0,audi


We can see many misspelled values (e.g. maxda instead of mazda) or values that have alternative name (e.g. vw instead of volkswagen) here. To fix this I will create a dictionary to correct and standardize the brand names.

In [91]:
brand_dict = {
   'maxda': 'mazda',
   'Nissan': 'nissan', 
   'porcshce': 'porsche',
   'toyouta': 'toyota',
   'vokswagen': 'volkswagen',
   'vw': 'volkswagen'
}

df['brand'].replace(to_replace=brand_dict, inplace=True)
df['brand'].unique()

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi',
       'nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab',
       'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)

### Other problems with the data
As you can probably see from the head there are models with very similar names (e.g. audi 100 ls and two entries for audi 100ls). However they seem to have different properties and it's impossible to know what are the differences between them without more context. I will therefore assume they are different models (these might be for example different trims or years: One might be a base model, and the other a more expensive luxury or performance-focused trim, or they might be the same model released in different year.). Therefore I will keep these duplicates.

I also do not see any reason for cleaning misspellings in the CarName string since it won't be analyzed and it would be really hard to do and require a lot of research to do properly.

## Save the data

Let's first make `car_ID` the index to avoid duplicate index:

In [94]:
df.set_index('car_ID', inplace=True)
df.head()

Unnamed: 0_level_0,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,...,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price,brand
car_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3,alfa-romero giulia,gas,std,2,convertible,rwd,front,88.6,168.8,...,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0,alfa-romero
2,3,alfa-romero stelvio,gas,std,2,convertible,rwd,front,88.6,168.8,...,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0,alfa-romero
3,1,alfa-romero Quadrifoglio,gas,std,2,hatchback,rwd,front,94.5,171.2,...,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0,alfa-romero
4,2,audi 100 ls,gas,std,4,sedan,fwd,front,99.8,176.6,...,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0,audi
5,2,audi 100ls,gas,std,4,sedan,4wd,front,99.4,176.6,...,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0,audi


And then let's save the data:

In [95]:
df.to_csv("../data/cleaned/CarPrice_cleaned.csv")

## Encoding
Although this is not needed for the data visualization, it might be useful for later when the data is used to train a machine learning model.