# Used car price predictor

In [97]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OrdinalEncoder
import joblib

# Load and preprocess the dataset
data = pd.read_csv('Auto_Dataset.csv', encoding='unicode_escape')
data['price'] = data['price'].replace({'\\$': '', ',': ''}, regex=True).astype(float)
data['odometer'] = data['odometer'].str.replace('km', '').str.replace(',', '').astype(float)

# Drop unnecessary columns and handle missing values
data = data.drop(columns=['dateCrawled', 'seller', 'offerType', 'abtest', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen']).dropna()

# Encode categorical columns
ordinal_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
    encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
    data[column] = encoder.fit_transform(data[[column]])
    ordinal_encoders[column] = encoder

# Train a Random Forest model
X = data.drop(columns=['price'])
y = data['price']
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Save the model
joblib.dump(model, 'random_forest_model.pkl')

# Prepare test input that we want to predict
test_input = pd.DataFrame({
    'year': [2010],
    'odometer': [150000],
    'fuelType': ['diesel'],
    'gearbox': ['manuell'],
    'brand': ['ford']
})

# Ensure test input has all features and encode categorical columns
test_input = test_input.reindex(columns=X.columns, fill_value=0)
for column in test_input.select_dtypes(include=['object']).columns:
    test_input[column] = ordinal_encoders[column].transform(test_input[[column]])

# Predict car price
predicted_price = model.predict(test_input)

# Print the predicted price
print('The predicted price for the car is: ($)', predicted_price[0])


The predicted price for the car is: ($) 27485.84


In [98]:
# Prepare test input for a different car to compare prices when values are changed
car_2_input = pd.DataFrame({
    'year': [2000],               
    'odometer': [10000],          
    'fuelType': ['benzin'],        
    'gearbox': ['automatik'],        
    'brand': ['ford']              
})

# Ensure test input has all features and encode categorical columns
car_2_input = car_2_input.reindex(columns=X.columns, fill_value=0)
for column in car_2_input.select_dtypes(include=['object']).columns:
    car_2_input[column] = ordinal_encoders[column].transform(car_2_input[[column]])

# Predict car price for the second car
car_2_predicted_price = model.predict(car_2_input)

# Print the predicted price for the second car
print('The predicted price for the second car is: ($)', car_2_predicted_price[0])

The predicted price for the second car is: ($) 26457.78


Conclusion
In the first scenario, the car with a 2010 model year, 150,000 kilometers on the odometer, and a diesel engine with a manual gearbox was predicted to have a price of $27,485.84. This result reflects the influence of various features such as the model year, mileage, fuel type, and gearbox type on the car's market value. Given that the car is relatively newer and has a high mileage, it falls within a moderate price range for used vehicles of this type.
In the second scenario, when the input features were altered for a car from the year 2000, with only 10,000 kilometers on the odometer, a petrol engine, and an automatic gearbox, the predicted price dropped to $26,457.78. Despite the car's lower mileage, the older model year and changes in fuel type and gearbox type result in a slightly lower price compared to the first car. This suggests that the model values the combination of newer years and higher mileage higher than older cars with lower mileage, showing the complex interplay of these features in car price prediction.