# üöó Used Car Selling Price Analysis

## Project Overview
This notebook analyzes factors influencing the selling price of used cars. We'll explore a dataset containing various car attributes to help understand what drives prices in the used car market.

### Problem Statement
Our friend Otis wants to sell his car but isn't sure about the price. He wants to maximize profit while ensuring a reasonable deal for buyers. To help Otis, we'll analyze the dataset and determine the factors affecting car prices.

## Step 1: Import Required Libraries

In [None]:
# Import necessary libraries for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ Libraries imported successfully!")

## Step 2: Load and Prepare the Dataset

First, we'll load the dataset and remove any unnecessary columns.

In [None]:
# Load the dataset
# Note: Make sure 'output.csv' is in the same directory as this notebook
try:
    df = pd.read_csv('output.csv')
    # Remove the first column if it's an index
    df = df.iloc[:, 1:]
    print("‚úÖ Dataset loaded successfully!")
    print(f"Dataset shape: {df.shape}")
except FileNotFoundError:
    print("‚ùå Error: 'output.csv' not found. Please ensure the file is in the correct directory.")
    print("You can download the dataset from the tutorial link.")

## Step 3: Assign Column Headers

The dataset doesn't have column names, so we'll assign descriptive headers based on the tutorial.

In [None]:
# Define column headers
headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration",
           "num-of-doors", "body-style", "drive-wheels", "engine-location",
           "wheel-base", "length", "width", "height", "curb-weight",
           "engine-type", "num-of-cylinders", "engine-size", "fuel-system",
           "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm",
           "city-mpg", "highway-mpg", "price"]

# Assign headers to the dataframe
df.columns = headers

# Display the first few rows to verify
print("‚úÖ Column headers assigned successfully!")
print("\nFirst 5 rows of the dataset:")
df.head()

## Step 4: Check for Missing Values

It's important to identify any missing values that might affect our analysis.

In [None]:
# Check for missing values in each column
print("Missing values per column:")
print(df.isna().any())

# Count total missing values
print(f"\nTotal missing values: {df.isna().sum().sum()}")

# Display columns with missing values and their counts
missing_cols = df.columns[df.isna().any()].tolist()
if missing_cols:
    print("\nColumns with missing values:")
    for col in missing_cols:
        print(f"{col}: {df[col].isna().sum()} missing values")
else:
    print("\n‚úÖ No missing values found!")

## Step 5: Convert MPG to L/100km

Since fuel consumption is measured differently in different regions, we'll convert miles per gallon (MPG) to liters per 100 kilometers (L/100km).

In [None]:
# Create a copy of the dataframe to work with
data = df.copy()

# Convert city-mpg to L/100km (235 / MPG)
data['city-mpg'] = 235 / data['city-mpg']

# Rename the column to reflect the new unit
data.rename(columns={'city-mpg': 'city-L/100km'}, inplace=True)

print("‚úÖ MPG to L/100km conversion completed!")
print("\nUpdated column names:")
print(data.columns.tolist())

print("\nData types after conversion:")
print(data.dtypes)

## Step 6: Clean and Convert Price Column

The price column may contain '?' as missing values. We need to clean it and convert to integer.

In [None]:
# Check unique values in price column
print("Unique values in price column:")
print(data['price'].unique()[:10])  # Show first 10 unique values

# Remove rows where price is '?'
data = data[data['price'] != '?']

# Convert price to integer
data['price'] = data['price'].astype(int)

print("\n‚úÖ Price column cleaned and converted to integer!")
print(f"Price range: ${data['price'].min():,} - ${data['price'].max():,}")
print(f"Average price: ${data['price'].mean():,.2f}")

print("\nUpdated data types:")
print(data.dtypes)

## Step 7: Normalize Features

To ensure fair comparisons between different features, we normalize numerical columns.

In [None]:
# Normalize length, width, and height (scale to 0-1 range)
data['length'] = data['length'] / data['length'].max()
data['width'] = data['width'] / data['width'].max()
data['height'] = data['height'] / data['height'].max()

print("‚úÖ Length, width, and height normalized!")
print("\nAfter normalization (values should be between 0 and 1):")
print(f"Length range: {data['length'].min():.3f} - {data['length'].max():.3f}")
print(f"Width range: {data['width'].min():.3f} - {data['width'].max():.3f}")
print(f"Height range: {data['height'].min():.3f} - {data['height'].max():.3f}")

## Step 8: Create Price Categories (Binning)

We'll categorize cars based on their price into three categories: Low, Medium, and High.

In [None]:
# Create bins for price categorization
bins = np.linspace(min(data['price']), max(data['price']), 4)
group_names = ['Low', 'Medium', 'High']

# Create a new column with price categories
data['price-binned'] = pd.cut(data['price'], bins, 
                              labels=group_names, 
                              include_lowest=True)

print("‚úÖ Price categories created!")
print("\nDistribution of price categories:")
print(data['price-binned'].value_counts())

# Visualize the distribution
plt.figure(figsize=(8, 5))
data['price-binned'].value_counts().plot(kind='bar', color=['green', 'yellow', 'red'])
plt.title('Distribution of Price Categories')
plt.xlabel('Price Category')
plt.ylabel('Number of Cars')
plt.xticks(rotation=0)
plt.show()

## Step 9: Convert Categorical Data to Numerical

Machine learning models require numerical data. We'll demonstrate one-hot encoding for categorical variables.

In [None]:
# Example: Convert 'fuel-type' to dummy variables
fuel_dummies = pd.get_dummies(data['fuel-type'])
print("One-hot encoding for fuel-type (first 5 rows):")
fuel_dummies.head()

In [None]:
# Get statistical summary of numerical columns
print("Statistical summary of numerical features:")
data.describe()

## Step 10: Data Visualization

Let's create various visualizations to understand the relationships in our data.

In [None]:
# Box plot of price distribution
plt.figure(figsize=(8, 6))
plt.boxplot(data['price'])
plt.title('Box Plot of Car Prices')
plt.ylabel('Price ($)')
plt.grid(True)
plt.show()

print(f"Price statistics:")
print(f"Median: ${data['price'].median():,.2f}")
print(f"Q1: ${data['price'].quantile(0.25):,.2f}")
print(f"Q3: ${data['price'].quantile(0.75):,.2f}")
print(f"IQR: ${data['price'].quantile(0.75) - data['price'].quantile(0.25):,.2f}")

In [None]:
# Box plot of price by drive-wheels
plt.figure(figsize=(10, 6))
sns.boxplot(x='drive-wheels', y='price', data=data)
plt.title('Price Distribution by Drive Wheels Type')
plt.xlabel('Drive Wheels')
plt.ylabel('Price ($)')
plt.show()

In [None]:
# Scatter plot of engine size vs price
plt.figure(figsize=(10, 6))
plt.scatter(data['engine-size'], data['price'], alpha=0.6)
plt.title('Engine Size vs Price')
plt.xlabel('Engine Size')
plt.ylabel('Price ($)')
plt.grid(True)
plt.show()

# Calculate correlation
correlation = data['engine-size'].corr(data['price'])
print(f"Correlation between engine size and price: {correlation:.3f}")

## Step 11: Group Analysis

Let's analyze average prices by drive-wheels and body-style.

In [None]:
# Select relevant columns and group by drive-wheels and body-style
test = data[['drive-wheels', 'body-style', 'price']]
data_grp = test.groupby(['drive-wheels', 'body-style'], 
                        as_index=False).mean()

print("Average price by drive-wheels and body-style:")
data_grp

In [None]:
# Create a pivot table for better visualization
data_pivot = data_grp.pivot(index='drive-wheels',
                            columns='body-style',
                            values='price')

print("Pivot table of average prices:")
data_pivot

In [None]:
# Create a heatmap of the pivot table
plt.figure(figsize=(10, 6))
sns.heatmap(data_pivot, annot=True, cmap='RdBu', fmt='.0f', 
            linewidths=1, cbar_kws={'label': 'Average Price ($)'})
plt.title('Average Price Heatmap: Drive Wheels vs Body Style')
plt.tight_layout()
plt.show()

## Step 12: Statistical Analysis - ANOVA Test

Let's perform an ANOVA test to determine if there's a significant price difference between Honda and Subaru.

In [None]:
# Prepare data for ANOVA
data_annova = data[['make', 'price']]
grouped_annova = data_annova.groupby(['make'])

# Perform ANOVA test between Honda and Subaru
try:
    annova_results_l = sp.stats.f_oneway(
        grouped_annova.get_group('honda')['price'],
        grouped_annova.get_group('subaru')['price']
    )
    
    print("ANOVA Test Results: Honda vs Subaru")
    print(f"F-statistic: {annova_results_l.statistic:.4f}")
    print(f"P-value: {annova_results_l.pvalue:.4f}")
    
    if annova_results_l.pvalue < 0.05:
        print("‚úÖ Result: Significant difference in prices (p < 0.05)")
    else:
        print("‚ùå Result: No significant difference in prices (p >= 0.05)")
        
except KeyError as e:
    print(f"‚ùå Error: {e}. Make sure 'honda' and 'subaru' exist in the dataset.")

In [None]:
# Regression plot of engine size vs price
plt.figure(figsize=(10, 6))
sns.regplot(x='engine-size', y='price', data=data, scatter_kws={'alpha':0.5})
plt.title('Engine Size vs Price with Regression Line')
plt.xlabel('Engine Size')
plt.ylabel('Price ($)')
plt.ylim(0, )
plt.show()

## Summary and Conclusions

Based on our analysis, we've discovered several key insights:

1. **Engine Size Impact**: There's a strong positive correlation between engine size and price, suggesting larger engines command higher prices.

2. **Drive Wheels Influence**: The type of drive wheels significantly affects pricing, with certain configurations consistently priced higher.

3. **Body Style Variations**: Different body styles show distinct price patterns when combined with drive wheel types.

4. **Statistical Significance**: The ANOVA test helps validate whether observed price differences between brands are statistically meaningful.

### Recommendations for Otis:
* Focus on key features like engine size and drive wheels when determining a fair price
* Compare similar body styles and brands in the market
* Use the visualizations to understand where his car fits in the overall price distribution

### Next Steps:
* Consider building a predictive model to estimate price based on features
* Explore more advanced feature engineering
* Analyze additional factors like mileage and condition if data becomes available

In [None]:
# Final dataset information
print("Final dataset shape:", data.shape)
print("\nColumns in final dataset:")
for col in data.columns:
    print(f"- {col}: {data[col].dtype}")

print("\n‚úÖ Analysis complete!")