# **Title of Project**

# **Mileage Prediction using Regression Analysis**

## **Objective**

## **Objective**

Predict car mileage (MPG) based on various features such as cylinders, displacement, horsepower, weight, acceleration, and origin.

## **Data Source**

## **Data Source**

The dataset is sourced from [YBI Foundation Dataset](https://github.com/YBI-Foundation/Dataset/raw/main/MPG.csv).

## **Import Library**

In [None]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score



## **Import Data**

In [None]:
url = 'https://github.com/YBI-Foundation/Dataset/raw/main/MPG.csv'
data = pd.read_csv(url)

## **Describe Data**

In [None]:
print("First five rows of the dataset:")
print(data.head())

print("\nInformation about the dataset:")
print(data.info())

print("\nSummary statistics of the dataset:")
print(data.describe())

# Check for missing values
print("\nMissing values in each column:")
print(data.isnull().sum())

## **Data Visualization**

In [None]:
sns.histplot(data['MPG'], kde=True)
plt.title('Distribution of MPG')
plt.xlabel('Miles Per Gallon (MPG)')
plt.ylabel('Frequency')
plt.show()


## **Data Preprocessing**

In [None]:
data['Horsepower'] = data['Horsepower'].replace('?', np.nan).astype(float)
data['Horsepower'].fillna(data['Horsepower'].median(), inplace=True)

data = pd.get_dummies(data, columns=['Origin'], drop_first=True)
data.drop(columns=['Car Name'], inplace=True)




## **Define Target Variable (y) and Feature Variables (X)**

In [None]:
X = data.drop('MPG', axis=1)
y = data['MPG']

## **Train Test Split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## **Modeling**

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## **Model Evaluation**

In [None]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'\nMean Squared Error: {mse}')
print(f'R-squared: {r2}')

## **Prediction**

In [None]:
plt.scatter(y_test, y_pred)
plt.xlabel('Actual MPG')
plt.ylabel('Predicted MPG')
plt.title('Actual vs Predicted MPG')
plt.show()


In [None]:

# Step 3: Exploratory Data Analysis (EDA)
print("First five rows of the dataset:")
print(data.head())

print("\nInformation about the dataset:")
print(data.info())

print("\nSummary statistics of the dataset:")
print(data.describe())

# Check for missing values
print("\nMissing values in each column:")
print(data.isnull().sum())

# Visualize the distribution of MPG
sns.histplot(data['MPG'], kde=True)
plt.title('Distribution of MPG')
plt.xlabel('Miles Per Gallon (MPG)')
plt.ylabel('Frequency')
plt.show()


In [None]:

# Step 4: Data Preprocessing
# Handle missing values in 'Horsepower'
data['Horsepower'] = data['Horsepower'].replace('?', np.nan).astype(float)
data['Horsepower'].fillna(data['Horsepower'].median(), inplace=True)

# Encode categorical variables
data = pd.get_dummies(data, columns=['Origin'], drop_first=True)

# Drop unnecessary columns
data.drop(columns=['Car Name'], inplace=True)


In [None]:

# Step 5: Model Building
# Define features and target variable
X = data.drop('MPG', axis=1)
y = data['MPG']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)


In [None]:

# Step 6: Model Evaluation
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'\nMean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Plot actual vs predicted MPG
plt.scatter(y_test, y_pred)
plt.xlabel('Actual MPG')
plt.ylabel('Predicted MPG')
plt.title('Actual vs Predicted MPG')
plt.show()



## **Conclusion**

The Linear Regression model was trained to predict the MPG of cars. The Mean Squared Error of the model is {mse:.2f} and the R-squared value is {r2:.2f}. These metrics indicate the performance of the model, with R-squared showing how well the variance in MPG is explained by the features.
