<a href="https://colab.research.google.com/github/kanishka9389/kanishka/blob/main/Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type** - Linear Regression
##### **Individual** **project**
##### **Name** - **Kanishka**

# **Project Summary -**

**Car Price Prediction**-
                       This project focuses on building a predictive model to estimate car prices based on a wide array of features provided in a structured dataset. The dataset includes 205 observations with 26 features each, encapsulating both numerical and categorical variables related to vehicle specifications and attributes. The primary goal of the project is to apply data preprocessing and machine learning techniques to accurately predict the price of a car, which can have significant applications in the automotive sales, insurance, and valuation sectors.

Dataset Overview

The dataset comprises several key types of information:

Identification and Categorical Descriptors: car_ID, CarName, fueltype, aspiration, doornumber, carbody, drivewheel, enginelocation, enginetype, cylindernumber, and fuelsystem.

Dimensional Features: wheelbase, carlength, carwidth, and carheight.

Performance and Mechanical Features: curbweight, enginesize, boreratio, stroke, compressionratio, horsepower, peakrpm, citympg, and highwaympg.

Target Variable: price.

The price variable, which is continuous in nature, serves as the output for the regression models to predict.

Exploratory Data Analysis and Preprocessing

An initial exploratory data analysis (EDA) would reveal insights into data distribution, correlations, and potential outliers. For example, features like enginesize, horsepower, and curbweight are typically strongly correlated with price, while categorical features such as fueltype or carbody may also influence consumer perception and pricing strategies.

Preprocessing steps would involve:

Handling Categorical Variables: Encoding non-numeric features using techniques like one-hot encoding.

Feature Engineering: Extracting meaningful components from CarName (e.g., brand), removing irrelevant or redundant identifiers like car_ID.

Normalization: Scaling features such as enginesize and horsepower to ensure comparability across different ranges.

Dealing with Inconsistencies: Standardizing text fields (e.g., misspellings in car brand names).

Modeling Approach

Given the regression nature of the task, multiple machine learning algorithms could be explored and compared, such as:

Linear Regression: For baseline performance and interpretability.

Ridge and Lasso Regression: To handle multicollinearity and perform feature selection.

Decision Tree and Random Forest Regressors: To capture non-linear relationships.

Gradient Boosting Methods (XGBoost/LightGBM): For improved accuracy with ensemble learning.

Support Vector Regression (SVR) and Neural Networks: For complex modeling needs, if necessary.

Models would be evaluated using appropriate regression metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² score.

# **GitHub Link -**

https://github.com/kanishka9389/kanishka.git

# **Problem Statement**


**Problem Statement**-

In the competitive automotive market, accurately determining the price of a vehicle is critical for manufacturers, dealers, and consumers. Car prices are influenced by a variety of factors, including technical specifications, design features, brand reputation, and performance metrics. However, manual estimation or reliance on limited criteria can lead to inconsistent and inaccurate pricing, potentially impacting sales, customer satisfaction, and market competitiveness.

The objective of this project is to develop a machine learning model that can accurately predict the price of a car based on its features. Using a structured dataset containing 205 entries and 26 variables—including engine characteristics, fuel type, body style, and performance parameters—this model aims to identify the most influential features and generate price predictions that align closely with actual market values.

By automating the pricing process through data-driven insights, this solution will assist stakeholders in making informed decisions related to car valuation, purchasing, and sales strategies.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

### Dataset Loading

In [None]:
df=pd.read_csv('/content/CarPrice_project.csv')
df.head()

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.describe()

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
df.shape

### Dataset Information

In [None]:
df.info()

#### Duplicate Values

In [None]:
df.drop(columns=['curbweight'], inplace=True)
df.head()

#### Missing Values/Null Values

In [None]:
df.isnull()

### What did you know about your dataset?

1 **Car Price Prediction**: the dataset is likely related to predicting car prices,as indicated by the file name (CarOrice_project.csv)and the presence of a 'price'column.
2 **Various Car Features**:The dataset contains information about various car features like car make and model,fuel type,engine specifications,dimensions,mileageand more.Numerical and Categorical Data: The dataset includes both numerical (e.g., horsepower, mileage) and categorical (e.g., fuel type, car body type) variables.


## ***2. Understanding Your Variables***

In [None]:
# Assuming df is your DataFrame and 'carbody' is the relevant column
Top3_carbody = df['carbody'].value_counts().head(3).reset_index()
Top3_carbody.columns = ['CarBody', 'Count']
print(Top3_carbody)

In [None]:
df.describe()

### Variables Description

**Identification:**

car_ID: Unique car identifier.
CarName: Car model.
**Categorical:** Describing car attributes like fuel type, engine, body style, etc.

fueltype, aspiration, doornumber, carbody, drivewheel, enginelocation, enginetype, cylindernumber, fuelsystem
**Numerical:** Measurable car characteristics like size, weight, engine specs, mileage, etc.

symboling, wheelbase, carlength, carwidth, carheight, curbweight, enginesize, boreratio, stroke, compressionratio, horsepower, peakrpm, citympg, highwaympg
**Target:** The variable to predict.

price: Car price.

### Check Unique Values for each variable.

In [None]:
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Column: {column}")
    print(f"Unique Values: {unique_values}\n")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Assuming 'df' is your DataFrame

# 1. Removing irrelevant columns:
# Check if columns exist before dropping
if 'CarName' in df.columns and 'car_ID' in df.columns:
    df = df.drop(['CarName', 'car_ID'], axis=1)
else:
    print("Columns 'CarName' and/or 'car_ID' have already been removed.")

# 2. Renaming columns for better readability:
df = df.rename(columns={'wheelbase': 'wheel_base', 'enginesize': 'engine_size'})

# 3. Handling missing values:
# Assuming no missing values based on the provided dataset
# If there were missing values, you could use:
# df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)

# 4. Feature engineering:
# Creating a new feature 'power_to_weight' (example)
# df['power_to_weight'] = df['horsepower'] / df['curbweight']

# 5. Data type conversion:
# Converting 'symboling' to categorical (if needed)
# df['symboling']

### What all manipulations have you done and insights you found?

Manipulations:

Removed irrelevant columns (CarName, car_ID).

Renamed columns for better readability (wheelbase to wheel_base, enginesize to engine_size).

Included handling for potential missing values (though none were present).

Provided an example of feature engineering (power_to_weight, but commented out).

Included an example of data type conversion
(symboling to categorical, but commented out).

Insights:

Simplified the dataset, reducing noise and potential dimensionality issues.

Improved code readability and consistency.

Prepared the data for analysis and modeling by addressing missing values, potentially engineering features, and potentially handling categorical variables appropriately.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame

plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.histplot(df['price'], bins=20, kde=True)  # Use histplot for histogram with optional KDE
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I picked the histogram (specifically using sns.histplot) for this visualization because:

**Distribution:** Histograms are ideal for visualizing the distribution of a single numerical variable, like car prices.
**Frequency:** They show the frequency of data points within specific price ranges, highlighting common price points.
**Patterns:** They can reveal skewness, central tendency, and outliers in the data.
Continuous Data: Histograms are designed for continuous numerical data like car prices.
**Easy Interpretation:** They are easy to understand, with price ranges on the x-axis and frequency on the y-axis.
**KDE:** The kde=True option adds a smooth curve to highlight the distribution's shape.

##### 2. What is/are the insight(s) found from the chart?

**Distribution:** Shows the overall shape (likely right-skewed), indicating more lower-priced cars and fewer high-priced ones.
**Central Tendency:** Gives an idea of the typical price range (mode) and approximate mean/median.
**Price Range:** Reveals the minimum and maximum prices in the dataset.
**Outliers:** Helps identify unusual prices outside the typical range.
**Frequency:** Shows the number of cars within different price ranges, indicating common price points.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Pricing, Inventory, Marketing, Competitive Analysis, Product Development.

**Negative Growth Insights:**

**Ignoring demand:** Not aligning with customer preferences for price ranges.

**Inventory mismanagement: **Overstocking or understocking based on inaccurate price distribution assumptions.

**Misaligned marketing:** Targeting wrong customer segments with ineffective pricing messages.

**Lack of innovation**: Missing opportunities to address market gaps or changing customer price preferences.

Justification: The histogram provides valuable data about the market, and using it strategically can drive positive outcomes. However, misinterpreting or ignoring its insights can have negative consequences for business growth.

#### Chart - 2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame

plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.scatterplot(x='horsepower', y='price', data=df)  # Create scatter plot
plt.title('Horsepower vs. Price')
plt.xlabel('Horsepower')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

I picked the scatter plot (specifically using sns.scatterplot) for this visualization because:

**Visualizing Relationships:** Scatter plots are ideal for showing the relationship between two numerical variables, like horsepower and price.

**Identifying Patterns:** They help see if there's a correlation (positive, negative, or none) between the variables.

**Detecting Outliers:** Scatter plots make it easy to spot unusual data points.

**Suitable for Continuous Data:** Both horsepower and price are continuous numerical data, making scatter plots appropriate.

**Intuitive Interpretation: **They are relatively easy to understand and interpret.


##### 2. What is/are the insight(s) found from the chart?

**Correlation:** Shows whether there's a relationship between horsepower and price (likely positive).

**Strength of Relationship:** Indicates how strong the correlation is (e.g., strong, moderate, weak).

**Outliers:** Reveals any unusual cars with unexpected horsepower-price combinations.

**Price Trends:** Helps see how price changes as horsepower increases.

**Potential Clusters:** Might identify groups of cars with similar horsepower and price ranges.

 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

**Pricing Strategy**: The scatter plot can help businesses understand the relationship between horsepower and price, which is crucial for setting competitive prices.

**Product Development:** The insights can guide product development decisions.

**Marketing and Sales**: The insights can guide marketing and sales strategies.

**Negative Growth Insights:**


**Misalignment with demand:** Offering horsepower levels that don't match customer preferences.

Overemphasis on horsepower: Focusing solely on horsepower when other factors might be more important to customers (e.g., fuel efficiency, features).

Neglecting price sensitivity: Ignoring the potential impact of price differences on customer choices, particularly in certain horsepower segments.

Justification:

Misalignment with market: Businesses risk losing customers if their offerings don't meet customer preferences revealed in the scatter plot.
Inefficient resource allocation: Overemphasizing horsepower might divert resources from other important aspects of product development or marketing.
Lost sales opportunities: Neglecting price sensitivity could lead to pricing products out of the market for certain customer segments.

#### Chart - 3

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame and 'carbody' is the column for car body types

plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.countplot(x='carbody', data=df)  # Create bar chart
plt.title('Distribution of Car Body Types')
plt.xlabel('Car Body Type')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

I picked the bar chart (specifically using sns.countplot) for this visualization because:

Suitable for Categorical Data: Bar charts are the primary choice for visualizing the distribution of categorical data. Car body types are categorical variables, meaning they represent distinct categories or groups (e.g., sedan, hatchback, SUV). Bar charts effectively show the frequency or count of each category.

sns.countplot for Frequency: The sns.countplot function in Seaborn is specifically designed for creating bar charts that display the counts of different categories in a column. It simplifies the process of generating a frequency distribution for categorical data.

Clear Comparison of Categories: Bar charts make it easy to compare the frequencies of different car body types. The height of each bar directly corresponds to the count of cars belonging to that category, allowing for quick visual comparisons.

Easy Interpretation: Bar charts are generally easy to understand and interpret. The x-axis represents the different car body types, and the y-axis represents the count or frequency of each category. This simple structure makes it accessible to a wide audience.

Effective for Showing Popularity: In this specific case, we want to understand the popularity or distribution of different car body types. Bar charts are an effective way to visually represent this information, highlighting the most and least frequent categories.

##### 2. What is/are the insight(s) found from the chart?

Most Popular Car Body Types: The chart will clearly show which car body types are most frequent in the dataset. These will be the categories with the tallest bars. For example, if the 'sedan' category has the tallest bar, it indicates that sedans are the most common car body type in the dataset.

Least Popular Car Body Types: Similarly, the chart will reveal the least frequent car body types, represented by the shortest bars. For example, if the 'convertible' category has a very short bar, it suggests that convertibles are relatively uncommon in the dataset.

Overall Distribution: You can get a sense of the overall distribution of car body types. Is it relatively balanced, with similar frequencies across categories, or are there a few dominant categories and several less common ones?

Market Trends: The distribution of car body types can provide insights into market trends and customer preferences. For example, if SUVs have a high frequency in the dataset, it might indicate a growing popularity of SUVs in the market.

Inventory Management: For businesses, this information can be valuable for inventory management decisions. They might want to stock more cars of the popular body types to meet customer demand and avoid overstocking less popular ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Targeted Marketing: Businesses can use the insights to tailor their marketing campaigns to specific customer segments. For example, if the bar chart shows that sedans are the most popular car body type, businesses can focus their marketing efforts on promoting sedans to a wider audience.
Negative Growth Insights:

Ignoring Customer Preferences: If businesses ignore the insights from the bar chart and fail to offer a sufficient variety of popular car body types, they risk losing customers to competitors who do. This can lead to negative growth in sales and market share.
Overstocking Unpopular Models: If businesses overstock unpopular car body types, they incur unnecessary inventory costs and risk having unsold inventory. This can negatively impact profitability and cash flow.
Misaligned Marketing Efforts: If businesses target their marketing efforts on unpopular car body types, they waste resources and fail to reach their target audience effectively. This can lead to a decline in sales and brand awareness.
Lack of Innovation: If businesses fail to adapt to changing customer preferences and market trends, they risk losing their competitive edge. This can result in negative growth and a decline in market share.
Justification:

The bar chart provides valuable insights into the popularity of different car body types, which is a key factor in understanding customer preferences and market trends. Businesses can leverage these insights to make data-driven decisions about marketing, product development, inventory management, and pricing strategies. By aligning their offerings with customer preferences and market trends, businesses can enhance their chances of success and drive positive business impact. However, failing to consider the insights from the bar chart can lead to negative growth due to factors like ignoring customer preferences, overstocking unpopular models, misaligned marketing efforts, and a lack of innovation.

#### Chart - 4

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame

plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.scatterplot(x='engine_size', y='price', data=df)  # Use the correct column name 'engine_size'
plt.title('Engine Size vs. Price')
plt.xlabel('Engine Size')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

I picked the scatter plot (specifically using sns.scatterplot) for this visualization because:

**Scatter plots are ideal for showing the relationship between two numerical **variables, like engine size and price. They help reveal **patterns, correlations, and outliers**, making them suitable for exploring how engine size might influence car prices.


##### 2. What is/are the insight(s) found from the chart?

**Positive Correlation:** Engine size and price tend to increase together.

**Correlation Strength:** How closely points cluster indicates the strength of this relationship.

**Outliers:** Unusual cars with unexpected engine size/price combinations.

**Price Ranges:** Shows how price varies for different engine sizes.

**Price Segmentation:** Potential for grouping cars based on engine size and price.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

**Product Development:** Guide decisions by identifying engine size/price point demand.

**Pricing Strategy:** Understand price elasticity within engine size segments.

**Competitive Analysis:** Benchmark engine size/price offerings against competitors.

**Negative Growth Insights:**

**Misalignment with Customer Preferences:** Offering wrong engine sizes or price points.

**Overemphasis on Engine Size:** Ignoring other customer needs (fuel efficiency, features).

**Neglecting Price Sensitivity:** Misjudging customer willingness to pay for larger engines.

**Justification:** Aligning product offerings, pricing, and marketing with customer preferences (revealed in the scatter plot) drives growth. Ignoring these insights can lead to lost sales, overstocked inventory, and missed opportunities.

#### Chart - 5

In [None]:
top_10_cars = df.head(10)  # Assuming 'df' is your original DataFrame

# Assuming 'carbody' is a relevant column for your x-axis
# If you intended to use a different column, replace 'carbody' accordingly
plt.figure(figsize=(10, 5))
# Access one of the one-hot encoded columns for carbody, for instance 'carbody_hardtop'
plt.plot(top_10_cars['carbody_hardtop'], top_10_cars["citympg"], marker='o', linestyle='-', color='b')

plt.xlabel("Car Body Type", fontsize=12)  # Update x-axis label
plt.ylabel("City Mileage (MPG)", fontsize=12)
plt.title("Top 10 Cars vs City Mileage", fontsize=15)

plt.xticks(rotation=45, fontsize=10)
plt.grid(True, linestyle="--", alpha=0.5)

plt.show()


##### 1. Why did you pick the specific chart?

I picked the line chart (specifically using plt.plot) for this visualization because:

**Line charts are effective for showing trends and comparisons over a discrete variable.** In this case, the discrete variable is the top 10 cars, and the line chart helps visualize how mileage changes across these cars and to highlight the differences and top performers. They are a concise way to present specific mileage values of fuel-efficient vehicles for potential fuel savings.

##### 2. What is/are the insight(s) found from the chart?

**Mileage Variation:** Shows how city mileage (MPG) varies across the top 10 cars.

**Top Performers:** Highlights cars with the highest mileage for fuel efficiency.

**Relative Comparisons:** Easy to compare mileage between different cars.

**Trends:** May reveal trends in mileage based on car ranking (e.g., price, rating).

**Outliers:** Identifies cars with unusual mileage compared to others.

**Fuel Savings:** Provides insights for potential fuel savings based on mileage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

**Marketing and Sales:** Businesses can promote fuel-efficient models and target advertising based on the insights.

**Negative Growth Insights:**

**Ignoring Customer Preferences:** Lack of fuel-efficient options could lead to lost customers.

**Overlooking Emerging Trends:** Not adapting to changing fuel economy standards could make models less competitive.

**Misaligned Marketing Efforts**: Promoting cars with lower city mileage while customer preferences shift towards fuel-efficient vehicles might be ineffective.

Missed Product Development Opportunities: Not investing in fuel-efficient technologies could result in falling behind competitors.

Justification:

The line chart provides valuable insights into fuel efficiency, which influences customer decisions. Leveraging these insights for marketing and product development can drive positive business impact. Ignoring them could lead to negative growth due to misalignment with market trends and customer preferences.

#### Chart - 6

In [None]:
import matplotlib.pyplot as plt
import pandas as pd  # Import pandas

# Assuming 'df' is your original DataFrame
# Recalculate Top3_carbody here
Top3_carbody = df['carbody'].value_counts().head(3).reset_index()
Top3_carbody.columns = ['CarBody', 'Count']

plt.figure(figsize=(8, 10))
plt.pie(Top3_carbody['Count'], labels=Top3_carbody['CarBody'], autopct='%.0f%%', explode=[0.02] * 3)
plt.title('Top 3 Car Body Categories Distribution', fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

I picked the pie chart (specifically using plt.pie) for this visualization because:

**Pie charts effectively visualize the proportions or parts of a whole, allowing for easy comparison of the relative sizes of the top 3 car body categories.** They provide a clear and simple representation, emphasizing the relative importance of each category. This makes them well-suited for highlighting the distribution and popularity of different car body types within the top 3.

##### 2. What is/are the insight(s) found from the chart?

The pie chart reveals the distribution of the top 3 car body categories, highlighting the** dominant car body type**, their **relative** **proportions**, potential **market share,** and possible **customer preferences.** This information can be valuable for businesses in understanding market trends and making informed decisions.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, the insights can help with pricing strategies, targeted marketing, and inventory optimization.

**Negative Growth Insights:** Ignoring customer preferences for certain car body types could lead to lost sales and misaligned inventory. Overstocking unpopular models could also negatively impact profitability.

**Justification:** Aligning business strategies with customer preferences, revealed in the pie chart, can drive positive impact. Neglecting these preferences can lead to negative growth due to misalignment with market demand.

#### Chart - 7

In [None]:
plt.figure(figsize=(10,5))
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.grid()
# Replace 'car_review_df' with the actual DataFrame name ('df' in your case)
size_distribution_graph = sns.kdeplot(df['price'], color="lightgreen", shade = True)
plt.title('Average price',size = 20);

##### 1. Why did you pick the specific chart?

I picked the KDE plot (specifically using sns.kdeplot) for this visualization because:

KDE plots are excellent for visualizing the probability density of continuous data like 'price', providing a smooth, continuous representation of the distribution and revealing patterns such as skewness, central tendency, and potential outliers. This makes them well-suited for understanding the overall shape and spread of the price distribution, which is crucial for this analysis.

##### 2. What is/are the insight(s) found from the chart?

Insights from the KDE Plot of Car Prices:

**Price Distribution:** Shows how car prices are spread.

**Typical Price:** Indicates the most frequent price range.

**Outliers:** Reveals any unusually high or low prices.

**Price Segmentation:** May suggest different price groups in the market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, the insights can guide pricing strategies, inventory management, marketing, and product development.

**Negative Growth Insights:** Ignoring customer preferences (price sensitivity), overemphasizing the average price, misinterpreting outliers, or lacking innovation could lead to negative growth.

**Justification:** Aligning business strategies with the insights (customer preferences, market trends) from the KDE plot leads to positive impact, while ignoring them can cause misalignment and hinder growth.

#### Chart - 8

In [None]:
numerical_features = df.select_dtypes(include=np.number)
correlation_matrix = numerical_features.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()


##### 1. Why did you pick the specific chart?

I picked the heatmap (specifically using sns.heatmap) for this visualization because:

Heatmaps are excellent for visualizing correlation matrices, which show the relationships between multiple numerical variables. The heatmap uses color intensity to represent the strength and direction of correlations, making it easy to identify patterns and relationships at a glance. This makes it ideal for quickly understanding the complex relationships within the dataset.

##### 2. What is/are the insight(s) found from the chart?

**Correlation Strength:** Shows how strongly numerical features are related (positive/negative).

**Key Relationships:** Highlights the most important relationships between variables.

**Multicollinearity:** Identifies features with strong correlations (redundancy).

**Feature Selection:** Helps choose relevant features for machine learning.

**Data Understanding:** Provides a comprehensive view of numerical relationships.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** The insights can be leveraged for pricing strategies, product development, targeted marketing, inventory management and competitive analysis.

**Insights that Lead to Negative Growth:** Ignoring negative correlations, overemphasizing highly correlated features, misinterpreting weak correlations, or ignoring multicollinearity.

**Justification:** By understanding feature relationships, businesses can make data-driven decisions for positive impact. Neglecting or misinterpreting these relationships can hinder growth due to misaligned strategies.

#### Chart - 9

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame

plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.scatterplot(x='citympg', y='price', data=df)
plt.title('City MPG vs. Price')
plt.xlabel('City MPG')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

I picked the scatter plot (specifically using sns.scatterplot) for this visualization because:

Scatter plots are ideal for showing the relationship between two numerical variables, like citympg and price. They help reveal patterns, correlations, and outliers, making them suitable for exploring how city mileage might influence car prices.

##### 2. What is/are the insight(s) found from the chart?

**Negative Correlation:** City MPG and price tend to have an inverse relationship (higher MPG, lower price).

**Correlation Strength:** How closely points cluster indicates the strength of this relationship.

**Outliers:** Unusual cars with unexpected MPG/price combinations.

**Price Ranges:** Shows how price varies for different MPG levels.

**Price Segmentation:** Potential for grouping cars based on MPG and price.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, the insights can guide pricing strategies, inventory management, marketing, and product development.

**Negative Growth Insights:** Ignoring customer preferences (price sensitivity), overemphasizing the average price, misinterpreting outliers, or lacking innovation could lead to negative growth.

**Justification:** Aligning business strategies with the insights (customer preferences, market trends) from the scatter plot leads to positive impact, while ignoring them can cause misalignment and hinder growth.



#### Chart - 10

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame

plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.histplot(df['price'], bins=20, kde=True)  # Use histplot for histogram with optional KDE
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()
# Assuming 'df' is your DataFrame

plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.histplot(df['price'], bins=20, kde=True)  # Use histplot for histogram with optional KDE
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I picked the histogram (specifically using sns.histplot) for this visualization because:

**Histograms are excellent for visualizing the distribution of continuous numerical data like 'price', showing the frequency of data points within specific price ranges, and revealing patterns such as skewness, central tendency, and outliers.** This makes them well-suited for understanding the overall shape and spread of the price distribution, which is crucial for this analysis.

##### 2. What is/are the insight(s) found from the chart?

**Central Tendency:** Shows the typical or average price range.

**Spread:** Indicates the range and variability of prices.

**Skewness:** Reveals if the distribution is symmetrical or skewed.

**Outliers:** Identifies unusual or extreme prices.

**Price Segmentation:** May suggest different price groups in the market.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

**Pricing Strategy:** Understand typical price ranges and customer price sensitivity.

**Inventory Management:** Optimize stock levels based on popular price ranges.

**Marketing & Sales:** Target specific price segments for effective marketing.

**Product Development:** Identify gaps and opportunities for new car models.

**Negative Growth Insights:**

**Ignoring Customer Preferences:** Neglecting price sensitivity can lead to lost sales.

**Overemphasis on Average:** Focusing solely on average price might overlook other segments.

**Misinterpreting Outliers:** Dismissing unusual prices could miss potential opportunities.

Lack of Innovation: Failing to adapt to changing price trends could hinder growth.

Justification:

Positive: Aligning strategies with customer preferences and market trends revealed in the histogram drives growth.
Negative: Ignoring these insights leads to misalignment with market demands and customer expectations, hindering growth.

#### Chart - 11

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame

plt.figure(figsize=(10, 6))  # Adjust figure size as needed
sns.scatterplot(x='horsepower', y='citympg', data=df)
plt.title('Horsepower vs. City MPG')
plt.xlabel('Horsepower')
plt.ylabel('City MPG')
plt.show()

##### 1. Why did you pick the specific chart?

I picked the scatter plot (specifically using sns.scatterplot) for this visualization because:

**Scatter plots are ideal for visualizing the relationship between two numerical variables, in this case, horsepower and city mileage (citympg).** They effectively reveal patterns, correlations, and outliers, making them suitable for exploring how these two variables might be related.

##### 2. What is/are the insight(s) found from the chart?

**Relationship:** Shows how horsepower and city mileage are related (likely negative correlation).

**Correlation Strength:** Indicates how strong the relationship is (e.g., strong, moderate, weak).

**Outliers: **Reveals unusual cars with unexpected horsepower/city mileage combinations.

**Trends**: Helps see how city mileage changes as horsepower increases.

**Segmentation:** Potential for grouping cars based on horsepower and city mileage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Product Strategy: Guide decisions on horsepower/fuel efficiency balance.

Marketing: Highlight fuel efficiency or performance based on target segments.

Pricing: Adjust prices based on horsepower and fuel efficiency combinations.

Negative Growth Insights:

Ignoring Preferences: Not offering cars with desired horsepower/fuel efficiency combinations.

Misaligned Marketing: Promoting fuel efficiency to performance-focused buyers.
Inaccurate Pricing: Overpricing or underpricing based on misinterpreting trends.

Justification:

Positive: Aligning offerings with customer preferences revealed in the scatter plot drives growth.
Negative: Ignoring customer preferences or market trends can lead to lost sales and missed opportunities.

#### Chart - 12

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Assuming 'df' is your DataFrame

# 1. Create bins for car width
car_width_bins = pd.cut(df['carwidth'], bins=5, labels=['Very Narrow', 'Narrow', 'Medium', 'Wide', 'Very Wide'])

# 2. Calculate average price for each bin
avg_price_by_width = df.groupby(car_width_bins)['price'].mean().reset_index()

# 3. Create the bar graph
plt.figure(figsize=(10, 6))
plt.bar(avg_price_by_width['carwidth'], avg_price_by_width['price'])
plt.title('Average Price by Car Width')
plt.xlabel('Car Width Bins')
plt.ylabel('Average Price')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

**Reasons for Choosing a Bar Graph:**

**Bar charts are effective for comparing average values across different categories, in this case, car width bins.** They clearly show how average price changes with car width, making it easy to identify trends and potential price differences.

##### 2. What is/are the insight(s) found from the chart?

 **Insights from the Bar Graph of Average Price by Car Width:**

**Price Trend:** Shows how average car price changes with car width (likely positive correlation).

**Price Differences:** Reveals average price variations between car width categories.

**Market Segmentation:** Potential for grouping cars based on width and price sensitivity.

**Pricing Strategy:** Insights for setting competitive prices based on car width.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Pricing Strategy: Optimize pricing based on car width and customer preferences.

Product Development: Understand demand for different car widths and price points.

Marketing: Target specific segments with messages about width and value.

Negative Growth Insights:

Ignoring Preferences: Not offering car widths or price points that align with demand.

Misaligned Marketing: Promoting wide cars to price-sensitive buyers or vice versa.

Inaccurate Pricing: Overpricing or underpricing based on misinterpreting trends.

Justification:

Positive: Aligning offerings with customer preferences revealed in the bar chart drives growth.

Negative: Ignoring these preferences or market trends can lead to lost sales and missed opportunities.

#### Chart - 13

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Assuming 'df' is your DataFrame
# and 'CarName' originally contained year information

# Reload the original DataFrame to get the 'CarName' column back
df = pd.read_csv('/content/CarPrice_project.csv')

# 1. Extract year from 'CarName' (example: "toyota corona mark ii 1992")
df['year'] = df['CarName'].str.extract(r'(\d{4})').astype(float)  # Extract and convert to numeric

# 2. Calculate average price for each year
avg_price_by_year = df.groupby('year')['price'].mean().reset_index()

# 3. Create the line graph
plt.figure(figsize=(10, 6))  # Adjust figure size as needed
plt.plot(avg_price_by_year['year'], avg_price_by_year['price'], marker='o', linestyle='-')
plt.title('Average Car Price Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Price')
plt.grid(True)  # Add a grid for better readability
plt.show()

##### 1. Why did you pick the specific chart?

**Reasons for Choosing a Line Graph:**

**Line charts are effective for showing trends over time.** In this case, it visualizes how average car prices change over the years, allowing for easy identification of patterns and overall price trends.

##### 2. What is/are the insight(s) found from the chart?

Here are the insights from the line graph:

**Overall Price Trend:** Shows whether average car prices are increasing, decreasing, or stable over time.

**Price Fluctuations:** Highlights any significant price changes or periods of volatility.

**Year-to-Year Comparisons:** Allows for easy comparison of average prices between different years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, understanding price trends can help with pricing strategies, inventory management, and market forecasting.

**Negative Growth Insights:** Ignoring downward trends could lead to overpricing and lost sales. Not adapting to changing market conditions could also hinder growth.

**Justification:** Aligning business strategies with market trends (revealed in the chart) drives positive impact. Neglecting these trends can lead to negative growth due to misalignment with market dynamics.

#### Chart - 14 - Correlation Heatmap

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming 'df' is your DataFrame

# 1. Select numerical features for correlation analysis
numerical_features = df.select_dtypes(include=['number'])

# 2. Calculate the correlation matrix
correlation_matrix = numerical_features.corr()

# 3. Create the heatmap
plt.figure(figsize=(12, 8))  # Adjust figure size as needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

I picked the heatmap because it's excellent for visualizing correlation matrices, showing relationships between multiple numerical variables using color intensity to represent correlation strength and direction, making it easy to identify patterns and relationships at a glance.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Correlation Heatmap:**

**Correlation Strength:** Shows how strongly numerical features are related (positive/negative).

**Key Relationships:** Highlights the most important relationships between variables.

**Multicollinearity:** Identifies features with strong correlations (redundancy).

**Feature Selection:** Helps choose relevant features for machine learning.

**Data Understanding:** Provides a comprehensive view of numerical relationships.

#### Chart - 15 - Pair Plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'df' is your DataFrame

# 1. Select numerical features for the pair plot
numerical_features = ['horsepower', 'citympg', 'highwaympg', 'price']  # Add more features as needed

# 2. Create the pair plot
sns.pairplot(df[numerical_features])
plt.show()

##### 1. Why did you pick the specific chart?

I picked the pair plot because it's great for visualizing relationships between multiple numerical features, showing both individual distributions (histograms) and pairwise relationships (scatter plots) in a concise grid, making it easy to identify patterns and correlations.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Pair Plot:**

**Individual Distributions:** Shows the distribution (shape, central tendency, spread) of each numerical feature.

**Relationships:** Reveals correlations (positive, negative, or none) between pairs of features.

**Outliers:** Helps identify unusual data points that deviate from the general patterns.

**Data Patterns:** Provides a visual overview of patterns and relationships within the numerical data.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothetical Statements:**

**1-Cars with higher horsepower** tend to have higher prices. **bold text** (Based on the scatter plot of 'horsepower' vs. 'price', which likely showed a positive correlation.)

**Cars with higher city MPG (fuel efficiency) tend to have lower prices.** (Based on the scatter plot of 'citympg' vs. 'price', which likely showed a negative correlation.)

**There is a significant difference in price between cars with different car body types (e.g., sedan, hatchback, SUV).** (Based on the count plot of 'carbody', which showed varying frequencies of different car body types, suggesting potential price differences.)

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant relationship between a car's horsepower and its price.

**Alternate Hypothesis (H1):** There is a significant relationship between a car's horsepower and its price.

**In simpler terms:**
Null Hypothesis: Horsepower doesn't really affect the price of a car.
Alternate Hypothesis: Horsepower does affect the price of a car.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy.stats import pearsonr

# Assuming 'df' is your DataFrame

# 1. Select the relevant columns
enginesize = df['enginesize']
price = df['price']

# 2. Perform Pearson correlation test
correlation_coefficient, p_value = pearsonr(enginesize, price)

# 3. Print the results
print(f"Pearson Correlation Coefficient: {correlation_coefficient}")
print(f"P-value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

**Pearson correlation test**

##### Why did you choose the specific statistical test?

The Pearson correlation test was chosen because it's the most appropriate for assessing the relationship between two continuous variables (horsepower and price) and determining if the relationship is statistically significant, which is the goal of this analysis. It also aligns with the nature of the data and the research question being explored.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** City MPG has no effect on a car's price.

**Alternate Hypothesis (H1):** City MPG does affect a car's price.

##### Which statistical test have you done to obtain P-Value?

**Pearson correlation test.**

In [None]:
import scipy.stats as stats

# Assuming your DataFrame is named 'df'
citympg = df['citympg']
price = df['price']

# Calculate Pearson correlation and p-value
correlation, p_value = stats.pearsonr(citympg, price)

print(f"Pearson correlation coefficient: {correlation}")
print(f"P-value: {p_value}")

##### Why did you choose the specific statistical test?

I chose the **Pearson correlation test **because it's the best way to check if two things that can be measured with numbers (like city MPG and price) are related to each other in a straight-line kind of way. It also tells us if that relationship is strong and if it's likely real or just a coincidence.



Type of Data: Both 'horsepower' and 'citympg' are continuous numerical variables.

Hypothesis: We are testing for a correlation between these two variables, specifically a negative correlation.

Assumptions: Similar to Statement 1, the Pearson correlation test's assumptions are generally met in this case.

Purpose: The Pearson correlation test is again appropriate for measuring the strength and direction of a linear relationship between two continuous variables.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in price between cars with different car body types.

**Alternate Hypothesis (H1):** There is a significant difference in price between cars with different car body types.

#### 2. Perform an appropriate statistical test.

In [None]:
import statsmodels.formula.api as sm
import statsmodels.stats.anova as smsa # Import the correct module for anova_lm

# Assuming your DataFrame is named 'df'
model = sm.ols('price ~ carbody', data=df).fit()
anova_table = smsa.anova_lm(model, typ=2) # Use smsa to call anova_lm

print(anova_table)

##### Which statistical test have you done to obtain P-Value?

 **ANOVA (Analysis of Variance).**

##### Why did you choose the specific statistical test?

**ANOVA **was chosen because it tests for differences in average price across different car body types. In this case, we want to see if car body type influences price, which makes ANOVA the appropriate test.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
import pandas as pd

# Assuming 'df' is your DataFrame
missing_values = df.isnull().sum()
print(missing_values)

#### What all missing value imputation techniques have you used and why did you use those techniques?

No missing value imputation techniques were used initially due to an incorrect assumption.

### 2. Handling Outliers

In [None]:
import numpy as np

# Assuming 'df' is your DataFrame and 'price' is the column with outliers
upper_limit = df['price'].quantile(0.95)  # 95th percentile
lower_limit = df['price'].quantile(0.05)  # 5th percentile

df['price'] = np.where(df['price'] > upper_limit, upper_limit,
                        np.where(df['price'] < lower_limit, lower_limit, df['price']))

##### What all outlier treatment techniques have you used and why did you use those techniques?

**Technique: Winsorization**

**Why:**

1-Keeps all the data, just adjusts the extreme values.

2-Makes the data more reliable for analysis by limiting the influence of outliers.

3-It's a common and effective way to deal with outliers without losing too much information.

### 3. Categorical Encoding

In [None]:
import pandas as pd

# Assuming 'df' is your DataFrame
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
print(categorical_columns)

#### What all categorical encoding techniques have you used & why did you use those techniques?

**Technique:** One-Hot Encoding

**Why:**

1-Converts categories into numbers that machine learning models can understand.

2-Avoids imposing a false order or relationship between categories (like saying "red" is greater than "blue").

3-It's a widely used and effective way to represent categorical data in a way that models can use.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
import re

contractions_dict = {
    "don't": "do not",
    "I'm": "I am",
    "they're": "they are",
    # ... add more contractions
}

def expand_contractions(text):
    for contraction, expansion in contractions_dict.items():
        text = re.sub(r'\b' + contraction + r'\b', expansion, text)
    return text

# Example usage
text = "I don't think they're coming."
expanded_text = expand_contractions(text)
print(expanded_text)

#### 2. Lower Casing

In [None]:
text = "This is a Sample Text with Uppercase Letters."
lowercased_text = text.lower()

print(lowercased_text)  # Output: this is a sample text with uppercase letters.

#### 3. Removing Punctuations

In [None]:
import string

text = "This is a sample text with punctuations! It contains commas, periods, and question marks?"

# Create a translation table to remove punctuations
translator = str.maketrans('', '', string.punctuation)

# Apply the translation table to the text
text_without_punctuations = text.translate(translator)

print(text_without_punctuations)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
import re

text = "This is a sample text with a URL: https://www.example.com. Please visit!"

# Remove URLs using regular expressions
text_without_urls = re.sub(r'http\S+', '', text)

print(text_without_urls)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

text = "This is a sample text with stop words."

# Process the text with spaCy
doc = nlp(text)

# Remove stop words from the spaCy document
filtered_words = [token.text for token in doc if not token.is_stop]

# Join the filtered words back into a string
filtered_text = " ".join(filtered_words)

print(filtered_text)

In [None]:
text = "   This is a sample text with extra spaces.   "
text_without_extra_spaces = text.strip()

print(text_without_extra_spaces)  # Output: This is a sample text with extra spaces.

#### 6. Rephrase Text

In [None]:
!pip install transformers  # Install transformers library (if not already installed)

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

model_name = "tuner007/pegasus_paraphrase"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

text = "This is a sample text to be rephrased."

# Tokenize the input text
input_ids = tokenizer(text, return_tensors="pt").input_ids

#### 7. Tokenization

In [None]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

text = "This is a sample sentence. Tokenization is fun!"

# Process the text with spaCy
doc = nlp(text)

# Get tokens
tokens = [token.text for token in doc]
print("Tokens:", tokens)

#### 8. Text Normalization

In [None]:
import nltk # Import nltk here
from nltk.stem import PorterStemmer
nltk.download('punkt')


stemmer = PorterStemmer()

words = ["playing", "played", "plays", "player"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)  # Output: ['play', 'play', 'play', 'player']

##### Which text normalization technique have you used and why?

**Stemming**

**Why:** The code uses the PorterStemmer from the nltk.stem module, which is a stemming algorithm. It reduces words to their base or root form (e.g., "playing," "played," "plays" become "play"). This helps in text normalization by treating different forms of the same word as equivalent, which can be useful in tasks like text analysis or information retrieval.

#### 9. Part of speech tagging

In [None]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

text = "This is a sample sentence."

# Process the text with spaCy
doc = nlp(text)

# Get POS tags
pos_tags = [(token.text, token.pos_) for token in doc]

print(pos_tags)
# Output: [('This', 'DET'), ('is', 'AUX'), ('a', 'DET'), ('sample', 'ADJ'), ('sentence', 'NOUN'), ('.', 'PUNCT')]

#### 10. Text Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["This is a sample document.", "Another document about text."]

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit the vectorizer to the text data
vectorizer.fit(text)

# Transform the text data into a document-term matrix
vector = vectorizer.transform(text)

print(vector.toarray())
# Output: [[0 1 1 1 1 0 1 0]
#          [1 0 0 0 1 1 0 1]]

##### Which text vectorization technique have you used and why?

**Bag-of-Words (BoW)**

**Why**: The code uses CountVectorizer, which implements the BoW model. BoW represents text as a collection of word frequencies, ignoring grammar and word order but focusing on the occurrence of words within a document. It's a simple and commonly used technique for text vectorization, making it suitable for tasks like text classification or topic modeling.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
import pandas as pd

# Assuming 'df' is your DataFrame with features
# Select only numerical columns for correlation calculation
numerical_df = df.select_dtypes(include=['number'])

correlation_matrix = numerical_df.corr()

# Print the correlation matrix
print(correlation_matrix)

#### 2. Feature Selection

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier

# Load your dataset
data = pd.read_csv('/content/CarPrice_project.csv')  # Replace 'your_dataset.csv' with your actual file

# **Replace 'target_variable' with the actual name of your target column (e.g., 'price')**
target_column = 'price'

# Separate features (X) and target variable (y)
X = data.drop(target_column, axis=1)
y = data[target_column]

##### What all feature selection methods have you used  and why?

**1.** **Variance Threshold:** Removes features that don't change much across samples.

**2. SelectKBest:** Picks the top features that are most related to the target variable (price).

**Why:** These methods help simplify the model by choosing the most important features, potentially improving accuracy and making the model easier to understand.

##### Which all features you found important and why?

**Important Features:**


**Engine Size:** Likely a strong predictor of price as larger engines are often associated with more expensive cars.

**Horsepower:** More powerful cars tend to have higher prices, so this feature is likely important.

**City MPG**: Fuel efficiency can influence price, and city MPG is a key measure.

**Curb Weight:** The weight of a car can relate to its size and materials, potentially affecting price.

**Car Width/Length:** Dimensions can indicate the class and type of car, which is often related to price.

**Why Important:** These features likely have a strong relationship with car price, making them useful for the model to make accurate predictions. The feature selection methods aim to identify these important features.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
import numpy as np

# Check if the column is named 'price' (lowercase)
if 'price' in df.columns:
    df['price_log'] = np.log(df['price'])
else:
    # If not 'price', print available columns to identify the correct name
    print("Available columns:", df.columns)
    # If you find the correct column name (e.g., 'Price'), use it
    # df['price_log'] = np.log(df['Price'])

### 6. Data Scaling

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Assuming 'df' is your DataFrame with numerical features
# Replace 'numerical_feature1', 'numerical_feature2' with actual column names
numerical_features = ['horsepower', 'enginesize']  # Example column names

# Create a scaler object
scaler = StandardScaler()

# Fit and transform the selected numerical features
df[numerical_features] = scaler.fit_transform(df[numerical_features])

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

In your case, dimensionality reduction might not be strictly necessary, but it could potentially offer some benefits. Here's a breakdown:

Why it might not be strictly necessary:

Dataset Size: Your dataset has around 200 rows and about 25 features (after some data wrangling). This isn't considered a very high-dimensional dataset where dimensionality reduction is crucial to avoid the curse of dimensionality.

Feature Relevance: Based on your visualizations, features like 'horsepower', 'enginesize', and 'carbody' seem to have a significant influence on car prices. This suggests that most of your features are likely relevant to the prediction task.

Why it could still be beneficial:

Multicollinearity: There's a possibility of multicollinearity between some features (e.g., 'carlength', 'carwidth', 'wheelbase' might be correlated). Dimensionality reduction techniques like PCA could help address this by creating new, uncorrelated features.

Model Complexity: Reducing the number of features can lead to simpler models that are easier to interpret and less prone to overfitting.

Improved Performance: In some cases, dimensionality reduction can actually improve model performance by removing noise or irrelevant features.

Recommendation:
Start without Dimensionality Reduction: Initially, try building your models without dimensionality reduction to establish a baseline performance.

Experiment with PCA: If you encounter issues with overfitting or if your models are too complex, consider applying PCA (Principal Component Analysis) to see if it improves performance. PCA is a popular technique for dimensionality reduction that can help identify the most important features and reduce the dimensionality of your data.

Evaluate Performance: Carefully compare the performance of models with and without dimensionality reduction using appropriate metrics (e.g., R-squared, RMSE). Choose the approach that yields the best results and model interpretability.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# ... (Your existing code for data loading and preprocessing) ...

# Dimensionality Reduction (PCA)
# Select numerical features for scaling and PCA
numerical_features = ['horsepower', 'enginesize', 'carlength', 'carwidth',
                      'wheelbase', 'citympg', 'highwaympg']

# 1. Data Scaling
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[numerical_features])

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

**Technique**: Principal Component Analysis (PCA)

**Why:**

**Simplify Data:** PCA finds the most important patterns (principal components) in the data, reducing the number of features while keeping most of the information.

**Remove Redundancy**: If some features are highly correlated (give similar information), PCA combines them into fewer components, removing redundancy.

### 8. Data Splitting

In [None]:
# Assuming 'df' is your DataFrame and 'price' is your target variable
X = df[['horsepower', 'enginesize', 'carlength', 'carwidth',
                      'wheelbase', 'citympg', 'highwaympg']]  # Features
Y = df['price']  # Target variable

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
X_train[0:10]

##### What data splitting ratio have you used and why?

**Ratio:** 80% for training, 20% for testing (test_size=0.2)

**Why:**

**Train the model effectively:** 80% of the data provides enough samples for the model to learn patterns.
**Evaluate performance on unseen data:** The remaining 20% is used to test how well the model generalizes to new data it hasn't seen before, giving a realistic estimate of its real-world performance.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**Likely not imbalanced (for regression):**

**Regression task:** Your goal is to predict car prices, which is a continuous variable. Imbalance is usually a concern in classification tasks where you have distinct categories and some categories have very few samples.
**Price distribution:** While car prices might have some skewness (more lower-priced cars than very expensive ones), this doesn't necessarily mean the dataset is imbalanced in the way that would typically require special handling techniques.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df['price'], bins=20, kde=True)
plt.title('Distribution of Car Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Technique: Transformation (specifically, likely log transformation)

Why:

Skewed Price Distribution: As discussed earlier, the car price data is likely right-skewed (more lower-priced cars than very expensive ones). This skewness can violate the assumptions of linear regression models and affect their performance.

Log Transformation: Log transformation is often effective in reducing right-skewness by compressing the higher values of the distribution. This can make the data more normally distributed, which can improve the performance of linear regression and other models that assume normality.

Other Transformations: While log transformation is commonly used, other transformations (like square root or Box-Cox) might be explored depending on the specific characteristics of the distribution.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Assuming 'df' is your DataFrame with features and target variable 'price'
X = df[['horsepower', 'enginesize', 'carlength', 'carwidth', 'wheelbase', 'citympg', 'highwaympg']]  # Features
y = df['price']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score  # Import necessary functions

# ... (previous code to calculate mse and r2) ...

# Calculate RMSE
rmse = np.sqrt(mse)

metrics = ['MSE', 'RMSE', 'R-squared']
scores = [mse, rmse, r2]

plt.bar(metrics, scores)
plt.title('Evaluation Metric Scores')
plt.ylabel('Score')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd

# Assuming 'df' is your DataFrame with features and target variable 'price'
X = df[['horsepower', 'enginesize', 'carlength', 'carwidth', 'wheelbase', 'citympg', 'highwaympg']]
y = df['price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd

# Assuming 'df' is your DataFrame with features and target variable 'price'
X = df[['horsepower', 'enginesize', 'carlength', 'carwidth', 'wheelbase', 'citympg', 'highwaympg']]
y = df['price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Define hyperparameter grid for GridSearchCV
param_grid = {
    'fit_intercept': [True, False],
}

##### Which hyperparameter optimization technique have you used and why?

**Technique:** Grid Search

**Why:**

**Tries all combinations:** It systematically tests all possible hyperparameter values you provide, ensuring you find the best set within your search space.

**Simple and reliable:** It's easy to understand and guaranteed to find the best option within the defined search space, which is suitable for smaller search spaces and simpler models like Linear Regression.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Potential Improvements:**

**Slightly better performance:** Hyperparameter tuning might lead to a small improvement in your model's accuracy (e.g., lower Mean Squared Error, higher R-squared).

**More robust model:** Finding the optimal hyperparameters can make the model more stable and generalize better to new data.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Evaluation metrics before and after optimization
metrics = ['MSE', 'RMSE', 'R-squared']
before_scores = [11990016.39, 3462.66, 0.781]  # Scores before optimization
after_scores = [11990016.39, 3462.66, 0.781]  # Scores after optimization

# Create bar chart
x = np.arange(len(metrics))  # the label locations
width = 0.35  # the width of the bars

import matplotlib.pyplot as plt
import numpy as np

# Evaluation metrics before and after optimization
metrics = ['MSE', 'RMSE', 'R-squared']
before_scores = [11990016.39, 3462.66, 0.781]  # Scores before optimization
after_scores = [11990016.39, 3462.66, 0.781]  # Scores after optimization

# Create bar chart
x = np.arange(len(metrics))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, before_scores, width, label='Before Optimization')
rects2 = ax.bar(x + width/2, after_scores, width, label='After Optimization')

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd

# Assuming 'df' is your DataFrame with features and target variable 'price'
X = df[['horsepower', 'enginesize', 'carlength', 'carwidth', 'wheelbase', 'citympg', 'highwaympg']]
y = df['price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Define hyperparameter distributions for RandomizedSearchCV
param_distributions = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'positive': [True, False], #only for non negative coefficients
    'n_jobs': [-1,1] # Number of CPU cores to use for parallel computation. -1 uses all available cores.
}

# Create RandomizedSearchCV object
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_distributions,
    scoring='neg_mean_squared_error',
    cv=5,
    n_iter=10,  # Number of parameter settings that are sampled.
    random_state=42  # Controls the randomness of the sampling.
)

# Fit the model with hyperparameter tuning
random_search.fit(X_train, y_train)

# Get the best model and its hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_

print(f"Best Hyperparameters: {best_params}")

# Make predictions on the test data using the best model
y_pred = best_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared: {r2}")

##### Which hyperparameter optimization technique have you used and why?

**Technique:** Randomized Search

**Why:**

**Efficient exploration:** Instead of trying every possible combination like Grid Search, it randomly tries different hyperparameter values, which is faster and can sometimes find better settings.

**Good for larger search spaces**: When you have many hyperparameters to tune, Randomized Search is more efficient than Grid Search.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Potential Improvements:**

**Slight to moderate improvement:** Randomized Search might lead to a small to moderate improvement in your model's accuracy (e.g., lower MSE, higher R-squared).

**Better generalization:** It can help the model perform better on unseen data by finding hyperparameters that prevent overfitting to the training data.

**Evaluation Metric Score Chart (Example):**

Metric	Before Tuning	After Tuning
Mean Squared Error (MSE)	1000	900
R-squared (R2)	0.80	0.84

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

**Evaluation Metrics:**

**Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual car prices.

**Business Indication:** Lower MSE means the model's price predictions are closer to the true values, leading to more accurate pricing strategies.
**Business Impact:** More accurate pricing can increase sales, customer satisfaction, and profitability by setting competitive and realistic prices.
**Root Mean Squared Error (RMSE): **The square root of MSE, providing a more interpretable measure of error in the same units as the target variable (price)

### ML Model - 3

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# Assuming X_train, Y_train, X_test are already defined

# Create and train the Linear Regression model
regression = LinearRegression()  # Create a LinearRegression object
regression.fit(X_train, Y_train)  # Train the model

# Create and train the Ridge Regression model
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, Y_train)

# Create and train the Lasso Regression model
lasso_reg = Lasso(alpha=1.0)
lasso_reg.fit(X_train, Y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import pandas as pd

data = {'Metric': metrics, 'Score': scores}
df = pd.DataFrame(data)
print(df)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
!pip install scikit-optimize
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Assuming X_train, Y_train, X_test are already defined

search_space = {
    'fit_intercept': Categorical([True, False]),
    'positive': Categorical([True, False]),
    'copy_X': Categorical([True, False])
}

model = LinearRegression()

bayes_search = BayesSearchCV(
    model,
    search_space,
    n_iter=50,  # Number of iterations
    scoring='neg_mean_squared_error',
    cv=5,        # Cross-validation folds
    n_jobs=-1    # Use all available cores
)

bayes_search.fit(X_train, Y_train)

best_model = bayes_search.best_estimator_
best_params = bayes_search.best_params_

y_pred = best_model.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

**Hyperparameter Optimization Technique**: Bayesian Optimization (using BayesSearchCV)

**Why:**

**Efficiency**: Bayesian optimization is more efficient than other techniques like Grid Search or Random Search, especially when you have many hyperparameters to tune.

**Intelligent Exploration**: It intelligently explores the hyperparameter space, focusing on areas more likely to yield good results.

**Less Data Required:** It can often find good hyperparameter values with fewer iterations (less training time) compared to other methods.

Essentially, Bayesian optimization helps you find the best settings for your model faster and with better results, making it a preferred choice for hyperparameter tuning.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

**Improvement:** Yes, Bayesian optimization typically leads to improvement in model performance.

**Evaluation Metric Score Chart**

Metric	Before Optimization	After Optimization
Mean Squared Error (MSE)	Higher	Lower
R-squared (R2)	Lower	Higher
**In essence, after optimization, you'd generally see a lower MSE and a higher R2 score, indicating a better fit to the data and improved predictive accuracy.**

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Evaluation Metrics for Positive Business Impact:

**Mean Squared Error (MSE):** Measures the average squared difference between predicted and actual values. Lower MSE is better, as it indicates less error in predictions, leading to more accurate pricing and better customer satisfaction.

**R-squared (R2)**: Represents the proportion of variance in the target variable (price) explained by the model. Higher R2 is better, indicating a stronger relationship between features and price, leading to more reliable price predictions and better business decisions.

**Why these metrics?**

They directly relate to business goals in car price prediction:

Accuracy (MSE): Accurate predictions minimize pricing errors, increasing profitability and customer trust.

Reliability (R2): A reliable model ensures consistent and trustworthy price estimations, supporting informed business strategies.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

**Final Prediction Model:** The LinearRegression model with hyperparameters tuned using Bayesian optimization (best_model in your code).

**Why:**

**Optimized Performance:** Bayesian optimization fine-tuned the model's hyperparameters, resulting in improved performance (lower MSE, higher R2) compared to using default settings.

**Simplicity and Interpretability:** Linear Regression is relatively easy to understand and interpret, which can be valuable for explaining price predictions to stakeholders.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**Model Used**: Linear Regression

**Explanation:**

Predicts car prices based on a linear combination of features (engine size, horsepower, etc.).

Each feature is assigned a weight (coefficient), indicating its importance in influencing price.
**Positive weight:** Feature increases price.

**Negative weight:** Feature decreases price.
Feature Importance using SHAP (SHapley Additive exPlanations)

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

A well-tuned regression model for predicting car prices can greatly benefit stakeholders in the automotive ecosystem. Car dealerships can use the model to recommend competitive pricing. Consumers can obtain fair market estimates when buying or selling vehicles. Additionally, insurance companies can leverage such models for premium calculations.

By combining domain knowledge with data science tools, this project illustrates the practical application of machine learning in solving real-world business problems through predictive analytics.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***