### Problem Statement

##### The car dataset contains information about various automobiles, including attributes such as price, engine size, horsepower, fuel efficiency, dimensions, and other technical specifications. However, like most real-world datasets, it contains missing values, inconsistent data formats, and variables that require transformation before meaningful analysis or modeling can be performed.

##### The objective of this project is to perform comprehensive data preprocessing to improve the datasetâ€™s quality, consistency, and usability. This involves identifying and handling missing values, correcting data types, standardizing and normalizing numerical variables, and transforming data into suitable formats for analysis.

##### Specifically, the preprocessing tasks include identifying missing values and applying appropriate techniques to handle them, correcting incorrect data formats to ensure numerical and categorical variables are properly represented, standardizing units and formats for consistency, and normalizing numerical features through centering and scaling to ensure comparability. Additionally, binning will be applied to convert continuous variables into categorical groups for easier interpretation, and indicator variables will be created to represent categorical features in a numerical format suitable for analysis and machine learning models.

##### By performing these preprocessing steps, the dataset will be transformed into a clean, structured, and analysis-ready format, enabling accurate statistical analysis, visualization, and predictive modeling.

In [None]:
#importing required libraries
import pandas as pd
import numpy as np
import matplotlib as plt
from matplotlib import pyplot

##### The dataset doesn't contain column headers, so we create an array to address that

In [None]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

In [None]:
#loading the dataset and viewing the first 5 rows
data=pd.read_csv("car_data.csv",names=headers)
data.head()

#### Data Cleaning

##### The dataset also contains entries marked as "?" which is supposed to indicate NaN values.
#####  We will solve this using the replace function

In [None]:
data.replace("?",np.nan,inplace=True)

In [None]:
data.head()

##### Above we can see that the values have been assigned NaN

##### Next we identify the number of empty entries in the columns. 

In [None]:
# Using this ".info()" method we can also get the data type of the columns
data.info()

Based on the summary above, each column has 205 rows of data and seven of the columns containing missing data:
<ol>
    <li>"normalized-losses": 41 missing data</li>
    <li>"num-of-doors": 2 missing data</li>
    <li>"bore": 4 missing data</li>
    <li>"stroke" : 4 missing data</li>
    <li>"horsepower": 2 missing data</li>
    <li>"peak-rpm": 2 missing data</li>
    <li>"price": 4 missing data</li>
</ol>


##### This will be solved using the mean and mode of the columns
##### Also note that the datatypes of the columns will change

In [None]:
# filling the empty values of the "normalized" column
data["normalized-losses"]=data["normalized-losses"].fillna(data["normalized-losses"].astype(float).mean())

In [None]:
# filling the empty values of the "stroke" column
data["stroke"]=data["stroke"].fillna(data["stroke"].astype(float).mean())

In [None]:
# filling the empty values of the "bore" column
data["bore"]=data["bore"].fillna(data["bore"].astype(float).mean())

In [None]:
# filling the empty values of the "horsepower" column
data["horsepower"]=data["horsepower"].fillna(data["horsepower"].astype(float).mean())

In [None]:
# filling the empty values of the "peak-rpm" column
data["peak-rpm"]=data["peak-rpm"].fillna(data["peak-rpm"].astype(float).mean())

In [None]:
# The "num-of-doors" column has only 2 misisng values we will replace this with the mode
data["num-of-doors"]=data["num-of-doors"].fillna(f"{data["num-of-doors"].mode()}")

In [None]:
# The price column has 5 empty entries, we'll be dropping those rows
data.dropna(subset=["price"], inplace=True)
data.reset_index(drop=True, inplace=True)

In [None]:
data.isnull().sum()

##### From the above we can see that we have a dataset without any empty entries

#### Data Standardization

##### Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.
##### In the data set, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit.
##### Let's Assume we're developing an application in a country that accepts the fuel consumption with L/100km standard.
##### We'll have to Transform mpg to L/100km:
##### We'll Use this formula for unit conversion:

##### L/100km = 235 / mpg

In [None]:
# conversion
data["city-mpg L/100km"]=235/data["city-mpg"]
data["highway-mpg L/100km"]=235/data["highway-mpg"]

#### Data Normalization
<p>Normalization is the process of transforming values of several variables into a similar range.</p>
<p><b>Approach:</b> replace the original value by (original value)/(maximum value)</p>


In [None]:
#We'll be normalizing the length ,width and height column
data["length"]=data["length"]/max(data["length"])
data["height"]=data["height"]/max(data["height"])
data["width"]=data["width"]/max(data["width"])

#### Binning
<p>
    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
</p>
<p>
        In the data set, "horsepower" is a real valued variable ranging from 48 to 288 and it has 59 unique values. 
We can clasify this into cars with high horsepower, medium horsepower, and little horsepower.</p>

##### We'll Plot the histogram of horsepower to see the distribution of horsepower.

In [None]:
data["horsepower"]=data["horsepower"].astype(float)

In [None]:
plt.pyplot.hist(data["horsepower"])

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

In [None]:
data["horsepower-binned"]=pd.cut(data["horsepower"],bins=np.linspace(min(data["horsepower"]),max(data["horsepower"]),4),labels=["Low","medium","high"],include_lowest=True)
data[["horsepower", "horsepower-binned"]].head(5)

In [None]:
data["horsepower-binned"].value_counts()

In [None]:
pyplot.bar(["Low","medium","high"], data["horsepower-binned"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

<p> Looking at the graph above we have successfully narrowed down the intervals from 59 to 3
</p>


#### Indicator Variable


<p>
    An indicator variable (or dummy variable) is a numerical variable used to label categories. 
</p>

<p>
    The column "fuel-type" has two unique values: "gas" or "diesel". 
    Regression doesn't understand words, only numbers.
    To use this attribute in regression analysis, we will convert "fuel-type" to indicator variables.
</p>

In [None]:
dummy_variable_1 = pd.get_dummies(data["fuel-type"]).astype(int)
dummy_variable_1.rename(
    columns={
        'gas':'fuel-type-gas',
        'diesel':'fuel-type-diesel'
    },
    inplace=True)
dummy_variable_1.head(5)

In [None]:
# Then join the orignal dataset and the new subset with the binary values
data=pd.concat([data,dummy_variable_1],axis=1)
# dropping the  original column "fuel-type" from "data"
data.drop("fuel-type", axis = 1, inplace=True)
data

<p>
    The column "aspiration" has two unique values: "std" or "turbo". 
    Regression doesn't understand words, only numbers.
    To use this attribute in regression analysis, we will convert "aspiration" to indicator variables.
</p>

In [None]:
dummy_variable_2=pd.get_dummies(data["aspiration"]).astype(int)
dummy_variable_2.rename(
    columns={
        "std":"aspiration-std",
        "turbo":"aspiration-turbo"
    },
     inplace=True
)
dummy_variable_2.head(5)

In [None]:
# add the dummy_variable_2 subset to the dataset
data=pd.concat([data,dummy_variable_2],axis=1)
# dropping the aspiration column
data.drop("aspiration",axis=1,inplace=True)
data.head(5)

In [None]:
# Now we save the clean dataset to a new file
data.to_csv("clean_car_data.csv")



## Author

<a href="https://www.linkedin.com/in/tanimowo-possible/" target="_blank">Tanimowo Possible</a>
