# Phase 1: Data Gathering

## a. Import required libraries

In [None]:
# These are the latest versions of said modules upon the creation of this project

# Uncomment the line below to install dependencies for the required libraries
#!pip install -r requirements.txt

# cross check with the cloud based version on the course

### Pandas

**Pandas** is a library that provides many tools for working with data in Python. It is imported to use these tools to performe wide variety of data manipulation and analysis tasks. In this project, pandas is mainly utilized for reading and writing CSV files, including importing the dataset, adding headers, and reviewing its contents using the df.head(n) and df.tail(n) methods. Additionally, functions are used to convert '?' values to "NaN" to identify non-numerical values. Pandas is also employed to obtain a brief overview of the dataset's information, data types, and descriptive statistics.

In [None]:
# pandas 1.5.3
import pandas as pd

### Numpy
**NumPy** is a widely used library for Python that simplifies data analysis and scientific computing by providing a variety of useful features for working with numerical data in arrays and matrices. 

In [None]:
# numpy 1.24.1
import numpy as np

### Matplotlib
**matplotlib** is a module within the matplotlib library in Python that is utilized for making plots and visualizing data. It offers a user-friendly interface to create a variety of plots, such as line plots, scatter plots, bar plots, and histograms. 

// pylab & pylot are used; explain each briefly


In [None]:
# matplotlib 3.6.3
import matplotlib.pylab as plt_lab
import matplotlib.pyplot as plt_plot

### Seaborn
**Seaborn** is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn is used to create a heatmap to visualize the correlation between the features in the dataset.

// copilot generated text; needs to proofread and revised! 

In [None]:
# seaborn 0.12.2
import seaborn as sns

### Scipy
**SciPy** is a Python library that is used to perform scientific computing and technical computing. It is used to perform a variety of mathematical, scientific, and engineering tasks. In this project, SciPy is used to perform a one-way ANOVA test to determine if there is a significant difference in the means of the three groups.

// copilot generated text; needs to proofread and revised! 

In [None]:
# scipy 1.2.1
from scipy import stats

### Scikit-learn
**Scikit-learn** is a Python library that is used for machine learning. It is used to perform a variety of machine learning tasks, such as classification, regression, and clustering. In this project, Scikit-learn is used to perform a logistic regression to predict the class of a new instance.

// copilot generated text; needs to proofread and revised! 

In [None]:
# scikit-learn 1.2.1

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

### Ipywidgets
**Ipywidgets** is a Python library that is used to create interactive widgets for Jupyter notebooks. It is used to create a slider to select the number of features to be used in the logistic regression model.

// copilot generated text; needs to proofread and revised!

In [None]:
# ipywidgets 8.0.4
from ipywidgets import interact, interactive, fixed, interact_manual

### Tqdm
**tqdm** is a Python library that is used to create progress bars for loops. It is used to create a progress bar to show the progress of the logistic regression model.

// copilot generated text; needs to proofread and revised!

In [None]:
# tqdm 4.64.1
from tqdm import tqdm

## b. 1985 Auto Imports Database
This dataset is donated by Jeffrey C. Schlimmer on May 19, 1987. The dataset's source are:
1. 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook
2. Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038
3. Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

### Attributes
The attributes of this dataset represent:
- specification of an auto in terms of various characteristics
- assigned insurance risk rating (symboling)
- normalized losses in use as compared to other cars

    *This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.*

1. **symboling** - the degree to which the auto is more risky than its price indicates (3 = very risky, -3 = very safe)
2. **normalized-losses** - normalized loss value for a particular car; may represent the financial loss or cost associated with a car
3. **make** - make of the car (e.g., Honda, Toyota, etc.)
4. **fuel-type** - type of fuel used by the car (gas or diesel)
5. **aspiration** - type of engine aspiration (standard or turbocharged)
6. **num-of-doors** - number of doors on the car 
7. **body-style** - style of the car's body (e.g., sedan, hatchback, etc.)
8. **drive-wheels** - configuration of the car's drive wheels (front, rear, or four-wheel)
9. **engine-location** - location of the car's engine (front or rear)
10. **wheel-base** - distance between the centers of the front and rear wheels 
11. **length** - length of the car 
12. **width** - width of the car 
13. **height** - height of the car
14. **curb-weight** - weight of the car without any passengers or cargo
15. **engine-type** - type of engine in the car (e.g., dohc, ohc, etc.)
16. **num-of-cylinders** - number of cylinders in the car's engine
17. **engine-size** - size of the car's engine
18. **fuel-system** - fuel system used by the car (e.g., mpfi, 2bbl, etc.)
19. **bore** - diameter of the car's cylinders
20. **stroke** - distance traveled by the car's pistons during one engine cycle
21. **compression-ratio** - ratio of the volume of the combustion chamber at the bottom of the piston stroke to the volume at the top
22. **horsepower** - power of the car's engine
23. **peak-rpm** - highest number of revolutions per minute the car's engine can make
24. **city-mpg** - number of miles per gallon the car can get in the city
25. **highway-mpg** - number of miles per gallon the car can get on the highway
26. **price** - price of the car

### Replace dataset headers with attribute names

In [None]:
# rename the columns to their proper attribute labels
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

### Read and save the dataset


In [None]:
dataset = "./auto.csv"

df = pd.read_csv(dataset, header=None, names=headers)

df.head()

## c. Basic insight on the dataset

In [None]:
print("The dataset has", df.shape[0] , "rows and", df.shape[1], "columns")

### Data Types

In [None]:
print("The types of the columns are as follows:\n", df.dtypes)

### Describe
This provides a statistical summary of all columns, including object-typed attributes.

*Note: Some values in the table below show as "NaN" because their respective columns contain missing data.*

In [None]:
df.describe(include = "all")

### Info
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [None]:
df.info()

# Phase 2: Cleansing Data

## a. Identify and handle missing values

In [None]:
# Convert "?" to NaN
df.replace("?", np.nan, inplace = True)

# Identify missing values
print("The number of missing values per column are:\n", df.isnull().sum())

#### There are several ways to deal with missing data:
1. **Drop data** - a data point (row) is removed if it contains a missing value
2. **Replace data** - missing value is replaced by mean or frequency
3. **Drop attribute** - the entire column is removed if it contains enough missing values


### Normalized Losses
The normalized losses attribute has 41 missing values, which is about 20% of the total dataset, which is insignificant enough for the attribute to be dropped. Furthermore, since it is a numerical variable and the data distribution is barely asymmetrical, replacing the missing values with the mean is recommended.

In [None]:
# visualize the distribution of normalized losses
plt_lab.hist(df["normalized-losses"].astype("float"), bins = 10)

In [None]:
# calculate the mean value for normalized losses
mean_normalized_losses = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized losses:", mean_normalized_losses)

# replace NaN with mean value in "normalized-losses" column
df["normalized-losses"].replace(np.nan, mean_normalized_losses, inplace=True)

### Num of Doors
The num of doors attribute has only 2 missing values. It is a categorical variable, so replacing the missing values with the most frequent value is recommended.

In [None]:
# calculate the frequency of each value in the "num-of-doors" column
print("The frequency of each value in the num-of-doors column is:\n", df["num-of-doors"].value_counts())

# replace the missing 'num-of-doors' values by the most frequent
df["num-of-doors"].replace(np.nan, "four", inplace=True)


### Bore
The bore attribute has only 4 missing values. It is a numerical variable, so replacing the missing values with the mean is recommended.

In [None]:
# calculate the mean value for bore
mean_bore = df["bore"].astype("float").mean(axis=0)
print("Average of bore:", mean_bore)

# replace NaN with mean value in "bore" column
df["bore"].replace(np.nan, mean_bore, inplace=True)

### Stroke
The stroke attribute has only 4 missing values. It is a numerical variable, so replacing the missing values with the mean is recommended.

In [None]:
# calculate the mean value for stroke
mean_stroke = df["stroke"].astype("float").mean(axis=0)
print("Average of stroke:", mean_stroke)

# replace NaN with mean value in "stroke" column
df["stroke"].replace(np.nan, mean_stroke, inplace=True)

### Horsepower
The horsepower attribute has only 2 missing values. It is a numerical variable, so replacing the missing values with the mean is recommended.


In [None]:
# calculate the mean value for horsepower
mean_horsepower = df["horsepower"].astype("float").mean(axis=0)
print("Average horsepower:", mean_horsepower)

# replace NaN with mean value in "horsepower" column
df["horsepower"].replace(np.nan, mean_horsepower, inplace=True)

### Peak RPM
The peak rpm attribute has only 2 missing values. It is a numerical variable, so replacing the missing values with the mean is recommended.

In [None]:
# calculate the mean value for peak-rpm
mean_peak_rpm = df["peak-rpm"].astype("float").mean(axis=0)
print("Average peak rpm:", mean_peak_rpm)

# replace NaN with mean value in "peak-rpm" column
df["peak-rpm"].replace(np.nan, mean_peak_rpm, inplace=True)

### Price
The price attribute has only 4 missing values. However, it is the target variable. As such, the rows with missing values are dropped.

In [None]:
# drop all rows that do not have price data
df.dropna(subset=["price"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

## b. Data standardization
In the context of the auto dataset provided, data standardization could be applied to the variables in order to make them comparable and interpretable. For example, the variables "normalized-losses", "wheel-base", "length", "width", "height", etc. are measured in different units and scales, which can make it difficult to compare and analyze them. By standardizing these variables, we can transform them into a common format, such as z-scores, which makes it easier to compare and analyze them.

Also, standardizing the data can help to identify the relationship between the independent and dependent variable. For instance, using the mean subtraction and division by the standard deviation method, can make the data more easily comparable and interpretable, by scaling the data and making it more easily comparable to other variables in the dataset. This can be useful in identifying patterns and relationships in the data that would not be easily apparent if the variables were in their original format.

### Correcting the data format of the columns
It has been observed that some columns in the dataset have incorrect data types. To fix this issue, the "astype()" method will be used to convert the data types of each column to the appropriate format.

In [None]:
# convert attributes to correct data format
df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

df.dtypes

### Unit Conversion
In order to compare the fuel economy across different vehicles, it is important to have a standard unit of measurement so that the values can be accurately compared. L/100km is an internationally recognized standard unit of measurement for fuel consumption, therefore it can be useful to convert mpg to L/100km in order to facilitate more accurate comparisons between vehicles.

In [None]:
# convert mpg to L/100km by mathematical operation (235 divided by mpg) 
df['city-L/100km'] = 235/df["city-mpg"]
df['highway-L/100km'] = 235/df["highway-mpg"]

# drop the original city-mpg and highway-mpg columns
df.drop(['city-mpg', 'highway-mpg'], axis=1, inplace=True)

## c. Data normalization
In the context of the auto dataset provided, normalization can be useful when comparing variables that have different scales and units of measurement. For example, "normalized-losses", "wheel-base", "length", "width", "height" etc are measured in different units and scales, so in order to compare them effectively and make meaningful analysis, normalizing the variables so that they have similar ranges can be necessary. 

In [None]:
# replace original value by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()

## d. Indicator variable
It would be beneficial to utilize indicator variables for the "fuel type" and "aspiration" attributes within the automobile dataset as it would enable the inclusion of these categorical variables in regression analysis, thus allowing for a more comprehensive examination of their impact on the dependent variable(s) of interest.

### Fuel Type

In [None]:
# create dummy variables for fuel-type
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)

# merge the dummy variables into the original dataframe
df = pd.concat([df, dummy_variable_1], axis=1)

# drop the original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)


### Aspiration

In [None]:
# create dummy variables for aspiration
dummy_variable_2 = pd.get_dummies(df['aspiration'])
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo':'aspiration-turbo'}, inplace=True)

# merge the dummy variables into the original dataframe
df = pd.concat([df, dummy_variable_2], axis=1)

# drop the original column "aspiration" from "df"
df.drop('aspiration', axis = 1, inplace=True)

# Phase 3: Exploratory Data Analysis

## Analyzing individual feature patterns using visualization

### Continuous Numerical Variables

### Categorical Variables

# Data Modeling

# Data Evaluation