# Classification of Car Acceptibility

Let's walk through a machine learning workflow to classify the acceptability of a car (acceptable or unacceptable) based on the features provided. We'll start by adding the headers, then proceed step-by-step through the ML workflow.

Source of dataset from [Kaggle](https://www.kaggle.com/datasets/elikplim/car-evaluation-data-set)

## Data Attributes
1. `buying`: Car buying price (categorical: 'vhigh', 'high', 'med', 'low')
2. `maint`: Maintenance price (categorical: 'vhigh', 'high', 'med', 'low')
3. `door`: Number of doors (categorical: '2', '3', '4', '5more')
4. `persons`: Person capacity (categorical: '2', '4', 'more')
5. `lug_boot`: Luggage boot size (categorical: 'small', 'med', 'big')
6. `safety`: Safety of the car (categorical: 'low', 'med', 'high')
7. `class`: Acceptability of the car (categorical: 'unacc', 'acc', 'good', 'vgood')

## Step 1: Load and Add Headers to the Data
First, we'll load the data from the CSV file and add the headers.

Explanation:
- We'll use pandas to load the data.
- Add headers to the dataset: `buying`, `maint`, `door`, `persons`, `lug_boot`, `safety`, `class`.

In [None]:
import pandas as pd

# Load the data without headers
data = pd.read_csv('car-evaluation.csv', header=None)

# Add headers to the dataset
headers = ['buying', 'maint', 'door', 'persons', 'lug_boot', 'safety', 'class']
data.columns = headers

# Display the first few rows of the dataset
data.head()

## Step 2: Data Preprocessing

We'll encode categorical variables, handle any missing values, and prepare the data for training.

Explanation:
- Convert categorical variables to numerical using one-hot encoding.
- Check for missing values and handle them if necessary.

In [None]:
# Convert categorical variables to numerical using one-hot encoding
data_encoded = pd.get_dummies(data, columns=['buying', 'maint', 'door', 'persons', 'lug_boot', 'safety'])

# Display the first few rows of the encoded dataset
print(data_encoded.head())

# Check for missing values
print(data_encoded.isnull().sum())

## Step 3: Split the Data into Training and Testing Sets

We need to divide the dataset into training and testing sets to evaluate the model's performance.

Explanation:
- We'll use the train_test_split function from scikit-learn to split the data.
- A common split is 80% training and 20% testing.

In [None]:
from sklearn.model_selection import train_test_split

# Define features and target
X = data_encoded.drop('class', axis=1)
y = data_encoded['class']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Display the shape of the training and testing sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Step 4: Train a Machine Learning Model

We'll train a model using a classification algorithm. For simplicity, we'll start with a K-Nearest Neighbours Classifier.

Explanation:
- We'll use the `KNeighborsClassifier` class from scikit-learn.
- Fit the model to the training data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize and train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

## Step 5: Evaluate the Model

We'll evaluate the model's performance using metrics such as `accuracy`, `precision`, `recall`, and `F1-score`.

Explanation:
- Predict the target values for the test set.
- Calculate evaluation metrics to assess model performance.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict the target values for the test set
y_pred = knn.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')

### Model Evaluation Metrics

#### Accuracy:

Accuracy=0.8757

- Explanation: Accuracy measures the proportion of correct predictions (both true positives and true negatives) among the total number of cases examined. An accuracy of 0.8757 means that the model correctly classified about 87.57% of the car acceptability cases.
- Interpretation: The model has a high accuracy, indicating that it performs well overall in classifying car acceptability.

#### Precision:

Precision=0.7524

- Explanation: Precision is the ratio of true positive predictions to the total number of positive predictions (true positives + false positives). A precision of 0.7524 means that about 75.24% of the cars predicted as "acceptable" (acc) by the model are actually acceptable.
- Interpretation: The precision is moderately high, which means that the model is relatively good at identifying acceptable cars without too many false positives.

#### Recall:

Recall=0.5763

- Explanation: Recall (also known as sensitivity or true positive rate) is the ratio of true positive predictions to the total number of actual positives (true positives + false negatives). A recall of 0.5763 means that about 57.63% of all actual acceptable cars were correctly identified by the model.
- Interpretation: The recall is lower compared to precision, indicating that the model misses a significant number of acceptable cars (higher false negatives).

#### F1-score:

F1-score=0.6156

- Explanation: The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. An F1-score of 0.6156 indicates a balance between precision and recall, but the value suggests there is room for improvement.
- Interpretation: The F1-score is moderate, indicating that while the model is reasonably good at distinguishing between acceptable and unacceptable cars, there is a trade-off between precision and recall.


### Summary
- `Accuracy` (87.57%): High overall performance in classification.
- `Precision` (75.24%): The model is good at correctly predicting acceptable cars, but there are some false positives.
- `Recall` (57.63%): The model misses a notable proportion of acceptable cars, leading to false negatives.
- `F1-score` (61.56%): The model balances precision and recall but suggests that there is room for improvement.

## Step 6: Predict Car Acceptability for a Different Car Features

- `Collect New Car Data`: Ensure the new data includes values for the features buying, maint, door, persons, lug_boot, and safety.
- `Preprocess the New Data`: Ensure the new data is in the same format as the training data (i.e., one-hot encoded).
- `Make Predictions`: Use the trained model to predict the class for the new data.

In [15]:
# Data attributes information
# `buying`: Car buying price (categorical: 'vhigh', 'high', 'med', 'low')
# `maint`: Maintenance price (categorical: 'vhigh', 'high', 'med', 'low')
# `door`: Number of doors (categorical: '2', '3', '4', '5more')
# `persons`: Person capacity (categorical: '2', '4', 'more')
# `lug_boot`: Luggage boot size (categorical: 'small', 'med', 'big')
# `safety`: Safety of the car (categorical: 'low', 'med', 'high')

# New car data
new_car = {
    'buying': 'med',
    'maint': 'med',
    'door': '1',
    'persons': 'more',
    'lug_boot': 'big',
    'safety': 'high'
}

# Convert the new car data to a DataFrame
new_car_data = pd.DataFrame([new_car])

# One-hot encode the new car data to match the training data format
new_car_encoded = pd.get_dummies(new_car_data)
new_car_encoded = new_car_encoded.reindex(columns=X_train.columns, fill_value=0)

# Make a prediction
new_car_prediction = knn.predict(new_car_encoded)

print(f'Predicted acceptability for the new car: {new_car_prediction[0]}')

Predicted acceptability for the new car: vgood
