# Third Assignment

For this assignment you will have to identify a suitable dataset and implement a machine learning pipeline, comprising the following steps:

* Presentation of the problem (6 points)
* data exploration (4 points)
* data preparation (10 points)
* model selection: try 2 models (max 3) (20 points)
* evaluation on test set (3 points)
* conclusion commenting on the results and comparing the models (7 points)

**Total: 50 points**

Some possible datasets:
* [Titanic dataset](https://www.kaggle.com/c/titanic): binary classification
* [Wine dataset](https://www.kaggle.com/datasets/yasserh/wine-quality-dataset): multi-class classification/regression
* [Abalone dataset](https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset): regression
* [California housing price](https://www.kaggle.com/datasets/camnugent/california-housing-prices): regression
* [CIFAR-10](http://www.cs.toronto.edu/~kriz/cifar.html): multi-class classification in images

Look for more on [Kaggle](https://www.kaggle.com/datasets) or elsewhere.

**Deadline:** To be confirmed by the tutor - please see the date on our Canvas page.

**Submission:** Please email your solutions and your completed Declaration of Authorship (DoA) form to weeklyclasses@conted.ox.ac.uk 

# Presentation of the problem (6 points)

For this assignment, I decided to create a machine learning model that can predict which country their car was manufactured in. Although many people could instantly know which country their car was manufactured through their car model I wanted to create this classifier as something that someone can use out of curiosity to see what country their car originated from. This classifier might be helpful for individuals who are not into cars that much and might not even know about different car models and where they are manufactured. Another potential application for this problem is for car part companies that store many different car parts. Having this classifier might help car repair companies perform a initial diagnosis of a individuals car (knowing where the car was manufactured could help with identifying what repair parts will be needed)

To create this ML model I will be analysing the mpg dataset which contains information about 38 new car models which have been released every year from 1999 and 2008. I will be using the milers per gallon -mpg, car displacement, horsepower, and car weight. The reason why I picked these variables is because I believe that they might have a correlation with being manufactured at a specific country (for example cars manufactured from the US might have lower mpg or horsepower). I briefly explored the mpg dataset in the "python part 1" course and noticed that the US manufactured cars that had lower mpg's, displacements, etc when compared to other countries. Seeing these results sparked my curiosity to classification algorithm that can predict where a car is manufactured by just taking in a few stats about the car.

# data exploration (4 points)

I have definded the dataset columns below and have also indicated what type of variable it is (ex. categorical or numerical):

*   **mpg** -continuous numerical:
      *   mpg means miles per gallon and it is the distance of each car measured in miles. It is referred to the *fuel economy* of a car since it represents the amount travelled by the car and the amount of fuel consumed. A *higher* mpg is better since it shows that your car can consume less fuel as you drive.
*   **cylinders** -nominal categorical:
      *   A car cylinder is a important part of the engine. This is where fuel is consumed and the car's energy is generated. If a car has *more* cylinders that means the cars power can be generated *faster*. However, there is a tradeoff to this since you need more fuel to generate more power. So if your care has more cylinders then you will end up paying for more fuel.
*    **displacement** -continuous numerical:
      *  A cars displacement refers to the amount of air that can be pushed into the cylinders. The higher the engine displacement the more power the engine can generate.
*    **horsepower** -discrete numerical:
      *  The cars horsepower is a measurment of how fast a cars engine can work. The more horsepower a car has the faster it can go
*    **weight** -continuous numerical:
      *  Car weight. The heavier a car, the more fuel it requires
*    **acceleration** -continuous numerical:
      *  Acceleration describes the rate at which the car can increase its speed. Having a high car acceleration will use more fuel
*    **model_year** -ordinal categorical:
      *  Gives the year the car was made/manufactured
*    **origin** -nominal categorical:
      *  Where the car was manufactured
*    **name** -nominal categorical:
      *  The car model

In [None]:
#installing the mpg python data package which contains information about fuel economy from 1999 and 2008 for 38 popular models of cars
#pip install mpg

In [50]:
import seaborn as sns
mpg = sns.load_dataset("mpg") #loading the dataset 
mpg[0:5]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [3]:
mpg.describe() #creating the summary statistic for the data

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


# data preparation (10 points)

To prepare the data I am first going to remove the columns that I don't need/will not use for analysis

In [51]:
mpg = mpg.loc[:, ['mpg', 'displacement', 'horsepower', 'weight', 'origin']]
mpg.head(5)

Unnamed: 0,mpg,displacement,horsepower,weight,origin
0,18.0,307.0,130.0,3504,usa
1,15.0,350.0,165.0,3693,usa
2,18.0,318.0,150.0,3436,usa
3,16.0,304.0,150.0,3433,usa
4,17.0,302.0,140.0,3449,usa


I am checking to see if there are any missing/NA values in any of the columns. I notice that the horsepower column has NA values so I need to impute the values in the column and replace the NA values with a huge negative number like -999 so that the ML model can learn that these are NA values. My second option is to remove these rows from the dataset.

I checked and the number of NA values in the horsepower column is 6 so I decided to remove those rows since they did not affect a huge portion of the dataset/sample.

In [52]:
print('mpg column: ', mpg['mpg'].isna().unique(), '\n',
    'displacement column: ', mpg['displacement'].isna().unique(),'\n',
     'horsepower column: ', mpg['horsepower'].isna().unique(), '\n',
     'weight column: ', mpg['weight'].isna().unique(), '\n',
     'origin column: ', mpg['origin'].isna().unique())

print()

print('Number of NA values in horepower column: ', mpg['horsepower'].isna().sum())
print('Number of non-NA values in horspower column: ',mpg['horsepower'].count())
print('Total number of rows in dataset: ',mpg['horsepower'].size)

mpg column:  [False] 
 displacement column:  [False] 
 horsepower column:  [False  True] 
 weight column:  [False] 
 origin column:  [False]

Number of NA values in horepower column:  6
Number of non-NA values in horspower column:  392
Total number of rows in dataset:  398


In [53]:
mpg = mpg[mpg['horsepower'].notna()]
print()
print('Number of NA values in horepower column: ', mpg['horsepower'].isna().sum())
print('Number of non-NA values in horspower column: ',mpg['horsepower'].count())
print('Total number of rows in dataset: ',mpg['horsepower'].size)


Number of NA values in horepower column:  0
Number of non-NA values in horspower column:  392
Total number of rows in dataset:  392


I am going to print out the unique values in the origin column and will label these values numerically since it is good practice to label the categories by number.

In [54]:
print(mpg['origin'].unique()) #checking to see how many countries are represented in this dataset

mpg['origin'] = mpg['origin'].replace({'usa': 0, 'japan': 1, 'europe': 2})

['usa' 'japan' 'europe']


In [56]:
print(mpg['origin'].unique())

[0 1 2]


I am now going to split the training and testing data

In [55]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

#creating training and test data
features = mpg[['mpg', 'displacement', 'horsepower', 'weight']]
target = mpg['origin']

#data split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42, stratify = target)

#feature scaling
scaler_minmax = MinMaxScaler()#importing the scaler

X_train_scaled = scaler_minmax.fit_transform(X_train)
X_test_scaled = scaler_minmax.fit_transform(X_test)

# model selection: try 2 models (max 3) (20 points)

I will be using logistic regression, random forests, and Support Vector Machines -SVM to implement my classifier. I will first run the results on the training data and then the test data and will share my findings/conclusions at the end.

## Logistic Regression

In [74]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the model
logreg_model = LogisticRegression(multi_class='auto', solver='lbfgs')
logreg_model.fit(X_train_scaled, y_train) # Fit the model to the training data
y_pred_logreg = logreg_model.predict(X_train_scaled) # Make predictions on training

# Evaluate the model on training data
accuracy = accuracy_score(y_train, y_pred_logreg)
print(f"Logistic Regression Accuracy: {accuracy:.2f}")

print("\nConfusion Matrix on training set:\n", confusion_matrix(y_train, y_pred_logreg))
print("\nClassification Report on training set:\n", classification_report(y_train, y_pred_logreg))

Logistic Regression Accuracy: 0.70

Confusion Matrix on training set:
 [[178  18   0]
 [ 20  41   2]
 [ 26  27   1]]

Classification Report on training set:
               precision    recall  f1-score   support

           0       0.79      0.91      0.85       196
           1       0.48      0.65      0.55        63
           2       0.33      0.02      0.04        54

    accuracy                           0.70       313
   macro avg       0.53      0.53      0.48       313
weighted avg       0.65      0.70      0.65       313



## Random Forests

In [65]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_scaled, y_train) # Fit the model to the training data
y_pred_rf = rf_classifier.predict(X_train_scaled) # Make predictions on training

# Evaluate the model on traing data
accuracy_rf = accuracy_score(y_train, y_pred_rf)
print(f"Random Forest Classifier Accuracy: {accuracy_rf:.2f}")

print("\nConfusion Matrix on training set:\n", confusion_matrix(y_train, y_pred_rf))
print("\nClassification Report on training set:\n", classification_report(y_train, y_pred_rf))

Random Forest Classifier Accuracy: 1.00

Confusion Matrix on training set:
 [[196   0   0]
 [  0  63   0]
 [  0   0  54]]

Classification Report on training set:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       196
           1       1.00      1.00      1.00        63
           2       1.00      1.00      1.00        54

    accuracy                           1.00       313
   macro avg       1.00      1.00      1.00       313
weighted avg       1.00      1.00      1.00       313



## Support Vector Machines (SVM)

In [73]:
from sklearn.svm import SVC

# Initialize the model
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_scaled, y_train) # Fit the model to the training data
y_pred_svm = svm_classifier.predict(X_train_scaled) # Make predictions on training data

# Evaluate the model
accuracy_svm = accuracy_score(y_train, y_pred_svm)
print(f"SVM Classifier Accuracy: {accuracy_svm:.2f}")

print("\nConfusion Matrix on training set:\n", confusion_matrix(y_train, y_pred_svm))
print("\nClassification Report on training set:\n", classification_report(y_train, y_pred_svm,zero_division=0))

SVM Classifier Accuracy: 0.69

Confusion Matrix on training set:
 [[175  21   0]
 [ 22  41   0]
 [ 24  30   0]]

Classification Report on training set:
               precision    recall  f1-score   support

           0       0.79      0.89      0.84       196
           1       0.45      0.65      0.53        63
           2       0.00      0.00      0.00        54

    accuracy                           0.69       313
   macro avg       0.41      0.51      0.46       313
weighted avg       0.59      0.69      0.63       313



# evaluation on test set (3 points)

## Logistic Regression

In [71]:
y_pred_logreg_test = logreg_model.predict(X_test_scaled) # Make predictions on test data

# Evaluate the model on training data
accuracy = accuracy_score(y_test, y_pred_logreg_test)
print(f"Logistic Regression Accuracy: {accuracy:.2f}")

print("\nConfusion Matrix on training set:\n", confusion_matrix(y_test, y_pred_logreg_test))
print("\nClassification Report on training set:\n", classification_report(y_test, y_pred_logreg_test,zero_division=0))

Logistic Regression Accuracy: 0.70

Confusion Matrix on training set:
 [[46  3  0]
 [ 7  9  0]
 [10  4  0]]

Classification Report on training set:
               precision    recall  f1-score   support

           0       0.73      0.94      0.82        49
           1       0.56      0.56      0.56        16
           2       0.00      0.00      0.00        14

    accuracy                           0.70        79
   macro avg       0.43      0.50      0.46        79
weighted avg       0.57      0.70      0.62        79



## Random Forsests

In [69]:
y_pred_rf_test = rf_classifier.predict(X_test_scaled) # Make predictions on test data

# Evaluate the model on traing data
accuracy_rf = accuracy_score(y_test, y_pred_rf_test)
print(f"Random Forest Classifier Accuracy: {accuracy_rf:.2f}")

print("\nConfusion Matrix on training set:\n", confusion_matrix(y_test, y_pred_rf_test))
print("\nClassification Report on training set:\n", classification_report(y_test, y_pred_rf_test))

Random Forest Classifier Accuracy: 0.75

Confusion Matrix on training set:
 [[49  0  0]
 [ 7  7  2]
 [ 4  7  3]]

Classification Report on training set:
               precision    recall  f1-score   support

           0       0.82      1.00      0.90        49
           1       0.50      0.44      0.47        16
           2       0.60      0.21      0.32        14

    accuracy                           0.75        79
   macro avg       0.64      0.55      0.56        79
weighted avg       0.71      0.75      0.71        79



## Support Vector Machines (SVM)

In [72]:
# Make predictions on training data
y_pred_svm_test = svm_classifier.predict(X_test_scaled)

# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm_test)
print(f"SVM Classifier Accuracy: {accuracy_svm:.2f}")

print("\nConfusion Matrix on training set:\n", confusion_matrix(y_test, y_pred_svm_test))
print("\nClassification Report on training set:\n", classification_report(y_test, y_pred_svm_test,zero_division=0))

SVM Classifier Accuracy: 0.70

Confusion Matrix on training set:
 [[46  3  0]
 [ 7  9  0]
 [10  4  0]]

Classification Report on training set:
               precision    recall  f1-score   support

           0       0.73      0.94      0.82        49
           1       0.56      0.56      0.56        16
           2       0.00      0.00      0.00        14

    accuracy                           0.70        79
   macro avg       0.43      0.50      0.46        79
weighted avg       0.57      0.70      0.62        79



# conclusion commenting on the results and comparing the models (7 points)

Before I share my analysis I want to show the results of the model accuracy score, the confusion matrix results, and F1 scores in tables below. 

### Accuracy Score

| Data | Logistic Regression | Random Forest | Support Vector Machine (SVM) |
| ---  | --- | --- | --- |
| Training Data | 0.70 | 1.00 | 0.69 |
| Testing Data | 0.70 | 0.75 | 0.70 |
| Difference | 0 | -0.25 | +0.01 |

### Confusion Matrix Results

** I am only displaying the correctly classified results of model in this table. The order of the results is USA-Japan-Europe

| Data | Logistic Regression | Random Forest | Support Vector Machine (SVM) |
| ---  | --- | --- | --- |
| Training Data | 178--41--1 | 196--63--54 | 175--41--0 |
| Testing Data | 46--9--0 | 49--7--3 | 46--9--0 |

### F1 Score Results

The order of the results is USA-Japan-Europe

| Data | Logistic Regression | Random Forest | Support Vector Machine (SVM) |
| ---  | --- | --- | --- |
| Training Data | 0.85--0.55--0.04 | 1--1--1 | 0.84--0.53--0 |
| Testing Data | 0.82--0.56--0 | 0.75--0.47--0.32 | 0.82--0.56--0 |

### Conclusions and Analysis

After looking and analysing the results of each classification model I would say that the Random Forest classfier was the best model to correctly classify all of the car's country of manufactuting. First off, the Random Forest classifier has a accuracy score of 1 on the training data which was suprising since I never imagined that it would be able to correctly classify all of the car's country of origin. Even though the Random Forest accuracy decreased on the test data by 0.25 the accuracy was still the highest compared to the logistic regression and SVM (although the accuracy scores of these two algorithms were not far off from the random forest since they were both 0.70).

I was leaning towards saying that all of the models were fairly accurate since on the test data, the accuracy scores were not too far off. But when looking at the confusion matrix results I realised that none of the models was able to correctly classify the cars that were manufactured in Europe. The Random Forest classifier was able to correctly identify Europe as the car's manufactured place in both, the training and test datasets. Similarily, you notice that the F1 scores for Europe in the Random Forest classifier is the highest in the training and testing data while the logistic regression and SVM both had a F1 score of zero for Europe.