# Palmers Penguins Solution

Below is a number of solutions to the problem. In order to solve this we have used a number of different machine learning algorithmns to classify the data:

- Random Forest
- XGBoost Random Forest
- Support Vector Machines
- K-Nearest Neighbour
- Logistic Regression

We will run these, find which one works best and then try to explain why.

In [35]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier


In [36]:
df = pd.read_csv("./data/penguin_data.csv")

## Encoding

In ML feature "encoding" is when we convert the data from text to numbers so the machine can interpret the data. For example, in our data we have "island_name" and "sex" that are text columns. We need to convert these into numbers. This can be achieved with Functions and Lambda.

In [37]:
# Define a function to map 'island' values to numeric values
def map_island_to_numeric(island_name):
    island_mapping = {
        'Torgersen': 0,
        'Biscoe': 1,
        'Dream': 2
    }
    
    return island_mapping.get(island_name, -1)  # Return -1 for unknown or missing values

In [38]:

# Define a function to map 'sex' values to 0 for male and 1 for female
def map_sex_to_binary(sex):
    if sex == 'male':
        return 0
    elif sex == 'female':
        return 1
    else:
        return None  # Handle missing or other cases if necessary


In [43]:
# Define a function to map 'island' values to numeric values
def map_species_to_numeric(species):
    species_mapping = {
        'Adelie': 0,
        'Chinstrap': 1,
        'Gentoo': 2
    }
    
    return species_mapping.get(species, -1) 

## Train Test Split

Need to understand how out model is performing. To do this we will seperate the data into training and test data:

- Training data: Fitting the model to this data.
- Test data: Model will tested against this data.

We can then calculate how effective our model is.

In [44]:
df.dropna(inplace=True)

# Apply the function to create a new 'island_numeric' column
df['island_numeric'] = df['island'].apply(map_island_to_numeric)

# Apply the function to create a new 'sex_binary' column
df['sex_binary'] = df['sex'].apply(map_sex_to_binary)

# Apply the function to create a new 'sex_binary' column
df['species'] = df['species'].apply(map_species_to_numeric)

# Define the features (X) and target (y)
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex_binary', 'island_numeric', 'year']]
y = df['species']

# Split the data into a training set (80%) and a test set (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Standardize the feature values (important for all classifiers)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Random Forest Classifier

A random forest classifier constructs multiple decision trees, each trained on a random subset of the data, and uses their collective votes to classify new data, reducing overfitting and enhancing accuracy by aggregating diverse insights from various perspectives of the dataset.

In [45]:
# Define and train a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_scaled, y_train)

# Make predictions
rf_predictions = rf_classifier.predict(X_test_scaled)

## Model Evaluation

**Accuracy Score**: The accuracy score measures the proportion of total correct predictions made by a model, providing a straightforward metric of its overall performance on a dataset.

**Confusion Matrix**: A confusion matrix is a table used to describe the performance of a classification model on a set of test data for which the true values are known, showing the counts of correct and incorrect predictions across different categories.

**Classification Report**: A classification report provides key metrics like precision, recall, and F1-score for each class, offering a detailed view of a model’s performance, especially useful for imbalanced datasets where accuracy alone might be misleading.



In [46]:
# Evaluate the Random Forest classifier
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_confusion_matrix = confusion_matrix(y_test, rf_predictions)
rf_classification_report = classification_report(y_test, rf_predictions)

In [54]:
# Display the results
print("Random Forest Classifier Accuracy:", rf_accuracy)
print("\n")
print("Random Forest Classifier Confusion Matrix:\n", rf_confusion_matrix)
print("\n")
print("Random Forest Classifier Classification Report:\n", rf_classification_report)

Random Forest Classifier Accuracy: 0.9850746268656716


Random Forest Classifier Confusion Matrix:
 [[35  0  0]
 [ 1 10  0]
 [ 0  0 21]]


Random Forest Classifier Classification Report:
               precision    recall  f1-score   support

           0       0.97      1.00      0.99        35
           1       1.00      0.91      0.95        11
           2       1.00      1.00      1.00        21

    accuracy                           0.99        67
   macro avg       0.99      0.97      0.98        67
weighted avg       0.99      0.99      0.98        67



## XGBoost Random Forest

XGBoost Random Forest leverages the XGBoost (Extreme Gradient Boosting) framework to enhance the traditional random forest approach, utilizing gradient boosting techniques to optimize the construction of decision trees across various stages of training. This method boosts performance by focusing on correcting the mistakes of previous trees, making it highly effective for complex datasets with high-dimensional features.

In [55]:
# Define and train an XGBoost classifier
xgb_classifier = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_classifier.fit(X_train_scaled, y_train)
xgb_predictions = xgb_classifier.predict(X_test_scaled)

# Evaluate the XGBoost classifier
xgb_accuracy = accuracy_score(y_test, xgb_predictions)
xgb_confusion_matrix = confusion_matrix(y_test, xgb_predictions)
xgb_classification_report = classification_report(y_test, xgb_predictions)

print("\nXGBoost Classifier Accuracy:", xgb_accuracy)
print("\n")
print("XGBoost Classifier Confusion Matrix:\n", xgb_confusion_matrix)
print("\n")
print("XGBoost Classifier Classification Report:\n", xgb_classification_report)


XGBoost Classifier Accuracy: 0.9850746268656716


XGBoost Classifier Confusion Matrix:
 [[35  0  0]
 [ 1 10  0]
 [ 0  0 21]]


XGBoost Classifier Classification Report:
               precision    recall  f1-score   support

           0       0.97      1.00      0.99        35
           1       1.00      0.91      0.95        11
           2       1.00      1.00      1.00        21

    accuracy                           0.99        67
   macro avg       0.99      0.97      0.98        67
weighted avg       0.99      0.99      0.98        67



# Support Vector Machine
Support Vector Machine (SVM) is a powerful machine learning model used for both classification and regression tasks, which finds the optimal hyperplane that best separates different classes in the feature space. This model is particularly effective in high-dimensional spaces and for cases where the number of dimensions exceeds the number of samples.

In [57]:
# Define and train a Support Vector Machine classifier
svm_classifier = SVC(kernel='linear', C=1.0, random_state=42)
svm_classifier.fit(X_train_scaled, y_train)

# Make predictions
svm_predictions = svm_classifier.predict(X_test_scaled)

# Evaluate the SVM classifier
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_confusion_matrix = confusion_matrix(y_test, svm_predictions)
svm_classification_report = classification_report(y_test, svm_predictions)

# Display the results
print("Support Vector Machine Classifier Accuracy:", svm_accuracy)
print("\n")
print("Support Vector Machine Classifier Confusion Matrix:\n", svm_confusion_matrix)
print("\n")
print("Support Vector Machine Classifier Classification Report:\n", svm_classification_report)


Support Vector Machine Classifier Accuracy: 0.9850746268656716


Support Vector Machine Classifier Confusion Matrix:
 [[34  1  0]
 [ 0 11  0]
 [ 0  0 21]]


Support Vector Machine Classifier Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.97      0.99        35
           1       0.92      1.00      0.96        11
           2       1.00      1.00      1.00        21

    accuracy                           0.99        67
   macro avg       0.97      0.99      0.98        67
weighted avg       0.99      0.99      0.99        67



## KNN

K-Nearest Neighbors (KNN) is a simple, intuitive machine learning algorithm that classifies a new data point based on the majority vote of its 'k' nearest neighbors in the feature space, making it highly effective for classification tasks where the relationship between features is spatially significant.

In [59]:
# Define and train the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=10)  # You can adjust the number of neighbors (k)
knn_classifier.fit(X_train_scaled, y_train)

# Make predictions
knn_predictions = knn_classifier.predict(X_test_scaled)

# Evaluate the KNN classifier
knn_accuracy = accuracy_score(y_test, knn_predictions)
knn_confusion_matrix = confusion_matrix(y_test, knn_predictions)
knn_classification_report = classification_report(y_test, knn_predictions)

# Display the results
print("K-Nearest Neighbors Classifier Accuracy:", knn_accuracy)
print("\n")
print("K-Nearest Neighbors Classifier Confusion Matrix:\n", knn_confusion_matrix)
print("\n")
print("K-Nearest Neighbors Classifier Classification Report:\n", knn_classification_report)

K-Nearest Neighbors Classifier Accuracy: 0.9701492537313433


K-Nearest Neighbors Classifier Confusion Matrix:
 [[34  1  0]
 [ 1 10  0]
 [ 0  0 21]]


K-Nearest Neighbors Classifier Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.97      0.97        35
           1       0.91      0.91      0.91        11
           2       1.00      1.00      1.00        21

    accuracy                           0.97        67
   macro avg       0.96      0.96      0.96        67
weighted avg       0.97      0.97      0.97        67

