<a href="https://colab.research.google.com/github/lucib3196/Machine_Learning_Projects/blob/main/Species_Classification_of_Penguins_Using_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression and GPT analysis
In this notebook, we explore the task of classifying penguin species based on various physical attributes using logistic regression

In [6]:
!pip install openai # Install OpenAI

Collecting openai
  Downloading openai-1.12.0-py3-none-any.whl (226 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/226.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/226.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/226.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.4-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.8/77.8 kB[0m [31m10

In [7]:
# Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, classification_report, confusion_matrix
from openai import OpenAI
import os
from getpass import getpass

In [None]:
df = pd.read_csv("/content/penguins_size.csv")
# Clean Data
df.dropna(axis = 0,how = "any", inplace = True,subset = None)
# Show head
df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE


In [None]:
# Convert Cat to hot encoding
df = pd.get_dummies(df,columns = [ "island", "sex"])


In [None]:
df.columns

Index(['species', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm',
       'body_mass_g', 'island_Biscoe', 'island_Dream', 'island_Torgersen',
       'sex_.', 'sex_FEMALE', 'sex_MALE'],
      dtype='object')

In [None]:
# Assign X and Y
X = df.drop('species', axis=1)  # Drop the 'species' column to create the feature set
y = df["species"]  # Assign the 'species' column as the target variable

# Split Training Data
x_train, x_test, y_train,y_test = train_test_split(X,y, test_size = 0.3)

# Assign Algorithm
model = LogisticRegression()
# Assign Algorithem to X and Y
model.fit(x_train,y_train)

# Run Algo
model_test = model.predict(x_test)

# Eval
print(confusion_matrix(y_test,model_test))
print(classification_report(y_test,model_test))


[[46  0  0]
 [ 2 13  0]
 [ 0  0 40]]
              precision    recall  f1-score   support

      Adelie       0.96      1.00      0.98        46
   Chinstrap       1.00      0.87      0.93        15
      Gentoo       1.00      1.00      1.00        40

    accuracy                           0.98       101
   macro avg       0.99      0.96      0.97       101
weighted avg       0.98      0.98      0.98       101



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Data Analysis Using GPT

In [9]:
# Insert Personal API key
api_key = getpass('Enter your OpenAI API Key: ')
os.environ['OPENAI_API_KEY'] = api_key

Enter your OpenAI API Key: ··········


In [10]:
client = OpenAI()
confusion_mat = confusion_matrix(y_test,model_test)
classification_rep = classification_report(y_test,model_test)
cluster_analysis_prompt =  f"""I have the following data on classifying penguins species. You have access
to the following
confusion matrix: {confusion_mat}
classification report: {classification_rep}
You are tasked with analyzing the structures and return an analysis on the models performace
Description: Description of numerical meaning of data
Analysis: Analysison validity of model

You're goal is to explain the data in a consise and to the point explanation and back up your reasoning
 """


response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages = [
      {"role": "system", "content": "You are a helpful data analysist, with machine learning expertise."},
      {"role": "user", "content": cluster_analysis_prompt}
  ],
  temperature = 0
)
print(response.choices[0].message.content)


### Description:
- The confusion matrix shows the model's performance on classifying penguin species. 
- The diagonal elements represent the number of correctly classified instances for each class, while off-diagonal elements represent misclassifications.
- The classification report provides metrics such as precision, recall, and F1-score for each class, as well as overall accuracy and other averages.

### Analysis:
- The model has high precision, recall, and F1-score for all three classes, indicating good performance in classifying penguin species.
- The overall accuracy of 98% is high, suggesting that the model is effective in making correct predictions.
- The high precision values indicate that when the model predicts a class, it is usually correct.
- The high recall values indicate that the model can identify most of the instances of each class.
- The F1-scores, which consider both precision and recall, are also high for all classes, indicating a good balance between precision and 