# Super‑Basic ML with scikit‑learn

**Sections**
1) Linear regression (predict a numeric target)  
2) Logistic regression (predict a simple binary label)  
3) Mini 'Kaggle' style analysis on one of the provided datasets (train on train, evaluate generalization on separate test dataset)

The goal is to keep everything **simple and readable**.


## 0) Setup & Data Loading

We'll use **Palmer Penguins** again.



In [None]:
# TODO: write load_penguins() function 
import warnings
warnings.filterwarnings("ignore")

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# YOUR CODE HERE

penguins = load_penguins()
penguins.columns = [c.strip().lower().replace(' ','_') for c in penguins.columns]
penguins.head()


## 1) Linear Regression (super basic)

**Task.** Predict `body_mass_g` from **one feature**: `flipper_length_mm`.

We'll do a simple train/test split and compute **R²** and **RMSE**. We'll also draw a quick **true vs predicted** scatter.


In [None]:
# TODO: Build X (flipper_length_mm) and y (body_mass_g), drop missing rows
# TODO: Split into train/test (test_size=0.25, random_state=42)
# TODO: Create a scikit-learn Pipeline: StandardScaler -> LinearRegression
# TODO: Fit on train, predict on test, compute R^2 and RMSE, then plot y_true vs y_pred

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# 1) Build X, y
# YOUR CODE HERE
# X = ...
# y = ...

# 2) Split
# X_train, X_test, y_train, y_test = train_test_split(...)

# 3) Pipeline
# pipe = Pipeline([('scaler', StandardScaler()), ('linreg', LinearRegression())])

# 4) Fit & predict
# pipe.fit(X_train, y_train)
# y_pred = pipe.predict(X_test)

# 5) Metrics
# r2 = r2_score(y_test, y_pred)
# rmse = mean_squared_error(y_test, y_pred, squared=False)
# print(f"R^2: {r2:.3f}, RMSE: {rmse:.1f} g")

# 6) Plot: true vs predicted
# plt.figure(figsize=(5,4))
# plt.scatter(y_test, y_pred, alpha=0.7)
# plt.xlabel("True body_mass_g"); plt.ylabel("Predicted body_mass_g")
# min_v = min(y_test.min(), y_pred.min()); max_v = max(y_test.max(), y_pred.max())
# plt.plot([min_v, max_v], [min_v, max_v], linestyle='--')
# plt.title("Linear regression: true vs predicted")
# plt.tight_layout()
# plt.show()


## 2) Logistic Regression (super basic)

**Task.** Classify **Adelie vs. not‑Adelie** using only **two features**: `bill_length_mm` and `bill_depth_mm`.

We'll compute a simple **accuracy** and display a tiny **confusion matrix**.


In [None]:
# TODO: Make a binary label y_bin = 1 if species == 'Adelie' else 0
# TODO: Build X with [bill_length_mm, bill_depth_mm], drop rows with NaNs in X or y
# TODO: Train/test split (test_size=0.25, random_state=42)
# TODO: Pipeline: StandardScaler -> LogisticRegression
# TODO: Fit, predict, print accuracy, and plot a small confusion matrix

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# 1) Build X, y_bin
# YOUR CODE HERE
# X = ...
# y_bin = ...

# 2) Split
# X_train, X_test, y_train, y_test = train_test_split(...)

# 3) Pipeline
# clf = Pipeline([('scaler', StandardScaler()), ('logreg', LogisticRegression(max_iter=1000))])

# 4) Fit & predict
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_test)

# 5) Accuracy
# acc = accuracy_score(y_test, y_pred)
# print(f"Accuracy: {acc:.3f}")

# 6) Confusion matrix plot
# cm = confusion_matrix(y_test, y_pred, labels=[0,1])
# import numpy as np
# fig, ax = plt.subplots(figsize=(4,4))
# im = ax.imshow(cm, interpolation='nearest')
# ax.set_xticks([0,1]); ax.set_yticks([0,1])
# ax.set_xticklabels(['not Adelie','Adelie']); ax.set_yticklabels(['not Adelie','Adelie'])
# for i in range(cm.shape[0]):
#     for j in range(cm.shape[1]):
#         ax.text(j, i, str(cm[i,j]), ha='center', va='center')
# ax.set_xlabel('Predicted'); ax.set_ylabel('True'); ax.set_title('Confusion matrix')
# fig.colorbar(im, ax=ax, shrink=0.8); fig.tight_layout(); plt.show()


## 3) Mini 'Kaggle' style analysis

Pick **one** of the **provided datasets** found in 'https://github.com/XenoQueer/Class_repo-liam/tree/main/data/prog_sci_data'. Each has a **train** file that ends with '_dataset' and a new separate **test** file.

- The test file **has labels**, so we will evaluate accuracy/RMSE on it.  

We'll keep the modeling **super basic** and numeric‑only:
- numeric imputer (mean) + standardize;
- **LinearRegression** if your task is **regression**,
- **LogisticRegression** if your task is **classification**.



In [None]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score
import pandas as pd


# === Load ===
# 0) Load the data
# YOUR CODE HERE
# make sure you load in both the training and the testing data

# === Clean data ===
# 1) Make sure your data is cleaned 
# YOUR CODE HERE 
# perform some basic visualization to make sure everything is ready to do for your analysis
# you can re-use your past work from previous analyses if you'd like!

# === Build X and y ===
# 2) Parse your data into X_train, y_train, X_gen, y_gen
# YOUR CODE HERE
# you'll want to run and refine your analysis for just the training data and then at the end test on the generalization data

# === Data split ===
# 3) Split only your training data
# YOUR CODE HERE
# X_train, X_test, y_train, y_test = train_test_split(...)

# === Pipeline ===
# 4) Create your pipeline
# YOUR CODE HERE
# for example:
# pipe = Pipeline([('scaler', StandardScaler()), ('logreg', LogisticRegression(max_iter=1000))])
# pipe = Pipeline([('scaler', StandardScaler()), ('linreg', LinearRegression())])

# === Fit & predict ===
# 5) Fit your model to the training data and test on the testing data (not the generalization data yet)
# YOUR CODE HERE
# pipe.fit(X_train, y_train)
# y_pred = pipe.predict(X_test)

# === Metrics ===
# 6) Use appropriate metrics for your analysis (regression vs classification) to assess your model
# YOUR CODE HERE

# === Refine model ===
# 7) Repeat steps 4-6 with different parameters/pipeline changes until satisfied with test results
# YOUR CODE HERE

# === Evaluate model ===
# 8) Assess the generalization of your model using X_gen and y_gen
# YOUR CODE HERE
# Don't forget to visualize your results in the appropriate manner! (Scatter or confusion matrix)




---

## ✅ What to submit
- **Section 1**: your pipeline code and the R²/RMSE printout (+ the scatter).
- **Section 2**: your pipeline code and the accuracy printout (+ the confusion matrix).
- **Section 3**: your train/test evaluation printout and metric printout (scatter or confusion matrix)

## Scoring (100 pts)
- *Linear regression* — 35 pts
- *Logistic regression* — 35 pts
- *Mini Kaggle analysis* — 30 pts

> Bonus (+5): Provide one or two sentences on what worked and why for your Kaggle analysis.
