# Linear

<div class="admonition danger">
    <p class="admonition-title">DRAFT</p>
    <p style="padding-top: 1em">
        This page is a work in progress and is subject to change at any moment.
    </p>
</div>

TODO:

In [1]:
import numpy as np
import pandas as pd

RANDOM_STATE = 32665

CSV_PATH = "https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/biosc1540/files/csv/mushrooms.csv"

df = pd.read_csv(CSV_PATH)

df_X = df.loc[:, df.columns != "class"]
df_y = df["class"]

## Throw it into sklearn

TODO: Logistic regression

In [2]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df_X_encoded = df_X.apply(label_encoder.fit_transform)

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_X_encoded, df_y, test_size=0.3, random_state=RANDOM_STATE
)

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [5]:
y_pred = model.predict(X_test)
classification_rep = classification_report(y_test, y_pred)
print(classification_rep)

              precision    recall  f1-score   support

           e       0.94      0.96      0.95      1276
           p       0.96      0.94      0.95      1162

    accuracy                           0.95      2438
   macro avg       0.95      0.95      0.95      2438
weighted avg       0.95      0.95      0.95      2438



## Metrics

### Precision

Precision is a crucial metric in assessing the accuracy of positive predictions made by a classification model, particularly in binary classification scenarios.

In the context of distinguishing between edible and poisonous mushrooms, the paragraph emphasizes that the precision for both classes is high.
This implies that when the model predicts a mushroom to be either edible or poisonous, it is typically correct.
Essentially, the model excels in accurately identifying mushrooms belonging to a specific class, showcasing its reliability in making positive predictions.
This high precision is especially important in situations where the consequences of misclassifying an observation could be significant.
In the case of mushroom classification, where misidentifying an edible mushroom as poisonous (or vice versa) could have serious implications, the model's ability to maintain high precision instills confidence in the accuracy of its positive predictions.

### Recall

Recall, also known as sensitivity, gauges the model's proficiency in capturing all instances of the actual positive class.
The ratio is calculated by dividing the number of correctly predicted positive observations by the total number of observations belonging to the actual positive class.

In the context of mushroom classification, the paragraph underscores that both classes (edible and poisonous) exhibit high recall values. This signifies that the model excels in identifying a vast majority of instances for both types of mushrooms.
In simpler terms, when there is an edible or poisonous mushroom present, the model is effective at recognizing and correctly labeling it.
This strong recall suggests that the model has a robust ability to capture positive cases, providing confidence in its effectiveness in identifying instances of both edible and poisonous mushrooms with a high degree of accuracy.

### Precision vs. recall

Recall focuses on the model's ability to capture all actual positive instances, aiming to minimize false negatives.
It's about not missing relevant cases.
Precision focuses on the accuracy of positive predictions, aiming to minimize false positives. It's about ensuring that when the model predicts positive, it's likely to be correct.

### F1-score

The F1-score is a valuable metric that encapsulates the performance of a classification model by taking into account both precision and recall.
Computed as the harmonic mean of these two metrics, the F1-score provides a balanced measure that is especially useful when there's an imbalance between the classes.

$$
F1 = 2 \frac{Precision * Recall}{Precision + Recall}
$$

The harmonic mean, employed in the F1-score calculation, places more weight on lower values, making it sensitive to imbalances in precision and recall.
This sensitivity is crucial in scenarios where one class significantly outnumbers the other.
In such cases, relying solely on accuracy might be misleading, as a model could achieve high accuracy by favoring the majority class.
The F1-score, with its ability to consider both false positives and false negatives, offers a more nuanced evaluation, making it a reliable choice for assessing model performance, particularly in situations with imbalanced class distributions.



## Are all features needed?

TODO:

In [6]:
feature_names = model.feature_names_in_
feature_coeffs = model.coef_[0]

for feature_name, feature_coef in zip(model.feature_names_in_, model.coef_[0]):
    print(f"{feature_name}: {feature_coef:.3f}")

cap_shape: -0.057
cap_surface: 0.442
cap_color: -0.043
bruises: -0.837
odor: -0.520
gill_attachment: -1.663
gill_spacing: -6.438
gill_size: 7.080
gill_color: -0.105
stalk_shape: 0.108
stalk_root: -1.608
stalk_surface_above_ring: -4.212
stalk_surface_below_ring: -0.323
stalk_color_above_ring: -0.142
stalk_color_below_ring: -0.049
veil_type: 0.000
veil_color: 6.114
ring_number: 1.404
ring_type: 0.792
spore_print_color: -0.206
population: -0.296
habitat: 0.059


In [7]:
idx_sort = np.argsort(np.abs(feature_coeffs))

features_top = feature_names[idx_sort[-5:]]
print(features_top)

['gill_attachment' 'stalk_surface_above_ring' 'veil_color' 'gill_spacing'
 'gill_size']


In [8]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train[features_top], y_train)

y_pred = model.predict(X_test[features_top])
classification_rep = classification_report(y_test, y_pred)
print(classification_rep)

              precision    recall  f1-score   support

           e       0.92      0.96      0.94      1276
           p       0.96      0.91      0.93      1162

    accuracy                           0.94      2438
   macro avg       0.94      0.94      0.94      2438
weighted avg       0.94      0.94      0.94      2438



In [9]:
features_bottom = feature_names[idx_sort[:10]]
print(features_bottom)

['veil_type' 'cap_color' 'stalk_color_below_ring' 'cap_shape' 'habitat'
 'gill_color' 'stalk_shape' 'stalk_color_above_ring' 'spore_print_color'
 'population']


In [10]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train[features_bottom], y_train)

y_pred = model.predict(X_test[features_bottom])
classification_rep = classification_report(y_test, y_pred)
print(classification_rep)

              precision    recall  f1-score   support

           e       0.83      0.85      0.84      1276
           p       0.83      0.81      0.82      1162

    accuracy                           0.83      2438
   macro avg       0.83      0.83      0.83      2438
weighted avg       0.83      0.83      0.83      2438

