# Import and prepare Data

Here we import the data and add two rows:

- dayofyear: in order to standardize how we use the dates, and in a more efficient manner: we use the day of the year *(ex: 3 January = 3)*
- We attach a varietyId to each variety since the algorithm cannot handle strings 

In [204]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datetime import datetime

col_names = ["date", "site name", "latitude", "longitude", "Altitude" ,"Genre" ,"species", "variety", "stage_code", "phenological scale", "data origin", "Dispositif"]

peach_data = pd.read_csv("peach_data.csv", parse_dates=["date"], header=None, names=col_names)
peach_data['dayofyear'] = peach_data['date'].dt.dayofyear
peach_data['variety_id'] = peach_data.groupby('variety').ngroup().fillna(0)

  peach_data = pd.read_csv("peach_data.csv", parse_dates=["date"], header=None, names=col_names)


AttributeError: Can only use .dt accessor with datetimelike values

# Load data into our training data

- We will use the columns dayofyear, and variety_id as our **feature columns** (The columns that will be used for the prediction)
- The expected data is our **"stage code"**

In [200]:

from sklearn.model_selection import train_test_split

feature_columns = ['dayofyear', 'variety_id']
X = peach_data[feature_columns]
y = peach_data.stage_code

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)



AttributeError: 'DataFrame' object has no attribute 'stage_code'

# Model Development and Prediction

- We create a Logistic Regression classifier using the LogisticRegression() function
    - We use random_state for reproducibility
    - Our solver here is **lbfgs**
        - Other solvers are: **liblinear**, **newton-cg**, **sag**, **saga** and **lbfgs** solvers. 
    - We put the max_iter (*max iterations*) at 500 in order to avoid the error: ***STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.***
    -  [see here](https://stackoverflow.com/questions/62658215/convergencewarning-lbfgs-failed-to-converge-status-1-stop-total-no-of-iter)
- We fit our model on the training set using fit()
- We perform prediction on the set using predict()

 *class sklearn.linear_model.LogisticRegression(penalty='l2',* * *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1,class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)*

In [129]:
from sklearn.linear_model import LogisticRegression

# instationtian of model using the default parameters
logreg = LogisticRegression(random_state=16, solver='lbfgs', max_iter=1000)

logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

**From StackOverflow Response** to avoid *STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.*

- [Try a different optimizer](https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions/52388406#52388406)
- [Scale your data](https://scikit-learn.org/stable/modules/preprocessing.html)
- [Add engineered features](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)
- [Data pre-processing](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)
- Add more data


# Model Evaluation using [Confusion Matrix](https://www.sciencedirect.com/topics/engineering/confusion-matrix#:~:text=5.5%20Confusion%20matrix,performance%20of%20a%20classification%20algorithm.)

*Confusion matrix is a very popular measure used while solving classification problems. It can be applied to binary classification as well as for multiclass classification problems.*

| Empty Cell | Empty Cell | Predicted | 
|------------|------------|-----------|
| Actual | Negative | Positive |
| Negative | TN | FP |
| Positive | FN | TP |




In [None]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
#Confusion Matrix Evaluation Metrics

from sklearn.metrics import classification_report
target_names = ["stage code", 'dayofyear', 'variety_id']
print(classification_report(y_test, y_pred, target_names=target_names))


- **Precision**
    - Measures the propotion of true positives among all the positive predictions
- **Recall**
    - Measures the proportion of the true positives amoung all the actual positives
- **f1-score**
    - Harmonic mean of precision and recall
- **Support**
    - The number of samples in each class
- **Macro Avg**
    - The average metrics across all the classes with all classes receving equal weight
- **Weighted Avg**
    - The average metrics across all the classes, with classes with more samples reveiving more weight

 
# **Definitions**

#### True/False Positive/Negative
> ` An Aesop's Fable: The Boy Who Cried Wolf (compressed) A shepherd boy gets bored tending the town's flock. To have some fun, he cries out, "Wolf!" even though no wolf is in sight. The villagers run to protect the flock, but then get really mad when they realize the boy was playing a joke on them.
>
> [Iterate previous paragraph N times.]
>
> One night, the shepherd boy sees a real wolf approaching the flock and calls out, "Wolf!" The villagers refuse to be fooled again and stay intheir houses. The hungry wolf turns the flock into lamb chops. The town goes hungry. Panic ensues.`
>

Let's make the following definitions:

    "Wolf" is a positive class.
    "No wolf" is a negative class.


> ![Capture.PNG](attachment:4e00753e-db25-4b57-b1cd-d49dc380f050.PNG)


A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.

A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

#### [Harmonic Mean](https://en.wikipedia.org/wiki/Harmonic_mean)

> In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired.

#### [ROC curve & ROC AUC Score](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.evidentlyai.com/classification-metrics/explain-roc-curve&ved=2ahUKEwjsvcLZvOeFAxWoTKQEHR_mDwQQFnoECCEQAw&usg=AOvVaw2nDCZVdZe_dHvcgKUvermT)

- The ROC curve shows the performance of a binary classifier with different decision thresholds. It plots the True Positive rate (TPR) against the False Positive rate (FPR).

- The ROC AUC score is the area under the ROC curve. It sums up how well a model can produce relative scores to discriminate between positive or negative instances across all classification thresholds. 

- The ROC AUC score ranges from 0 to 1, where 0.5 indicates random guessing, and 1 indicates perfect performance.

In [199]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

solvers = ["liblinear", "newton-cg", "sag", "saga", "newton-cholesky"]


for solver in solvers:
     # instationtian of model using the default parameters
    logreg = LogisticRegression(random_state=16, solver=solver, max_iter=5000)

    logreg.fit(X_train, y_train)
    y_pred = logreg.predict(X_test)

    cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
    cnf_matrix

    target_names = ["stage code", 'dayofyear', 'variety_id']
    print(solver, classification_report(y_test, y_pred, target_names=target_names))


    # Assuming logreg is already fitted and X_test is prepared
    preds = logreg.predict_proba(X_test)[::,1]
    fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
    print(preds)

liblinear               precision    recall  f1-score   support

  stage code       0.78      0.83      0.81       606
   dayofyear       0.82      0.78      0.80       620
  variety_id       1.00      1.00      1.00       476

    accuracy                           0.86      1702
   macro avg       0.87      0.87      0.87      1702
weighted avg       0.86      0.86      0.86      1702



ValueError: multiclass format is not supported