# Logsitic Regression Exercise: 
## Classification of *Senecio* ecotypes

<img src="senecio.png" alt="drawing" width="400"/>

In this exercise we will be exploring how well we can predict *Senecio* ecotypes based on various morphological features. The data is taken from an experiment where seeds from the Australian wildflower *Senecio lautus* were collected, grown in glasshouses under uniform conditions, and a range of morphological traits measured. The seeds were collected from populations belonging to four distinct ecotypes (dune, headland, tableland and woodland) that may be in the process of forming separate species. 
###### Walter, G.M., Aguirre, J.D., Blows, M.W. & Ortiz-Barrientos, D. Evolution of Genetic Variance during Adaptive Radiation. The American Naturalist 191: E108–E128 (2018), https://www.journals.uchicago.edu/doi/10.1086/696123

The dataset consists of the following features:

- **Ecotype**: One of four ecotypes

- **Population**: Population ID

- **VegHeight**: Vegetative height of the plant, in mm

- **MSL_W**: Ratio between main stem length and mean plant width

- **SB**: Number of branches

- **MSD**: Main stem diameter, in mm

- **Area**: Leaf area, in mm2

- **P2A2**: Leaf perimeter squared / area squared (an indicator of leaf complexity)

- **Circularity**: Leaf circularity

- **Nindents.Peri**: Number of leaf indents divided by leaf perimeter

- **IndentWidth**: Leaf indent width, in mm

- **IndentDepth**: Leaf indent depth, in mm

### Libraries
**Import whichever libraries you think you will need.**

### Import and inspect data
**Read in the ecotype-OB.csv file and set it to a dataframe called eco.**

**Check the head of eco.**

**Use .info and .describe to familiarise yourself with the data.**

**Because this is a logistic regression based classification problem we will restrict ourselves the two most dominant ecotypes: "Dune" and "Headland". Make a new dataframe `ecosub` that only includes the "Dune" and "Headland" ecotypes.**

### Data exploration

**Use seaborn to make a pairplot of the ecosub datset with a different colour for ecotype.**

**Generate a heatmap showing the correlation between the dataset features.**

### Model training
**Seperate the dataset into label and features arrays (only use one feature)**

**Split into training and test data using train_test_split**

**Import logistic regression from sklearn and fit model on training data**

**Predict values from the test data**

**Generate a classfication report and confusion matrix**

**Plot the estimate probabilities as we did for the Palmer penguins dataset (code provided but at the very least you will likely need to modify the range of X_new in `np.linspace()`**

In [None]:
X_new = np.linspace(2,6,1000).reshape(-1,1) #  columns
y_proba = logmodel.predict_proba(X_new)
plt.figure(figsize=(8, 3))
plt.plot(X_new, y_proba[:, 1], "g-", label = "Dune")
plt.plot(X_new, y_proba[:, 0], "b--", label = "Headland")
decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]
plt.plot(X_train[y==0], y_train[y==0], "bs")
plt.plot(X_train[y==1], y_train[y==1], "g^")
plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2)
plt.text(decision_boundary+0.02, 0.15, "Decision  boundary", fontsize=14, color="k", ha="center")
plt.xlabel("MSD", fontsize=14)
plt.ylabel("Probability", fontsize=14)
plt.legend(loc="center right", fontsize=14)
plt.axis([2, 6, -0.02, 1.02])

**Generate ROC curve**

**Find optimum decision boundary**

## Good work!