## Intermediate Data Science

#### University of Redlands - DATA 201
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data201.joannabieri.com](https://joannabieri.com/data201_intermediate.html)

In [1]:
# Some basic package imports
import os
import numpy as np
import pandas as pd

# Visualization packages
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'
import seaborn as sns

# ML packages
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score 
from sklearn.metrics import confusion_matrix, classification_report

## Do the two You Try problems from the lecture. 

- Open the Day13_LinearLogisticRegression.ipynb code and run it.
- Do the two you try problems to make sure you understand the models and over vs. under fitting.



---------------
## Linear and Logistic Regression - Day13 HW


Let's do an analysis of the Avocado Ripeness data from [https://www.kaggle.com/datasets/amldvvs/avocado-ripeness-classification-dataset](https://www.kaggle.com/datasets/amldvvs/avocado-ripeness-classification-dataset)

This is the same data from last class!

Explore some of the following research questions:

- Can you predict avocado firmness given one or more of the numerical features?
    - What model should you use here and why? (eg. Linear Regression vs Logistic Regression, Linear model vs. Nonlinear model)
    - Use appropriate numerical measures to talk about how good your model is (eg. for Linear Regression we would talk about MSE and $R^2$ but for Logistic Regression we might do a classification report)

- Can you predict whether or not an avocado is ripe based on one or more of the other features?
**NOTE - below I add a column for ripe 0/1** use this as your target.

     - What model should you use here and why? (eg. Linear Regression vs Logistic Regression, Linear model vs. Nonlinear model)
    - Use appropriate numerical measures to talk about how good your model is (eg. for Linear Regression we would talk about MSE and $R^2$ but for Logistic Regression we might do a classification report)


- In both cases play around with the models a little bit and see how you can get your best predictions!


NOTE - You should be able to look at the sns.pairplot() and say ahead of time which of the variables would be good for predicting either firmness or ripeness, just based on the shape of the graphs!


Please write up your conclusions.

**Your final notebooks should:**

- [ ] Be a completely new notebook with just the Day13 stuff in it NO YOU TRY: Read in the data, make the plots. Make sure to discuss what you see and comment on why your plots are great!
- [ ] **Contain your "best models" for both questions ALONG WITH a discussion of what other things you tried and why these are your best results.**
- [ ] Be reproducible with junk code removed.
- [ ] Have lots of language describing what you are doing, especially for questions you are asking or things that you find interesting about the data. Use complete sentences, nice headings, and good markdown formatting: https://www.markdownguide.org/cheat-sheet/
- [ ] It should run without errors from start to finish.

In [2]:
import kagglehub
# Download latest version
path = kagglehub.dataset_download("amldvvs/avocado-ripeness-classification-dataset")

print("Path to dataset files:", path)

# Note this downloads three files. We will use the second one.
file = path + '/' + os.listdir(path)[0]
df = pd.read_csv(file)
df['ripe'] = df['ripeness'].apply(lambda x: 1 if 'ripe' in x else 0)
df

Path to dataset files: /Users/mac/.cache/kagglehub/datasets/amldvvs/avocado-ripeness-classification-dataset/versions/1


Unnamed: 0,firmness,hue,saturation,brightness,color_category,sound_db,weight_g,size_cm3,ripeness,ripe
0,14.5,19,40,26,black,34,175,261,ripe,1
1,71.7,53,69,75,green,69,206,185,pre-conditioned,0
2,88.5,60,94,46,dark green,79,220,143,hard,0
3,93.8,105,87,41,dark green,75,299,140,hard,0
4,42.5,303,58,32,purple,63,200,227,breaking,0
...,...,...,...,...,...,...,...,...,...,...
245,94.1,83,80,58,dark green,72,254,134,hard,0
246,21.6,17,36,19,black,47,182,240,firm-ripe,1
247,14.0,4,40,17,black,37,188,274,ripe,1
248,61.5,63,87,75,green,65,261,162,pre-conditioned,0


1. EDA
2. test train split

In [7]:
df.keys()

Index(['firmness', 'hue', 'saturation', 'brightness', 'color_category',
       'sound_db', 'weight_g', 'size_cm3', 'ripeness', 'ripe'],
      dtype='object')

In [22]:
features = ['weight_g', 'brightness']
target = ['ripe']

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [23]:
X_train

Unnamed: 0,weight_g,brightness
132,265,58
225,208,48
238,234,66
119,241,53
136,219,10
...,...,...
106,259,59
14,184,52
92,215,68
179,258,54


In [24]:
joanna = LogisticRegression()
joanna.fit(X_train, y_train.to_numpy().reshape(-1))

In [25]:
y_true = y_test.to_numpy().reshape(-1)
y_pred = joanna.predict(X_test)
accuracy_score(y_true, y_pred,)

0.86

In [26]:
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.82      0.87        28
           1       0.80      0.91      0.85        22

    accuracy                           0.86        50
   macro avg       0.86      0.87      0.86        50
weighted avg       0.87      0.86      0.86        50



In [27]:
X_test

Unnamed: 0,weight_g,brightness
142,258,74
6,187,57
97,214,57
60,182,53
112,245,77
181,291,68
197,229,56
184,283,58
9,232,78
104,235,74


In [34]:
weight = 300
brightness = 30

ex_df = pd.DataFrame({'weight_g': [weight], 'brightness': [brightness]})
joanna.predict(ex_df)


array([0])