<font color='darkred'>Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, you'll create a `train.py` file. The *apputil.py* file might only be used for the bonus exercises.

## Exercise 1

Recall the [simple streamlit app](https://github.com/leontoddjohnson/simple_streamlit) and the [coffee analysis data](https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv) used.

Write a Python script called `train.py` that does the following:

- Loads the [coffee analysis data](https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv) (from the URL).
- Trains a (Scikit-Learn) linear regression model to predict `rating` based on the single feature `100g_USD`.
- Saves the trained model in this repository as a pickle file called `model_1.pickle`.

## Exercise 2

Update the script to train a **Decision Tree Regressor** model that predicts `rating` based on *both* `100g_USD` and `roast`, and saves the trained model as `model_2.pickle`. Notice that the `roast` column is categorical, so you'll need to convert it into a numerical label format:

- Create a function called `roast_category` that maps *all* roast `values` to a valid input to the model. *Note: missing values are okay.*
    - For example, you might have `roast_category('Medium-Light') --> 1`
- Use this function along with `.map` or `.apply` (in pandas) to create a corresponding numerical column, `roast_cat`.
- Train your model on `100g_USD` and `roast_cat`.

For example, we should be able to run the following code.

In [1]:

import pandas as pd
from sklearn.linear_model import LinearRegression
import pickle

# Load the data
url = "https://raw.githubusercontent.com/leontoddjohnson/datasets/refs/heads/main/data/coffee_analysis.csv"
data = pd.read_csv(url)

# Prepare the model
X = data[['100g_USD']]   # predictor
y = data['rating']       # response

# Train the model
model = LinearRegression()
model.fit(X, y)

# Save the trained model
with open("model_1.pickle", "wb") as f:
    pickle.dump(model, f)

# Print confirmation message
print("Model training complete. File 'model_1.pickle' created successfully.")


Model training complete. File 'model_1.pickle' created successfully.


In [None]:
df_X = pd.DataFrame([
    [10.00, 1],
    [15.00, 3],
    [8.50, np.nan]], 
    columns=["100g_USD", "roast_cat"])

y_pred = dtr.predict(df_X.values)   # `dtr` is a DecisionTreeRegressor  
y_pred

**Note: Do not worry about model performance here.** In fact, try `roast_cat` values that don't make sense, such as `-99` or `2938.24` (try large numbers!). This is not an issue for this week, but consider what this indicates about how the decision tree behaves ...

## (Optional) 

### Bonus Exercise 3

Update the *apputil\.py* file to include a `predict_rating(df_X)` function that takes in a two-column dataframe, `df_X`, with columns `100g_USD` (numerical) and `roast` (in original text form), and returns an array containing corresponding predicted `rating` values. If a `roast` value is not one of the roast values in the training data, the function should only use the `100g_USD` value to make the prediction (recall `model_1.pickle`). Otherwise, it should use both features.

In [None]:
import pandas as pd
from apputil import predict_rating

df_X = pd.DataFrame([
    [10.00, "Dark"],
    [15.00, "Very Light"]], 
    columns=["100g_USD", "roast"])
y_pred = predict_rating(df_X)
y_pred

### Bonus Exercise 4

Vectorize the `desc_3` column in the coffee analysis data using TF-IDF vectorization. Train a linear regression model to predict `rating` based only on the vectorized text data, and save the trained model as `model_3.pickle`.

Adjust your `predict_rating(X, text=True)` function where the `text` argument indicates that `X` is an array of strings of text (in the style of the reviews in `desc_3`). Update the function so that when `text=True`, it returns predicted ratings based on the text.

Note: you'll need to figure out what to do when the input text contains words that were not in the training data!

In [None]:
X = pd.DataFrame([
    "A delightfull coffee with hints of chocolate and caramel.",
    "A strong coffee with a bold flavor and a smoky finish."], 
    columns=["text"])
y = predict_rating(X, text=True)
y