<a href="https://colab.research.google.com/github/joeljohnston/mediagen/blob/master/Week_3_Regression_with_Scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> 1. DUPLICATE THIS COLAB TO START WORKING ON IT: Use **File** > **Save a copy in Drive**
> 2. SHARE SETTINGS: In the new notebook, set the sharing settings to **Anyone with the link** by clicking **Share** on the top-right corner.

<center>
  <img src=https://freedesignfile.com/upload/2021/01/Cartoon-illustration-home-office-vector.jpg width="500" align="center" />
</center>
<br/>


# Week 3: Pricing your new Airbnb listings!

👋 Hello 👋 Our third week project for *Python for Machine Learning* is all about seeing if you can use the dataset to make predictions for new Airbnb listings.

This week's lecture and material on Uplimit demonstrated a simple flow for how to structure small machine learning projects so that you can create your own models. Now we'll test your skills a bit with this project. At the end, you'll be able to use the model that you created as an app! Let's get started 💪💪!

## Downloading the dataset

You'll need to download some prerequisite Python packages in order to run all of the code below. Let's install them!

In [1]:
%%capture
!pip install numpy streamlit pandas==1.5.2 scikit-learn==1.2.0

We will download the datasets from Google Drive just like we did last week, but this time the dataset is in [Pickle](https://pythonnumericalmethods.berkeley.edu/notebooks/chapter11.03-Pickle-Files.html).

In [2]:
!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1KTF77Sj0kWyft9gNT3_6k84gauPA95rG' -O listings.pkl

--2024-07-31 17:29:02--  https://drive.google.com/uc?export=download&id=1KTF77Sj0kWyft9gNT3_6k84gauPA95rG
Resolving drive.google.com (drive.google.com)... 74.125.197.102, 74.125.197.101, 74.125.197.100, ...
Connecting to drive.google.com (drive.google.com)|74.125.197.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1KTF77Sj0kWyft9gNT3_6k84gauPA95rG&export=download [following]
--2024-07-31 17:29:02--  https://drive.usercontent.google.com/download?id=1KTF77Sj0kWyft9gNT3_6k84gauPA95rG&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 74.125.197.132, 2607:f8b0:400e:c03::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|74.125.197.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 493925 (482K) [application/octet-stream]
Saving to: ‘listings.pkl’


2024-07-31 17:29:04 (85.0 MB/s) - ‘listings.pkl’ saved [493925/493

In [3]:
import pandas as pd

# Show all columns (instead of cascading columns in the middle)
pd.set_option("display.max_columns", None)
# Don't show numbers in scientific notation
pd.set_option("display.float_format", "{:.2f}".format)

## Preprocessing the dataset
Please load the downloaded file as a DataFrame (df). The method for loading these datasets is the same as what we did on the Uplimit platform.

#### Part 1: Read Pickle

Read the Python Pickle file that we've just downloaded as `df_list`.

In [4]:
# Read a Python Pickle file
df_list = pd.read_pickle("listings.pkl")

<details>
<summary>Show Solution</summary>

```python
df_list = pd.read_pickle("listings.pkl")
```
</details>

Now let's have a look at the **Listings DataFrame** to get a feel for how the data looks. Show the first two rows.

In [6]:
# Show first two rows of the dataset
df_list.head(2)

Unnamed: 0,id,host_acceptance_rate,neighbourhood,room_type,price_in_dollar,amenities,accommodates,host_is_superhost,has_availability,review_scores_rating,instant_bookable,number_of_reviews_l30d,discount_per_5_days_booked,discount_per_10_days_booked,discount_per_30_and_more_days_booked,host_reported_average_tip,service_cost
0,23726716,0.95,De Pijp - Rivierenbuurt,Private room,127.0,15,7,False,True,4.61,False,3,8.0,15.0,16.0,1.03,$4.99
1,35815046,1.0,De Baarsjes - Oud-West,Shared room,62.0,13,2,False,True,4.38,False,6,4.0,10.0,16.0,1.26,$2.99


<details>
<summary>Show Solution</summary>

```python
df_list.head(2)
```
</details>

Awesome! Our next step is to get an overview of the columns that are in this particular DataFrame 🧐.

#### Part 2: Print column names, types, and non-null values

Let's try and get an overview of the **Listings DataFrame**, called `df_list`, with the [`info()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) command. This should show us some details about the columns in the DataFrame, like the column names, their data types, and the number of non-null values.

In [8]:
df_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4817 entries, 0 to 6172
Data columns (total 17 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 0   id                                    4817 non-null   int64   
 1   host_acceptance_rate                  4817 non-null   float64 
 2   neighbourhood                         4817 non-null   category
 3   room_type                             4817 non-null   category
 4   price_in_dollar                       4817 non-null   float64 
 5   amenities                             4817 non-null   int64   
 6   accommodates                          4817 non-null   int64   
 7   host_is_superhost                     4817 non-null   bool    
 8   has_availability                      4817 non-null   bool    
 9   review_scores_rating                  4817 non-null   float64 
 10  instant_bookable                      4817 non-null   bool    
 11  numb

<details>
<summary>Show Solution</summary>

```python
df_list.info()
```
</details>

This info printout provides a good overview of which columns we can use to make our predictions. Often you don't need to use all of data/columns to make great predictions.

Also, we saw on Uplimit this week that columns with the dtype **category** and **boolean** require some processing in order for us to use them in Scikit-learn algorithms. The algorithms from Scikit-learn that we will use are not made to handle text, unless it is encoded properly. These algorithms work well with numerical values.

#### Part 3: Make a selection

There are a total of 17 columns. For readability purposes, we are going to prematurely drop a lot of these columns. In practice, you often drop columns at a later stage once you've determined their value. This is something we'll look into next week. For now we want you to drop the columns:

- **id** (it has no meaning for the ML algorithm)
- **host_acceptance_rate**
- **host_is_superhost**
- **has_availability**
- **number_of_reviews_l30d**
- **discount_per_5_days_booked**
- **discount_per_10_days_booked**
- **discount_per_30_and_more_days_booked**
- **service_cost**


In [9]:
df_list = df_list.drop(columns=["id", "host_acceptance_rate", "host_is_superhost", "has_availability", "number_of_reviews_l30d",
                                "discount_per_5_days_booked", "discount_per_10_days_booked", "discount_per_30_and_more_days_booked", "service_cost"],
                       axis=1)

<details>

> Indented block


<summary>Show Solution</summary>

```python
df_list = df_list.drop(columns=["id", "host_acceptance_rate", "host_is_superhost", "has_availability", "number_of_reviews_l30d",
                                "discount_per_5_days_booked", "discount_per_10_days_booked", "discount_per_30_and_more_days_booked", "service_cost"],
                       axis=1)

```
</details>

Now take another look at the leftover information contained in the DataFrame.

In [10]:
df_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4817 entries, 0 to 6172
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   neighbourhood              4817 non-null   category
 1   room_type                  4817 non-null   category
 2   price_in_dollar            4817 non-null   float64 
 3   amenities                  4817 non-null   int64   
 4   accommodates               4817 non-null   int64   
 5   review_scores_rating       4817 non-null   float64 
 6   instant_bookable           4817 non-null   bool    
 7   host_reported_average_tip  4817 non-null   float64 
dtypes: bool(1), category(2), float64(3), int64(2)
memory usage: 240.1 KB


<details>
<summary>Show Solution</summary>

```python
df_list.info()

```
</details>

Here we see that we are left with two categorical values and one boolean value. These need to be encoded in a format that a Scikit-learn algorithm can work with.

Also, it is important that we determine what we actually want our algorithm to predict. We'd like to predict for price for an airbnb listing. So our target variable will be "price_in_dollar", while the other variables are our features. This means that our problem yet again is a regression problem, since our target variable is a continuous numerical value.


#### Part 4: Create the target and feature variables

It's necessary that we split our dataset before we start training our algorithm. Remember the "student taking a test" metaphor?

Let's split up our DataFrame into target (y) and feature variables (X), using the leftover columns from the previous task.

In [11]:
X, y = (
    df_list[["neighbourhood", "room_type", "host_reported_average_tip", "amenities",
             "accommodates", "review_scores_rating", "instant_bookable"]],
    df_list[["price_in_dollar"]],
)

<details>
<summary>Show Solution</summary>

```python
# features (X), label (y)
X, y = (
    df_list[["neighbourhood", "room_type", "host_reported_average_tip", "amenities",
             "accommodates", "review_scores_rating", "instant_bookable"]],
    df_list[["price_in_dollar"]],
)

```
</details>

#### Part 5: Create the target and feature variables

Let's make a function that we can use to split our dataset into three separate parts. We are going to need to first split our dataset into train, test, and validation splits. But before we can do this, we need to prepare some small stuff. An example is setting a seed so that our results are reproducible when we run our training from any computer at any time.

In [12]:
# Setting seed allows us to generate a random dataset split that
# is the same on every computer. Otherwise, every time you run
# the split, you'd get a different dataset split.
SEED = 42

<details>
<summary>Show Solution</summary>

```python
SEED = 42

```
</details>

Then we should create a function that splits our dataset into three. Please have a peek over at Uplimit to see how we made this function!

In [13]:
import sklearn
from sklearn.model_selection import train_test_split

def train_validation_test_split(
    X, y, train_ratio: float, validation_ratio: float, test_ratio: float
):
    # Split up dataset into train and test, of which we split up the test
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=(1 - train_ratio), random_state=SEED
    )

    # Split up test into two (validation and test)
    X_val, X_test, y_val, y_test = train_test_split(
        X_test,
        y_test,
        test_size=(test_ratio / (test_ratio + validation_ratio)),
        random_state=SEED,
    )

    # Return the splits
    return X_train, X_val, X_test, y_train, y_val, y_test

<details>
<summary>Show Solution</summary>

```python

def train_validation_test_split(
    X, y, train_ratio: float, validation_ratio: float, test_ratio: float
):
    # Split up dataset into train and test, of which we split up the test.
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=(1 - train_ratio), random_state=SEED
    )

    # Split up test into two (validation and test).
    X_val, X_test, y_val, y_test = train_test_split(
        X_test,
        y_test,
        test_size=(test_ratio / (test_ratio + validation_ratio)),
        random_state=SEED,
    )

    # Return the splits
    return X_train, X_val, X_test, y_train, y_val, y_test

```
</details>

This time we want a different ratio. We want to split our dataset into a ratio of 75%/15%/10% for train, validation, and test, respectively. Please do so below.

In [14]:
# Splits according to ratio of 75/15/10
X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(
    X, y, 0.75, 0.15, 0.1
)

<details>
<summary>Show Solution</summary>

```python
# Splits according to ratio of 75/15/10
X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(
    X, y, 0.75, 0.15, 0.1
)

```
</details>

#### Part 6: Verify the dataset sizes

Since you split the dataset into three parts, let's manually verify if their shape corresponds with what is expected, given the previous ratio. Use Pandas [`shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) to perform counts for each dataset.

In [15]:
X_train.shape, X_val.shape, X_test.shape

((3612, 7), (723, 7), (482, 7))

<details>
<summary>Show Solution</summary>

```python
X.shape, X_train.shape, X_val.shape, X_test.shape

```
</details>

#### Part 7: Convert `bool` to `int`

As you might remember from above, we have three variables that we need to encode properly: the boolean and the categorical values. Please see below.

In [16]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4817 entries, 0 to 6172
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   neighbourhood              4817 non-null   category
 1   room_type                  4817 non-null   category
 2   host_reported_average_tip  4817 non-null   float64 
 3   amenities                  4817 non-null   int64   
 4   accommodates               4817 non-null   int64   
 5   review_scores_rating       4817 non-null   float64 
 6   instant_bookable           4817 non-null   bool    
dtypes: bool(1), category(2), float64(2), int64(2)
memory usage: 202.5 KB


We normally do encoding after splitting the dataset in order to prevent data leakage (as described on Uplimit). This also means that encoding now takes three steps per converting encoding. Let's start with the boolean column "instant_bookable". For each split, please convert these from a boolean to an int.

In [17]:
# Boolean to Int
X_train["instant_bookable"] = X_train["instant_bookable"].astype(int)
X_val["instant_bookable"] = X_val["instant_bookable"].astype(int)
X_test["instant_bookable"] = X_test["instant_bookable"].astype(int)

<details>
<summary>Show Solution</summary>

```python
X_train["instant_bookable"] = X_train["instant_bookable"].astype(int)
X_val["instant_bookable"] = X_val["instant_bookable"].astype(int)
X_test["instant_bookable"] = X_test["instant_bookable"].astype(int)
```
</details>

Then please inspect your data by using [`head(2)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) on one of the three splits to see if our encoding worked. It should display 0 or 1 now for "instant_bookable".

In [18]:
# Show how the dataframe looks like
X_train.head(2)

Unnamed: 0,neighbourhood,room_type,host_reported_average_tip,amenities,accommodates,review_scores_rating,instant_bookable
2791,Westerpark,Shared room,2.31,10,1,4.33,1
2897,Centrum-West,Private room,13.64,17,4,4.95,0


<details>
<summary>Show Solution</summary>

```python
X_train.head(2)

```
</details>

#### Part 8: Convert categorical to one-hot encoding

The most "elaborate" encoding step for us will be encoding our categorical columns to one-hot encoding. Please refer back to Uplimit to see how it's done, because that's exactly what you are going to do below, only you need to change the columns accordingly.

<details>
<summary>Show Solution</summary>

```python
# Define how the encoding should work.
oh_encoder = OneHotEncoder(  # Define one-hot encoding
    sparse_output=False,  # Sparse matrix doesn't work well with Pandas DataFrame.
    dtype="int",  # Set type to integer
)

# Define which columns to transform.
oh_enc_transformer = make_column_transformer(  # Define how to output columns
    (oh_encoder, ["room_type", "neighbourhood"]),
    verbose_feature_names_out=False,  # Column names are "more concise"
    remainder="passthrough",  # All other columns should be left untouched
)

# Train (fit) the transformation on the training set
oh_encoded = oh_enc_transformer.fit(X_train)  # Change from category to number

```
</details>

Above you "train" (fit) an encoder on the training set, afterwhich we apply what it's "learned" (transform) to each of the dataset splits. We show you here how to do it for X_train. You'll have to do it yourself for X_val and X_test.

In [None]:
# One-hot encoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

# Define how the encoding should work
oh_encoder = OneHotEncoder(  # Define one-hot encoding
    sparse_output=False,  # Sparse matrix doesn't work well with Pandas DataFrame
    dtype="int",  # Set type to integer
)

# Define which columns to transform
oh_enc_transformer = make_column_transformer(  # Define how to output columns
    (oh_encoder, ["room_type", "neighborhood"]),
    verbose_feature_names_out=False,  # Column names are "more concise"
    remainder="passthrough",  # All other columns should be left untouched
)

# Train (fit) the transformation on the training set
oh_encoded = oh_enc_transformer.fit(X_train)  # Change from category to number

In [None]:
# Transform the train columns into one-hot encoding
X_train_oh_enc = oh_encoded.transform(X_train)

# Turn the encoded columns into a df
X_train = pd.DataFrame(
    X_train_oh_enc,  # Input the transformed dataset
    columns=oh_encoded.get_feature_names_out(),  # Set column names
    index=X_train.index,  # Keep index numbering of original df
)

In [None]:
# Transform the validation columns into one-hot encoding
X_val_oh_enc = oh_encoded.transform(X_val)

# Turn the encoded columns into a dataframe.
X_val = pd.DataFrame(
    X_val_oh_enc,  # Input the transformed dataset
    columns=oh_encoded.get_feature_names_out(),  # Set column names
    index=X_val.index,
)

<details>
<summary>Show Solution</summary>

```python
X_val_oh_enc = oh_encoded.transform(X_val)

# Turn the encoded columns into a dataframe.
X_val = pd.DataFrame(
    X_val_oh_enc,  # Input the transformed dataset
    columns=oh_encoded.get_feature_names_out(),  # Set column names
    index=X_val.index,
)

```
</details>

In [None]:

# Transform the columns into one-hot-encoding.
X_test_oh_enc = oh_encoded.transform(X_test)

# Turn the encoded columns into a dataframe.
X_test = pd.DataFrame(
    X_test_oh_enc,  # Input the transformed dataset
    columns=oh_encoded.get_feature_names_out(),  # Set column names
    index=X_test.index,  # Keep index numbering of original df
)

<details>
<summary>Show Solution</summary>

```python

# Transform the columns into one-hot-encoding.
X_test_oh_enc = oh_encoded.transform(X_test)

# Turn the encoded columns into a dataframe.
X_test = pd.DataFrame(
    X_test_oh_enc,  # Input the transformed dataset
    columns=oh_encoded.get_feature_names_out(),  # Set column names
    index=X_test.index,  # Keep index numbering of original df
)

```
</details>

Now when you print out the [`info()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) of the data, you might see that the data types have all changed to "object". Don't worry, we have a quick fix for that.

In [None]:
X_train = X_train.convert_dtypes()
X_val = X_val.convert_dtypes()
X_test = X_test.convert_dtypes()

Now, let's inspect if the numbers look right.

In [None]:
# Show how the dataframe looks like
X_train.head(2)

... and verify if now our [`info()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) prints out the dtypes we've expected it to display.

In [None]:
X_train.info()

Seems about right! On to the next part!

---

#### (Optional) Part 9: Correlation and SPLOM matrix

As discussed on Uplimit, correlation is often a simple and intuitive measure to see which features can be useful to use for our model later on. There exists [many more extensive and better methods](https://neptune.ai/blog/feature-selection-methods) to see and select our features, however for now we'll only focus on this method.

Let's recreate the correlation matrix, but now with a few more features and a different target variable.

In [None]:
... # COPY THE CODE FROM UPLIMIT

<details>
<summary>Show Solution</summary>

```python
# Correlation - what are the best features?
import numpy as np
import plotly.express as px

# Exclude "neighbourhood" colums for better visualization.
X_train_filtered = X_train.filter(regex="^((?!neighbourhood).)*$")

# combine X_train with Y_train
ndf_list = pd.concat([X_train_filtered, y_train], axis=1)

# Create a dataframe that can be used as a heatmap
fig = px.imshow(
    ndf_list.corr().round(2),
    text_auto=True,
    aspect="auto",
    color_continuous_scale="rdylgn",
)
fig.show()

```
</details>

This reveals that for "host_reported_average_tip", there are many features that might be interesting to use. For example, "amenities", "price_in_dollar", and "room_type_Shared room" seem to be quite correlated. To expand our analysis, it might also be worthwhile to visualize these relations by using a [SPLOM](https://plotly.com/python/splom/). Can you create a SPLOM using the provided link, while using the `ndf_list` DataFrame? We'd recommend to set a height of at least "1200" as parameter.

In [None]:
... # USE THE CODE FROM THE "SPLOM" LINK ABOVE

<details>
<summary>Show Solution</summary>

```python
# Splom
import plotly.express as px

fig = px.scatter_matrix(ndf_list, height=1200)
fig.show()

```
</details>

Now that you've seen the dataset, you might understand why some variables are more heavily correlated then some of the other ones, since we can observe some linear trends between a few of these features.

---

#### Part 10: Linear Regression

We made use of the regression version of the decision tree this week in Uplimit. In this case, we are going to use Linear Regression for price prediction of our Airbnb Listings. For now, you don't need to go to deep into the details if you just want to be able to use it. This is up to you.

First you need to import the [Linear Regression Model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). Please import this down below and assign this to the model variable.

In [None]:
# Get the algorithm
... # IMPORT YOUR CODE HERE

# Create a regression algorithm
model = ... # YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python
# Get the algorithm
from sklearn.linear_model import LinearRegression

# Create a regression algorithm
model = LinearRegression()

```
</details>

#### Part 11: Model training, pt. 1

Let's train our algorithm using the `X_train` dataset, where we only use the features that are most highly correlated with the target variable. In this case, let's use `amenities`.

A tip: You might want to use [`np.squeeze()`](https://numpy.org/doc/stable/reference/generated/numpy.squeeze.html) instead for the target variable, when the implementation recommends that you use `ravel()`.

In [None]:
# Fit the model - Pick one feature "amenities"
model.fit(  # Train it ("Learn the material")
    ... , # YOUR CODE HERE
    ... # YOUR CODE HERE
)

<details>
<summary>Show Solution</summary>

```python
# Fit the model - Pick one feature "amenities"
model.fit(  # Train it ("Learn the material")
    X_train[["amenities"]],
    np.squeeze(y_train),
)

```
</details>

#### Part 12: The right metric


Let's re-use the [$R^2$ score metric](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) for now because it seems quite intuitive. There are [many other metrics](https://www.qualdo.ai/blog/complete-list-of-performance-metrics-for-monitoring-regression-models/) that you might preferably use otherwise. However, that is not the goal of this project 😁.

In [None]:
# Get the R2 metric
... # IMPORT YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python
from sklearn.metrics import r2_score

```
</details>

#### Part 13: Predict and SCORE!!!

Since we trained our model, let's use it to make predictions with examples of the validation set.

In [None]:
# Predict
y_predict = ... # YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python
y_predict = model.predict(X_val[["amenities"]])

```
</details>

Now let's use these predictions to calculate our $R^2$ score and round it off by four digits!

In [None]:
... # YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python
r2_score(y_predict, y_val).round(4)

```
</details>

Awesome, the scoring is quite good! Let's try some other features as well!

#### (Optional) Part 14: Scatter Plot and Regression Line

In this section, we are going to use Plotly to plot our `amenities` features on the x-axis, and `price_in_dollar` on the y-axis. And then we will plot the line of regression generated from training our Linear Regression Model.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

... #YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python
import plotly.express as px
import plotly.graph_objects as go
updatedX = np.array(X_val['amenities'])

reference_line = go.Scatter(x=updatedX,
                            y=y_predict,
                            mode="lines",
                            line=go.scatter.Line(color="gray"),
                            showlegend=False)

fig = px.scatter(x=X_val['amenities'], y=y_val['price_in_dollar'])

fig.add_trace(reference_line)

fig.show()
```
</details>

---

#### (Optional) Part 15: Model training, pt. 2

Please use some other features that you think might be interesting for training. What happens if you use an extra feature? What happens if you use a lowly correlated feature? What happens if you use all features?

In [None]:
# Repeat the previous four cells (Parts 11 until 13)
# Choose some other features
# Fit, Predict, Score

... # YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python

# Fit the model - Pick two feature that you think are best
model.fit(  # Train it ("Learn the material")
    X_train["accommodates", "room_type_Shared room"]],
    np.squeeze(y_train),
)

# Predict
y_predict = model.predict(X_val[["accommodates", "room_type_Shared room"]])  # Do a "final exam"

# Score
# Compare algorithms' "final exam" vs. expected.
r2_score(y_predict, y_val).round(4)

```
</details>

---

#### Part 16: Model training, pt. 3

Select the three features "amenities", "accommodates", and "instant_bookable" for training and creating predictions for our model.

In [None]:
# Repeat the previous four cells (Parts 11 until 13)
# Choose the three features
# Fit, Predict, Score

... # YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python
# Fit, Predict, Score
model.fit(  # Train it ("Learn the material")
    X_train[["amenities", "accommodates", "instant_bookable"]],
    np.squeeze(y_train),
)

# Predict
y_predict = model.predict(X_test[["amenities", "accommodates", "instant_bookable"]])  # Do a "final exam"

# Score
# Compare algorithms' "final exam" vs. expected.
r2_score(y_predict, y_test).round(4)


```
</details>

#### Part 17: Example prediction

[*\[Related section on Uplimit\]*](https://uplimit.com/course/python-for-machine-learning/v2/module/first-algo#corise_clcuzv5kf00062a73t4hr7w8i)

We've trained a model using three features. Now let's create an example based on these three features and inspect the kind of average tip the model would expect!

In [None]:
# Predict a listing price based on "50" amenities, "4" accommodates, instant_bookable=True (1)
example = [[...]] # YOUR CODE HERE

# Round off by 2
... # YOUR CODE HERE

<details>
<summary>Show Solution</summary>

```python
example = [[20, 4, 1]]

model.predict(example).round(2)

```
</details>

Awesome! Today you've made a machine learning project by using NumPy, Pandas, Plotly, and Scikit-learn to create predictions given the Airbnb dataset! This is a simplified version of a professional workflow, but that is often how companies start — first by doing the bare minimum, and then expanding the complexity of the model. That way you can have something up-and-running quickly that steadily improves with a very quick feedback loop!

Now as an extra we'll explore how to deploy this model as a Streamlit app!

### (Optional) Make an app for your portfolio!

<center>
  <img src=https://griddb-pro.azureedge.net/en/wp-content/uploads/2021/08/streamlit-1160x650.png width="500" align="center" />
</center>
<br/>

**Participants such as yourselves often want to use the weekly Uplimit projects for their portfolios. To facilitate that, we've created this section. It might seem like a lot, but it's actually just following instructions and copy-pasting. Reach out on Slack if you get stuck!**

You will make an app that uses the model you just created, encapsulates that in a neat Streamlit interface, where you can provide input through the use of sliders!

<!-- <center>
  <img src=https://i.ibb.co/N9JKbd8/Screen-Shot-2022-11-10-at-4- width="500" align="center" />
</center>
<br/> -->

To visualize this, we will again use a library called [Streamlit](https://streamlit.io/). For now you are not expected to know how Streamlit works, but you are expected to be able to copy-paste and follow instructions if you want to share this project as part of your portfolio!

We are going to use [Streamlit Share](https://share.streamlit.io/) to host your projects. It's a website that allows us to host our interactive projects for free online! Again, we don't expect you to understand how to use and/or modify the code we will show below. We do expect you to read the instructions and copy-paste our code to the Streamlit Share platform. Feel free to change it any way you like. Some great starting points are [here](https://python.plainenglish.io/how-to-build-web-app-using-streamlit-pandas-numpy-5e134f0cf552), [here](https://docs.streamlit.io/library/get-started/create-an-app), [here](https://streamlit.io/components), and [here](https://streamlit.io/gallery)!

In [None]:
import pickle

pickle.dump(model, open("model.pkl", "wb"))

In [None]:
from google.colab import files

# Download the file locally
files.download('model.pkl')

In [None]:
%%writefile streamlit_app.py
import streamlit as st
import sklearn
import pickle

model = pickle.load(open("model.pkl", "rb"))

st.title("Week 3: The Airbnb dataset of Amsterdam")
st.markdown(
    "The dataset contains modifications with regards to the original for illustrative & learning purposes"
)

amenities = st.slider('How many amenities does the listing have?', 0, 50, 20)
accommodates = st.slider('How many people does the listing accommodate?', 1, 16, 2)
instant_bookable = st.radio(
    "Is the listing instantly bookable?",
    ("True", "False"))
instant_bookable = 1 if instant_bookable == "True" else 0

user_input = [[amenities, accommodates, instant_bookable]]

if st.button('Predict?'):
    st.write("The model predicts that the average tip for this listing is:", model.predict(user_input).round(2))

The **%%writefile [FILE_NAME].[FILE_EXTENSION]** command let's us save the code written in the cells in your Google Colab instance. Having it saved like that enables us to download it as a file, as seen below:

In [None]:
from google.colab import files

# Download the file locally
files.download('streamlit_app.py')

In [None]:
%%writefile requirements.txt
streamlit
pandas
scikit-learn

In [None]:
from google.colab import files

# Download the file locally
files.download('requirements.txt')

Please verify that you've downloaded three files:
- `model.pkl`
- `streamlit_app.py`
- `requirements.txt`

Now let's head over to GitHub and [create an account](https://github.com/signup).

Then, since you are logged in [go to GitHub.com](https://github.com) and click on the **+** icon at the top-right corner and select **New repository**.

<center>
  <img src=https://i.ibb.co/4gkPBCp/Screen-Shot-2022-11-28-at-1-51-02-PM.png width="300" align="center" />
</center>
<br/>

Here you provide:
- **Repository name**: Up to you
- **License**: Up to you. We recommend **apache-2.0**.

- **Public or private?** Public, otherwise you can't host it on [Streamlit Share](https://share.streamlit.io)!

<center>
  <img src=https://i.ibb.co/0B533dw/Screen-Shot-2022-11-28-at-1-55-14-PM.png width="450" align="center" />
</center>
<br/>

Then upload the three files to this URL below. ***Please modify it before copy-pasting it***:

```https://github.com/[YOUR_ACCOUNT_NAME]/[YOUR_REPOSITORY_NAME]/upload/main```

<center>
  <img src=https://i.ibb.co/jTsrgJw/Screen-Shot-2022-11-28-at-1-58-31-PM.png width="500" align="center" />
</center>
<br/>

Commit directly to the `main` branch, then click **Commit changes**.

Next, you have to create an account on [Streamlit Share](https://share.streamlit.io/signup).

<center>
  <img src=https://i.ibb.co/znFngJc/Screen-Shot-2022-11-28-at-1-59-47-PM.png width="500" align="center" />
</center>
<br/>

It's recommended to click **Continue with GitHub**.

Then, select **New app** **>** **Deploy a new app...** **>** **From existing repo**.

<center>
  <img src=https://i.ibb.co/VQPQzt3/Screen-Shot-2022-11-28-at-2-05-04-PM.png width="500" align="center" />
</center>

Followed by providing your:

```[GITHUB_ACCOUNT_NAME]/[GITHUB_REPOSITORY]```

<center>
  <img src=https://i.ibb.co/PDSQccD/Screen-Shot-2022-11-28-at-2-10-47-PM.png width="500" align="center" />
</center>

You will have to wait around 1-5 minutes, then an automatic hyperlink is generated for your new website. An example is this app:

```https://[GITHUB_ACCOUNT_NAME]-[GITHUB_REPOSITORY]-[RANDOM_6_LETTER_STRING].streamlit.app/```

***Please modify the link before copy-pasting it.***

---

# 🎉 CONGRATULATIONS!!!

You've made it to the end of the Week 3 assignment! You should be proud.

If you have any lingering questions, post them on Slack! As you know, we're always here to help.

And if you want any additional challenge questions, check out the bonus extensions below.

---

## (Optional) Extensions

<center>
  <img src=https://upload.wikimedia.org/wikipedia/commons/c/c6/Celebration_fireworks.jpg width="500" align="center" />
</center>
<br/>

🎉🎉 Amazing 🎉🎉 You completed this week's project! Have you thought about extending this project and try some extensions like:

- Using different [regression algorithms](https://towardsdatascience.com/choosing-a-scikit-learn-linear-regression-algorithm-dd96b48105f5) from the Scikit-learn toolkit!
- Using [different metrics](https://www.qualdo.ai/blog/complete-list-of-performance-metrics-for-monitoring-regression-models/) for the model?
- Train the model on different features and modify the Streamlit app to reflect that?
- ...

The possibilities are endless!

# Next up?
Next week we will ramp up our knowledge of Scikit-learn and introduce some other new algorithms! While this week was focused on supervised machine learning problems, regression in particular, next week is more focused on classification and a a few unsupervised algorithms!