# Problem Set 2
## Part I

1\. Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was X ̄ = 10, while the mean for those that didn’t was X ̄ = 0. In addition, the variance of X for these two sets of companies was σˆ2 = 36. Finally, 80% of companies issued dividends. Assuming that X follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was X = 4 last year. Note: You may write your solution using LATEXwithin your Jupyter Notebook by enclosing equations with dollar signs, or you may create a set of variables and calculate the probability using Python.


We can apply Linear Discrimant Analysis to estimate the probability of a certain case using the exisiting knowledge about the distribution. 
- In this problem, we want to know the probaility that the company *will issue* a divident when this year's percentage profit is 4. 
- We can express this as $Pr(Y="Yes"| X=4)$  We know that the probability of the companies issuing the dividends is 0.8. 
- We also know that the X always has a normal distribution, so we can make an assumption on the likelihood of the shape of x as $f_{k}(x) = \frac{1}{\sqrt{{{2\pi\sigma}}}}e^{-\frac{1}{2\sigma^2}(x-\mu_{k}^2)}$  
  
Based on this information, we can do the following calculation.

In [1]:
################################## import packages necessarily for the rest of the code
import math
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso
import sklearn.metrics

In [2]:
############################## values for LDA
# Saving values needed the LDA calculation into variables

proportion_yes = 0.8
proportion_no = 1 - proportion_yes
variance = 36
std_deviation = math.sqrt(variance)
avg_yes = 10
avg_no = 0
prediction_for = 4

# I create a function for the normal distribution
def normal_distribution (std_dev, x, mean):
    result = 1/math.sqrt(2*math.pi*std_dev)*pow(math.e, (x-mean)**2/-(2* (std_dev**2))) 
    return result



In [3]:
############################## calculating for LDA
estimated_prob = proportion_yes * normal_distribution(std_deviation, prediction_for, avg_yes)/\
    (proportion_yes * normal_distribution(std_deviation, prediction_for, avg_yes)+proportion_no * normal_distribution(std_deviation, prediction_for, avg_no))

print("When the percentage profit is "+ str(prediction_for) + " last year, the probability that a company will issue a dividend this year is "+str(estimated_prob))

When the percentage profit is 4 last year, the probability that a company will issue a dividend this year is 0.7518524532975263


2\. We perform best subset, forward stepwise, and backward stepwise selection on a single dataset. For each approach, we obtain p + 1 models, containing 0, 1, 2, ..., p predictors. Explain your answers:

* a) Which of the three models with k predictors has the smallest training RSS?  
  
    The best subset has the smallest training RSS. Unlike forward and backward stepwise selection (which considers $1+\frac{p(p+1)}{2}$ number of models), the best subset model compares all possible models ($2^p$) to find the subset that minimizes the training RSS (James et al., 2021). Thus, the best subset method will have the smallest training RSS.  

* b) Which of the three models with k predictors has the smallest test RSS?  
  
    We cannot know which of the three models will have the smallest test RSS. Although the best subset can choose the k predictor model that minimizes the RSS for the training dataset, it does not guarantee that it will have the minimal RSS for the test data -- it just increases the likelihood of having the smallest test RSS. As a result, we need to calculate the RSS using test data for all three models to check which has the smallest test RSS. 

* c) True or False:
    - The predictors of the k-variable model identified by forward stepwise are a subset of the predictors in the (k + 1)-variable model identified by forward stepwise selection.   
      
        **True** The forward stepwise selection adds the variable that will reduce the most RSS when the model has one variable to k number of variables. Thus, the (k+1) variable model will have only one additional variable from the k-variable model.

        Let's assume that there is a model with p = 4, with the variable income, gender, race, and has_disability. The forward stepwise selection begins from null, which is a model without a variable, and in each step, includes the variable with the best prediction when it is added.  

        e.g. *The list of predictors for each k value*
                (begins with null) (k=0)  
                income (k=1)   
                income, race (k=2)  
                income, race, gender (k=3)  
                income, race, gender, has_disability (k=4)  

        Here, we can find out that the predictors of the k+1 always include all the predictors for k. 
      
    - The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)-variable model identified by backward stepwise selection.   
      
        **True** The backward stepwise selection subtracts the variable that will produce the smallest RSS among the models with the same number of variables. It begins when the model has p number of variables to k number of variables. I will use the same example to demonstrate the backward stepwise model.  
        
        e.g. *The list of predictors for each k value*  
                income, race, gender, has_disability (k = p = 4)   
                race, gender, has_disability (k = 3)  
                race, gender (k=2)  
                race (k=1)  
  
        In this case, since the k-variable model derives from the variables of the k+1 variable-model, all variables from k-variable model are included in k+1 variable model.    
          
    - The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)-variable model identified by forward stepwise selection.  
    
        **False** We cannot assume that the k-variable model identified by backward stepwise will have the predictors that the k+1-variable model resulted from the forward stepwise selection. Let's use the example from problem c-2. When the backward stepwise selection reduces its number of variables from 3 to 2, it excludes comparing models such as "income, race", "income, gender", or "income, has_disability" because income was already eliminated when k becomes 3. However, when we determine the k-3 variable model using the forward stepwise selection, income is always a part of every model since it was selected from the k=1 level. As a result, there is no logical reason to believe that k-variable using the forward stepwise model will always have the variables that k+1 variable model has. 

    - The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k + 1)-variable model identified by backward stepwise selection.

        **False** We cannot assume that the k-variable model identified by the forward stepwise will have the same result as the backward stepwise model. The forward stepwise model will always include the first predictor included when k=1. If we use the example used above (problem c-1), it will never compare the predictor set of "race, gender, has_disability" with the "income, race, gender" set because the prior does not consist of income. On the other hand, the k+1 model is also restricted because it cannot include the variable dropped in the k+2 level. Thus, it is not guaranteed that the k variable model from the forward stepwise model will choose from a list of predictors that k+1 variable model contains.

    - The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k+1)-variable model identified by best subset selection.

        **False** The k-variable model using the best subset method and the (k+1)-variable model does not necessarily overlap with each other. When choosing the k-variable model, it will use all possible k-variable models from the total number of variables (p). Similarly, the best subset selection compares all (k+1)-variable models possible from the p variables to choose the (k+1)-variable model. These two processes are totally independent of each other and therefore do not guarantee that one will include all predictors from the other model.    

## Part II
For this part of the assignment, you will analyze a dataset of New York City AirBnB listings scraped in March 2020. Your task will be to predict the price of an AirBnB listing. You can download the nyc_airbnb_listings.csv file from Canvas.

1\. Examine the dataset and discuss the following:

In [4]:
####################### Read the data
df = pd.read_csv("nyc_airbnb_listings.csv")
df.head()

Unnamed: 0,listing_id,host_id,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood_group,...,number_of_reviews_ltm,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,reviews_per_month
0,2060,2259,0.22,0.5,0.0,0.0,0.0,1.0,0.0,Manhattan,...,0,80.0,,,,,,,0,0.01
1,2595,2845,0.87,0.38,0.0,6.0,6.0,1.0,1.0,Manhattan,...,5,94.0,9.0,9.0,10.0,10.0,10.0,9.0,0,0.38
2,3831,4869,0.83,0.96,0.0,1.0,1.0,1.0,1.0,Brooklyn,...,69,90.0,9.0,9.0,10.0,10.0,10.0,8.0,0,4.71
3,5099,7322,,0.71,0.0,1.0,1.0,1.0,0.0,Manhattan,...,8,90.0,10.0,9.0,10.0,10.0,10.0,9.0,0,0.59
4,5114,7345,0.5,,0.0,3.0,3.0,1.0,0.0,Manhattan,...,0,94.0,10.0,10.0,10.0,10.0,10.0,10.0,0,0.56


a) Attribute types

In [5]:
print("These are the attributes of the dataset: \n"+', '.join(df.columns.values))

These are the attributes of the dataset: 
listing_id, host_id, host_response_rate, host_acceptance_rate, host_is_superhost, host_listings_count, host_total_listings_count, host_has_profile_pic, host_identity_verified, neighbourhood_group, room_type, accommodates, bathrooms, bedrooms, beds, price, security_deposit, cleaning_fee, guests_included, extra_people, has_availability, availability_30, availability_60, availability_90, availability_365, number_of_reviews, number_of_reviews_ltm, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, instant_bookable, reviews_per_month


In [6]:
df.describe()

Unnamed: 0,listing_id,host_id,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,accommodates,...,number_of_reviews_ltm,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,reviews_per_month
count,50796.0,50796.0,31790.0,36781.0,50791.0,50791.0,50791.0,50791.0,50791.0,50796.0,...,50796.0,39365.0,39330.0,39344.0,39317.0,39333.0,39314.0,39314.0,50796.0,40343.0
mean,22623080.0,84993400.0,0.929213,0.824063,0.195684,23.64478,23.64478,0.997165,0.442893,2.862095,...,9.159737,93.903137,9.613552,9.284211,9.734848,9.740066,9.599812,9.386351,0.377471,1.282087
std,13173950.0,97570760.0,0.179629,0.253269,0.39673,165.173607,165.173607,0.053171,0.496733,1.890896,...,16.542545,8.849797,0.859817,1.084723,0.751754,0.771954,0.750225,0.939184,0.484759,1.625176
min,2060.0,2259.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,20.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.01
25%,10916870.0,9531972.0,0.97,0.74,0.0,1.0,1.0,1.0,0.0,2.0,...,0.0,92.0,9.0,9.0,10.0,10.0,9.0,9.0,0.0,0.18
50%,22227310.0,38625530.0,1.0,0.95,0.0,1.0,1.0,1.0,0.0,2.0,...,1.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,0.0,0.64
75%,34907510.0,138515600.0,1.0,1.0,0.0,2.0,2.0,1.0,1.0,4.0,...,11.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,1.9
max,42892720.0,341439900.0,1.0,1.0,1.0,2345.0,2345.0,1.0,1.0,22.0,...,730.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,66.36


This dataset has various data objects with diverse attribute types. I classified the attribute types of the data objects in these dataset based on the Data Dictionary of the Inside AirBnB and using the describe function of pandas.


  
**ID attributes**
  
| attribute  | discrete/categorical/binary | qualitative/quantitative | nominal/ordinal/interval/ratio | additional description |
|------------|------------------------------|---------------------------|--------------------------------|-------------------------|
| listing_id | discrete                     | categorical               | nominal                        |                         |
| host_id    | discrete                     | categorical               | nominal                        |                         |



**Attributes related to Host**

| attribute               | discrete/categorical/binary | qualitative/quantitative | nominal/ordinal/interval/ratio | additional description                |
|-------------------------|-----------------------------|---------------------------|----------------------------------|---------------------------------------|
| host_response_rate      | continuous                  | numeric                   | ratio                            | range: 0-1.0                          |
| host_acceptance_rate    | continuous                  | numeric                   | ratio                            | range: 0-1.0                          |
| host_is_superhost       | binary                      | qualitative               | nominal                          | asymmetric, denoted with 1,0           |
| host_listings_count     | discrete                    | numeric                   | ratio                            |                                       |
| host_total_listings_count | discrete                  | numeric                   | ratio                            |                                       |
| host_has_profile_pic    | binary                      | qualitative               | nominal                          | denoted with 1,0                       |
| host_identity_verified  | binary                      | qualitative               | nominal                          | asymmetric, denoted with 1,0           |




**Attributes related to House**

| Attribute               | Discrete/Categorical/Binary | Qualitative/Quantitative | Nominal/Ordinal/Interval/Ratio | Additional Description                |
|-------------------------|------------------------------|---------------------------|--------------------------------|---------------------------------------|
| neighbourhood_group     | Discrete                     | Categorical               | Nominal                        | 'Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx' |
| room_type                | Discrete                     | Categorical               | Nominal                        | 'Private room', 'Entire home/apt', 'Shared room', 'Hotel room' |
| accommodates             | Discrete                     | Numeric                   | Ratio                          |                                       |
| bathrooms                | Discrete                     | Numeric                   | Ratio                          |                                       |
| bedrooms                 | Discrete                     | Numeric                   | Ratio                          |                                       |
| beds                     | Discrete                     | Numeric                   | Ratio                          |                                       |


**Attribute  related to Price**
  
| Attribute         | Discrete/Categorical/Binary | Qualitative/Quantitative | Nominal/Ordinal/Interval/Ratio | Additional Description |
|--------------------|-----------------------|-----------------------|---------------------------|----------------------------|
| price               | continuous           | numeric               | ratio                     |                            |
| security_deposit    | continuous           | numeric               | ratio                     |                            |
| cleaning_fee        | continuous           | numeric               | ratio                     |                            |
| guests_included     | discrete             | numeric               | ratio                     |                            |
| extra_people        | continuous           | numeric               | ratio                     |                            |


**Attributes related to Availability**
  
| Attribute              | Discrete/Categorical/Binary | Qualitative/Quantitative | Nominal/Ordinal/Interval/Ratio | Additional Description                |
|------------------------|------------------------------|---------------------------|-------------------------------------|----------------------------------------|
| has_availability | binary | qualitative | nominal | denoted with 1,0 |
| availability_30 | discrete | quantitative | ratio | |
| availability_60 | discrete | quantitative | ratio | |
| availability_90 | discrete | quantitative | ratio | |
| availability_365 | discrete | quantitative | ratio | |
| instant_bookable | binary | qualitative | nominal | denoted with 1,0 |


**Attributes related to Reviews**
  
| Attribute               | Discrete/Categorical/Binary | Qualitative/Quantitative | Nominal/Ordinal/Interval/Ratio | Additional Description                |
|------------------------|------------------------------|----------------------------|---------------------------------|----------------------------------------|
| number_of_reviews       | discrete                     | quantitative               | ratio                           |                                        |
| number_of_reviews_ltm   | discrete                     | quantitative               | ratio                           |                                        |
| review_scores_rating    | continuous                   | quantitative               | ratio                           | range: 20-100                          |
| review_scores_accuracy  | continuous                   | quantitative               | ratio                           | range: 2.0 - 10.0                      |
| review_scores_cleanliness| continuous                   | quantitative               | ratio                           | range: 2.0 - 10.0                      |
| review_scores_checkin   | continuous                   | quantitative               | ratio                           | range: 2.0 - 10.0                      |
| review_scores_communication | continuous                | quantitative               | ratio                           | range: 2.0 - 10.0                      |
| review_scores_location  | continuous                   | quantitative               | ratio                           | range: 2.0 - 10.0                      |
| review_scores_value     | continuous                   | quantitative               | ratio                           | range: 2.0 - 10.0                      |
| reviews_per_month       | continuous                   | quantitative               | ratio                           |                                        |




 b) Resolution and possible alternative units of analysis 
  
Each row of data represents a single listing. We know this fact because all row has a unique listing id (listing_id). The unit of analysis can be on the listing id, as given in the data, or we can accumulate it by a host (using host_id) or region (neighborhood_group) to lower the resolution. However, for the price prediction, I think the listing id level of analysis is sufficient as we are questioning the price prediction for "AirBnB listing", not in any other unit of analysis.

c) Dimensionality  

In [7]:
# c) dimensionality
print("This dataset has "+str(df.shape[0])+" rows and "+str(df.shape[1])+" columns")

This dataset has 50796 rows and 36 columns


d) Missing values

In [8]:
# d) Missing values
i = 0 
for columns in df.columns.values:
    if df[columns].isna().sum() > 0:
        print(columns+" has "+str(df[columns].isna().sum())+" missing values")
        i+=1

print("\n"+str(i)+" out of "+str(df.shape[1])+" variables have missing values")

host_response_rate has 19006 missing values
host_acceptance_rate has 14015 missing values
host_is_superhost has 5 missing values
host_listings_count has 5 missing values
host_total_listings_count has 5 missing values
host_has_profile_pic has 5 missing values
host_identity_verified has 5 missing values
bathrooms has 54 missing values
bedrooms has 77 missing values
beds has 482 missing values
security_deposit has 17325 missing values
cleaning_fee has 10528 missing values
review_scores_rating has 11431 missing values
review_scores_accuracy has 11466 missing values
review_scores_cleanliness has 11452 missing values
review_scores_checkin has 11479 missing values
review_scores_communication has 11463 missing values
review_scores_location has 11482 missing values
review_scores_value has 11482 missing values
reviews_per_month has 10453 missing values

20 out of 36 variables have missing values


e) Potential multicollinearity

When we include all attributes in the model, there is a potential for multicollinearity between the attributes. For instance, several variables depend on the number of reviews or availability based on different time periods. However, if the number of reviews and availability does not alter significantly by months or years, these variables will likely be colinear. For instance, if the trend of availability tends to be stable throughout the year, the availability of the next 60 days (availability_60) may be around twice the availability of the next 30 days(availability_30).

2\. Preprocess the dataset to eliminate missing values and convert categorical variables to dummy variables. Create a training and test set with 30% of the data held out for testing.

In [9]:
## eliminate missing values 
# I decide to drop the whole row that contains any missing values 
df_dropna = df.copy()
for row in range(df.shape[0]):
    if df.iloc[row].isna().sum()>0:
        df_dropna = df_dropna.drop(index=row)

In [10]:
# check the dataset without missing values
df_dropna.head()

Unnamed: 0,listing_id,host_id,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood_group,...,number_of_reviews_ltm,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,reviews_per_month
1,2595,2845,0.87,0.38,0.0,6.0,6.0,1.0,1.0,Manhattan,...,5,94.0,9.0,9.0,10.0,10.0,10.0,9.0,0,0.38
8,5238,7549,1.0,0.26,1.0,4.0,4.0,1.0,1.0,Manhattan,...,5,94.0,10.0,9.0,10.0,10.0,9.0,9.0,0,1.26
9,5441,7989,1.0,0.56,1.0,1.0,1.0,1.0,1.0,Manhattan,...,39,97.0,10.0,10.0,10.0,10.0,10.0,10.0,0,1.59
10,5552,8380,1.0,0.2,0.0,1.0,1.0,1.0,1.0,Manhattan,...,1,97.0,10.0,9.0,10.0,10.0,10.0,10.0,0,0.51
11,5803,9744,1.0,0.99,1.0,3.0,3.0,1.0,1.0,Brooklyn,...,17,94.0,10.0,10.0,10.0,10.0,10.0,10.0,0,1.35


In [11]:
## convert categorical variables to dummy variables
df_dummy = pd.get_dummies(df_dropna[["neighbourhood_group", "room_type"]], drop_first=True)

## concatenate the dataframe with dummy variables (df_dummy) with the dataset without missing values
df_clean = pd.concat([df_dropna, df_dummy], axis = 1)
df_clean = df_clean.copy().drop(["neighbourhood_group", "room_type"], axis=1)


In [12]:
## Create a training and test set with 30% of the data held out for testing.

# Before spliting the test set, I will divide the data set into the outcome and the features
#   Outcome (y)
y = df_clean['price']

#   Features(x)
X = df_clean.drop('price', axis=1)

# I standardize the features for scaling using StandardScaler
scaler = StandardScaler()
Xsc = pd.DataFrame(scaler.fit_transform(X), columns = X.columns)

# dividing the y and X to train and test set
Xtrain, Xtest, ytrain, ytest = train_test_split(Xsc, y, test_size=0.3, random_state= 500)

3\. Define a function called r2_scores() that accepts three arguments, ytrue and ypredicted, and the number of predictors p, and returns a tuple with the R2 and Adjusted R2 scores.

In [23]:
def r2_scores(ytrue, ypredicted, p):
    # When ytrue and ypredicted does not match, it prints the message and does not return anything
    if len(ytrue) != len(ypredicted):
        print("The ytrue and ypredicted does not match")
        return 

    # n = number of observation
    n = len(ytrue)

    # calculate R2 score
    r2 = sklearn.metrics.r2_score(ytrue,ypredicted)

    # calculate Adjusted R2 score
    # The equation for Adjusted R2 comes from Wikipedia page for Coefficient of determination
    adjusted_r2 = 1 - ((1-r2)*(n-1)/(n-p-1))

    return r2, adjusted_r2

4\. Fit a linear regression model with all parameters included. Repeat this process using a LASSO regression model using the default alpha of 0.5, and again with an alpha value of 5.0. Discuss the impact of increasing alpha on the number of coefficients that shrink to zero.

In [24]:
## Fitting Linear Regression using the training data
lm = LinearRegression().fit(Xtrain, ytrain)

In [25]:
## Fitting a LASSO regression with the alpha value of 0.5
lasso05 =  Lasso(alpha=0.5).fit(Xtrain, ytrain)

In [26]:
## Fitting a LASSO regression with the alpha value of 5
lasso50 =  Lasso(alpha = 5.0).fit(Xtrain, ytrain)

In [27]:
## saving the coefficients of the linear regression for later comparison
coef_value = pd.DataFrame(columns= ["variable","linear", "lasso_05", "lasso_5"])
coef_value["variable"] = Xsc.columns
coef_value["linear"] = lm.coef_

## saving the coefficients of the LASSO regression with the alpha value of 0.5 for later comparison
coef_value["lasso_05"] = lasso05.coef_

## saving the coefficients of the LASSO regression with the alpha value of 5.0 for later comparison
coef_value["lasso_5"]= lasso50.coef_

## check which variables shrink to 0 by increasing the alpha from 0.5 to 5
coef_value["shrink_to_0"] = np.where((coef_value["lasso_05"] != coef_value["lasso_5"])&(coef_value["lasso_5"] == 0), True , False)

In [28]:
# observe variables that shrink to 0 by increasing the alpha from 0.5 to 5
coef_value[coef_value.shrink_to_0]

Unnamed: 0,variable,linear,lasso_05,lasso_5,shrink_to_0
3,host_acceptance_rate,4.160558,3.436458,0.0,True
4,host_is_superhost,1.935912,1.03411,-0.0,True
5,host_listings_count,-12396950000000.0,-3.437649,-0.0,True
6,host_total_listings_count,12396950000000.0,-4.489503e-16,-0.0,True
7,host_has_profile_pic,1.531939,0.9717051,0.0,True
8,host_identity_verified,2.975047,2.091468,-0.0,True
12,beds,-14.70934,-12.12599,-0.0,True
15,guests_included,0.4577866,0.1822846,0.0,True
16,extra_people,3.577225,3.194576,0.0,True
24,review_scores_rating,-7.116412,-5.678743,-0.0,True


In [29]:
# print the number of variables that shrink to 0 by increasing the alpha from 0.5 to 5
print("When alpha increases from 0.5 to 5," + "the additional " + str(coef_value[coef_value.shrink_to_0].shape[0])+ " variables shrink to 0")

When alpha increases from 0.5 to 5,the additional 14 variables shrink to 0


As observable in the table above, 14 additional variables shrink to 0 when the alpha value increases from 0.5 to 5.0. The decrease of number of variables are predictable because the increase in alpha value, or the penalty term for the complexity of the model, should lead to the decrease in the coefficients. Hence, the variables that survive when alpha is 0.5 ends up reducing to 0 when the alpha increases to 5.

5\. Using the r2_scores() function you defined, compute and discuss the scores for the linear regression and LASSO models.

In [30]:
##For comparison, I will use the test data to compute the predicted outcome value of each models
# I will store the predicted y values into dataframe y_predicts
# for linear regression
y_predicts = pd.DataFrame(columns=["linear", "lasso_05", "lasso_5"])
y_predicts["linear"]=lm.predict(Xtest)

# for LASSO regression with the alpha value of 0.5
y_predicts["lasso_05"]=lasso05.predict(Xtest)

# for LASSO regression with the alpha value of 5
y_predicts["lasso_5"]=lasso50.predict(Xtest)


In [31]:
for model in y_predicts.columns:
    number_of_variable = sum(np.where(coef_value[model] != 0, True, False))
    print("For model "+ model+ " (R2, Adjusted R2): ")
    print(r2_scores(ytest, y_predicts[model], number_of_variable))

For model linear (R2, Adjusted R2): 
(0.17659601982389617, 0.17110116510133289)
For model lasso_05 (R2, Adjusted R2): 
(0.1781219652373136, 0.17291326521713657)
For model lasso_5 (R2, Adjusted R2): 
(0.1796130719090019, 0.17633698434258194)


The linear model has the lowest R2 and the adjusted R2, and the LASSO model with the alpha of 5.0 has the highest R2 and adjusted R2 values. These results align with our expectations. The linear model has the lowest R2 because the goal of the linear model is to minimize the residual sum of squares, which is a key determinant of the R2 value. However, the LASSO does not only fit the model based on minimizing the residual sum of squares. It also regularizes the model to prevent overfitting using the penalty value, alpha. When the alpha value increases, the model is more willing to trade off the goal of minimizing the residual sum of squares with regularization. For the linear model, we can understand that there is no penalty, and therefore alpha is 0. Hence, it makes sense that the R2 and Adjusted R2 will decrease as alpha increases.

6\. Create and display a dataframe with the variable names, linear regression coefficients, and LASSO (alpha=5.0) coefficients as columns. Provide an interpretation of the results.

In [32]:
# create a dataframe with the variable names, linear regression coefficient, and LASSO (alpha=5.0) coefficients
coef_value_df = pd.DataFrame(columns= ["variable", "linear", "lasso_5"])

# saving the variables names 
coef_value_df["variable"] = Xsc.columns

# saving the coefficients of the linear regression 
coef_value_df["linear"] = lm.coef_

# saving the coefficients of the LASSO regression with the alpha value of 5.0 for later comparison
coef_value_df["lasso_5"]= lasso50.coef_

# display dataframe
coef_value_df

Unnamed: 0,variable,linear,lasso_5
0,listing_id,2.981283,0.419338
1,host_id,9.934262,8.492892
2,host_response_rate,-12.95939,-6.958285
3,host_acceptance_rate,4.160558,0.0
4,host_is_superhost,1.935912,-0.0
5,host_listings_count,-12396950000000.0,-0.0
6,host_total_listings_count,12396950000000.0,-0.0
7,host_has_profile_pic,1.531939,0.0
8,host_identity_verified,2.975047,-0.0
9,accommodates,49.7041,45.223051


Most attributes on houses remain after applying the LASSO regression. The attributes such as the accommodates ("the maximum capacity of the listing"), bathrooms, and bedrooms still have significant values of coefficients in the LASSO regression. The attributes on reviews also show similar trends (Inside Airbnb, 2022). Although the divergence from 0 has generally decreased, LASSO regression leaves all attributes on reviews except review_scores_rating in the equation. From this tendency, I infer that attributes of houses can be major contributors to price.
  
There are some attribute groups that partially survive the LASSO regression. For instance, a listing located in Manhattan(neighbourhood_group_Manhattan) is critical in determining the price according to the LASSO regression, but being in Bronx, Brooklyn, or Staten Island(neighbourhood_group_Bronx, neighbourhood_group_Brooklyn, neighbourhood_group_Staten Island) has no influence. Similarly, the security deposit(security_deposit) and cleaning fee(cleaning_fee) are the only price-related attributes that affect the price in LASSO regression. Through these results, I would take into note that only a specific location and the pricing structure is relevant to estimating the price.

Finally, all the variables related to the host have shrunk to 0 in LASSO regression except the hosting ID. However, I would not think hosting ID has one of the major contributors to the price because it is just a random number assigned by Airbnb. Therefore, I can understand that the LASSO regression excludes these variables before the others to regularize the regression. Thus, I would consider the attributes related to the host as not critical factors to price determination. 



## References

- Inside Airbnb. (2022, August). Inside airbnb data dictionary. Inside Airbnb: Explore the Data. Retrieved February 24, 2023, from https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit?usp=sharing
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning : with applications in R (Second edition.). Springer.
- Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to data mining (Second edition.). Pearson Education, Inc.
- Wikimedia Foundation. (2023, February 11). Coefficient of determination. Wikipedia. Retrieved February 24, 2023, from https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2