<a href="https://colab.research.google.com/github/jonahsjlee/programming/blob/main/linearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q0 - 1
A model is "linear" if it is linear in its parameters. This means that the predicted outcome, y, is a linear function of the coefficients (parameters) and the independent variables. Linear models do not imply that the relationship between y and x must visually appear linear—just that the model is linear with respect to its parameters.

Q0 - 2
The coefficient for a dummy variable represents the difference in the mean outcome between the category represented by the dummy variable and the baseline (reference) category. This interpretation depends on the intercept term. If there is an intercept, the coefficient for the dummy variable shows how much the predicted outcome changes when switching from the baseline category to the category indicated by the dummy variable.

Q0 - 3
Linear regression can technically be applied to classification, particularly binary classification, but it is not typically appropriate. Linear regression predictions for binary outcomes can exceed the range [0, 1], which makes probability interpretation problematic. Logistic regression or other classification methods that handle probabilities bounded within [0, 1] are generally preferred for classification.

Q0 - 4
Signs of overfitting include:

Very high accuracy on training data but poor performance on test data.
The residual plot shows patterns rather than randomness, indicating the model is fitting noise rather than just the signal.
The model may have a high variance, where small changes in the data lead to significant changes in the model parameters.

Q0 - 5 Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, leading to unstable estimates and inflated variances. This can make it difficult to determine the effect of each variable on the dependent variable. The two-stage least squares (2SLS) method is used in situations with endogenous predictors (where a predictor is correlated with the error term). The first stage regresses the problematic predictor on all other predictors to obtain predicted values. In the second stage, these predicted values (instrumented values) replace the original variable, which helps in mitigating multicollinearity and endogeneity issues.

Q0 - 6 Polynomial Terms: You can add polynomial terms (e.g., x^2, x^3) for features to allow for curved relationships.
Transformations: Applying transformations like logarithmic, exponential, or square root to the features or the target variable can introduce nonlinear effects into a model.

Q0 -7 Intercept: The intercept is the predicted value of the outcome variable when all predictors are zero.

Slope Coefficient: The slope coefficient represents the change in the outcome variable for a one-unit increase in the predictor, holding other variables constant.
Dummy Variable Coefficient: The coefficient for a dummy variable indicates the difference in the outcome between the reference group (usually coded as 0) and the group represented by the dummy variable (coded as 1), holding other variables constant.

Q1 Compute the average prices and scores by Neighbourhood ; which bourough is the most expensive on average? Create a kernel density plot of price and log price, grouping by Neighbourhood .

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df = pd.read_csv('./sample_data/Q1_clean.csv')
df.head()

Unnamed: 0,Price,Review Scores Rating,Neighbourhood,Property Type,Room Type
0,549,96.0,Manhattan,Apartment,Private room
1,149,100.0,Brooklyn,Apartment,Entire home/apt
2,250,100.0,Manhattan,Apartment,Entire home/apt
3,90,94.0,Brooklyn,Apartment,Private room
4,270,90.0,Manhattan,Apartment,Entire home/apt


In [3]:
df.loc[:,['Price','Neighbourhood '] ].groupby('Neighbourhood ').describe()


Unnamed: 0_level_0,Price,Price,Price,Price,Price,Price,Price,Price
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Neighbourhood,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Bronx,217.0,75.276498,39.755468,10.0,50.0,60.0,90.0,244.0
Brooklyn,8487.0,127.747378,106.038466,20.0,75.0,100.0,150.0,4500.0
Manhattan,11763.0,183.664286,170.434606,25.0,103.0,150.0,214.0,10000.0
Queens,1590.0,96.857233,61.712648,25.0,60.0,80.0,115.0,950.0
Staten Island,96.0,146.166667,508.462029,35.0,54.75,71.0,99.0,5000.0


Q1 - Regress price on Neighbourhood  by creating the appropriate dummy/one-hot-encoded variables, without an intercept in the linear model and using all the data. Compare the coefficients in the regression to the table from part 1. What pattern do you see? What are the coefficients in a regression of a continuous variable on one categorical variable?

In [4]:
y = df['Price']
X = pd.get_dummies(df['Neighbourhood '], dtype='int')

from sklearn import linear_model
reg = linear_model.LinearRegression(fit_intercept=False).fit(X,y) # Run regression

results = pd.DataFrame({'variable':reg.feature_names_in_, 'coefficient': reg.coef_}) # Regression coefficients
results

Unnamed: 0,variable,coefficient
0,Bronx,75.276498
1,Brooklyn,127.747378
2,Manhattan,183.664286
3,Queens,96.857233
4,Staten Island,146.166667


Q1 - Repeat part 2, but leave an intercept in the linear model. How do you have to handle the creation of the dummies differently? What is the intercept? Interpret the coefficients. How can I get the coefficients in part 2 from these new coefficients?

In [5]:
y = df['Price']
X = pd.get_dummies(df['Neighbourhood '], dtype='int', drop_first = True)

from sklearn import linear_model
reg = linear_model.LinearRegression().fit(X,y) # Run regression

results = pd.DataFrame({'variable':reg.feature_names_in_, 'coefficient': reg.coef_}) # Regression coefficients
results

Unnamed: 0,variable,coefficient
0,Brooklyn,52.470881
1,Manhattan,108.387789
2,Queens,21.580735
3,Staten Island,70.890169


In [6]:
print(reg.intercept_)

75.27649769585331


In [7]:
results = pd.DataFrame({'variable':reg.feature_names_in_,
                        'coefficient': reg.coef_+reg.intercept_}) # Regression coefficients
results

Unnamed: 0,variable,coefficient
0,Brooklyn,127.747378
1,Manhattan,183.664286
2,Queens,96.857233
3,Staten Island,146.166667


Q1 - Split the sample 80/20 into a training and a test set. Run a regression of Price on Review Scores Rating and Neighbourhood . What is the
 and RMSE on the test set? What is the coefficient on Review Scores Rating? What is the most expensive kind of property you can rent?

In [8]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split

y = df['Price']
X = df.loc[:,['Review Scores Rating', 'Neighbourhood '] ]
X_train, X_test, y_train, y_test = train_test_split(X,y, # Feature and target variables
                                                    test_size=.2, # Split the sample 80 train/ 20 test
                                                    random_state=100) # For replication purposes

Z_train = pd.concat([X_train['Review Scores Rating'],
                     pd.get_dummies(X_train['Neighbourhood '], dtype='int')], axis = 1)
Z_test = pd.concat([X_test['Review Scores Rating'],
                    pd.get_dummies(X_test['Neighbourhood '], dtype='int')], axis = 1)

reg = linear_model.LinearRegression(fit_intercept=False).fit(Z_train,y_train) # Run regression

y_hat = reg.predict(Z_test)
print('Rsq: ', reg.score(Z_test,y_test)) # R2
rmse = np.sqrt( np.mean( (y_test - y_hat)**2 ))
print('RMSE: ', rmse) # R2

results = pd.DataFrame({'variable':reg.feature_names_in_, 'coefficient': reg.coef_}) # Regression coefficients
results

Rsq:  0.06701086106947296
RMSE:  125.01092061382933


Unnamed: 0,variable,coefficient
0,Review Scores Rating,1.032257
1,Bronx,-17.261392
2,Brooklyn,32.180888
3,Manhattan,89.42102
4,Queens,4.050208
5,Staten Island,61.576393


Q1 - Split the sample 80/20 into a training and a test set. Run a regression of Price on Review Scores Rating and Neighbourhood  and Property Type. What is the
 and RMSE on the test set? What is the coefficient on Review Scores Rating? What is the most expensive kind of property you can rent?

In [9]:
y = df['Price']
X = df.loc[:,['Review Scores Rating', 'Neighbourhood ', 'Room Type'] ]
X_train, X_test, y_train, y_test = train_test_split(X,y, # Feature and target variables
                                                    test_size=.2, # Split the sample 80 train/ 20 test
                                                    random_state=100) # For replication purposes

Z_train = pd.concat([X_train['Review Scores Rating'],
                    pd.get_dummies(X_train['Neighbourhood '], dtype='int'),
                    pd.get_dummies(X_train['Room Type'], dtype='int')],
                    axis = 1)
Z_test = pd.concat([X_test['Review Scores Rating'],
                    pd.get_dummies(X_test['Neighbourhood '], dtype='int'),
                    pd.get_dummies(X_test['Room Type'], dtype='int')],
                    axis = 1)
reg = linear_model.LinearRegression(fit_intercept=False).fit(Z_train,y_train) # Run regression
y_hat = reg.predict(Z_test)
print('Rsq: ', reg.score(Z_test,y_test)) # R2
rmse = np.sqrt( np.mean( (y_test - y_hat)**2 ))
print('RMSE: ', rmse) # R2
results = pd.DataFrame({'variable':reg.feature_names_in_, 'coefficient': reg.coef_}) # Regression coefficients
results

Rsq:  0.22035348129282306
RMSE:  114.27692123130633


Unnamed: 0,variable,coefficient
0,Review Scores Rating,0.626912
1,Bronx,-13.022765
2,Brooklyn,10.378456
3,Manhattan,53.693304
4,Queens,-6.83333
5,Staten Island,50.003022
6,Entire home/apt,110.61782
7,Private room,3.101341
8,Shared room,-19.500474


A 100-rated Entire home/apt in Manhattan would cost: 110.617+53.69+100*.0626 = 170.567

Q1 - What does the coefficient on Review Scores Rating mean if it changes from part 4 to 5? Hint: Think about how multilple linear regression works.

In part 4, it was 1.03 and in part 5 it was 0.63. When we do multiple linear regression, you can imagine first regression your
 and
 of interest on all the other variables, then saving the residuals, then regressing those residuals on each other. So in part 4 we didn't include the information about the room type, and in part 5 we did. What we're seeing is that some of the variation in prices by rating and neighbourhood is explained by the room types available in those neighbourhoods. Once we control for room type, the other variables become less powerful predictors because some of their predictive power is correlated with room type. That's why the coefficient on rating shrinks.