# Homework 1: Text Mining
## Part 3: Difference in Difference

Group Members: Matias Borrel, Pol Garcia, Marvin Ernst

#### Importing relevant Libraries:

In [1]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

#### Load the cleaned data:

In [2]:
df = pd.read_csv("final_cleaned_data.csv")

In [3]:
print(df.head()) 

                                     Name  Barcelona  Price  Rating  stars  \
0                        Isabella's House          1   2142    8.58    4.0   
1                         Sonder Casa Luz          1   4870    8.48    4.0   
2  Axel TWO Barcelona 4 Sup - Adults Only          1   2149    8.38    4.0   
3                           Acta Voraport          1   1540    8.88    3.0   
4                         Tembo Barcelona          1   2840    8.88    4.0   

   num_coments                                        Description  \
0        1.390  Isabella's House se encuentra a 2,9 km de Pase...   
1        2.889  El Sonder Casa Luz se encuentra en Barcelona y...   
2        2.742  El TWO Hotel Barcelona by Axel está situado en...   
3       13.850  Acta Voraport está en Barcelona, a 10 min a pi...   
4        2.697  Tembo Barcelona está en Barcelona, a 16 min a ...   

   During_Treatment                                Cleaned_Description  \
0                 1  isabell hous encuentr

## Explanation of the Dataset

- **Name**: The name of the hotel.
- **Barcelona**: This column indicates whether the hotel is in the treatment or control group:
  - `0`: The hotel is located in **Madrid** (control group).
  - `1`: The hotel is located in **Barcelona** (treatment group).
- **Price**: The price of the hotel in **Euros** during the given time period.
- **Rating**: The rating of the hotel based on user reviews from the booking platform.
- **Stars**: The star classification of the hotel (e.g., **3-star, 4-star, etc.**).
- **num_coments**: The number of reviews left by users on the platform.
- **Description**: The original, unprocessed hotel description as found on the website.
- **During_Treatment**: This column indicates whether the observation belongs to the treatment period:
  - `0`: The prices were recorded **two weeks after the event** (control period).
  - `1`: The prices correspond to **the event week** (treatment period).
- **Cleaned_Description**: The processed hotel description after **stemming, lowercasing, removing stopwords, and preserving relevant numeric mentions** (such as distances and durations).
- **Word_Count**: The total number of words in the hotel’s **raw** description.
- **Sentence_Count**: The number of sentences in the hotel’s **raw** description.
- **Avg_Word_Length**: The average length of words in the hotel’s **raw** description.
- **Special_Mentions**: The number of times the description contains **important location-based keywords**, such as:
  - `"centro"` (city center)
  - `"playa"` (beach)
  - `"aeropuerto"` (airport)
  - `"wifi"`, `"metro"`, and `"piscina"` (pool)
- **Sentiment_Score**: A computed **sentiment polarity score** based on the `TextBlob` sentiment analysis of the **cleaned** description.
  - **Ranges from** `-1` (negative) **to** `1` (positive), **with** `0` **indicating a neutral description**.
- **Luxury_Score**: The number of times the **cleaned** description includes words associated with **luxury accommodations**, such as:
  - `"spa"`, `"lujoso"` (luxurious), `"exclusivo"`, `"premium"`
  - `"sofisticado"` (sophisticated), `"personalizado"`, `"masajes"` (massages), and `"sauna"`


## (a) Fixed Effects Regression

Write down a fixed effects regression equation that allows you to derive a difference-
in-difference estimate of the effect of the event on prices. Think of controls to
add, why is this relevant? Explain why you need a second city for this.

### Base Model:

Fist we will run a regression without controls.

Create the interaction term:

In [4]:
df["Barcelona_x_Treatment"] = df["Barcelona"] * df["During_Treatment"]

#### Why Take the Log of the Price?

We take the logarithm of the price (outcome variable) because:

- **Interpretability**: The coefficient estimates can be interpreted as percentage changes rather than absolute changes. This is particularly useful when analyzing price effects, as percentage changes are more intuitive in economic contexts.

- **Reduces Skewness**: Prices often have a right-skewed distribution, meaning there are a few very high values that could disproportionately influence the regression results. Taking the log helps make the distribution more normal, which improves the reliability of the statistical model.

- **Stabilizes Variance**: Log transformation can help reduce heteroskedasticity, meaning it makes the variance of the residuals more constant across different levels of the independent variables, which is an assumption of OLS regression.

- **Better Fit for Multiplicative Relationships**: Many economic relationships are multiplicative rather than additive (e.g., a 10% increase in demand leads to a 5% increase in price, rather than a fixed amount). A log-linear model captures these effects better than a linear model.

By applying the log transformation, we ensure that our estimates reflect relative price changes, improving both the robustness and interpretability of our regression results.

In [5]:
df["Log_Price"] = np.log(df["Price"])

In [6]:
model1 = smf.ols(
    "Log_Price ~ During_Treatment + Barcelona + Barcelona_x_Treatment ", 
    data=df
).fit()

print(model1.summary())

                            OLS Regression Results                            
Dep. Variable:              Log_Price   R-squared:                       0.408
Model:                            OLS   Adj. R-squared:                  0.407
Method:                 Least Squares   F-statistic:                     338.9
Date:                Wed, 05 Feb 2025   Prob (F-statistic):          2.20e-167
Time:                        21:46:18   Log-Likelihood:                -762.90
No. Observations:                1480   AIC:                             1534.
Df Residuals:                    1476   BIC:                             1555.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 7.06

#### Key Takeaways from the Base Model

- **R-squared (0.408):** The model explains about 40.8% of the variance in log-transformed hotel prices. While this is a reasonable fit, it suggests that other factors influence prices beyond our current variables.

- **During_Treatment (-0.0541, p = 0.059):** There is a small negative effect of the treatment period on prices, but it is not statistically significant at the 5% level. This suggests that, in general, prices may not have dropped significantly in the post-event period across both cities.

- **Barcelona (-0.0356, p = 0.234):** The coefficient is negative but not statistically significant, indicating that, after controlling for treatment timing, baseline hotel prices in Barcelona do not differ significantly from those in Madrid.

- **Interaction Term (Barcelona × Treatment: 0.8597, p < 0.001):** This is the most relevant coefficient for our Difference-in-Differences analysis. The highly significant and large positive effect suggests that hotel prices in Barcelona **increased** significantly during the Mobile World Congress, relative to Madrid. This confirms a strong event effect.

#### Potential Controls:


- **Hotel Rating:** Higher-rated hotels tend to have higher prices, as positive reviews often signal better quality, amenities, and service. Including this as a control helps account for differences in pricing due to customer satisfaction.

- **Number of Stars:** The official star rating of a hotel (e.g., 3-star, 4-star, 5-star) is a strong determinant of price. More luxurious hotels generally charge higher prices, so including this variable helps isolate the effect of the event from differences in accommodation quality.

- **Number of Comments:** Hotels with a higher number of reviews might have different pricing strategies. A well-reviewed hotel may charge premium prices due to its established reputation, while a newer or less-reviewed hotel may offer lower prices to attract customers. Controlling for this helps adjust for differences in consumer demand based on past guest experiences.


Unfortunately, we are not able to include any time dependent varaibles, however, we assume common trends for our model.

###### Updating the number of controls to in thousands:

In [7]:
df['num_coments'] = df['num_coments'] / 1000

##### Adding these controls:

In [8]:
model2 = smf.ols(
    "Log_Price ~ During_Treatment + Barcelona + Barcelona_x_Treatment + Rating + stars + num_coments", 
    data=df
).fit()

print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:              Log_Price   R-squared:                       0.607
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     379.0
Date:                Wed, 05 Feb 2025   Prob (F-statistic):          3.39e-294
Time:                        21:46:18   Log-Likelihood:                -459.56
No. Observations:                1478   AIC:                             933.1
Df Residuals:                    1471   BIC:                             970.2
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 5.54

#### Key Takeaways from the Regression Results

- **Event Effect (Barcelona × Treatment: 0.8598, p < 0.001):**  
  The strong and highly significant positive coefficient confirms that hotel prices in Barcelona increased significantly during the Mobile World Congress, compared to Madrid. This provides strong evidence that the event had a substantial impact on hotel prices.

- **During_Treatment (-0.0542, p = 0.021):**  
  Prices generally declined slightly during the treatment period across both cities. However, this effect is relatively small and only marginally significant.

- **Barcelona (-0.1359, p < 0.001):**  
  On average, hotels in Barcelona tend to have lower prices compared to Madrid, after controlling for other factors. This could be due to differences in market structure, competition, or supply constraints.

- **Hotel Characteristics Matter:**  
  - **Rating (0.0939, p < 0.001):** Higher-rated hotels tend to have significantly higher prices.  
  - **Stars (0.2383, p < 0.001):** The number of stars is a strong predictor of price, with higher-star hotels charging more.  
  - **Number of Comments (0.0815, p = 0.032):** Hotels with more reviews tend to have slightly higher prices, possibly reflecting higher demand or reputation effects.  

## (b) Text Features as Controls

Incorporating text-based features from hotel descriptions allows us to capture additional factors that might influence hotel prices. Below are key text-derived variables that can serve as controls in our analysis:

- **Special Mentions:** The number of times key location-based words appear in the description (e.g., "centro", "playa", "aeropuerto", "wifi", "metro", "piscina"). These terms indicate proximity to important landmarks or services that could drive up hotel prices.

- **Sentiment Score:** A numerical measure of the overall tone of the hotel description, capturing whether the description is more positive, neutral, or negative. More positive descriptions might correlate with higher prices.

- **Luxury Score:** The count of words associated with luxury (e.g., "spa", "lujoso", "exclusivo", "premium", "sofisticado", "personalizado", "masajes", "sauna"). Hotels emphasizing luxury features in their descriptions are likely to have higher prices.

- **Mentions of Key Amenities:** Some amenities, such as "wifi" or "piscina", could impact pricing. Hotels highlighting more amenities may justify higher prices.

- **Sentence Count:** A measure of the length of the description, which could serve as a proxy for how much information is provided about the hotel.

- **Word Count:** Longer descriptions might indicate better marketing efforts, higher-end accommodations, or more comprehensive information. This could act as a proxy for hotel reputation and demand.

By integrating these text-based features into our regression model, we can better control for qualitative aspects of hotel listings that could affect price differences.

On the other hand, words like "Barcelona" would not help because they appear in almost every description for accommodations in Barcelona and could only serve to distinguish the treatment and control groups, rather than providing meaningful variation within the treatment group. Including such a term in the regression would not add explanatory power, as it would be highly collinear with the city dummy variable. 

Moreover, since "Barcelona" is already accounted for in the fixed effects or city dummy, its presence in the text does not provide additional information about hotel characteristics, amenities, or price determinants. Instead, we should focus on words that differentiate hotels based on location, quality, or services offered.

##### The following model adds to the model in part (a) the above described controls from the descriptions:

In [9]:
model3=smf.ols(
    "Log_Price ~ During_Treatment + Barcelona + Barcelona_x_Treatment + Rating + stars + num_coments + Word_Count + Sentence_Count +  Avg_Word_Length + Special_Mentions + Sentiment_Score + Luxury_Score",
    data=df
).fit()

print(model3.summary())

                            OLS Regression Results                            
Dep. Variable:              Log_Price   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.609
Method:                 Least Squares   F-statistic:                     192.5
Date:                Wed, 05 Feb 2025   Prob (F-statistic):          1.06e-290
Time:                        21:46:18   Log-Likelihood:                -450.48
No. Observations:                1478   AIC:                             927.0
Df Residuals:                    1465   BIC:                             995.8
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 5.67

#### Main Takeaways from the Regression Results

- **Event Effect (`Barcelona_x_Treatment = 0.8598, p < 0.001`)**:  
  - The strong and statistically significant interaction term confirms that the **Mobile World Congress (MWC) led to a substantial increase in hotel prices in Barcelona** compared to Madrid during the treatment period.  
  - The magnitude of **0.86 suggests an approximately 86% increase in hotel prices in Barcelona during the event**.

- **Baseline Price Differences (`Barcelona = -0.1254, p < 0.001`)**:  
  - On average, hotel prices in **Barcelona are lower than in Madrid** when not considering the event effect.

- **Treatment Period Effect (`During_Treatment = -0.0542, p = 0.020`)**:  
  - There is a small, but significant, **overall decrease in hotel prices during the treatment period**, which may indicate seasonality or other market dynamics affecting both cities.

- **Hotel Characteristics Matter**:  
  - **Higher ratings (`Rating = 0.0983, p < 0.001`) and more stars (`stars = 0.2318, p < 0.001`) are strongly associated with higher hotel prices**.
  - **Number of reviews (`num_coments = 8.356e-05, p = 0.029`) has a small but positive effect on prices**, suggesting that more frequently reviewed hotels tend to be priced higher.

- **Text Features as Controls**:  
  - **Word count (`Word_Count = 0.0012, p = 0.003`) has a small but positive effect on prices**, potentially capturing more detailed or extensive descriptions related to high-end accommodations.  
  - **Sentence count (`Sentence_Count = -0.0213, p = 0.003`) has a negative effect**, possibly indicating that longer descriptions could be associated with lower-end accommodations or less concise marketing.  
  - **Sentiment Score, Special Mentions, and Luxury Score are not statistically significant**, meaning they do not provide strong predictive power in this model.  

## (c) Heterogeneous Treatment Effects

To explore whether the effect of the Mobile World Congress (MWC) on hotel prices varies based on **hotel quality**, we introduce interaction terms between the treatment dummy (`During_Treatment`), the event interaction (`Barcelona_x_Treatment`), and various hotel characteristics such as **rating, stars, and luxury score**.

#### Choosing the Right Interaction Terms

- **Luxury Score:** This text-derived metric captures mentions of luxury-related words in hotel descriptions. It may provide additional insights beyond star ratings but could also overlap conceptually.
- **Stars and Rating:** These two variables might be highly correlated, as higher-rated hotels tend to have more stars. Including both interactions may introduce **multicollinearity**, making it difficult to interpret the individual effects. Thus we should chek for their correlation.

##### Correlation between Stars and Rating:

In [10]:
correlation = df[['stars', 'Rating']].corr()
print(correlation)

           stars    Rating
stars   1.000000  0.396621
Rating  0.396621  1.000000


This correlation is moderate.

**Multiple Interactions Together**  
We include interactions with **all three hotel characteristics** (stars, rating, and luxury score) to capture multiple dimensions of heterogeneity. These we add in the following to out previous model:

In [11]:
model4 = smf.ols(
    "Log_Price ~ During_Treatment + Barcelona + Barcelona_x_Treatment + Rating + stars + num_coments + Word_Count + Sentence_Count + Avg_Word_Length + Special_Mentions + Sentiment_Score + Luxury_Score + During_Treatment:Rating + During_Treatment:stars + During_Treatment:Luxury_Score",
    data=df
).fit()

print(model4.summary())

                            OLS Regression Results                            
Dep. Variable:              Log_Price   R-squared:                       0.614
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     154.9
Date:                Wed, 05 Feb 2025   Prob (F-statistic):          1.00e-288
Time:                        21:46:18   Log-Likelihood:                -447.14
No. Observations:                1478   AIC:                             926.3
Df Residuals:                    1462   BIC:                             1011.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

#### Key Takeaways from Heterogeneous Treatment Analysis Based on Hotel Quality

- **Overall Model Fit**:  
  - The model explains **61.4%** of the variance in log hotel prices (**R² = 0.614**), which is a significant improvement from previous models.  
  - This suggests that incorporating hotel quality characteristics enhances explanatory power.

- **Main Treatment Effect**:  
  - The **During_Treatment** coefficient is **-0.3829 (p = 0.017)**, indicating that prices generally decrease outside the event period.  
  - The **Barcelona_x_Treatment** coefficient remains strong and positive (**0.8516, p < 0.001**), confirming that hotel prices in **Barcelona increased significantly** during the **Mobile World Congress (MWC)** compared to Madrid.

- **Hotel Characteristics & Price Levels**:  
  - **Higher ratings and more stars significantly increase hotel prices**:  
    - **Rating**: **0.0828 (p < 0.001)** → Higher-rated hotels tend to have higher prices.  
    - **Stars**: **0.2216 (p < 0.001)** → More stars strongly correlate with higher prices.  
  - **Number of reviews has a small but positive effect** (**p = 0.028**), suggesting that more reviewed hotels tend to have slightly higher prices.

- **Heterogeneous Treatment Effects (Interaction Terms)**:  
  - **During_Treatment × Rating (0.0311, p = 0.144)**:  
    - The coefficient is **positive but not statistically significant**, meaning we do **not** find strong evidence that **higher-rated hotels reacted differently** to MWC in terms of price increases.  
  - **During_Treatment × Stars (0.0203, p = 0.387)**:  
    - The effect of **stars on price changes** during the event is **also not significant**, suggesting that **luxury hotels did not systematically increase their prices more than lower-tier hotels**.  
  - **During_Treatment × Luxury_Score (0.0300, p = 0.212)**:  
    - The **luxury score does not significantly moderate the treatment effect**, indicating that explicitly “luxury-marketed” hotels **did not price differently** due to MWC.

- **Conclusion**:  
  - The event had a **substantial price effect overall**, particularly in **Barcelona**.  
  - However, we do **not find strong evidence that higher-end hotels** (luxury, more stars, or higher ratings) **systematically increased their prices more than lower-end hotels** during the event.  
  - This could suggest that **MWC affects all hotels similarly**, rather than having a **disproportionate effect on high-end accommodations**.