# House Price Prediction

### Data Cleaning

Load the dataset into a Pandas DataFrame, display the first few rows, and generate a statistical summary of its columns.

Sort the columns based on their missing values, then handle them using two different strategies. Create two separate datasets based on these approaches.

### Evaluating Features

Use the `groupby` function in Pandas to visualize the distribution of house prices for categorical features. Identify which categorical features provide the most useful information for prediction based on their impact on price variation.

### Model setup

In this section, implement **LightGBM, Random Forest, and CatBoost** to predict house prices. Randomly split the dataset into **training and validation sets**, ensuring fair model evaluation. Choose an appropriate **evaluation metric** (e.g., RMSE, MAE, or R²) and report the results on the **validation dataset**.

#### Random forest

#### lightgbm

#### Catboost

# Window-Based Signed Return Labeling Function


Develop a Python function that computes and assigns labels to a financial time series based on the cumulative return over a specified time horizon (window) and a predetermined threshold. This function will label events as follows:
- **1** if the cumulative return exceeds a positive threshold (indicating an upward trend),
- **-1** if the cumulative return falls below a negative threshold (indicating a downward trend), and
- **0** otherwise (indicating no significant movement).



You are given a time series of stock prices and a series of event timestamps (which indicate the starting points for the labeling process). For each event, you will compute the cumulative return over a specified time horizon $w$(in days) and compare it to a threshold $\tau$ (expressed in log return terms). Then, you will assign a label according to the following mathematical formulation:

1. **Cumulative Return Over Window:**
   $$
   R_t^{(w)} = \ln\left(\frac{P_{t+w}}{P_t}\right)
  $$
   where $P_t$ is the stock price at time $t$.

2. **Labeling Rule:**
   $$
   s_t^{(w, \tau)} = \begin{cases}
   1 & \text{if } R_t^{(w)} \geq \tau, \\
   -1 & \text{if } R_t^{(w)} \leq -\tau, \\
   0 & \text{if } -\tau < R_t^{(w)} < \tau.
   \end{cases}
   $$



Implement a function named `label_events` that meets the following requirements:


In [None]:
def label_events(prices: pd.Series, events: pd.Series, horizon: int, threshold: float) -> pd.DataFrame:
    """
    Labels events in a financial time series based on cumulative returns.

    Parameters:
    - prices (pd.Series): A pandas Series of stock prices indexed by date.
    - events (pd.Series): A pandas Series of booleans or timestamps indicating the starting points for labeling events.
    - horizon (int): The time horizon (window size in days) over which to compute the cumulative return.
    - threshold (float): The threshold value (in log return terms) used to determine significant movement.

    Returns:
    - pd.DataFrame: A DataFrame containing the event timestamps, the computed cumulative return, and the assigned label.
    """

## A

Select **10 stock market datasets** from the dataset provided in class. Apply the **labeling algorithm**, using the **CUMSUM filter** for event detection, as demonstrated in class. Ensure that the labeled dataset is correctly structured for further analysis.

## B

Now use random forest classifier to predict lablel.

## C

Compare the results of the **labeling method introduced in class** with the **one implemented here**. Assess which method performs better by defining **clear evaluation criteria** . Justify your conclusion with a **logical and reasonable evaluation** that aligns with the principles discussed in class. Avoid any obvious mistakes in the evaluation process.