# Apartment Prices Dataset

## Dataset Overview

This synthetic dataset simulates apartment prices based on various features, including apartment size, number of rooms, age of the building, floor level, and city. The dataset was generated to study and predict apartment prices based on these factors. It can be useful for regression modeling, data exploration, and machine learning experimentation.

This dataset includes 500 records, with each row representing a single apartment. Prices are calculated based on predefined factors, and cities have a specific influence on the base price per square meter.

## Features and Field Information

- **Square_Area**: The area of the apartment in square meters. Random values between 60 and 200.

- **Num_Rooms**: The number of rooms in the apartment. This is a randomly generated integer between 1 and 5.

- **Age_of_Building**: The age of the building in years. Randomly generated as an integer between 1 and 40, with older buildings generally having lower prices due to depreciation.

- **Floor_Level**: The floor on which the apartment is located. This is a random integer between 1 and 20.

- **City**: The city in which the apartment is located. Three cities are represented, with each city influencing the base price per square meter differently:
  - **Amman**: Higher base price multiplier (1.5).
  - **Irbid**: Standard base price multiplier (1.0).
  - **Aqaba**: Medium base price multiplier (1.2).

- **Price**: The target variable representing the total price of the apartment in Jordanian Dinars (JDs). This price is calculated based on the following formula:
  
  $$
  \text{Price} = (\text{Square\_Area} \times \text{Base Price per Square Meter} \times \text{City Factor}) + (\text{Num\_Rooms} \times 5000) + (\text{Age\_of\_Building} \times -1000) + (\text{Floor\_Level} \times 1000)
  $$

  Prices are non-negative, as any negative values are capped at zero to avoid unrealistic prices.

## Data Generation Methodology

The data was generated using the following steps:
1. **Square Area**: Randomly selected from a predefined list of possible sizes.
2. **Number of Rooms**: Randomly generated as an integer from 1 to 5.
3. **Age of Building**: Randomly generated as an integer from 1 to 40, with a depreciation factor applied based on age.
4. **Floor Level**: Randomly generated as an integer from 1 to 20.
5. **City**: Randomly selected from three options: Amman, Irbid, and Aqaba, each influencing the base price per square meter differently.
6. **Price Calculation**: The price is calculated based on the square area, number of rooms, age of the building, floor level, and city. A base price per square meter is modified by a city factor, and additional price adjustments are made based on rooms, age (depreciation), and floor level.

## Pricing Factors

The price calculation uses the following constants:
- **Base Price per Square Meter**: 300 JDs.
- **Price per Room**: 5000 JDs.
- **Depreciation per Year**: -1000 JDs (lower prices for older buildings).
- **Price per Floor Level**: 1000 JDs.
- **City Factors**:
  - **Amman**: 1.5 multiplier on the base price.
  - **Irbid**: 1.0 multiplier (no change to base price).
  - **Aqaba**: 1.2 multiplier on the base price.

## General Theme

This dataset reflects a **real estate market simulation** where prices are influenced by multiple apartment characteristics and a geographic factor (city). The goal is to provide a controlled, synthetic dataset that can be used to explore relationships between apartment features and prices, and to build predictive models.

The synthetic nature of this dataset allows for flexibility in adjusting parameters to suit various modeling tasks without privacy concerns. This makes it ideal for educational purposes, regression analysis, and experimenting with data transformations and machine learning techniques.

## Potential Use Cases

- **Regression Modeling**: Predicting apartment prices based on key features.
- **Feature Engineering**: Analyzing the impact of categorical encoding on city and feature scaling for numeric columns.
- **Data Visualization**: Visualizing how each feature impacts price, particularly across different cities.
- **Model Evaluation**: Testing the performance of various machine learning models (e.g., linear regression, tree-based models) on a controlled dataset.

## Licensing and Acknowledgments

This dataset is synthetic and was generated for educational and research purposes. There are no real-world privacy concerns, and the dataset can be freely used for analysis, modeling, and experimentation.



In [5]:
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Number of data points
n = 500

# Specify allowed apartment sizes
square_area = np.random.randint(60, 200, n)  # in square meters

# Generate other features
num_rooms = np.random.randint(1, 6, n)  # number of rooms
age_of_building = np.random.randint(1, 20, n)  # age of the building in years
floor_level = np.random.randint(1, 20, n)  # floor level of the apartment
city = np.random.choice(['Amman', 'Irbid', 'Aqaba'], n)  # categorical feature: city

# Pricing factors
base_price_per_sqm = 300  # base price per square meter in JDs
price_per_room = 5000  # additional price per room
price_per_year = -1000  # depreciation due to age of the building
price_per_floor = 1000  # increase in price based on floor level

# City factors affecting the base price per square meter
city_factor = {'Amman': 1.5, 'Irbid': 1.0, 'Aqaba': 1.2}

# Calculate base price influenced by city factor only
base_price = square_area * base_price_per_sqm * np.array([city_factor[c] for c in city])

# Generate the target variable (price)
price = (base_price + 
         num_rooms * price_per_room + 
         age_of_building * price_per_year + 
         floor_level * price_per_floor)

# Ensure prices are non-negative
price = np.maximum(price, 0)

# Convert to DataFrame
df = pd.DataFrame({
    'Square_Area': square_area,
    'Num_Rooms': num_rooms,
    'Age_of_Building': age_of_building,
    'Floor_Level': floor_level,
    'City': city,
    'Price': price
})

# Save the DataFrame to a CSV file
file_path = '../datasets/apartment_prices.csv'
df.to_csv(file_path, index=False)

# Display the first few rows of the dataset
df.head()


Unnamed: 0,Square_Area,Num_Rooms,Age_of_Building,Floor_Level,City,Price
0,162,1,15,12,Amman,74900.0
1,152,5,8,8,Aqaba,79720.0
2,74,3,2,8,Irbid,43200.0
3,166,1,3,18,Irbid,69800.0
4,131,3,14,15,Aqaba,63160.0
