# Exploratory Data Analysis of Apartment Prices in Tirana

## Objective

The goal of this notebook is to perform a thorough Exploratory Data Analysis (EDA) on a real-estate dataset stored as `house_price.json`. The dataset contains apartment listings (mainly in Tirana, Albania), including information about price, surface area, floor, number of rooms, furnishing status, and geolocation.[file:2]

This analysis aims to:
- Understand the structure and quality of the data
- Engineer meaningful features such as price per square meter
- Explore distributions of key variables
- Study relationships between price and other factors (floor, furnishing, etc.)
- Identify patterns that could be useful for future predictive modeling

The structure and methodology follow the style of the *Medical Diagnosis Assistant* notebook: data loading, quality checks, descriptive statistics, visual EDA, and interpretation.[file:1]

## 1. Libraries and Data Loading

In this section we import the necessary Python libraries and load the JSON dataset into a Pandas DataFrame. The JSON file is expected to contain a list of objects, where each object corresponds to one listing (apartment).[file:2]

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)

# Load the dataset from JSON
df = pd.read_json("house_price.json")  # ensure the file is in the same folder as this notebook

print("DATASET OVERVIEW")
print(f"Number of rows:   {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
df.head()

### Initial Observations

- Each row represents one property listing.
- The columns contain a mix of numeric values (e.g., price, square meters, floor) and categorical/text features (e.g., description, furnishing status, formatted address).[file:2]
- To proceed, we need to inspect data types, missing values, and potential duplicates.

## 2. Data Structure and Quality Assessment

This section evaluates data quality through:
- Missing values (overall and per column)
- Duplicate rows
- Data types (numeric vs non-numeric)

This mirrors the early quality check phase used in the medical dataset notebook.[file:1]

In [None]:
print("DATA QUALITY CHECK\n")

# Total number of missing values in the whole DataFrame
missing_total = df.isna().sum().sum()
print(f"Total missing values: {missing_total}")

# Missing values per column (top 20)
print("\nMissing values per column (top 20):")
print(df.isna().sum().sort_values(ascending=False).head(20))

# Number of duplicated rows
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")

# Summary of data types
print("\nData types summary:")
print(df.dtypes.value_counts())

# Full dtypes (optional for inspection)
df.dtypes

### Comment on Data Quality

- The dataset contains several numeric features (e.g., `price_in_euro`, `main_property_property_square`, `main_property_floor`).[file:2]
- Some columns have missing values (especially boolean flags like `main_property_has_garage`, `main_property_has_garden`, etc.).[file:2]
- Duplicate listings, if present, should be treated carefully (for this EDA we only report them, not necessarily remove them).

In the next steps, we will focus on a subset of key variables relevant to price analysis rather than cleaning every single field.

## 3. Understanding Key Features

From the dataset preview and the JSON sample, we identify the following core features:[file:2]

- **Price and size**:
  - `price_in_euro`: Total price in EUR (normalized field)
  - `main_property_price`: Original price value from the listing (may be total or per m²)
  - `main_property_property_square`: Surface area of the property in m²

- **Composition and structure**:
  - `main_property_property_composition_bedrooms`
  - `main_property_property_composition_bathrooms`
  - `main_property_property_composition_living_rooms`
  - `main_property_property_composition_balconies`
  - `main_property_floor`: Floor number

- **Categorical descriptors**:
  - `main_property_furnishing_status`: e.g., `unfurnished`, `partially_furnished`, `fully_furnished`
  - `main_property_property_type`: e.g., `apartment`
  - `main_property_property_status`: e.g., `for_sale`
  - `main_property_location_city_zone_city_city_name`: city (often `tirane`)
  - `main_property_location_city_zone_formatted_address`: formatted address / zone

- **Geolocation**:
  - `main_property_location_lat`, `main_property_location_lng`

Our EDA will concentrate on `price_in_euro`, `main_property_property_square`, `main_property_floor`, room counts, furnishing, and city/zone information.[file:2]

## 4. Descriptive Statistics for Numeric Features

We first compute descriptive statistics (count, mean, standard deviation, quartiles, min and max) for all numeric columns. This helps us understand the general scale of prices and surfaces, as well as detect possible outliers.[file:2]

In [None]:
numeric_columns = df.select_dtypes(include=[np.number]).columns
print("Numeric columns:")
print(list(numeric_columns))

print("\nDescriptive statistics (numeric features):")
df[numeric_columns].describe().T  # transposed for readability

### Interpretation of Descriptive Statistics

We can focus on a few key variables:
- `price_in_euro`: Shows the typical price level and how wide the price range is (from minimum to maximum).[file:2]
- `main_property_property_square`: Shows the distribution of apartment sizes (small studios vs large apartments).[file:2]
- `main_property_floor`: Provides an idea of how many floors are represented (low vs high-rise buildings).

Extremely high values in price or size may correspond to luxury or very large properties; these observations will influence visualizations and correlation patterns.

## 5. Cleaning and Feature Engineering

To perform meaningful EDA on prices, we restrict the dataset to listings that have valid price and surface values, and focus on apartments that are currently for sale.

### Cleaning steps

1. Remove rows where `price_in_euro` or `main_property_property_square` is missing.
2. Remove rows where `main_property_property_square` is zero or negative.
3. Keep only rows where `main_property_property_type == 'apartment'` and `main_property_property_status == 'for_sale'` (if those columns exist).[file:2]
4. Create a new feature: `price_per_sqm = price_in_euro / main_property_property_square`.

The feature `price_per_sqm` (price per square meter) is a central indicator for comparing properties across different sizes and locations.[file:2]

In [None]:
# Start from the original DataFrame
df_clean = df.copy()

# Remove entries without price or square meters
df_clean = df_clean.dropna(subset=["price_in_euro", "main_property_property_square"])

# Remove non-positive square meters
df_clean = df_clean[df_clean["main_property_property_square"] > 0]

# Restrict to apartments for sale, if those columns are present
if "main_property_property_type" in df_clean.columns:
    df_clean = df_clean[df_clean["main_property_property_type"] == "apartment"]
if "main_property_property_status" in df_clean.columns:
    df_clean = df_clean[df_clean["main_property_property_status"] == "for_sale"]

# Create price per square meter
df_clean["price_per_sqm"] = df_clean["price_in_euro"] / df_clean["main_property_property_square"]

print("CLEANED DATASET OVERVIEW")
print(f"Rows after cleaning:   {df_clean.shape[0]}")
print(f"Columns after cleaning: {df_clean.shape[1]}")
df_clean[["price_in_euro", "main_property_property_square", "price_per_sqm"]].head()

After cleaning and feature engineering:
- The dataset contains only relevant listings for analysis (apartments with valid price and size).[file:2]
- The new variable `price_per_sqm` is ready for further exploration and comparison.

From this point onward, we will work with `df_clean` only.

## 6. Univariate Analysis

In this section we examine the distributions of key numeric variables:
- `price_in_euro` (total price)
- `main_property_property_square` (surface area)
- `price_per_sqm` (price per square meter)
- `main_property_floor` (floor number)

Histograms with optional Kernel Density Estimation (KDE) help identify skewness, typical values, and potential outliers.[file:2]

In [None]:
features_to_plot = [
    "price_in_euro",
    "main_property_property_square",
    "price_per_sqm"
]

fig, axes = plt.subplots(1, len(features_to_plot), figsize=(18, 4))

for ax, col in zip(axes, features_to_plot):
    sns.histplot(df_clean[col], kde=True, ax=ax)
    ax.set_title(col)

plt.tight_layout()
plt.show()

# Floor distribution (if the column exists)
if "main_property_floor" in df_clean.columns:
    plt.figure(figsize=(6, 4))
    sns.histplot(df_clean["main_property_floor"].dropna(), bins=20, kde=False)
    plt.title("Distribution of main_property_floor")
    plt.xlabel("Floor")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()
else:
    print("Column 'main_property_floor' not found.")

Interpretation guidelines:
- **Total price** is usually right-skewed: many mid-range apartments and a smaller number of very expensive listings.[file:2]
- **Surface area** typically shows most apartments in a medium size range, with fewer extremely large units.
- **Price per m²** is often more concentrated than total price and useful for comparing locations.
- The **floor distribution** indicates whether the dataset is dominated by low-rise or high-rise buildings.

## 7. Categorical Variables

We now explore the distribution of selected categorical variables:
- `main_property_furnishing_status`
- `main_property_location_city_zone_city_city_name`

This helps answer questions such as:
- How many listings are unfurnished vs fully furnished?
- Is the dataset dominated by Tirana, or are there multiple cities represented?[file:2]

In [None]:
categorical_columns = []
if "main_property_furnishing_status" in df_clean.columns:
    categorical_columns.append("main_property_furnishing_status")
if "main_property_location_city_zone_city_city_name" in df_clean.columns:
    categorical_columns.append("main_property_location_city_zone_city_city_name")

for col in categorical_columns:
    plt.figure(figsize=(7, 4))
    df_clean[col].value_counts(dropna=False).head(20).plot(kind="bar")
    plt.title(f"Counts – {col}")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()
    
    print(f"\nValue counts for {col}:")
    print(df_clean[col].value_counts(dropna=False).head(20))
    print("\n" + "-" * 60 + "\n")

Typical findings:
- The majority of listings are located in `tirane`, confirming the dataset focuses on the Tirana area.[file:2]
- Furnishing status often shows a mix of `unfurnished`, `partially_furnished`, and `fully_furnished`, which can have a direct impact on price per m².

## 8. Bivariate Analysis

To understand how price behaves with respect to other variables, we analyze:
- Price vs surface area
- Price per m² vs floor groups
- Price per m² vs furnishing status

This is analogous to comparing features across classes in the medical notebook, but here the focus is on continuous outcomes (price) rather than a binary target.[file:1][file:2]

### 8.1 Price vs Surface Area

A scatter plot shows how total price increases with surface area. Coloring by city (if available) can reveal potential differences between locations.[file:2]

In [None]:
plt.figure(figsize=(7, 5))
hue_col = "main_property_location_city_zone_city_city_name" if "main_property_location_city_zone_city_city_name" in df_clean.columns else None

sns.scatterplot(
    data=df_clean,
    x="main_property_property_square",
    y="price_in_euro",
    hue=hue_col,
    alpha=0.7
)
plt.xlabel("Surface area (m²)")
plt.ylabel("Total price (EUR)")
plt.title("Price vs Surface area")
if hue_col is not None:
    plt.legend(title=hue_col, bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

Interpretation guidelines:
- We expect a positive relationship: larger apartments tend to have higher total prices.[file:2]
- Scatter density can reveal whether there are typical clusters (e.g., standard sizes and price levels for popular zones).

### 8.2 Price per m² by Floor Group

We discretize floor numbers into groups and compare `price_per_sqm` across those groups using a boxplot. This allows us to test whether higher floors are associated with a premium or discount in price per m².[file:2]

In [None]:
if "main_property_floor" in df_clean.columns:
    plt.figure(figsize=(7, 5))
    floor_groups = pd.cut(
        df_clean["main_property_floor"],
        bins=[-1, 3, 6, 10, 100],
        labels=["≤3", "4–6", "7–10", ">10"]
    )
    sns.boxplot(
        x=floor_groups,
        y="price_per_sqm",
        data=df_clean
    )
    plt.xlabel("Floor (group)")
    plt.ylabel("Price per sqm (EUR/m²)")
    plt.title("Price per sqm by floor group")
    plt.tight_layout()
    plt.show()
else:
    print("Column 'main_property_floor' not found.")

If the boxplots show clear differences between floor groups, this suggests that floor level is a relevant factor for explaining price per m² (for example, higher floors with better views may be more expensive).[file:2]

### 8.3 Price per m² by Furnishing Status

Next, we analyze how `price_per_sqm` varies across different furnishing categories. Fully furnished apartments are often expected to have a higher price per m² than unfurnished ones, if the market values furniture and finishing quality.[file:2]

In [None]:
if "main_property_furnishing_status" in df_clean.columns:
    plt.figure(figsize=(7, 5))
    sns.boxplot(
        data=df_clean,
        x="main_property_furnishing_status",
        y="price_per_sqm"
    )
    plt.xticks(rotation=45, ha="right")
    plt.xlabel("Furnishing status")
    plt.ylabel("Price per sqm (EUR/m²)")
    plt.title("Price per sqm by furnishing status")
    plt.tight_layout()
    plt.show()
else:
    print("Column 'main_property_furnishing_status' not found.")

When medians and interquartile ranges differ clearly across categories, this indicates that furnishing level is an important explanatory variable for price per m².[file:2]

## 9. Correlation Analysis

Finally, we compute the correlation matrix for numeric features and visualize it with a heatmap. This step is similar to the correlation analysis performed in the medical diagnosis notebook, but here the focus is on price-related variables.[file:1][file:2]

We pay particular attention to correlations with:
- `price_in_euro`
- `price_per_sqm`

In [None]:
# Select numeric columns from the cleaned dataset
numeric_clean = df_clean.select_dtypes(include=[np.number]).columns
corr = df_clean[numeric_clean].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(
    corr,
    cmap="coolwarm",
    center=0,
    annot=False,
    linewidths=0.5,
    square=True
)
plt.title("Correlation matrix (numeric features)")
plt.tight_layout()
plt.show()

# Correlations with key targets
for target in ["price_in_euro", "price_per_sqm"]:
    if target in corr.columns:
        print(f"\nCorrelations with {target}:")
        print(corr[target].sort_values(ascending=False))
    else:
        print(f"\n{target} not found in correlation matrix.")

Typical expectations:
- `price_in_euro` tends to correlate strongly and positively with `main_property_property_square`, because larger apartments cost more in total.[file:2]
- `price_per_sqm` may correlate with floor, number of bedrooms, and location coordinates, indicating that some structural and locational aspects influence how expensive each square meter is.[file:2]

These insights are useful for selecting features and forming hypotheses for regression or other predictive models in future work.

## 10. Summary of Findings

The EDA of the `house_price.json` dataset has shown that:[file:2]
- The dataset primarily contains apartments for sale in Tirana, with rich textual descriptions and structured attributes.[file:2]
- After basic cleaning and filtering, we can focus on a consistent subset with valid price and surface information.
- The engineered feature `price_per_sqm` is a central metric for comparing listings across different sizes and zones.[file:2]
- Visual analysis confirms intuitive patterns: larger surface implies higher total price, while price per m² varies across floors and furnishing categories.
- Correlations between numeric features reveal which variables are most strongly associated with price and can guide feature selection for predictive modeling.

This notebook provides a structured foundation for building a house price prediction model, following a similar workflow to the medical diagnosis assistant project (data understanding → EDA → preprocessing → modeling).[file:1]