# House Price Analysis from JSON
## Exploring Apartment Listings in Tirana

### The Problem

Real estate markets are driven by many factors: location, size, floor, number of rooms, furnishing level, and other amenities. Understanding how these variables relate to apartment prices is essential for buyers, sellers, and agents.

In this notebook we analyze a JSON dataset containing apartment listings (mainly in Tirana, Albania). Each listing describes:
- Price (in EUR)
- Surface area in square meters
- Number of rooms (bedrooms, bathrooms, living rooms, balconies)
- Floor
- Furnishing status
- Location (address and coordinates)

### Our Mission

We will:
1. **Load and inspect the data** from a JSON file
2. **Convert JSON → Pandas DataFrame** for easier analysis
3. **Clean and rename columns** to more meaningful English names
4. **Perform EDA (Exploratory Data Analysis)** to understand distributions and relationships


## Phase 1: Core Libraries

We start by importing the core Python libraries for data analysis and visualization, and by setting some basic plotting preferences.

In [None]:
# Core Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Visualization Settings
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

print("Libraries loaded successfully!")

## Phase 2: Load JSON and Inspect Raw Data

In this step we load the `house_price.json` file into a Pandas DataFrame, check its size, and preview the raw columns and a few example rows.

In [None]:
# Load data from JSON file
df = pd.read_json("house_price.json")

print("DATASET OVERVIEW")
print(f"Number of listings (rows): {df.shape[0]}")
print(f"Number of features (columns): {df.shape[1]}")

print("\nColumn names (raw):")
print(df.columns.tolist())

# Preview first 5 rows
df.head()

## Phase 3: Column Renaming and Selection

The raw JSON contains long technical column names. Here we create clearer English names and then select the main variables of interest for analysis (price, size, rooms, floor, and key amenities).

In [None]:
# Make a copy and rename columns to clearer English names
df_renamed = df.copy()

df_renamed = df_renamed.rename(columns={
    "main_property_description_text_content_original_text": "description",
    "main_property_floor": "floor",
    "main_property_furnishing_status": "furnishing_status",
    "main_property_has_carport": "has_carport",
    "main_property_has_elevator": "has_elevator",
    "main_property_has_garage": "has_garage",
    "main_property_has_garden": "has_garden",
    "main_property_has_parking_space": "has_parking_space",
    "main_property_has_terrace": "has_terrace",
    "main_property_location_city_zone_city_city_name": "city",
    "main_property_price_currency": "price_currency",
    "main_property_property_composition_balconies": "balconies",
    "main_property_property_composition_bathrooms": "bathrooms",
    "main_property_property_composition_bedrooms": "bedrooms",
    "main_property_property_composition_kitchens": "kitchens",
    "main_property_property_composition_living_rooms": "living_rooms",
    "main_property_property_status": "property_status",
    "main_property_property_type": "property_type",
    "price_in_euro": "price_eur",
    "main_property_property_square": "area_sqm"
})

print("Columns after renaming:")
print(df_renamed.columns.tolist())

# Keep and order main analysis columns
main_cols = [
    "price_eur",
    "area_sqm",
    "floor",
    "bedrooms",
    "bathrooms",
    "balconies",
    "living_rooms",
    "furnishing_status",
    "has_elevator",
    "has_parking_space",
    "has_garage",
    "has_terrace",
    "has_garden",
    "city",
    "property_type",
    "property_status",
    "price_currency",
    "description"
]

# Only keep columns that actually exist
main_cols = [c for c in main_cols if c in df_renamed.columns]

df_clean = df_renamed[main_cols].copy()

print("\nShape of cleaned DataFrame:", df_clean.shape)
df_clean.head()

## Phase 4: Data Quality Checks

Before plotting, we inspect basic statistics, missing values, and duplicates to understand the overall data quality. Outliers and irregular values will also be highlighted by simple descriptive statistics.

In [None]:
# Basic info
print("DATA TYPES AND NON-NULL COUNTS")
df_clean.info()

# Missing values per column
print("\nMISSING VALUES PER COLUMN")
print(df_clean.isna().sum())

# Duplicates
num_duplicates = df_clean.duplicated().sum()
print(f"\nNumber of duplicated rows: {num_duplicates}")

# Basic numeric description (only for numeric columns)
print("\nNUMERIC SUMMARY STATISTICS")
print(df_clean.describe(include=["number"]))

## Phase 5: Feature Engineering – Price per Square Meter

A key indicator for real estate is the price per square meter. We derive this feature from total price and area, then explore its distribution.

In [None]:
# Create price_per_sqm, guarding against division by zero
df_clean["price_per_sqm"] = np.where(
    (df_clean["area_sqm"] > 0) & (~df_clean["area_sqm"].isna()),
    df_clean["price_eur"] / df_clean["area_sqm"],
    np.nan
)

print("New column added: price_per_sqm")
df_clean[["price_eur", "area_sqm", "price_per_sqm"]].head()

## Phase 6: Univariate Distributions

In this section we inspect the distributions of the main numeric variables: total price, area in square meters, and price per square meter. Simple histograms show whether the values are concentrated or heavily skewed.

In [None]:
# Histogram of total price in EUR (with upper cap to reduce influence of extreme outliers)
plt.figure()
sns.histplot(data=df_clean[df_clean["price_eur"] < df_clean["price_eur"].quantile(0.99)],
             x="price_eur", bins=50, kde=True)
plt.title("Distribution of Apartment Prices (EUR) – 99th Percentile Cap")
plt.xlabel("Price in EUR")
plt.ylabel("Count of listings")
plt.show()

# Histogram of area in square meters
plt.figure()
sns.histplot(data=df_clean[df_clean["area_sqm"] < df_clean["area_sqm"].quantile(0.99)],
             x="area_sqm", bins=50, kde=True, color="orange")
plt.title("Distribution of Apartment Area (sqm) – 99th Percentile Cap")
plt.xlabel("Area (sqm)")
plt.ylabel("Count of listings")
plt.show()

# Histogram of price per square meter
plt.figure()
sns.histplot(data=df_clean[df_clean["price_per_sqm"] < df_clean["price_per_sqm"].quantile(0.99)],
             x="price_per_sqm", bins=50, kde=True, color="green")
plt.title("Distribution of Price per Square Meter (EUR/sqm)")
plt.xlabel("EUR per sqm")
plt.ylabel("Count of listings")
plt.show()

## Phase 7: Bivariate Analysis – Price vs Size and Floor

Here we examine how total price and price per square meter vary with apartment size and floor. Scatter plots allow us to observe general trends and the presence of outliers.

In [None]:
# Scatter: Area vs Price
plt.figure()
sns.scatterplot(data=df_clean,
                x="area_sqm",
                y="price_eur",
                alpha=0.4)
plt.title("Relationship between Area (sqm) and Price (EUR)")
plt.xlabel("Area (sqm)")
plt.ylabel("Price (EUR)")
plt.show()

# Boxplot: Floor vs Price per sqm (grouping floors)
df_clean["floor_group"] = pd.cut(
    df_clean["floor"],
    bins=[-2, 0, 3, 6, 10, 40],
    labels=["Basement/0", "1-3", "4-6", "7-10", ">10"]
)

plt.figure()
sns.boxplot(data=df_clean, x="floor_group", y="price_per_sqm")
plt.title("Price per sqm by Floor Group")
plt.xlabel("Floor group")
plt.ylabel("EUR per sqm")
plt.show()

## Phase 8: Categorical Variables – Furnishing and City

Furnishing status and city can also influence price. We check how price per square meter behaves across these categorical groups using boxplots.

In [None]:
# Boxplot: Furnishing status vs price per sqm
plt.figure()
sns.boxplot(data=df_clean, x="furnishing_status", y="price_per_sqm")
plt.title("Price per sqm by Furnishing Status")
plt.xlabel("Furnishing status")
plt.ylabel("EUR per sqm")
plt.xticks(rotation=45)
plt.show()

# To avoid clutter, keep only the top 5 cities by count
top_cities = df_clean["city"].value_counts().head(5).index
df_top_cities = df_clean[df_clean["city"].isin(top_cities)]

plt.figure()
sns.boxplot(data=df_top_cities, x="city", y="price_per_sqm")
plt.title("Price per sqm for Top 5 Cities")
plt.xlabel("City")
plt.ylabel("EUR per sqm")
plt.xticks(rotation=45)
plt.show()

## Phase 9: Correlation Matrix for Numeric Features

Finally, we examine the linear correlations between the main numeric variables, including `price_per_sqm`. This helps identify which factors are most strongly associated with price and price per square meter.

In [None]:
# Compute correlation matrix for numeric columns
numeric_df = df_clean.select_dtypes(include=["number"])
corr = numeric_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap="coolwarm", center=0, linewidths=0.5, square=True)
plt.title("Correlation Matrix of Numeric Features")
plt.tight_layout()
plt.show()

# Correlations with price_per_sqm
if "price_per_sqm" in corr.columns:
    print("Correlations with price_per_sqm (sorted):")
    print(corr["price_per_sqm"].sort_values(ascending=False))
else:
    print("price_per_sqm not found in correlation matrix.")

## Conclusions

This notebook has:
- Loaded and cleaned apartment listing data from a JSON file.
- Renamed columns to more intuitive English names and created a focused analysis DataFrame.
- Engineered a key feature (price per square meter) and explored its distribution.
- Performed univariate and bivariate EDA, including correlations, to understand the main drivers of apartment prices.

This provides a solid foundation for building a predictive model of apartment prices in a separate notebook if desired.