# House Price Analysis from JSON
## Exploring Apartment Listings in Tirana

### The Problem

Real estate markets are driven by many factors: location, size, floor, number of rooms, furnishing level, and other amenities. Understanding how these variables relate to apartment prices is essential for buyers, sellers, and agents.

In this notebook we analyze a JSON dataset containing apartment listings (mainly in Tirana, Albania). Each listing describes:
- Price (in EUR)
- Surface area in square meters
- Number of rooms (bedrooms, bathrooms, living rooms, balconies)
- Floor
- Furnishing status
- Location (address and coordinates)

### Our Mission

We will:
1. **Load and inspect the data** from a JSON file
2. **Convert JSON → Pandas DataFrame** for easier analysis
3. **Perform EDA (Exploratory Data Analysis)** to understand distributions and relationships

This structure mirrors the early phases of your previous notebook (data loading, exploration, and EDA) but applied to a house price dataset instead of a medical dataset.[file:4][file:2]

## Phase 1: Core Libraries

We start by importing the core Python libraries for data analysis and visualization, and by setting some basic plotting preferences.[file:4]

In [None]:
# Core Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Visualization Settings
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")

print("Libraries loaded successfully!")

## Phase 2: Data Loading & Exploration

Before any modeling or advanced analysis, we need to understand the data we have. The key questions are:
1. **Size:** How many listings do we have?
2. **Structure:** What are the main features and what do they represent?
3. **Quality:** Are there missing values or anomalies?
4. **Distribution:** What do the numbers look like for price and size?

The data is stored in a JSON file named `house_price.json`.[file:2]

In [None]:
# 1. Load the dataset from JSON
# The JSON file contains a list of property objects
df = pd.read_json("house_price.json")

# 2. Basic Information
print("DATASET OVERVIEW")
print(f"Number of Listings: {df.shape[0]}")
print(f"Number of Features: {df.shape[1]}")

# First 5 rows
df.head()

### Understanding the Features

From the JSON structure, each row represents a single apartment listing. Important features include:[file:2]

- **Price and Size**:
  - `price_in_euro`: Final price in EUR
  - `main_property_price`: Original price in the listing
  - `main_property_property_square`: Surface area (m²)

- **Composition**:
  - `main_property_property_composition_bedrooms`
  - `main_property_property_composition_bathrooms`
  - `main_property_property_composition_living_rooms`
  - `main_property_property_composition_balconies`

- **Location**:
  - `main_property_location_city_zone_city_city_name` (e.g., `tirane`)
  - `main_property_location_city_zone_formatted_address`
  - `main_property_location_lat`, `main_property_location_lng`

- **Other details**:
  - `main_property_floor`
  - `main_property_furnishing_status`
  - `main_property_property_type` (e.g., `apartment`)
  - `main_property_property_status` (e.g., `for_sale`)

Next we inspect data quality and basic statistics for these features.[file:2]

### Data Quality Check

We now verify whether the dataset is ready for analysis by checking:
- Missing values
- Duplicate rows
- Data types (numeric vs non-numeric)

This mirrors the quality checks in your previous notebook (missing, duplicates, dtypes, and sample rows).[file:4][file:2]

In [None]:
print("\nDATA QUALITY CHECK")

# Missing values
missing_total = df.isna().sum().sum()
print(f"Missing Values (total): {missing_total}")

print("\nMissing values per column (top 20):")
print(df.isna().sum().sort_values(ascending=False).head(20))

# Duplicate rows
duplicates = df.duplicated().sum()
print(f"\nDuplicate Rows: {duplicates}")

# Data types
print("\nData Types (summary):")
print(df.dtypes.value_counts())

print("\nSAMPLE DATA (First 5 Rows)")
df.head()

The output shows:
- Many features are numeric (prices, sizes, coordinates, floors).[file:2]
- Several boolean or categorical fields may have missing values (e.g., `main_property_has_garage`, `main_property_has_garden`).[file:2]
- A few duplicates may exist; for pure EDA we report them but do not necessarily drop them.

Next we compute statistical summaries for the numeric features.

## Phase 3: Exploratory Data Analysis (EDA)

In this phase we:
1. Summarize numeric variables (prices, sizes, floors)
2. Engineer a useful feature: **price per square meter**
3. Visualize distributions of price and size
4. Explore relationships between price and other variables

The goal is to understand patterns in the housing market, similar to how EDA was used to understand medical measurements in the previous project.[file:4][file:2]

### 3.1 Statistical Summary

We start with a statistical summary of all numeric columns (count, mean, standard deviation, min, quartiles, max). This gives us a quick overview of price ranges and apartment sizes.[file:2]

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns
print("NUMERIC COLUMNS:")
print(list(numeric_cols))

print("\nSTATISTICAL SUMMARY (numeric features):")
df[numeric_cols].describe().T  # Transpose for better readability

From this summary we focus on:[file:2]
- `price_in_euro`: how expensive the apartments are on average and the range of prices
- `main_property_property_square`: typical apartment sizes
- `main_property_floor`: distribution of floors

Very large values are potential luxury or extremely large apartments and will influence later visualizations.

### 3.2 Cleaning and Feature Engineering

For price analysis we need valid price and size information. We:
- Drop rows with missing `price_in_euro` or `main_property_property_square`
- Remove rows where size is zero or negative
- Keep only apartments for sale (if that information exists)
- Create `price_per_sqm = price_in_euro / main_property_property_square`

The new feature `price_per_sqm` is crucial for comparing different listings on a fair basis.[file:2]

In [None]:
# Work on a cleaned copy
df_clean = df.copy()

# Keep only entries with valid price and size
df_clean = df_clean.dropna(subset=["price_in_euro", "main_property_property_square"])
df_clean = df_clean[df_clean["main_property_property_square"] > 0]

# Optionally focus on apartments for sale
if "main_property_property_type" in df_clean.columns:
    df_clean = df_clean[df_clean["main_property_property_type"] == "apartment"]
if "main_property_property_status" in df_clean.columns:
    df_clean = df_clean[df_clean["main_property_property_status"] == "for_sale"]

# Create price per square meter
df_clean["price_per_sqm"] = df_clean["price_in_euro"] / df_clean["main_property_property_square"]

print("CLEANED DATASET OVERVIEW")
print(f"Rows after cleaning:   {df_clean.shape[0]}")
print(f"Columns after cleaning: {df_clean.shape[1]}")
df_clean[["price_in_euro", "main_property_property_square", "price_per_sqm"]].head()

`df_clean` is now our main dataset for EDA, focusing on apartment listings with valid price and surface.[file:2]

We can now analyze distributions and relationships.

### 3.3 Distributions of Price and Size

We visualize the distributions of:
- `price_in_euro`
- `main_property_property_square`
- `price_per_sqm`
- `main_property_floor` (if available)

Histograms help us see whether the variables are symmetric, skewed, or contain extreme outliers.[file:2]

In [None]:
features_to_plot = [
    "price_in_euro",
    "main_property_property_square",
    "price_per_sqm"
]

fig, axes = plt.subplots(1, len(features_to_plot), figsize=(18, 4))

for ax, col in zip(axes, features_to_plot):
    sns.histplot(df_clean[col], kde=True, ax=ax)
    ax.set_title(col)

plt.tight_layout()
plt.show()

# Floor distribution (optional)
if "main_property_floor" in df_clean.columns:
    plt.figure(figsize=(6, 4))
    sns.histplot(df_clean["main_property_floor"].dropna(), bins=20, kde=False)
    plt.title("Distribution of main_property_floor")
    plt.xlabel("Floor")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()
else:
    print("Column 'main_property_floor' not found.")

Typical patterns:
- Total prices are often right-skewed: many mid-priced apartments, fewer very expensive ones.[file:2]
- Surface areas tend to cluster around common apartment sizes (e.g., 50–100 m²).
- Price per m² may be more concentrated and is easier to compare across zones.
- Floor distribution reveals how many listings are on low vs high floors.

### 3.4 Price vs Surface Area

We now examine the relationship between total price and surface area. A scatter plot helps us see how price grows with size and whether there are different clusters (e.g., by city).[file:2]

In [None]:
plt.figure(figsize=(7, 5))
hue_col = "main_property_location_city_zone_city_city_name" if "main_property_location_city_zone_city_city_name" in df_clean.columns else None

sns.scatterplot(
    data=df_clean,
    x="main_property_property_square",
    y="price_in_euro",
    hue=hue_col,
    alpha=0.7
)
plt.xlabel("Surface area (m²)")
plt.ylabel("Total price (EUR)")
plt.title("Price vs Surface area")
if hue_col is not None:
    plt.legend(title=hue_col, bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

Observations:
- There is usually a clear positive trend: larger apartments have higher total prices.[file:2]
- If different colors are visible, they can indicate differences between cities or zones (e.g., different price levels for the same size).

### 3.5 Price per m² by Floor Group

We group floors into categories and compare `price_per_sqm` across these groups using a boxplot. This can show whether lower or higher floors tend to be more expensive per square meter.[file:2]

In [None]:
if "main_property_floor" in df_clean.columns:
    plt.figure(figsize=(7, 5))
    floor_groups = pd.cut(
        df_clean["main_property_floor"],
        bins=[-1, 3, 6, 10, 100],
        labels=["≤3", "4–6", "7–10", ">10"]
    )
    sns.boxplot(
        x=floor_groups,
        y="price_per_sqm",
        data=df_clean
    )
    plt.xlabel("Floor group")
    plt.ylabel("Price per sqm (EUR/m²)")
    plt.title("Price per sqm by floor group")
    plt.tight_layout()
    plt.show()
else:
    print("Column 'main_property_floor' not found.")

If boxplots differ visibly between groups, floor is likely an important factor in explaining price per m².[file:2]

For example, apartments with better views or less noise (often higher floors) may have higher price per m².

### 3.6 Price per m² by Furnishing Status

Finally, we compare `price_per_sqm` across different furnishing levels (unfurnished, partially furnished, fully furnished). Furnishing can influence how much buyers are willing to pay per square meter.[file:2]

In [None]:
if "main_property_furnishing_status" in df_clean.columns:
    plt.figure(figsize=(7, 5))
    sns.boxplot(
        data=df_clean,
        x="main_property_furnishing_status",
        y="price_per_sqm"
    )
    plt.xticks(rotation=45, ha="right")
    plt.xlabel("Furnishing status")
    plt.ylabel("Price per sqm (EUR/m²)")
    plt.title("Price per sqm by furnishing status")
    plt.tight_layout()
    plt.show()
else:
    print("Column 'main_property_furnishing_status' not found.")

Clear differences between categories (in median or spread) suggest that furnishing level is an important driver of price per square meter.[file:2]

This completes an initial EDA similar in spirit to your previous notebook: we have loaded the JSON data, converted it to a DataFrame, assessed quality, and explored key relationships relevant to house prices.[file:4][file:2]