# Exploratory Data Analysis (EDA)
## StreetEasy Rental Dataset

This notebook performs Exploratory Data Analysis (EDA) on the `streeteasy.csv` dataset.
It is designed for data analytics, OLAP, and data warehousing education.

In [None]:

import pandas as pd
import matplotlib.pyplot as plt


## Load the Dataset

In [None]:

df = pd.read_csv("streeteasy.csv")
df.head()


## Dataset Structure and Summary

In [None]:

df.info()
df.describe()


## Missing Values Analysis

In [None]:

df.isnull().sum().sort_values(ascending=False)


In [None]:

df.isnull().sum().plot(kind="bar")
plt.title("Missing Values per Column")
plt.ylabel("Count")
plt.show()


## Rent Distribution

In [None]:

plt.hist(df["rent"], bins=30)
plt.xlabel("Monthly Rent ($)")
plt.ylabel("Frequency")
plt.title("Rent Distribution")
plt.show()


## Rent by Borough (Roll-up Analysis)

In [None]:

df.boxplot(column="rent", by="borough")
plt.xlabel("Borough")
plt.ylabel("Rent")
plt.title("Rent Distribution by Borough")
plt.suptitle("")
plt.show()


## Top 10 Neighborhoods by Average Rent

In [None]:

avg_rent_neighborhood = (
    df.groupby("neighborhood")["rent"]
      .mean()
      .sort_values(ascending=False)
      .head(10)
)

avg_rent_neighborhood.plot(kind="bar")
plt.title("Top 10 Neighborhoods by Average Rent")
plt.ylabel("Average Rent")
plt.show()


## Rent vs Apartment Size

In [None]:

plt.scatter(df["size_sqft"], df["rent"], alpha=0.4)
plt.xlabel("Size (sqft)")
plt.ylabel("Rent")
plt.title("Rent vs Apartment Size")
plt.show()


## Rent vs Distance to Subway

In [None]:

plt.scatter(df["min_to_subway"], df["rent"], alpha=0.4)
plt.xlabel("Minutes to Subway")
plt.ylabel("Rent")
plt.title("Rent vs Distance to Subway")
plt.show()


## Impact of Amenities on Rent (Example: Doorman)

In [None]:

df.groupby("has_doorman")["rent"].mean().plot(kind="bar")
plt.xticks([0,1], ["No Doorman", "Doorman"], rotation=0)
plt.ylabel("Average Rent")
plt.title("Impact of Doorman on Rent")
plt.show()


## Bedrooms vs Rent

In [None]:

df.groupby("bedrooms")["rent"].mean().plot(kind="bar")
plt.xlabel("Bedrooms")
plt.ylabel("Average Rent")
plt.title("Average Rent by Number of Bedrooms")
plt.show()


## Derived Metric: Rent per Square Foot

In [None]:

df["rent_per_sqft"] = df["rent"] / df["size_sqft"]

plt.hist(df["rent_per_sqft"], bins=30)
plt.xlabel("Rent per Sqft")
plt.title("Rent per Square Foot Distribution")
plt.show()


## Correlation Analysis

In [None]:

numeric_cols = [
    "rent", "bedrooms", "bathrooms", "size_sqft",
    "min_to_subway", "floor", "building_age_yrs"
]

corr = df[numeric_cols].corr()
corr


In [None]:

plt.imshow(corr)
plt.colorbar()
plt.xticks(range(len(corr)), corr.columns, rotation=45)
plt.yticks(range(len(corr)), corr.columns)
plt.title("Correlation Matrix")
plt.show()


## EDA Summary
- Rent distribution is right-skewed
- Borough and neighborhood strongly affect rent
- Size and subway proximity are key drivers
- Amenities significantly increase rent
- Derived metrics improve comparability
