# House Sales Price Prediction Analysis
**Final Project - Data Analyst Role for Real Estate Investment Trust**

This notebook walks through the analysis and modeling of house prices in King County.
Each section corresponds to one of the 10 required questions for submission.


## Question 1
**Display the data types of each column using `dtypes`.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

%matplotlib inline

# Download kc_house_data.csv from Kaggle, put it in your working directory, and use:
df = pd.read_csv("kc_house_data.csv")  # <-- Make sure this file exists in your folder
df.dtypes  # (or whatever next step)


## Question 2
**Drop `id` and `Unnamed: 0`, then describe the dataset.**

In [None]:
df.drop(['id', 'Unnamed: 0'], axis=1, inplace=True, errors='ignore')
df.describe()

## Question 3
**Use `value_counts` on `floors` and convert to a DataFrame.**

In [None]:
floors_df = df['floors'].value_counts().to_frame(name='count')
floors_df

## Question 4
**Boxplot: Determine if waterfront homes have more price outliers.**

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='waterfront', y='price', data=df)
plt.title('Price vs Waterfront')
plt.show()

## Question 5
**Use `regplot` to see if `sqft_above` is correlated with price.**

In [None]:
plt.figure(figsize=(8,6))
sns.regplot(x='sqft_above', y='price', data=df)
plt.title('Price vs Sqft Above')
plt.show()

## Question 6
**Linear regression using `sqft_living` to predict price.**

In [None]:
X = df[['sqft_living']]
y = df['price']
lm = LinearRegression()
lm.fit(X, y)
lm.score(X, y)

## Question 7
**Multiple linear regression using several features.**

In [None]:
features = ['floors', 'waterfront', 'lat', 'bedrooms', 'sqft_basement',
            'view', 'bathrooms', 'sqft_living15', 'sqft_above', 'grade', 'sqft_living']
X_multi = df[features]
y = df['price']
lm_multi = LinearRegression()
lm_multi.fit(X_multi, y)
lm_multi.score(X_multi, y)

## Question 8
**Create a pipeline with scaler, polynomial transform, and linear regression.**

In [None]:
pipe = Pipeline([
    ('scale', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('model', LinearRegression())
])
pipe.fit(X_multi, y)
pipe.score(X_multi, y)

## Question 9
**Train/test split with Ridge regression (`alpha=0.1`).**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_multi, y, test_size=0.15, random_state=1)
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
ridge.score(X_test, y_test)

## Question 10
**Polynomial transform (degree=2) + Ridge regression (`alpha=0.1`) on train/test.**

In [None]:
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
ridge_poly = Ridge(alpha=0.1)
ridge_poly.fit(X_train_poly, y_train)
ridge_poly.score(X_test_poly, y_test)