### Multiple Linear Regression Practice Notebook

#### Dataset to use : housing_data_dirty.csv

#### 1. Load & Explore
- Import libraries (`pandas`, `numpy`, `seaborn`, `matplotlib`, etc.)
- Load `housing_data_dirty.csv`
- `.info()`, `.describe()`, `.head()`
- Show missing-value counts

#### 2. Data Cleaning Activities

##### 2.1 Data Types
- Convert dates to `datetime`
- Binary/Boolean fields (`Yes/No`) ➝ `0/1`
- Categorical features as `category` dtype

##### 2.2 Missing Values
- `HOA_Fees`, `Crime_Rate`, etc. — detect NaNs
- Impute based on location median or use interpolation
- Drop rows with >50% missing, or experiment

##### 2.3 Categorical Cleanup
- Standardize neighborhood names ("Downtown", "down town", "DownTown")
- Check unique labels and unify


##### 2.4 Outlier Detection
- Use IQR or Z-score for `Price`, `Area`, `Crime_Rate`
- Visualize with boxplots
- Decide to clip, cap, or remove outliers

##### 2.5 Duplicates
- Detect duplicates on house descriptors + date
- Remove or consolidate near-duplicates


##### 2.6 Text Cleanup (Optional)
- `Description`: extract keywords (e.g. “reno”, “needs work”) using regex
- Drop free‑text if unused

#### 3. Exploratory Data Analysis

##### 3.1 Univariate
- Histograms (`Price`, `Area`, `Age`)
- Bar charts (`Location`, `Has_Basement`, `Pool`)
- Boxplot of `Price` by `Location`

##### 3.2 Bivariate
- Scatter plots: `Area vs Price`, `Crime_Rate vs Price`, `Distance_to_city vs Price`
- Heatmap of correlations
- Boxplots: `Bedrooms vs Price`, `School_Rating vs Price`
- Pairplot sample

##### 3.3 Time-Series
- Plot `Price` vs `Date_Sold`, observe trends/seasonality
- Extract and group by sale month/year

##### 3.4 Location-Based Insights
- Group by `Location`: mean price, avg crime rate, school rating
- Bar or barplot with error bars

#### 4. Feature Engineering
- Create `price_per_sqft = Price / Area`
- One-hot encode `Location`, `Has_Basement`, `Pool`
- Extract numeric month, year
- Log-transform skewed numeric variables

#### 5. Multicollinearity Check
- Compute VIF
- Examine correlation between numeric predictors
- Decide which redundant variables to drop

#### 6. Modeling Prep
- Split into train/test (e.g. 80/20)
- Scale numeric variables if needed
- Baseline regression: `Price ~ Area + Bedrooms + Bathrooms + Location  + ...`
- Train using `statsmodels` or `sklearn.linear_model`

#### 7. Diagnostics & Evaluation
- Residuals vs fitted plot
- Q-Q plot for residuals 
- Evaluate R², RMSE, MAE
- Analyze feature coefficients and p-values

#### 8. Iteration
- Add/remove features based on significance/VIF
- Experiment with transformations and interactions
- Re-evaluate model

#### Extend the model to use regularization techniques
- Ridge/Lasso regression