# 🏡 Real Estate Price Prediction - Internship Project Task Flow

This project is divided into task-based modules. Follow each task carefully. Upon completing all tasks in sequence, you'll have a complete and deployable real estate price prediction app using Streamlit.

---

## ✅ Phase 1: Data Preparation

**Task 01: data-preprocessing-flats**  
- Load flat data using Pandas  
- Clean column names, fix data types  
- Handle irrelevant or redundant columns  
- Save cleaned file as `flats_cleaned.csv`

**Task 02: data-preprocessing-houses**  
- Load house data  
- Clean and standardize as done in flats  
- Save as `houses_cleaned.csv`

**Task 03: data-preprocessing-level-2**  
- Apply deeper preprocessing: normalize column names, harmonize data ranges  
- Convert categorical to consistent labels  
- Save intermediate cleaned data

**Task 04: merge-flats-and-house**  
- Merge flat and house datasets with a new column `property_type`  
- Ensure merged dataframe has uniform structure  
- Save as `merged_data.csv`

---

## 🛠️ Phase 2: Feature Engineering and EDA

**Task 05: feature-engineering**  
- Create or modify columns: example → floor category, luxury category  
- Convert textual/numerical to categorical as needed

**Task 06: eda-univariate-analysis**  
- Perform individual column analysis using plots (matplotlib/seaborn)  
- Focus on variables like price, area, furnishing type

**Task 07: eda-multivariate-analysis**  
- Check correlations between variables  
- Visualize price vs area, price vs bedrooms, etc.

**Task 08: eda-pandas-profiling**  
- Use `pandas_profiling` to auto-generate EDA report  
- Save it as HTML

---

## 🧹 Phase 3: Data Cleaning and Selection

**Task 09: outlier-treatment**  
- Use IQR/z-score method to detect and treat outliers  
- Decide whether to remove or cap values

**Task 10: missing-value-imputation**  
- Identify missing data  
- Fill using mean/median/mode or predictive methods

**Task 11: feature-selection-and-feature-engineering**  
- Perform both: drop irrelevant features and add new meaningful ones  
- Encode categorical features

**Task 12: feature-selection**  
- Apply statistical/ML-based methods (e.g., correlation, tree-based importance)  
- Select top features for modeling

---

## 🤖 Phase 4: Modeling and Evaluation

**Task 13: baseline model**  
- Build a simple regression model (LinearRegression, DecisionTree, etc.)  
- Evaluate using RMSE, R²

**Task 14: model_selection**  
- Try multiple models: Ridge, Lasso, RandomForest, XGBoost  
- Compare performance using cross-validation  
- **(Optional but Appreciated)**: Use **boosting algorithms** like **XGBoost, LightGBM, CatBoost**  
- **(Optional but Appreciated)**: Apply **Optuna for hyperparameter tuning** to improve model performance  
- Select best model and save pipeline as `pipeline.pkl`  
- Save final data used as `df.pkl`

---

## 🚀 Phase 5: Deployment (Streamlit App)

**Final Task: app.py**  
- Create a web app using Streamlit  
- Build a frontend with dropdowns and number inputs to accept user inputs  
- Load trained model (`pipeline.pkl`) and dataset info (`df.pkl`)  
- Predict and display price range based on input features  
- Ensure user-friendly layout with proper titles and sections

---

📝 **Instructions:**  
- Each task should be committed separately with a proper title  
- Push your task code and outputs to your GitHub repo  
- Ask for help only after giving your 100%  

Good luck and happy learning! 🚀
