Surprise Housing, a US-based real estate company, is venturing into the Australian market using data analytics to identify lucrative investment opportunities. This comprehensive dataset from Australian property sales is utilized to construct a robust regression model. The goal is to predict the true value of potential properties, aiding in decision-making for property acquisitions and understanding variables influencing house prices.
The key questions guiding this analysis include identifying influential predictive variables and determining optimal lambda values for ridge and lasso regression. This data-driven approach empowers Surprise Housing to make informed investment decisions in the Australian real estate market.
Develop a predictive model for house prices to provide insights into the relationship between prices and various factors. This model assists in strategic decision-making, optimizing investment strategy, and navigating the complexities of the real estate market.
Details of variables are provided in the data description file.
The exploration begins with doubling alpha values in ridge and lasso regression, investigating shifts in model dynamics and predictor variable importance. The decision-making process between ridge and lasso, optimal lambda values, constructing models without crucial predictors, and ensuring model robustness and generalizability are addressed.
-
Programming Languages:
- Python
-
Libraries and Frameworks:
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
-
Data Handling:
- Data loading using Pandas
- Handling missing values with SimpleImputer
- Scaling numerical features with StandardScaler
-
Data Visualization:
- Matplotlib and Seaborn for creating various plots and visualizations
-
Machine Learning Models:
- Ridge and Lasso regression models implemented using Scikit-learn
-
Model Evaluation:
- Metrics such as Mean Squared Error (MSE), R-squared (R²), and Mean Absolute Error (MAE) for evaluating model performance
-
Data Preprocessing:
- OneHotEncoder for handling categorical variables
- GridSearchCV for hyperparameter tuning
-
Data Analysis:
- Exploratory Data Analysis (EDA) techniques, including correlation matrix heatmap and box plots
-
Data Pipeline:
- Utilization of Scikit-learn's Pipeline for streamlined and reproducible model building
Detailed analysis can be found here.
For the Python code, click here.
Full Code for the Analysis - Click Here
The data used in this analysis was generously provided by Upgrade Academy. I extend my sincere gratitude to the dedicated faculty members at Upgrade Academy for their invaluable support and guidance throughout the analysis process. Their expertise and commitment significantly contributed to the success of this project.