Hannah_Reber
20.10.2020
https://www.coursera.org/learn/ibm-exploratory-data-analysis-for-machine-learning/home/welcome
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data "With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home."
- original dataset needed a lot of cleaning, but is very diverse and very good for cleaning practice
"Predict sales prices and practice feature engineering, RFs, and gradient boosting"
- NA-Handeling
- Assigning numeric values to all columns based on value count ranks
- Merge repetative columns
- Assigning contant
Correlations Top 5:
-
- RANK: 'OveralQual'
-
- RANK: 'GrLivArea'
-
- RANK: 'GarageCars'
-
- RANK: 'GarageArea'
-
- RANK: 'TotalBsmtSF'
Timedependency SEASONALITY:
- density (= number of houses sold) varies over time: peaks in summers
- price range: most houses between 10K and 30K, only a handful prices >10K or >30K
- relation: prices also seem to increase during selling-season(=summer)
-
- H1: Sales Prices are dependent on seasonality
-
- H2: Summer is the best selling season
-
- H3: OveralQuality is most important real estate property
Significance was tested via OLS modeling and visualized via scatterplot, boxplot correlations and time series.
Next steps suggestion: using the cleaned data to train a deep learning framework and compare predictions.
main_notebook = statistical-analysis.ipynb
png_images_generated_in_sub_notebooks_and_integrated_in_main_notebook = statistical-analysis.ipynb
input_and_output_csv = data
all_sub notebooks_with_detailed_analysis_steps = subs