This project is divided into 3 significant portions: Data Preparation, Feature Selection and Modeling - Validation. We begin by explaining how we handled the missing data values, and go on to talk about the hypothesis we modeled our data based on. Feature Selection - We try various feature selection methods, and provide justifying visualizations for reducing the number of features by removing the redundant features and creating features that are relavant. We then proceed to the modeling stage, and we used 4 main models - Poisson Regression, Negative Binomial Regression Approach, Random Forest Approach and Gradient Boosting Algorithm (XGBoost). We performed many iterations of these models to decide on the best parameters to be selected, and many of these code chunks are commented out because they take about 15-20 mins to run. Unfortunately, the algorithms we decided do not have visualizations that are signifacnt to show, so we haven't included any, but we have provided an in-depth analysis on why we chose each model, and why we this the Random Forest Approach is the best out of 4. After deciding the best approach, we then show how it behaves on the test dataset, and compare the predicted values to actual values.
All the source code can be found on the gh-pages branch.
Link to the report: https://info370a-w19.github.io/a4-poojaram/