https://joygeo007-machineknight-hackathon-main-2agjg1.streamlitapp.com/
The dataset consists of housing properties located in Bengaluru and Chennai. The objective is to create an ML model that can predict the rent of a house based on the given properties. The model has been trained using the train data and makes predictions for the test data. Train.csv has dimensions: 20500 rows X 25 columns, whereas, test.csv has dimensions: 4500 rows X 24 columns
- No null values were found in the dataset
- No duplicate values were found in the dataset
- Most of the East facing flats have rent in the range of 15000-20000
- Most of the North facing flats have rent less than 10000
- Most of the 2 BHKs have rent in the range 15000-20000
- There was slight correlation between total number of floors and floor
- There was slight correlation between bathroom,property_size and rent
- Properties with either one of gym, swimming pool or lift facilities have higher chances of having the other 2 amenities.
- Dropped columns with high cardinality
- We haven't taken locality into consideration as we had the variables latitude and longitude
- Amenities column had already been processed , so we removed it.
- Since it is regression problem, we encoded all the categorical variables:
4.1. Categorical variables with distinct hierarchical values were label-encoded
4.2. Rest of the categorical variables were one-hot encoded. - Separated target(rent) and predictor variables.
- Scaled the train and test data using Standard Scaler
- After removing the columns previously mentioned, we performed Feature Engineering and EDA to gain initial insights from the given dataset
- We used Sweetviz for EDA besides doing the same by ourselves
- Next, we checked for the correlation between the columns
- We started with the Linear Regression model initially, which was followed by Ridge and Lasso
- After this, we used Gradient Boosting Regressor and Support Vector Regressor
- This was followed by Random Forest Regressor and Decision Tree
- For finding the best model, we calculated a particular model's root mean square error (RMSE) and R2 Score
- We found out that Decision tree regressor and random forest regressor were giving the most optimal score
- We also deduced that decision tree regressor was getting overfitted.
- We first tried to postrun the Decision Tree. Then we used RandomizedSearchCV, for a specific range of parameters, to find the best optimal score
- We did the same thing with Random Forest Regressor. Then we got a specific range of parameters, which is giving us the best possible score for a model till now.
- We used GridSearchCV on the range of the specific parameters and found out that the best suitable parameters that gave us the best possible model score