The loan prediction model actually predicts whether a loan will be approved or not on the applicant's profile/info
In this Model, i have applied the preprocessing techniques as follows:
- Hypothesis generation before model builting: It come from the domian knowledge..I actually chosen some key points that we face while applying for any loan in Bank
- Uni and Bivariate Analysis: By visualising insights and trends of the features with target help us in future while feature selection and outlier treatment.
- Data Cleaning: Missing value Treatment by mode or median and Outlier Removal by percentile or boxplot method.
- Feature Engineering: Add/removal of features by visualising correlation using heatmap or pairplot and also multicollinarity can be checked that would bias our model and confuse gradient descent also.
- Standardisation: It actually makes our population/data into standard normal(Gaussian) distribution..instead our data if right skewed using log transformation we can convert it into normal distribution...standardisation means mean=0 & standard deviation=1
- Principle components analysis: Best method for model selection before going to any ML algorithm.
- Model Building: By chosing different ML algo's and checking the training and validation accuracy we can best select the model.
- Imbalance Dataset handling: If our dataset is imbalance then it would bias towards some data may lead to incorrect result..it can effectively handled using smote as oversampling technique and Ensemble with undersampling.
- Model performace: If our model has high bias and very low Variance >6%...model is overfitting and is both bias and variance are low then underfitting
- Validation Technique: To deal with overfitting and imbalance dataset StratifiedKFold is efficient.
- Feature engineering: It's better to add any features for improving model performance on them..addition of new features based on hypothesis generated before..
- Model Selection: I have applied LogisticRegression , DecisionTreeClassifier , RandomForestClassifier, SVC, Naive bayes, KNN, linear-discriminant analyser.
- Performance improvement by using hyper-parameter optimisation..and Select the Model..I have chosen Logistic Regression coz, it's giving better accuracy,precision,f1 score.
- Python - 3.9.6
- Google Colab - for analysis.
- Pycharm - for streamlit webapp builting.
- Heroku - for deployment.
From Kaggle i got the dataset
- Pandas - 1.2.4
- Numpy - 1.20.4
- Matplotlib- 3.3.4
- Seaborn - 0.11.0
- Scikit-learn- 0.24.1
- Streamlit- 0.88.0
- Logistic Regression --> 0.81%
- Polynomial Regression --> 0.77%
- Decision Tree classifier --> 0.79%
- Random Forest classifier --> 0.80%
- Support vector classifier--> 0.61%
- KNN --> 0.62%
- Naive Bayes --> 0.77%
- LinearDiscriminantAnalyis--> 0.79% 10.XGBoost --> 85%
Streamlit Deployment through GitHub : "Below is the loan-prediction-webapp Link" https://kolhesamiksha-loan-prediction-demo-8boafr.streamlit.app/