### <span style="color: #1f77b4;">Problem Statement</span>

Understanding the problem statement is the first and foremost step. This would help you give an intuition of what you will face ahead of time. Let us see the problem statement.

<span style="color: #1f77b4; font-weight: bold;">Nairobi Housing Finance Company</span> offers all types of housing loans. They are present in all urban, semi-urban, and rural locations. Customers apply for a house loan once the company verifies their loan eligibility. The organization wants to <span style="font-style: italic;">automate the loan eligibility process (in real time)</span> based on the information provided by the consumer when filling out the online application form. These characteristics include <span style="font-weight: bold;">Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History</span>, and others. To automate this process, they created a problem to identify <span style="font-style: italic; color: #1f77b4;">client segments that are eligible for loan amounts</span> so that they may directly target these customers.

### <span style="color: #1f77b4;">Problem Statement</span>

It is a <span style="color: #2ca02c; font-weight: bold;">classification problem</span> where we have to predict whether a loan would be approved or not. In these kinds of problems, we have to predict <span style="font-style: italic;">discrete values</span> based on a given set of independent variables (s).

- <span style="font-weight: bold;">Binary Classification</span>: In this context, binary classification refers to predicting either of the two given classes. For example, classifying loan applications as either <span style="font-style: italic;">approved</span> or <span style="font-style: italic;">rejected</span> based on customer data.

- <span style="font-weight: bold;">MultiClass Classification</span>: Here, multiclass classification involves categorizing loan applications into multiple classes. For instance, you might classify loan applications into categories like <span style="font-style: italic;">approved</span>, <span style="font-style: italic;">rejected</span>, and <span style="font-style: italic;">under review</span> based on customer attributes.

<span style="font-style: italic;">Loan prediction</span> is a very common real-life problem that each retail bank faces at least once in its lifetime. If done correctly, it can save a lot of man-hours at the end of a retail bank.

---

![Loan prediction](images/image_1.png)

### <span style="color: #1f77b4;">Hypothesis Generation</span>

After looking at the problem statement, we will now move into hypothesis generation. It is the process of listing out all the possible factors that can affect the outcome.

Below are some of the factors which I think can affect the Loan Approval (dependent variable for this loan prediction problem):

- <span style="font-weight: bold;">Salary</span>: Applicants with high income should have more chances of getting approval.
- <span style="font-weight: bold;">Previous History</span>: Applicants who have paid their historical debts have more chances of getting approval.
- <span style="font-weight: bold;">Loan Amount</span>: Less the amount higher the chances of getting approval.
- <span style="font-weight: bold;">Loan Term</span>: Less the time period has higher chances of approval.
- <span style="font-weight: bold;">EMI</span>: Lesser the amount to be paid monthly, the higher the chances of getting approval.

---

### <span style="color: #1f77b4;">Getting the System Ready and Loading the Data</span>

We will be using Python for this problem along with the below-listed libraries. 

#### <span style="font-weight: bold;">Loading Packages</span>

In [1]:
import pandas as pd 
import numpy as np                     # For mathematical calculations 
import seaborn as sns                  # For data visualization 
import matplotlib.pyplot as plt        # For plotting graphs 
%matplotlib inline 
import warnings                        # To ignore any warnings
warnings.filterwarnings("ignore")

#### <span style="font-weight: bold;">Data</span>

For this problem, we have been given two CSV files: train and test.

- <span style="font-weight: bold;">Train file</span> will be used for training the model, i.e. our model will learn from this file. It contains all the independent variables and the target variable.
- <span style="font-weight: bold;">The test file</span> contains all the independent variables, but not the target variable. We will apply the model to predict the target variable for the test data.


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings

# Ignore any warnings
warnings.filterwarnings("ignore")

# Load the data
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

# # Display some basic information about the datasets
# print("Train Data Info:")
# print(train_data.info())
# print("\nTest Data Info:")
# print(test_data.info())

# # Display the first few rows of the datasets
# print("\nFirst 5 rows of Train Data:")
# print(train_data.head())
# print("\nFirst 5 rows of Test Data:")
# print(test_data.head())

# # Data Exploration and Visualization
# # You can explore and visualize the data using seaborn and matplotlib.

# # For example, to create a histogram of a numeric feature in the train data:
# plt.figure(figsize=(8, 5))
# sns.histplot(train_data['numeric_feature'], kde=True)
# plt.title("Histogram of Numeric Feature")
# plt.xlabel("Value")
# plt.ylabel("Frequency")
# plt.show()

# # Preprocessing
# # Handle missing values, encode categorical variables, and perform other data preprocessing steps.

# # For example, to fill missing values with the mean in a specific column:
# train_data['column_name'].fillna(train_data['column_name'].mean(), inplace=True)

# # Split the data into features and target variable
# X_train = train_data.drop('target', axis=1)
# y_train = train_data['target']

# # Model Training
# # Choose a machine learning model, train it, and evaluate it.

# # For example, to train a simple Linear Regression model:
# from sklearn.linear_model import LinearRegression
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_squared_error

# X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# model = LinearRegression()
# model.fit(X_train, y_train)
# y_pred = model.predict(X_val)

# # Evaluate the model
# mse = mean_squared_error(y_val, y_pred)
# print("Mean Squared Error:", mse)

# # Making Predictions
# # Use the trained model to make predictions on the test data.

# # For example, to make predictions on the test data:
# test_predictions = model.predict(test_data)

# # Create a submission DataFrame
# submission = pd.DataFrame({'ID': test_data['ID'], 'PredictedTarget': test_predictions})

# # Save the submission to a CSV file
# submission.to_csv('submission.csv', index=False)


FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'