## Part - A
In this section, we will import our dataset (which can be in any format) and store it as a data frame in a Python object. Furthermore, we will collect data from the dataframes that we will create.


**SRN: BP0272399**

**Module : Programming for Data Analysis**

**Date: 18 September 2023**


Importing the necessary libraries (NumPy and Pandas) for creating and storing data in a dataframe as a Python object with Pandas and performing tabular functions with Numpy.

In [None]:
#importing modules
import numpy as np
import pandas as pd

### Get data from Excel File
We are reading data from the 'Loan Data.xlsx' input file and storing it in the 'data1' object as a data frame. Following that, the data is viewed in tabular format to confirm the basic structure of the dataframe created.

In [None]:
#read and store XLS data in a variable called data1
data1 = pd.read_excel('Loan_Data_XLS_DEMO.xlsx')

In [None]:
data1.head()

### Extract data from pdf to excel in python
We are now installing and importing the tabula.py library in order to use specific functions from this library to convert the tabular data from the 'Loans Database Table.pdf' file to.xlsx format. Later on, I'm storing the data from the excel file in the form of a data frame in the 'data2' object.

In [None]:
!pip install tabula-py

###Get data from PDF File
The code below is used to extract data from the provided pdf file and then convert the read data into an excel file.<br />
```
import tabula
variable_name = tabula.read_pdf("File Path", pages = 'all')[0]
variable_name.to_excel('File Path for excel')
```

In [None]:
#importing modules
import tabula

#this PDF contains a list
convert = tabula.read_pdf("Loans_Database_Table_PDF_DEMO.pdf", pages = 'all')[0]

#convert into Excel File.  This will appear in the Google Colab Files Workspace area
convert.to_excel('Loans_Database_Table.xlsx')

#store from PDF converted XLS data in a variable called data2
data2 = pd.read_excel('Loans_Database_Table.xlsx')

In [None]:
data2.head()

Fetching description of data1

In [None]:
data1.info()

Fetching description of data2

In [None]:
data2.info()

Checking the data frame for the number of null values in data1


Checking the data frame for the number of null values in data2


In [None]:
data2.isna().sum()


Merging the two data frames into a single one to make the data processing and analysis easier in the following steps.<br /> <br />
Using Pandas Library's **concat()** function, the two data frames - data1 and data2 - are merged into a single data frame - data3.

In [None]:
data3 = pd.concat([data2, data1], ignore_index=True)

data3

## Part - B:
###Data Processing

Data processing is an important part of data analysis. The data collected from people has many errors and may need to be cleaned up. If you fail to clean up your data, you may have problems analyzing your data and the model you are training with your data may be unreliable.<br /><br />
Rectifications to be done on the data are as follows:

*   Remove duplicate values
*   Remove null and empty values
*   Perform sorting
<br /><br />

**Remove duplicate values**<br />
When analyzing data, there can be duplicate values ​​that result in undesired output that undermines our ultimate goal. During  model training, these duplicate values ​​cause overfitting problems. This is a statistical modeling error  that occurs when the function is  too close to the data point (in this case, too many duplicate values).

**Remove null and empty values**<br />
Null and empty values are treated as 0 by default during data analysis, resulting in unfavourable results, which might be dangerous in the case of business analytics. Null values produce a void in classification type model training challenges like Linear Classifier, Logistic Classifier, Random Forest Classifier, and so on, which may cause confusion for our model  based on its classification functions.

**Perform sorting**<br />
Sorting is performed based on the primary attributes considered by the analyst for re-arranging the data to simplify deciphering data when considered by the analyst.


Checking for any duplicate values in the dataframe that has been produced.

In [None]:
data3.duplicated().sum()

Using Loan ID as the primary attribute, sorting the provided data in the dataframe in ascending order.

In [None]:
data3.sort_values('Loan_ID')

###For Model Training
The new_data object is constructed according to the assessment handbook's guidelines. It will be utilised in the next sections. Moreover, printing data_for_model object.

In [None]:
data_for_model = data3[['Loan_ID','Gender','Married','Dependents','Graduate','Self_Employed','Credit_History','Property_Area','Loan_Status']].copy()
data_for_model

## Part - C
###For Data Analysis

Data analysis is a sequential method of evaluating and illustrating data using logical methods. We sometimes prefer Data Visualization because we can use statistical methods to achieve the desired result.

Data analysis is a growing field because it allows for the extraction of vital information from large amounts of data. Knowing about the latest trends, for example, is critical for e-commerce companies like Amazon, which want their customers to use their services more frequently. This is only possible if we can glean useful information from data on products purchased recently by customers at a specific time of year.

Furthermore, data analysis is critical for obtaining usable information for training machine learning models. The details we acquire after the study determine which operation we need our ML model to do.

In [None]:
analysis = data3[['Loan_ID','Gender','Graduate','Self_Employed','ApplicantIncome','Loan_Status']].copy()

In [None]:
analysis

**Solutions for Problem Statements**

Problem statement 1 - The percentage of female applicants that had their loan approved

In [None]:
statement1 = analysis[analysis['Gender'] == 1]

solution1 = statement1[statement1['Loan_Status'] == 'Y'].count()*100/statement1[['Loan_Status']].count()

print(f'Percentage: {solution1["Loan_Status"]}%')

Problem statement 2 - The average income of all applicants

In [None]:
solution2 = analysis[['ApplicantIncome']].mean()

print(solution2)

Problem statement 3 - The average income of all applicants that are self-employed

In [None]:
statement3 = analysis[analysis['Self_Employed'] == 1]

solution3 = statement3[['ApplicantIncome']].mean()
print(solution3)

Problem statement 4 - The average income of all applicants that are not self-employed

In [None]:
statement4 = analysis[analysis['Self_Employed'] == 0]

solution4 = statement4[['ApplicantIncome']].mean()
print(solution4)

Problem statement 5 - The average income of all graduate applicants

In [None]:
statement5 = analysis[analysis['Graduate'] == 1]

solution5 = statement5[['ApplicantIncome']].mean()
print(solution5)

Problem statement 6 - The percentage of graduate applicants that had their loan status approved

In [None]:
statement6 = analysis[analysis['Graduate'] == 1]

solution6 = statement6[statement6['Loan_Status'] == 'Y'].count()*100/statement6[['Loan_Status']].count()

print(f'Percentage: {solution6["Loan_Status"]}%')

## Saving the dataframe (data3) as an Excel File (myfile.xlsx) and saving it

In [None]:
#Install XlsxWriter
!pip install XlsxWriter
# specify a writer fom Pandas
writer= pd.ExcelWriter('myfile.xlsx',engine='xlsxwriter')
#write the dataframe (data3) to a file
data3.to_excel(writer, 'Sheet1')
#save this file
writer.save()

## Part - D
This section explains how to train a Machine Learning [ML] model to predict whether a Loan Application will be approved or denied in the future depending on the information provided by the applicant.

In order to reduce human effort, a machine learning model is trained to mimic human behaviours. Furthermore, the ML model improves accuracy and rapidly completes the work. We must train an ML model before we can use it. There are three techniques to train an ML model:


*   Supervised Learning -> Learning that is supervised
*   Unsupervised Learning -> Learning Without Supervision
*   Reinforcement Learning -> Learning through Reinforcement

**Supervised Learning**

It is a type of learning in which big labelled datasets are used to train an ML model. The ML model uses some of the data to train using labelled data, and then the remainder of the data is utilised to test the ML model's accuracy.

**Unsupervised Learning**

We don't monitor the ML model by giving it with labelled data in this type of learning. The goal of this learning is to detect patterns in the data. To be more specific, unsupervised learning is concerned with getting a machine to detect a pattern in a dataset on its own.

**Reinforcement Learning**

Reinforcement Learning is a sort of learning in which we can train our model in a similar way to how we would train a pet. We give the model a reward if the outcome is correct, and we don't give the model a reward if the outcome is incorrect. We programmed the computer in such a way that it, like a pet, would attempt to get as many rewards as possible.

### Model Training
The ML model is developed using **Supervised Learning** with the data provided with the assessment, while adhering to the guidelines outlined in the assessment manual.

Logistic Regression is used to determine whether or not the applications are approved for the loan. Since, given the data we have, logistic regression is the optimal strategy for achieving high accuracy.

In [None]:
#Loading Modules
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

Displaying the data in tabular format

In [None]:
data_for_model.head()

Converting the data frame into X and Y nd-arrays.

In [None]:
x_data=data_for_model.iloc[:,1:-1].values
y_data=data_for_model.iloc[:,-1].values

Printing the X array along with its shape

In [None]:
print('X data: ', x_data)

Printing the Y array along with its shape

In [None]:
print('Y data:', y_data)

Splitting the X array into x_train and x_test arrays.

Similarly, splitting the Y array into y_train and y_test arrays.

Here, test data and train data are in the ratio of 70 : 30 for X and Y respectively.

In [None]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(x_data, y_data, test_size=0.30, random_state=7)

We have made a Logistic Regression model and fitted it with x_train and y_train, which is the same as training it. The values from the x_test array are then predicted, and the anticipated values are checked with y_test to produce an accuracy score for our model.

In [None]:
model = LogisticRegression()
model.fit(x_train,y_train)
predictions = model.predict(x_test)
score = accuracy_score(y_test, predictions)
print(f'Accuracy: {round(score*100,3)}%')

We got an accuracy of 85.542% for our ML model.