Dream Housing Finance société de crédit spécialisée dans les prêts immobiliers. Ils sont présents dans toutes les zones urbaines, semi-urbaines et rurales. Pour chaque client demandeur de prêt au logement l'entreprise doit vérifier l'éligibilité de ce dernier.

L'entreprise souhaite automatiser le processus d'admissibilité au prêt en fonction des renseignements fournis par le client au moment de remplir le formulaire de demande en ligne. 

Ces détails sont le sexe, l'état civil, les études, le nombre de personnes à charge, le revenu, le montant du prêt, les antécédents de crédit et autres. Pour automatiser ce processus, ils ont donné un problème pour identifier les segments de clients, ceux qui sont éligibles pour le prêt afin qu'ils puissent cibler spécifiquement ces clients. Ils ont fourni ici un ensemble partiel de données.

La variable à prédire est donc *Loan_Status*.


Ci-après le détail des données:

**The Data**  

*Variable* 	: *Description* 

Loan_ID: 	Unique Loan ID     
Gender: 	Male/ Female  
Married: 	Applicant married (Y/N)  
Dependents: 	Number of dependents  
Education: 	Applicant Education (Graduate/ Under Graduate)  
Self_Employed: 	Self employed (Y/N)  
ApplicantIncome: 	Applicant income  
CoapplicantIncome: 	Coapplicant income    
LoanAmount 	Loan: amount in thousands  
Loan_Amount_Term: 	Term of loan in months  
Credit_History: 	credit history meets guidelines  
Property_Area: 	Urban/ Semi Urban/ Rural  
Loan_Status: 	Loan approved (Y/N)  

**Challenge:**


Réalisez une analyse graphique pour explorer les données.  
Faites un premier modèle pour prédire la variable cible, vous évalurez la performance de ce modèle à l'aide d'une validation croisée à 5 couches.


In [0]:
url = "https://raw.githubusercontent.com/shri1407/Loan-Prediction-Dataset/master/train.csv"
import pandas as pd
loan=pd.read_csv(url)
loan.head(3)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [0]:
loan.describe(include='all')

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
count,614,601,611,599.0,614,582,614.0,614.0,592.0,600.0,564.0,614,614
unique,614,2,2,4.0,2,2,,,,,,3,2
top,LP001616,Male,Yes,0.0,Graduate,No,,,,,,Semiurban,Y
freq,1,489,398,345.0,480,500,,,,,,233,422
mean,,,,,,,5403.459283,1621.245798,146.412162,342.0,0.842199,,
std,,,,,,,6109.041673,2926.248369,85.587325,65.12041,0.364878,,
min,,,,,,,150.0,0.0,9.0,12.0,0.0,,
25%,,,,,,,2877.5,0.0,100.0,360.0,1.0,,
50%,,,,,,,3812.5,1188.5,128.0,360.0,1.0,,
75%,,,,,,,5795.0,2297.25,168.0,360.0,1.0,,


In [0]:
loan.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [0]:
loan.shape

(614, 13)

In [0]:
loan.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [0]:
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing

In [0]:

# from sklearn import preprocessing
# le = preprocessing.LabelEncoder()
# le.fit(loan['Gender'])
# LabelEncoder()
# list(le.classes_)
# loan['sex']=le.transform(loan['Gender'])

In [0]:
loan.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


In [0]:
colonne =['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area']
for col in colonne:
  loan[col]= pd.get_dummies(loan[col], drop_first =True)

In [0]:
loan.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1,0,0,0,0,5849,0.0,,360.0,1.0,0,Y
1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,N
2,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,0,Y


In [0]:
loan_clean = loan.dropna()

In [0]:
loan_clean

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,N
2,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,0,Y
3,LP001006,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,0,Y
4,LP001008,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,0,Y
5,LP001011,1,1,2,0,1,5417,4196.0,267.0,360.0,1.0,0,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,0,0,0,0,0,2900,0.0,71.0,360.0,1.0,0,Y
610,LP002979,1,1,3+,0,0,4106,0.0,40.0,180.0,1.0,0,Y
611,LP002983,1,1,1,0,0,8072,240.0,253.0,360.0,1.0,0,Y
612,LP002984,1,1,2,0,0,7583,0.0,187.0,360.0,1.0,0,Y


In [0]:
dependents= loan_clean['Dependents'].apply(str)

In [0]:
le = preprocessing.LabelEncoder()
le.fit(dependents)
LabelEncoder()
list(le.classes_)
loan_clean['Dependents']=le.transform(dependents)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [0]:
loan_clean

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,N
2,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,0,Y
3,LP001006,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,0,Y
4,LP001008,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,0,Y
5,LP001011,1,1,2,0,1,5417,4196.0,267.0,360.0,1.0,0,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,0,0,0,0,0,2900,0.0,71.0,360.0,1.0,0,Y
610,LP002979,1,1,3,0,0,4106,0.0,40.0,180.0,1.0,0,Y
611,LP002983,1,1,1,0,0,8072,240.0,253.0,360.0,1.0,0,Y
612,LP002984,1,1,2,0,0,7583,0.0,187.0,360.0,1.0,0,Y


In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X =loan_clean.drop(['Loan_ID','Loan_Status'],axis=1)
y=loan_clean['Loan_Status']



In [0]:
from sklearn.linear_model import LogisticRegression

In [0]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

In [0]:
modelLR= LogisticRegression().fit(X_train, y_train)

In [0]:
y_predict = modelLR.predict(X_test)

In [0]:
modelLR

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(modelLR, X_train, y_train, cv=5))


[0.72857143 0.82608696 0.72463768 0.7826087  0.84057971]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist