## Label Encoding
### This notebook outlines the usage of Label Encoding
### Label Encoding assigns a unique integer to a value in alphabetical order
Dataset: [https://github.com/subashgandyer/datasets/blob/main/loan_prediction.zip]

In [37]:
import pandas as pd
import numpy as np

In [38]:
data=pd.read_csv("X_train.csv")
data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001032,Male,No,0,Graduate,No,4950,0.0,125,360,1,Urban
1,LP001824,Male,Yes,1,Graduate,No,2882,1843.0,123,480,1,Semiurban
2,LP002928,Male,Yes,0,Graduate,No,3000,3416.0,56,180,1,Semiurban
3,LP001814,Male,Yes,2,Graduate,No,9703,0.0,112,360,1,Urban
4,LP002244,Male,Yes,0,Graduate,No,2333,2417.0,136,360,1,Urban
...,...,...,...,...,...,...,...,...,...,...,...,...
379,LP002585,Male,Yes,0,Graduate,No,3597,2157.0,119,360,0,Rural
380,LP001841,Male,No,0,Not Graduate,Yes,2583,2167.0,104,360,1,Rural
381,LP002820,Male,Yes,0,Graduate,No,5923,2054.0,211,360,1,Rural
382,LP001744,Male,No,0,Graduate,No,2971,2791.0,144,360,1,Semiurban


### How many categorical variables in the dataset and what are they?

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384 entries, 0 to 383
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            384 non-null    object 
 1   Gender             384 non-null    object 
 2   Married            384 non-null    object 
 3   Dependents         384 non-null    object 
 4   Education          384 non-null    object 
 5   Self_Employed      384 non-null    object 
 6   ApplicantIncome    384 non-null    int64  
 7   CoapplicantIncome  384 non-null    float64
 8   LoanAmount         384 non-null    int64  
 9   Loan_Amount_Term   384 non-null    int64  
 10  Credit_History     384 non-null    int64  
 11  Property_Area      384 non-null    object 
dtypes: float64(1), int64(4), object(7)
memory usage: 36.1+ KB


### Import label encoder 

### label_encoder object knows how to understand word labels

### Encode labels in column 'Property_Area'. 

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Property_Area_Clean
0,LP001032,Male,No,0,Graduate,No,4950,0.0,125,360,1,Urban,2
1,LP001824,Male,Yes,1,Graduate,No,2882,1843.0,123,480,1,Semiurban,1
2,LP002928,Male,Yes,0,Graduate,No,3000,3416.0,56,180,1,Semiurban,1
3,LP001814,Male,Yes,2,Graduate,No,9703,0.0,112,360,1,Urban,2
4,LP002244,Male,Yes,0,Graduate,No,2333,2417.0,136,360,1,Urban,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,LP002585,Male,Yes,0,Graduate,No,3597,2157.0,119,360,0,Rural,0
380,LP001841,Male,No,0,Not Graduate,Yes,2583,2167.0,104,360,1,Rural,0
381,LP002820,Male,Yes,0,Graduate,No,5923,2054.0,211,360,1,Rural,0
382,LP001744,Male,No,0,Graduate,No,2971,2791.0,144,360,1,Semiurban,1


In [43]:
data[['Property_Area', 'Property_Area_Clean']]

Unnamed: 0,Property_Area,Property_Area_Clean
0,Urban,2
1,Semiurban,1
2,Semiurban,1
3,Urban,2
4,Urban,2
...,...,...
379,Rural,0
380,Rural,0
381,Rural,0
382,Semiurban,1


## Convert all other categorical variables into Numerical Variables

In [44]:

data.head()

    Loan_ID Gender Married Dependents Education Self_Employed  \
0  LP001032   Male      No          0  Graduate            No   
1  LP001824   Male     Yes          1  Graduate            No   
2  LP002928   Male     Yes          0  Graduate            No   
3  LP001814   Male     Yes          2  Graduate            No   
4  LP002244   Male     Yes          0  Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             4950                0.0         125               360   
1             2882             1843.0         123               480   
2             3000             3416.0          56               180   
3             9703                0.0         112               360   
4             2333             2417.0         136               360   

   Credit_History Property_Area  Property_Area_Clean  Gender_Clean  \
0               1         Urban                    2             1   
1               1     Semiurban           

In [45]:
data.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area',
       'Property_Area_Clean', 'Gender_Clean', 'Education_Clean',
       'Self_Employed_Clean'],
      dtype='object')

In [46]:
data = data[['Dependents', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area_Clean', 'Gender_Clean', 'Education_Clean',
       'Self_Employed_Clean']]
data

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_Clean,Gender_Clean,Education_Clean,Self_Employed_Clean
0,0,4950,0.0,125,360,1,2,1,0,0
1,1,2882,1843.0,123,480,1,1,1,0,0
2,0,3000,3416.0,56,180,1,1,1,0,0
3,2,9703,0.0,112,360,1,2,1,0,0
4,0,2333,2417.0,136,360,1,2,1,0,0
...,...,...,...,...,...,...,...,...,...,...
379,0,3597,2157.0,119,360,0,0,1,0,0
380,0,2583,2167.0,104,360,1,0,1,1,1
381,0,5923,2054.0,211,360,1,0,1,0,0
382,0,2971,2791.0,144,360,1,1,1,0,0


In [47]:
data.Dependents.value_counts()

0     225
2      69
1      61
3+     29
Name: Dependents, dtype: int64

### Convert 3+ value into 3 for Dependents

In [48]:
def clean_dep(x):
    return 

### Apply the function clean_dep

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Dependents_Clean'] = data['Dependents'].apply(clean_dep)


Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_Clean,Gender_Clean,Education_Clean,Self_Employed_Clean,Dependents_Clean
0,0,4950,0.0,125,360,1,2,1,0,0,0
1,1,2882,1843.0,123,480,1,1,1,0,0,1
2,0,3000,3416.0,56,180,1,1,1,0,0,0
3,2,9703,0.0,112,360,1,2,1,0,0,2
4,0,2333,2417.0,136,360,1,2,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
379,0,3597,2157.0,119,360,0,0,1,0,0,0
380,0,2583,2167.0,104,360,1,0,1,1,1,0
381,0,5923,2054.0,211,360,1,0,1,0,0,0
382,0,2971,2791.0,144,360,1,1,1,0,0,0


### Select the numerical variables as feature matrix

In [50]:
data = data[['Dependents_Clean', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area_Clean', 'Gender_Clean', 'Education_Clean',
       'Self_Employed_Clean']]
data

Unnamed: 0,Dependents_Clean,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_Clean,Gender_Clean,Education_Clean,Self_Employed_Clean
0,0,4950,0.0,125,360,1,2,1,0,0
1,1,2882,1843.0,123,480,1,1,1,0,0
2,0,3000,3416.0,56,180,1,1,1,0,0
3,2,9703,0.0,112,360,1,2,1,0,0
4,0,2333,2417.0,136,360,1,2,1,0,0
...,...,...,...,...,...,...,...,...,...,...
379,0,3597,2157.0,119,360,0,0,1,0,0
380,0,2583,2167.0,104,360,1,0,1,1,1
381,0,5923,2054.0,211,360,1,0,1,0,0
382,0,2971,2791.0,144,360,1,1,1,0,0


In [51]:
print(X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

(384, 12) (384, 1) (96, 12) (96, 1)


### Import LogisticRegression and Accuracy Metrics

### Create a Logistic Regression and train the model

  return f(**kwargs)


LogisticRegression(C=0.1)

### Accuracy score

ValueError: X has 12 features per sample; expecting 10

### What is the issue?

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP002684,Female,No,0,Not Graduate,No,3400,0,95,360,1,Rural
1,LP001907,Male,Yes,0,Graduate,No,14583,0,436,360,1,Semiurban
2,LP001205,Male,Yes,0,Graduate,No,2500,3796,120,360,1,Urban
3,LP001275,Male,Yes,1,Graduate,No,3988,0,50,240,1,Urban
4,LP002455,Male,Yes,2,Graduate,No,3859,0,96,360,1,Semiurban
...,...,...,...,...,...,...,...,...,...,...,...,...
91,LP001536,Male,Yes,3+,Graduate,No,39999,0,600,180,0,Semiurban
92,LP001367,Male,Yes,1,Graduate,No,3052,1030,100,360,1,Urban
93,LP002160,Male,Yes,3+,Graduate,No,5167,3167,200,360,1,Semiurban
94,LP002964,Male,Yes,2,Not Graduate,No,3987,1411,157,360,1,Rural


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Loan_ID            96 non-null     object
 1   Gender             96 non-null     object
 2   Married            96 non-null     object
 3   Dependents         96 non-null     object
 4   Education          96 non-null     object
 5   Self_Employed      96 non-null     object
 6   ApplicantIncome    96 non-null     int64 
 7   CoapplicantIncome  96 non-null     int64 
 8   LoanAmount         96 non-null     int64 
 9   Loan_Amount_Term   96 non-null     int64 
 10  Credit_History     96 non-null     int64 
 11  Property_Area      96 non-null     object
dtypes: int64(5), object(7)
memory usage: 9.1+ KB


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Property_Area_Clean,Gender_Clean,Education_Clean,Self_Employed_Clean,Dependents_Clean
0,LP002684,Female,No,0,Not Graduate,No,3400,0,95,360,1,Rural,0,0,1,0,0
1,LP001907,Male,Yes,0,Graduate,No,14583,0,436,360,1,Semiurban,1,1,0,0,0
2,LP001205,Male,Yes,0,Graduate,No,2500,3796,120,360,1,Urban,2,1,0,0,0
3,LP001275,Male,Yes,1,Graduate,No,3988,0,50,240,1,Urban,2,1,0,0,1
4,LP002455,Male,Yes,2,Graduate,No,3859,0,96,360,1,Semiurban,1,1,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,LP001536,Male,Yes,3+,Graduate,No,39999,0,600,180,0,Semiurban,1,1,0,0,3
92,LP001367,Male,Yes,1,Graduate,No,3052,1030,100,360,1,Urban,2,1,0,0,1
93,LP002160,Male,Yes,3+,Graduate,No,5167,3167,200,360,1,Semiurban,1,1,0,0,3
94,LP002964,Male,Yes,2,Not Graduate,No,3987,1411,157,360,1,Rural,0,1,1,0,2


Unnamed: 0,Dependents_Clean,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_Clean,Gender_Clean,Education_Clean,Self_Employed_Clean
0,0,3400,0,95,360,1,0,0,1,0
1,0,14583,0,436,360,1,1,1,0,0
2,0,2500,3796,120,360,1,2,1,0,0
3,1,3988,0,50,240,1,2,1,0,0
4,2,3859,0,96,360,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...
91,3,39999,0,600,180,0,1,1,0,0
92,1,3052,1030,100,360,1,2,1,0,0
93,3,5167,3167,200,360,1,1,1,0,0
94,2,3987,1411,157,360,1,0,1,1,0


0.71875

### Challenges ???