# Logistic regression project

We begin by doing the necessary imports.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

## The Dataset
We'll use the [Credit Card Approval dataset](http://archive.ics.uci.edu/ml/datasets/credit+approval) from the UCI Machine Learning Repository.
    
We explore the variables within this dataset in the sections below. 

### Reading in the data

First, loading and viewing the dataset. We find that since this data are confidential, the contributor of the dataset has anonymized the feature names.

In [69]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/89fee4463f428f55d31a254924e18501a3c468c3/Data/classification_sprint/cc_approvals.data',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


The output may appear a bit confusing at first glance, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but [this blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. 

This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.   

As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing.

In [3]:
df.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB



<li>Our dataset contains both numeric and non-numeric data (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64, respectively) and all the other features contain non-numeric values.</li>
<li>The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000.
<li>Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labelled with '?', which can be seen in the last cell's output.</li>
</ul>

## Data Cleaning

We start by writing a function to clean the given data. The function should:
* Replace the '?'s with NaN.
* Impute the missing values with mean imputation.
* Impute the missing values of non-numeric columns with the most frequent values as present in the respective columns.

_**Function Specifications:**_
* Should take a pandas DataFrame and column name as input and return a list as an output.
* The list should be a count of unique values in the column

In [38]:
def data_cleaning(data, column_name):

    # Replacing '?' values with NaN
    step1_df = data.replace('?', pd.NA)

    # Imputing missing values with mean (for numeric columns) and mode (for categorical columns) imputations
    if step1_df[column_name].dtype == 'int64' or step1_df[column_name].dtype == 'float64':
        mean_value = step1_df[column_name].mean()
        step1_df[column_name] = step1_df[column_name].fillna(mean_value)
    else:
        mode_value = step1_df[column_name].mode()[0]
        step1_df[column_name] = step1_df[column_name].fillna(mode_value)

    import warnings
    warnings.filterwarnings('ignore')

    # Return the counts for the specified column index
    return list(step1_df[column_name].value_counts())

In [39]:
data_cleaning(df, 9)

[395, 295]

## Data Preprocessing

We then write a function to pre-process the data so that we can run it through the classifier. The function should:
* Convert the non-numeric data into numeric using sklearn's ```labelEncoder``` 
* Drop the features 11 and 13 and convert the DataFrame to a NumPy array
* Split the data into features and labels
* Standardise the features using sklearn's ```MinMaxScaler```
* Split the data into 80% training and 20% testing data.
* Use the `train_test_split` method from `sklearn` to do this.
* Set random_state to equal 42 for this internal method. 

_**Function Specifications:**_
* Should take a dataframe as input.
* The input should be the raw unprocessed dataframe df.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.

In [88]:
def data_preprocess(df):

    # Converting non-numeric data into numeric
    le = LabelEncoder()
    df_encoded = df.copy()  # Make a copy of the DataFrame to avoid modifying the original
    for column in df.select_dtypes(include=['object']):
        df_encoded[column] = le.fit_transform(df[column])

    # Dropping features 11 and 13 and casting the df as an array
    df_encoded_arr = np.array(df_encoded.drop(columns=[11, 13]))

    # Splitting the data into features and labels
    X = df_encoded_arr[:,:-1] # Features
    y = df_encoded_arr[:,-1] # Labels

    # Standardising the features
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    # Splitting the data into training and testing data
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

    return (X_train, y_train), (X_test, y_test)

In [89]:
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[:1])
print(y_train[:1])
print(X_test[:1])
print(y_test[:1])

[[1.         0.25787966 0.48214286 1.         1.         0.42857143
  0.33333333 0.         0.         0.         0.         0.
  0.        ]]
[1.]
[[0.5        1.         0.05357143 0.66666667 0.33333333 0.42857143
  0.33333333 0.         0.         1.         0.02985075 0.
  0.00105   ]]
[1.]


## Training the model

Now that we have formatted our data, we can fit a model using sklearn's `LogisticRegression` class with solver 'lbfgs'. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* The returned model should be fitted to the data.

In [80]:
def train_model(X_train, y_train):
    
    # Training our model
    lr = LogisticRegression(solver='lbfgs')
    model = lr.fit(X_train, y_train)

    return model

In [81]:
lm = train_model(X_train, y_train)
print(lm.intercept_[0])
print(lm.coef_)

0.3709671903129163
[[ 4.94801124e-01 -2.87318292e-03 -1.55699147e-05  1.06465557e+00
   6.94353620e-01  1.78587134e-02 -1.67566958e-01 -1.99136235e-02
  -2.13340055e+00 -1.73525327e-01 -8.18853840e-02  2.82992496e-02
  -1.37812114e-02]]


## Testing the model

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. We therefore rite a function which returns the roc auc score of your trained model when tested with the test set.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a `float` of the roc auc score of the model. This number should be between zero and one.

In [1]:
def roc_score(lm, X_test, y_test):
    
    # Calculating the AUC-ROC score
    y_pred = lm.predict_proba(X_test)[:,1]
    roc_score = roc_auc_score(y_test, y_pred)

    return roc_score 

In [83]:
print(roc_score(lm,X_test,y_test))

0.8497899159663865


It appears our model has an ROC score of 0.85, which is good.

We proceed to write a function which calculates the Accuracy, Precision, Recall and F1 scores.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a tuple in the form (`Accuracy`, `Precision`, `Recall`, `F1-Score`)

In [66]:
def scores(lm, X_test, y_test):
    
    # Calculating the individual scores
    y_pred = lm.predict(X_test)
    Accuracy = accuracy_score(y_test, y_pred)
    Precision = precision_score(y_test, y_pred)
    Recall = recall_score(y_test, y_pred)

    #from sklearn.metrics import confusion_matrix
    #conf_matrix = confusion_matrix(y_test, y_pred)
    #TP = conf_matrix[0][0]
    #FP = conf_matrix[0][1]
    #FN = conf_matrix[1][0]
    #TN = conf_matrix[1][1]
    F1_Score = 2*((Precision*Recall)/(Precision+Recall))
    #F1_Score = f1_score(y_test, y_pred)

    return (Accuracy, Precision, Recall, F1_Score)

In [67]:
(accuracy, precision, recall, f1) = scores(lm, X_test, y_test)    

print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)

Accuracy: 0.855072
Precision: 0.875000
Recall: 0.823529
F1 score: 0.848485
