# Data Science Workflow

## 1. Define the Problem

1. What is the problem? Provide formal and informal definitions.
2. Why does the problem need to be solved? Motivation, benefits, how it will be used.
3. How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.

### This problem definition
1. Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. 

#### Variable Descriptions
1. Variable	 -             Description
2. Loan_ID -	               Unique Loan ID
3. Gender -	               Male/ Female
4. Married -               Applicant married (Y/N)
5. Dependents -	          Number of dependents
6. Education -	          Applicant Education (Graduate/ Under Graduate)
7. Self_Employed -	          Self employed (Y/N)
8. ApplicantIncome -	      Applicant income
9. CoapplicantIncome -	      Coapplicant income
10. LoanAmount -          Loan amount in thousands
11. Loan_Amount_Term -	      Term of loan in months
12. Credit_History -      credit history meets guidelines
13. Property_Area -          Urban/ Semi Urban/ Rural
14. Loan_Status	 -        Loan approved (Y/N)

## 2. Prepare Data
1. Data Selection. Availability, what is missing, what can be removed.
2. Data Preprocessing. Organize selected data by formatting, cleaning and sampling.
3. Data Transformation. Feature engineering using scaling, attribute decomposition and attribute aggregation.
4. Data visualizations such as with histograms.

In [1]:
# Get Data
import pandas as pd
import re
import requests
import io
url_name = "https://raw.githubusercontent.com/3Blades/notebook-templates/master/python/Datasets/lp_train.csv"
request=requests.get(url_name).content
df = pd.read_csv(io.StringIO(request.decode('utf-8'))) # Reading the training dataset in a dataframe using Pandas

url_name = "https://raw.githubusercontent.com/3Blades/notebook-templates/master/python/Datasets/lp_test.csv"
request=requests.get(url_name).content
df_test = pd.read_csv(io.StringIO(request.decode('utf-8'))) # Reading the test dataset in a dataframe using Pandas

In [2]:
# show data that needs to be added to system
df.apply(lambda x: sum(x.isnull()),axis=0)

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [None]:
# show all non number columns
all_cols = df.columns
num_cols = df._get_numeric_data().columns
nonum_cols = list(set(all_cols) - set(num_cols))
print(nonum_cols)

In [None]:
# clean and update missing or nonum data
# Gender               13
df['Gender'].fillna('Male',inplace=True)
df = pd.get_dummies(df, columns=["Gender"])

#Married               3
df['Married'].fillna('Yes',inplace=True)
df = pd.get_dummies(df, columns=["Married"])

#Dependents           15
df['Dependents'].fillna('0',inplace=True)
df['Dependents'] = pd.to_numeric(df['Dependents'].str.replace(r'[^-\d.]', ''))

#Self_Employed        32
df['Self_Employed'].fillna('No',inplace=True)
df = pd.get_dummies(df, columns=["Self_Employed"])

#LoanAmount           22
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

#Loan_Amount_Term     14 - assume way more 360 so these are probably 360
df['Loan_Amount_Term'].fillna(360,inplace=True)


#Credit_History       50 - assume if there is no data then credit history is not valid
df['Credit_History'].fillna(0,inplace=True)

df = pd.get_dummies(df, columns=["Education"])
df = pd.get_dummies(df, columns=["Property_Area"])
#df = pd.get_dummies(df, columns=["Loan_Status"])

df = df.drop(["Loan_ID"],axis=1)

In [None]:
# Recheck
print(df.apply(lambda x: sum(x.isnull()),axis=0))
# show all non number columns
all_cols = df.columns
num_cols = df._get_numeric_data().columns
nonum_cols = list(set(all_cols) - set(num_cols))
print(nonum_cols)

In [None]:
outcome_var = 'Loan_Status'
predictor_var = [x for x in df.columns if x not in [outcome_var]]

In [None]:
from sklearn import model_selection
X = df[predictor_var].values
Y = df[outcome_var]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

## 3. Spot Check Algorithms
1. Test harness with default values.
2. Run family of algorithms across all the transformed and scaled versions of dataset.
3. View comparisons with box plots.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("\n** Validate Model on Test Data **")
print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))

## 4. Improve Results (Tuning)
1. Algorithm Tuning: discovering the best models in model parameter space. This may include hyper parameter optimizations with additional helper services.
2. Ensemble Methods: where the predictions made by multiple models are combined.
3. Feature Engineering: where the attribute decomposition and aggregation seen in data preparation is tested further.

In [None]:
model = LogisticRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print("\n** Validate Model on Test Data **")
print("1. Accuracy: {}".format(accuracy_score(Y_validation, predictions)))
print("2. Confusion Matrix:\n{}".format(pd.crosstab(Y_validation, predictions, rownames=['True'], colnames=['Predicted'], margins=True)))

## 5. Present Results
1. Context (Why): how the problem definition arose in the first place.
2. Problem (Question): describe the problem as a question.
3. Solution (Answer): describe the answer the the question in the previous step.
4. Findings: Bulleted lists of discoveries you made along the way that interests the audience. May include discoveries in the data, methods that did or did not work or the model performance benefits you observed.
5. Limitations: describe where the model does not work.
6. Conclusions (Why+Question+Answer)