# Lab: Spark Machine Learning

In this lab you will be digging into the world of credit markets and peer-to-peer lending. Loan decisions from financial institutions depend on a vast trove of data on borrower characteristics, market trends, and more. The best way to harness the vast data is with machine learning. An [article by the Brookings Institution](https://www.brookings.edu/research/credit-denial-in-the-age-of-ai/) discusses the impact of AI on loan decisions and higlights that “If there are data out there on you, there is probably a way to integrate it into a credit model.” 

We will use Spark Machine Learning to build a machine learning pipeline that can model and predict credit worthiness. The data we will be using comes from Lending Club, a company that conducted peer-to-peer lending from 2007 to 2020. There are over 2.9 million loans in the dataset. Each row of data about the loan, credit history, and the success of the loan. Credit history variables include  number of accounts, past missed payments, occupation, years of experience, etc. You can read about the source data [from Kaggle](https://www.kaggle.com/ethon0426/lending-club-20072020q1).

## Understanding your data

Find all your Spark related environment variables, and pyspark using the `findspark.init()` function:

In [1]:
import findspark
findspark.init()

Create your SparkSession. You are only going to create a `SparkSession`, not a `SparkContext`.

In [2]:
import pyspark.sql.functions as f
from pyspark.sql.functions import col, lit
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName("lendingclub").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/18 20:43:54 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


Make sure your SparkSession is active:

In [4]:
spark

## 1. Read in the data

The data is in the format of an Apache Parquet file. This file format is optimized for use on clusters, which you can read more about [here](https://drill.apache.org/docs/parquet-format/). You will be working with this file using [Spark DataFrame API and Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) as we did in the SparkSQL lab.

Start off by moving the data from the central bucket to your personal bucket.

In [5]:
!aws s3 cp s3://bigdatateaching/lendingclub s3://mc2582 --recursive

copy: s3://bigdatateaching/lendingclub/lendingclub_loan_data.parquet to s3://mc2582/lendingclub_loan_data.parquet


Load in the Parquet file using [Generic load functions for SparkSQL](https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources), which is located at `s3://[[YOUR-BUCKET-NAME]]/lendingclub/lendingclub_loan_data.parquet`.

Create a DataFrame called `df_in`, which should contain many rows and over 140 columns.

In [7]:
df_in = spark.read.parquet('s3://mc2582/lendingclub_loan_data.parquet')

Whenever you load in a new dataset, you will always do the following three actions:

    1. Get a count of the rows
    2. Print the schema of the data
    3. View the first few rows of the data (there are a lot of columns so just look at 3)

Do these steps now!

In [8]:
df_in.count()

22/10/18 15:45:07 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

2925492

In [9]:
df_in.printSchema()

root
 |-- id: string (nullable = true)
 |-- loan_amnt: double (nullable = true)
 |-- funded_amnt: double (nullable = true)
 |-- funded_amnt_inv: double (nullable = true)
 |-- term: string (nullable = true)
 |-- int_rate: string (nullable = true)
 |-- installment: double (nullable = true)
 |-- grade: string (nullable = true)
 |-- sub_grade: string (nullable = true)
 |-- emp_title: string (nullable = true)
 |-- emp_length: string (nullable = true)
 |-- home_ownership: string (nullable = true)
 |-- annual_inc: double (nullable = true)
 |-- verification_status: string (nullable = true)
 |-- issue_d: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- pymnt_plan: string (nullable = true)
 |-- url: string (nullable = true)
 |-- purpose: string (nullable = true)
 |-- title: string (nullable = true)
 |-- zip_code: string (nullable = true)
 |-- addr_state: string (nullable = true)
 |-- dti: double (nullable = true)
 |-- delinq_2yrs: double (nullable = true)
 |-- earliest_cr

In [11]:
df_in.show(3)



+-------+---------+-----------+---------------+----------+--------+-----------+-----+---------+---------+----------+--------------+----------+-------------------+--------+-----------+----------+--------------------+--------------+--------------------+--------+----------+-----+-----------+----------------+--------------+---------------+--------------+----------------------+----------------------+--------+-------+---------+----------+---------+-------------------+---------+-------------+---------------+---------------+---------------+-------------+------------------+----------+-----------------------+------------+---------------+------------+------------------+--------------------+-------------------+--------------------------+---------------------------+-----------+----------------+----------------+---------+-------------------------+--------------+------------+-----------+-----------+-----------+-----------+-----------+------------------+------------+-------+-----------+-----------+---

                                                                                

## 2.1 Clean data to prepare for machine learning

No dataset is ever ready for modeling without cleaning. There are several steps required to get the data ready:
    
1. Select only the columns - `['int_rate', 'term', 'loan_status', 'home_ownership', 'tot_cur_bal', 'annual_inc', 'grade', 'bc_util']` and then drop if there are any coded NA values left `.dropna()` [read more here on the method](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.dropna.html).

These columns mean:
* `int_rate`: interest rate
* `term`: length of the loan
* `loan_status`: loan paid, in progress, or charged off
* `home_ownership`: home ownership status
* `tot_cur_bal`: total current balance on all credit accounts
* `grade`: The grade of the loan
* `bc_util`: bank card utilization rate

2. Extract the numbers from the column `int_rate`. You will have to first use `select` and `show` methods to view what the data looks like. Then you can use the [regex replace technique](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.regexp_replace.html) `regexp_replace` or a python-based user defined function method. **There are both XX.XX% and X.XX% rate types in this column**. Finally, cast `int_rate` to float.

4. Filter to only rows with `loan_status` of "Fully Paid" or "Charged Off". This means that the loan went to "maturity" so we can determine if the loan was successful (fully paid) or unsuccessful (charged off).

5. Check the potential values for `term`, `home_ownership`, and `grade`. What are the possible categories of each?

6. Drop rows with unusual values for on the `home_ownership` variable . There are several categories that are small and have ambiguous meaning. There are three to drop and three to keep.

7. Report the number of rows left in your DataFrame.

1. Selecting columns

In [8]:
df = df_in.select(['int_rate', 'term', 'loan_status', 'home_ownership', 'tot_cur_bal', 'annual_inc', 'grade', 'bc_util'])

In [19]:
df.printSchema()

root
 |-- int_rate: string (nullable = true)
 |-- term: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- home_ownership: string (nullable = true)
 |-- tot_cur_bal: double (nullable = true)
 |-- annual_inc: double (nullable = true)
 |-- grade: string (nullable = true)
 |-- bc_util: double (nullable = true)



2. String extraction and data type conversion

In [79]:
# always look at the data before
df.select('int_rate').show(10)



+--------+
|int_rate|
+--------+
|  10.65%|
|  15.27%|
|  15.96%|
|  13.49%|
|  12.69%|
|   7.90%|
|  15.96%|
|  18.64%|
|  21.28%|
|  12.69%|
+--------+
only showing top 10 rows



                                                                                

In [9]:
df = df.withColumn('int_rate', f.regexp_extract('int_rate', r'(.*)%', 1).cast('float')*0.01)

In [10]:
# always look at the data after
df.select('int_rate').show(10)

22/10/18 20:44:34 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+-------------------+
|           int_rate|
+-------------------+
|0.10649999618530273|
|0.15270000457763672|
|0.15960000038146974|
|0.13489999771118164|
|0.12689999580383302|
|0.07900000095367432|
|0.15960000038146974|
|0.18639999389648437|
| 0.2128000068664551|
|0.12689999580383302|
+-------------------+
only showing top 10 rows



                                                                                

3. Drop loans that have not completed, keeping only `Charged Off` or `Fully Paid`

In [33]:
# a count function by distinct value is quite important in pyspark!
df.groupBy('loan_status').count().collect()

                                                                                

[Row(loan_status='In Grace Period', count=10028),
 Row(loan_status='Does not meet the credit policy. Status:Fully Paid', count=1988),
 Row(loan_status='Late (31-120 days)', count=16154),
 Row(loan_status='Issued', count=2062),
 Row(loan_status='Does not meet the credit policy. Status:Charged Off', count=761),
 Row(loan_status='Fully Paid', count=1497783),
 Row(loan_status='Default', count=433),
 Row(loan_status='Charged Off', count=362548),
 Row(loan_status='Current', count=1031016),
 Row(loan_status='Late (16-30 days)', count=2719)]

In [11]:
df = df.filter(col('loan_status').isin(['Charged Off','Fully Paid']))

In [35]:
df.groupBy('loan_status').count().collect()

                                                                                

[Row(loan_status='Fully Paid', count=1497783),
 Row(loan_status='Charged Off', count=362548)]

4. Check possible values for categorical columns

In [12]:
df.show(2)



+-------------------+----------+-----------+--------------+-----------+----------+-----+-------+
|           int_rate|      term|loan_status|home_ownership|tot_cur_bal|annual_inc|grade|bc_util|
+-------------------+----------+-----------+--------------+-----------+----------+-----+-------+
|0.10649999618530273| 36 months| Fully Paid|          RENT|       null|   24000.0|    B|   null|
|0.15270000457763672| 60 months|Charged Off|          RENT|       null|   30000.0|    C|   null|
+-------------------+----------+-----------+--------------+-----------+----------+-----+-------+
only showing top 2 rows



                                                                                

In [39]:
df.groupBy('term').count().collect()

                                                                                

[Row(term=' 60 months', count=467032), Row(term=' 36 months', count=1393299)]

In [41]:
df.groupBy('home_ownership').count().collect()

                                                                                

[Row(home_ownership='MORTGAGE', count=918028),
 Row(home_ownership='ANY', count=1174),
 Row(home_ownership='NONE', count=49),
 Row(home_ownership='RENT', count=735358),
 Row(home_ownership='OWN', count=205578),
 Row(home_ownership='OTHER', count=144)]

In [42]:
df.groupBy('grade').count().collect()

                                                                                

[Row(grade='D', count=275001),
 Row(grade='B', count=546603),
 Row(grade='G', count=10695),
 Row(grade='F', count=37108),
 Row(grade='C', count=528107),
 Row(grade='E', count=116500),
 Row(grade='A', count=346317)]

5. Drop rows with explicitly missing data

In [13]:
df = df.na.drop('any')

6. Report how much data remains after your cleaning steps

In [14]:
df.count()

                                                                                

1771614

**Run this cell to export 5 rows of your data to a json for grading**

In [46]:
import json
json.dump({'df' : df.limit(5).toPandas().to_dict('records'),
           'home_counts' : [x.asDict() for x in df.groupby('home_ownership').count().collect()]
          },
          fp = open('data-soln-1.json','w'))

                                                                                

## 2.2. Split data into train, test, and split

In this section you will split your data into: train, test and predict datasets. Create three splits of `df_in` (train, test, predict) by using the `randomSplit` method. Read more about the method [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.randomSplit.html).

In [27]:
train_data, test_data, predict_data = df.randomSplit([0.8, 0.18, 0.02], 24)

After splitting into three datasets, report the number of rows for each split.

In [28]:
print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))
print("Number of prediction records : " + str(predict_data.count()))

                                                                                

Number of training records: 1416786


                                                                                

Number of testing records : 319634




Number of prediction records : 35194


                                                                                

### 2.2: Create pipeline and train a model

For this lab, your job is to build a model that classifies the `loan_status`. In this section you will create a machine learning pipeline and then train the model. The next cell imports all the packages you will need:

In [15]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, convert all the string fields to numeric indices. Look at the DataFrame schema to determine which ones you need to convert. Use the `StringIndexer` estimator. You need to create a transformer for each column you want to modify because the "model" has to find out how many distinct options there are for the variable in order to assign consistent integer values.

The format of the command will be `StringIndexer(inputCol="[INPUT COL NAME]", outputCol="[OUTPUT COL NAME]")`

In [86]:
df.printSchema()

root
 |-- int_rate: double (nullable = true)
 |-- term: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- home_ownership: string (nullable = true)
 |-- tot_cur_bal: double (nullable = true)
 |-- annual_inc: double (nullable = true)
 |-- grade: string (nullable = true)
 |-- bc_util: double (nullable = true)



In [16]:
stringIndexer_loan = StringIndexer(inputCol="loan_status", outputCol="loan_success")
stringIndexer_term = StringIndexer(inputCol="term", outputCol="term_ix")
stringIndexer_home = StringIndexer(inputCol='home_ownership', outputCol='home_ownership_ix')
stringIndexer_grade = StringIndexer(inputCol='grade', outputCol='grade_ix')

Try looking at the values for one of the re-encoded columns using the `labels` method. Does it work?

In [17]:
stringIndexer_grade.labels

AttributeError: 'StringIndexer' object has no attribute 'labels'

To see the values of the labels, the estimator needs to be fitted and made a transformer first. You can do so by using the `fit` method. Try that now:

In [18]:
model_loan = stringIndexer_loan.fit(df)

                                                                                

In [19]:
model_loan.labels

['Fully Paid', 'Charged Off']

In this step, we need to convert our index variables that have more than two levels. Use the function OneHotEncoder [(read more here)](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html). The format is the same as the StringIndexer.

In [20]:
onehot_home = OneHotEncoder(inputCol="home_ownership_ix", outputCol="home_ownership_vec")
onehot_grade = OneHotEncoder(inputCol="grade_ix", outputCol="grade_vec")

In the following step, create a feature vector by combining all string features together using the `vectorAssembler` method:

In [21]:
vectorAssembler_features = VectorAssembler(
    inputCols=['term_ix', 'home_ownership_vec', 'grade_vec'], 
    outputCol= "features")

What does the resulting transformer look like? There is no data here because the transformer is just the blueprint for the change in the DataFrame

In [22]:
vectorAssembler_features

VectorAssembler_4e043a22878e

Next, define the estimator you want to use for classification. You will build a Random Forest using the `RandomForestClassifier` [estimator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html). This function is also where you would adjust hyperparameters for your model.

In [23]:
rf = RandomForestClassifier(labelCol="loan_success", featuresCol="features", numTrees=50)

Finally, indexed labels back to original labels.

In [24]:
labelConverter = IndexToString(inputCol="prediction", 
                               outputCol="predictedLoanStatus", 
                               labels= ['Fully Paid', 'Charged Off'])

Let's build the pipeline now. A pipeline consists of transformers and an estimator.

In [25]:
pipeline_rf = Pipeline(stages=[stringIndexer_loan, 
                               stringIndexer_term, 
                               stringIndexer_home, 
                               stringIndexer_grade, 
                               onehot_home,
                               onehot_grade,
                               vectorAssembler_features, 
                               rf, labelConverter])

Now, you can train your Random Forest model by using the previously defined **pipeline** and **train data**.

In [26]:
train_data.printSchema()

NameError: name 'train_data' is not defined

In [29]:
train_data.show(10)



+-------------------+----------+-----------+--------------+-----------+----------+-----+-------+
|           int_rate|      term|loan_status|home_ownership|tot_cur_bal|annual_inc|grade|bc_util|
+-------------------+----------+-----------+--------------+-----------+----------+-----+-------+
|0.05309999942779541| 36 months|Charged Off|      MORTGAGE|    13153.0|   65400.0|    A|   70.3|
|0.05309999942779541| 36 months|Charged Off|      MORTGAGE|    15169.0|   85000.0|    A|   22.2|
|0.05309999942779541| 36 months|Charged Off|      MORTGAGE|    16313.0|  130000.0|    A|   21.4|
|0.05309999942779541| 36 months|Charged Off|      MORTGAGE|    16848.0|   45000.0|    A|   13.3|
|0.05309999942779541| 36 months|Charged Off|      MORTGAGE|    29641.0|  161000.0|    A|   29.8|
|0.05309999942779541| 36 months|Charged Off|      MORTGAGE|    36754.0|  120000.0|    A|   34.7|
|0.05309999942779541| 36 months|Charged Off|      MORTGAGE|    37312.0|   90000.0|    A|   21.2|
|0.05309999942779541| 36 month

                                                                                

In [30]:
model_rf = pipeline_rf.fit(train_data)

                                                                                

What does `model_rf` look like? What is it, an estimator or a transformer?

In [31]:
type(model_rf)

pyspark.ml.pipeline.PipelineModel

What happens when you send the train data through the transform method?

In [32]:
train_pred = model_rf.transform(train_data)

In [33]:
train_pred.printSchema()

root
 |-- int_rate: double (nullable = true)
 |-- term: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- home_ownership: string (nullable = true)
 |-- tot_cur_bal: double (nullable = true)
 |-- annual_inc: double (nullable = true)
 |-- grade: string (nullable = true)
 |-- bc_util: double (nullable = true)
 |-- loan_success: double (nullable = false)
 |-- term_ix: double (nullable = false)
 |-- home_ownership_ix: double (nullable = false)
 |-- grade_ix: double (nullable = false)
 |-- home_ownership_vec: vector (nullable = true)
 |-- grade_vec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)
 |-- predictedLoanStatus: string (nullable = true)



In [34]:
train_pred.filter(col('loan_status')==col('predictedLoanStatus')).count()/train_pred.count()

                                                                                

0.803683830867894

## 3.1 Model Test Results

In this section, we will evaluate the **model accuracy**. This requires using the evaluator `MulticlassClassificationEvaluator`, which you can read about in the doc [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html). To evaluate the model, use **test data**.

First, let's transform the test data using our model pipeline

In [35]:
predictions = model_rf.transform(test_data)

Next, let's run the `MulticlassClassificationEvaluator` by passing in the label column (acutal result), prediction column (from our model), and the metric we want to calculate.

In [36]:
evaluatorRF = MulticlassClassificationEvaluator(labelCol="loan_success", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)

                                                                                

In [37]:
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

Accuracy = 0.803225
Test Error = 0.196775


Was the test error similar to the train error? Run the same accuracy metrics on the training data

In [38]:
train_pred = model_rf.transform(train_data)

In [39]:
accuracy = evaluatorRF.evaluate(train_pred)

                                                                                

In [40]:
print("Accuracy = %g" % accuracy)
print("Train Error = %g" % (1.0 - accuracy))

Accuracy = 0.803684
Train Error = 0.196316


## 3.2 Confusion Matrix

In this section, we will build the confusion matrix from the model. We will use sci-kit learn to build the confusion matrix. We first have to extract the predicted label and the true label columns. These are the numeric binary form of the data. Finally, send them through the `confusion_matrix` method.

In [41]:
from sklearn.metrics import confusion_matrix

In [42]:
y_pred=predictions.select("predictedLoanStatus").collect()
y_orig=predictions.select("loan_status").collect()

                                                                                

In [43]:
cm = confusion_matrix(y_orig, y_pred)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[     0  62896]
 [     0 256738]]


## 3.3 ROC Curve

In this section, we will evaluate the **model area under the curve**. This requires using the evaluator `BinaryClassificationEvaluator`, which you can read about in the doc [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html). To evaluate the model, use **test data**.

In [44]:
evaluatorRF = BinaryClassificationEvaluator(labelCol="loan_success", rawPredictionCol="prediction", metricName="areaUnderROC")
roc_result = evaluatorRF.evaluate(predictions)
roc_result

                                                                                

0.5

**Run this cell to export model metrics to a json for grading**

In [45]:
import json
json.dump({'roc' : roc_result,
           'cm' : str(cm),
           'acc' : accuracy
          },
          fp = open('data-soln-2.json','w'))

# MAKE SURE YOU STOP YOUR EMR CLUSTER!

In [None]:
spark.stop()