# Agreggations, Joins & Classification

In this lesson we will create aggregations, joins and classify a churn dataset.

## Summary
- <a href='#1'>1. Context and Motivation</a>
- <a href='#2'>2. Agreggations</a>
    - <a href='#2.1'>2.1. Aggregation Functions</a>
    - <a href='#2.2'>2.2. Grouping</a>
    - <a href='#2.3'>2.3. Window Functions</a>
    - <a href='#2.4'>2.4. User-Defined Aggregation Functions</a>
- <a href='#2'>3. Joins</a>
    - <a href='#3.1'>3.1. Join Types</a>
        - <a href='#3.1.1'>3.1.1 Inner Joins</a>
        - <a href='#3.1.2'>3.1.2 Outer Joins</a>
        - <a href='#3.1.3'>3.1.3 Left Outer Joins</a>
        - <a href='#3.1.4'>3.1.4 Right Outer Joins</a>
        - <a href='#3.1.5'>3.1.5 Left Semi Joins</a>
        - <a href='#3.1.6'>3.1.6 Left Anti Joins</a>
        - <a href='#3.1.7'>3.1.7 Natural Joins</a>
        - <a href='#3.1.8'>3.1.8 Cross (Cartesian) Joins</a>
    - <a href='#3.2'>3.3. How Spark Perform Joins</a>
- <a href='#4'>4.  Exercises</a>
    - <a href='#4.1'>4.1. EDA</a>
    - <a href='#4.2'>4.2. Classification</a>
        - <a href='#4.2.1'>4.2.1 Logistic Regression</a>
        - <a href='#4.2.2'>4.2.2 (SVM)Support vector Machine</a>
        - <a href='#4.2.3'>4.2.3 Decision Trees</a>
        - <a href='#4.2.4'>4.2.4 Feature Importance</a>
    - <a href='#4.3'>4.3. Evaluation</a>
- <a href='#5'>5.  References</a>

# <a id='1'>1. Context and Motivation</a>

When we work with data we need to transform the data to get into something that we need to view. These transformations always come with **Agreggations**. 


Sometimes we need to perform some dataset joins in order to join multiple datasets to have more peformance we need to know which **join** we use. 


# <a id='2'>2. Agreggations</a>

Aggregation is an act of collecting something together, we will specify a key or grouping and an aggregation function that specifies how we should transform one or more columns.   
We need to perform aggregations to transform one or more columns, grouping data and view the data the way we want. 


In [None]:
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("retail_data_2010-12-01.csv")\

df.cache()

In [None]:
df.printSchema()

In [None]:
df.show(5)

## <a id='2.1'>2.1. Aggregation Functions</a>

All aggregation functions are available as functions and we can find the most of them in: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

Function **count(col)** 
* Aggregate function - Returns the number of items in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions.
* IF count(*) count nulls either.

In [None]:
from pyspark.sql.functions import count
df.select(count("StockCode")).show() # 541909

Function **countDistinct(col, *cols)** 
* Aggregate function - Returns a new Column for distinct count of col or cols.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

In [None]:
from pyspark.sql.functions import countDistinct
df.select(countDistinct("StockCode")).show()

Function **approx_count_distinct(col, rsd=None)** 
* Aggregate function: returns a new Column for approximate distinct count of column col.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

In [None]:
from pyspark.sql.functions import approx_count_distinct
df.select(approx_count_distinct("StockCode", 0.1)).show()

Function **first(col, ignorenulls=False)** 
* Aggregate function: returns the first value in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

Function **last(col, ignorenulls=False)** 
* Aggregate function: returns the last value in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

In [None]:
from pyspark.sql.functions import first, last
df.select(first("StockCode"), last("StockCode")).show()

Function **min(col)** 
* Aggregate function: returns the minimum value of the expression in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.min

Function **max(col)** 
* Aggregate function: returns the maximum value of the expression in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.max

In [None]:
from pyspark.sql.functions import min, max
df.select(min("Quantity"), max("Quantity")).show()

Function **sum(col)** 
* Aggregate function: returns the sum of all values in the expression.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum


In [None]:
from pyspark.sql.functions import sum
df.select(sum("Quantity")).show() # 5

Function **sumDistinct(col)** 
* Aggregate function: returns the sum of distinct values in the expression.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sumDistinct


In [None]:
from pyspark.sql.functions import sumDistinct
df.select(sumDistinct("Quantity")).show() # 5

Function **avg(col)** 
* Aggregate function: returns the average of the values in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.avg

Function **expr(str)** 
* Parses the expression string into the column that it represents.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.expr

Function **sum(col)** 
* Aggregate function: returns the sum of all values in the expression.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum


In [None]:
from pyspark.sql.functions import sum, count, avg, expr
df.select(
count("Quantity").alias("total_transactions"),
sum("Quantity").alias("total_purchases"),
avg("Quantity").alias("avg_purchases"),
expr("mean(Quantity)").alias("mean_purchases"))\
.selectExpr(
"total_purchases/total_transactions",
"avg_purchases",
"mean_purchases").show()


Function **var_pop(col)** 
* Aggregate function: returns the population variance of the values in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.var_pop

Function **stddev_pop(col)** 
* Aggregate function: returns population standard deviation of the expression in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.stddev_pop

Function **var_samp(col)** 
* Aggregate function: returns the unbiased sample variance of the values in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.var_samp

Function **stddev_samp(col)** 
* Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.stddev_samp

In [None]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp
df.select(var_pop("Quantity"), var_samp("Quantity"),
stddev_pop("Quantity"), stddev_samp("Quantity")).show()

### Aggregating to Complex Types

Function **agg(col)** 
* Aggregate on the entire DataFrame without groups (shorthand for df.groupBy.agg()).
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg 

Function **collect_set(col)** 
* Aggregate function: returns a set of objects with duplicate elements eliminated.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.collect_set

Function **collect_list(col)** 
* Aggregate function: returns a list of objects with duplicates.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.collect_list

In [None]:
from pyspark.sql.functions import collect_set, collect_list
df.agg(collect_set("Country"), collect_list("Country")).show(20,True)

In [None]:
from pyspark.sql.functions import  collect_list
df.agg(collect_set("Country")).show(20,False)

In [None]:
from pyspark.sql.functions import  collect_list
df.agg(collect_list("Country")).show(1,True)

## <a id='2.2'>2.2. Grouping</a>



Function **groupby(*cols)** 
* Groups the DataFrame using the specified columns, so we can run aggregation on them.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy 

In [None]:
df.groupBy("InvoiceNo", "CustomerId").count().show()

In [None]:
from pyspark.sql.functions import count
df.groupBy("InvoiceNo").agg(
count("Quantity").alias("quantit"),
expr("count(Quantity)")).show()

In [None]:
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\
.show()

## <a id='2.3'>2.3. Window Functions</a>

Function **dense_rank(*cols)** 
* Window function: returns the rank of rows within a window partition, without any gaps.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dense_rank 
    
Function **rank(*cols)** 
* Window function: returns the rank of rows within a window partition.
* See https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.rank 

We are going to establishing the maximum purchase quantity over all time. 

In [None]:
from pyspark.sql.functions import col, to_date,desc, max, dense_rank, rank
from pyspark.sql.window import Window

dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy H:mm"))


In [None]:
windowSpec = Window\
.partitionBy("CustomerId", "date")\
.orderBy(desc("Quantity"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [None]:
maxPurchaseQuantity = max(col("Quantity")).over(windowSpec)

In [None]:
purchaseDenseRank = dense_rank().over(windowSpec)
purchaseRank = rank().over(windowSpec)

In [None]:
dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")\
.select(
col("CustomerId"),
col("date"),
col("Quantity"),
purchaseRank.alias("quantityRank"),
purchaseDenseRank.alias("quantityDenseRank"),
maxPurchaseQuantity.alias("maxPurchaseQuantity")).show()

## <a id='2.4'>2.4. User-Defined Aggregation Functions</a>

**Can be define but only in Java or Scala.**

# <a id='3'>3. Joins</a>
A join brings together two sets of data, the left and the right, by comparing the value of one or
more keys of the left and right and <evaluating the result of a join expression that determines
whether Spark should bring together the left set of data with the right set of data.

## <a id='3.1'>3.1. Join Types</a>
Whereas the join expression determines whether two rows should join, the join type determines
what should be in the result set. 

In [None]:
person = spark.createDataFrame([
(0, "Bill Chambers", 0, [100]),
(1, "Matei Zaharia", 1, [500, 250, 100]),
(2, "Michael Armbrust", 1, [250, 100])])\
.toDF("id", "name", "graduate_program", "spark_status")

graduateProgram = spark.createDataFrame([
(0, "Masters", "School of Information", "ISCTE"),
(2, "Masters", "School of Information", "ISCTE"),
(1, "Ph.D.", "School of Information", "ISCTE")])\
.toDF("id", "degree", "department", "school")

sparkStatus = spark.createDataFrame([
(500, "Vice President"),
(250, "PMC Member"),
(100, "Contributor")])\
.toDF("id", "status")

### <a id='3.1.1'>3.1.1. Inner Joins</a>
Inner joins evaluate the keys in both of the DataFrames or tables and include (and join together)
only the rows that evaluate to true.   
**Inner joins (keep rows with keys that exist in the left and right datasets)**

In [None]:
joinExpression = person["graduate_program"] == graduateProgram['id']

In [None]:
joinExpression

In [None]:
#newJoinExpression = person["name"] == graduateProgram["school"] This will work?

#### join in dataframe 

In [None]:
person.join(graduateProgram, joinExpression).show()

In [None]:
joinType = "inner"

In [None]:
person.join(graduateProgram, joinExpression, joinType).show()

### <a id='3.1.2'>3.1.2. Outer Joins</a>
Outer joins evaluate the keys in both of the DataFrames or tables and includes (and joins
together) the rows that evaluate to true or false. If there is no equivalent row in either the left or
right DataFrame, Spark will insert null:

**Outer joins (keep rows with keys in either the left or right datasets)**

In [None]:
joinType = "outer"

In [None]:
person.join(graduateProgram, joinExpression, joinType).show()

### <a id='3.1.3'>3.1.3. Left Outer Joins</a>

Left outer joins evaluate the keys in both of the DataFrames or tables and includes all rows from
the left DataFrame as well as any rows in the right DataFrame that have a match in the left
DataFrame.   

**Left outer joins (keep rows with keys in the left dataset)**

In [None]:
joinType = "left_outer"

In [None]:
graduateProgram.join(person, joinExpression, joinType).show()

### <a id='3.1.4'>3.1.4. Right Outer Joins</a>

Right outer joins evaluate the keys in both of the DataFrames or tables and includes all rows
from the right DataFrame as well as any rows in the left DataFrame that have a match in the right
DataFrame.

**Right outer joins (keep rows with keys in the right dataset)**


In [None]:
joinType = "right_outer"

In [None]:
person.join(graduateProgram, joinExpression, joinType).show()

###   <a id='3.1.5'>3.1.5. Left Semi Joins</a>

Semi joins are a bit of a departure from the other joins. They do not actually include any values
from the right DataFrame. They only compare values to see if the value exists in the second
DataFrame. If the value does exist, those rows will be kept in the result, even if there are
duplicate keys in the left DataFrame. 


**Left semi joins (keep the rows in the left, and only the left, dataset where the key appears in the right dataset)**

In [None]:
joinType = "left_semi"

In [None]:
graduateProgram.join(person, joinExpression, joinType).show()

In [None]:
gradProgram2 = graduateProgram.union(spark.createDataFrame([
(0, "Masters", "Duplicated Row", "Duplicated School")]))

In [None]:
gradProgram2.join(person, joinExpression, joinType).show()

###  <a id='3.1.6'>3.1.6. Left Anti Joins</a>

Left anti joins are the opposite of left semi joins. Like left semi joins, they do not actually
include any values from the right DataFrame.    
They only compare values to see if the value exists in the second DataFrame.   
However, rather than keeping the values that exist in the second
DataFrame, they keep only the values that do not have a corresponding key in the second
DataFrame.

**Left anti joins (keep the rows in the left, and only the left, dataset where they do not appear in the right dataset)**

In [None]:
joinType = "left_anti"

In [None]:
graduateProgram.join(person, joinExpression, joinType).show()

### <a id='3.1.7'>3.1.7.  Natural Joins</a>

Natural joins make implicit guesses at the columns on which you would like to join. It finds
matching columns and returns the results. Left, right, and outer natural joins are all supported.

**Natural joins (perform a join by implicitly matching the columns between the two datasets with the same names)**

### <a id='3.1.8'>3.1.8.  Cross (Cartesian) Joins</a>

The last of our joins are cross-joins or cartesian products. Cross-joins in simplest terms are inner
joins that do not specify a predicate. Cross joins will join every single row in the left DataFrame
to ever single row in the right DataFrame


**Cross (or Cartesian) joins (match every row in the left dataset with every row in the right dataset)**

In [None]:
joinType = "cross"
graduateProgram.join(person, joinExpression, joinType).show()

In [None]:
person.crossJoin(graduateProgram).show()

## <a id='3.2'>3.2. How Spark Perform Joins</a>

### Big table–to–big table
When you join a big table to another big table, you end up with a shuffle join, such as that

<img src="big-to-big.png" width="500px"/>

### Big table–to–small table
When the table is small enough to fit into the memory of a single worker node, we can optimize our join. Although we can use a big table–to–big table communication strategy, it can often be more efficient to use a broadcast join. What this means is that we will replicate our small DataFrame onto every worker node in the cluster (be it
located on one machine or many). Now this sounds expensive. However, what this does is
prevent us from performing the all-to-all communication during the entire join process. Instead,
we perform it only once at the beginning and then let each individual worker node perform the
work without having to wait or communicate with any other worker node.

<img src="big-to-small.png" width="700px"/>

In [None]:
from pyspark.sql.functions import broadcast 

person.join(broadcast(graduateProgram), joinExpr).explain() # Marks a DataFrame as small enough for use in broadcast joins.

### Little table–to–little table
When performing joins with small tables, it’s usually best to let Spark decide how to join them.   
You can always force a broadcast join if you’re noticing strange behavior.

# <a id='4'>4. Exercises</a>

# Customer Churn

Customer churn, also known as customer attrition, customer turnover, or customer defection, is the loss of clients or customers.

Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer churn analysis and customer churn rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients.

Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided.

Predictive analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.

## Column Description   

| Column     | Type       | Description |
|--------  |---------  |: --------- |
| **customerID** | String | Customer ID |
| **gender** | String | Whether the customer is a male or a female |
| **SeniorCitizen** | Integer | Whether the customer is a senior citizen or not (1, 0) |
| **Partner** | String | Whether the customer has a partner or not (Yes, No) |
| **Dependents** | String | Whether the customer has dependents or not (Yes, No) |
| **tenure** | Integer | Number of months the customer has stayed with the company |
| **PhoneService** | String | Whether the customer has a phone service or not (Yes, No) |
| **MultipleLines** | String | Whether the customer has multiple lines or not (Yes, No, No phone service) |
| **InternetService** | String | Customer’s internet service provider (DSL, Fiber optic, No) |
| **OnlineSecurity** | String | Whether the customer has online security or not (Yes, No, No internet service) |
| **OnlineBackup** | String | Whether the customer has online backup or not (Yes, No, No internet service) |
| **DeviceProtection** | String | Whether the customer has device protection or not (Yes, No, No internet service) |
| **TechSupport** | String | Whether the customer has tech support or not (Yes, No, No internet service) |
| **StreamingTV** | String | Whether the customer has streaming movies or not (Yes, No, No internet service) |
| **StreamingMovies** | String | Whether the customer has a partner or not (Yes, No) |
| **Contract** | String | The contract term of the customer (Month-to-month, One year, Two year) |
| **PaperlessBilling** | String | Whether the customer has paperless billing or not (Yes, No) |
| **PaymentMethod** | String | The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)) |
| **MonthlyCharges** | Double | The amount charged to the customer monthly |
| **TotalCharges** | String | The total amount charged to the customer |
| **Churn** | String | Whether the customer churned or not (Yes or No) |

In [None]:
#Create the dataframe.
df = spark.read.format("csv")\
.option("header","true")\
.option("inferSchema","true")\
.load("WA_Fn-UseC_-Telco-Customer-Churn.csv")

## <a id='4.1'>4.1. EDA (Exploratory Data Analysis)</a>

 Convert types in string into numbers where is possible.

In [None]:
from pyspark.sql.types import DoubleType, IntegerType
from pyspark.sql.functions import when  

In [None]:
df = df.withColumn("TotalCharges", df["TotalCharges"].cast(DoubleType()))

In [None]:
df = df.withColumn('Label', when(df["Churn"] == "Yes" , 1).otherwise(0)) # convert into 0 or 1

In [None]:
df.where(col("TotalCharges").isNull()).count() # check null values

In [None]:
df.where(col("SeniorCitizen").isNull()).count() # check null values

In [None]:
df = df.na.drop(subset=["TotalCharges"]) ## Drop null values

## <a id='4.2'>4.2.Classification</a>

### <a id='4.2.1'>4.2.1. Logistic Regression</a>



In [None]:
from pyspark.ml.feature import RFormula
supervised = RFormula(formula="label ~ . + Churn:TotalCharges + Churn:MonthlyCharges + Churn:SeniorCitizen")

In [None]:
fittedRF = supervised.fit(df)
preparedDF = fittedRF.transform(df)
preparedDF.show(1)

Exercise: **Create a logistic regression model**  
See https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression

In [None]:
from pyspark.ml.classification import LogisticRegression

train, test = preparedDF.randomSplit([0.7, 0.3])

lr = LogisticRegression(labelCol="Label",featuresCol="features",  regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(train)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

# We can also use the multinomial family for binary classification
mlr = LogisticRegression(labelCol="Label",featuresCol="features", maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")

# Fit the model
mlrModel = mlr.fit(train)

# Print the coefficients and intercepts for logistic regression with multinomial family
print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
print("Multinomial intercepts: " + str(mlrModel.interceptVector))

### <a id='4.2.2'>4.2.2. Support Vector Machine</a>
Create a Suport Vector machine Classification.   
See https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine

### <a id='4.2.3'>4.2.3. Decision Trees </a>
Create a Decision Tree Classification.
See https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

### <a id='4.2.4'>4.2.4. Feature importance </a>

Print the tree and check the most important features.

## <a id='4.3'>4.3. Evaluation</a>

In [None]:
# Extract the summary from the returned LogisticRegressionModel instance trained
# in the earlier example
trainingSummary = mlrModel.summary

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

# Set the model threshold to maximize F-Measure
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
    .select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)

In [None]:
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall

### <a id='4.3.1'>4.3.1. Confusion Matrix</a>

### <a id='4.3.1'>4.3.1. AUC</a>

# <a id='5'>5. References</a>

https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/

https://spark.apache.org/docs/latest/api/python/

https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression

https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine