# Logistic regression on the example of a consulting project

Logistic regression is used to predict a binary outcome using binomial logistic regression or to predict a multiclass outcome using polynomial logistic regression. We use the method in situations where we classify in terms of yes or no answers, such as smokers and non-smokers, sick and healthy, exposed and non-exposed.

## Client churn - binary option

A certain marketing agency has many clients who use their services to create advertisements for client/customer websites. Recently, they have noticed a large churn of the agency's services. Basically, they are randomly assigning client advisors at the moment, but they want a machine learning model to help predict which clients will leave (stop buying their service) so they can correctly assign advisors to the companies most likely to see clients leave.

Dane are stored in file **test_customer_churn.csv**. There is its structure:

    Names: Name and surname of the customer
    Age: Age of the customer
    Total_Purchase: Total money spent on our services
    Account_Manager: 0- no Customer Advisor assigned, 1-Customer Advisor assigned
    Years: How long has he been our customer
    Num_sites: Number of websites that uses our services
    Onboard_date: Last contract date
    Location: Client address
    Company: Company name
    Churn (only in test set): 1 lub 0 (churn or not) 

### Initialize Spark

In [1]:
import findspark
findspark.init()
import pyspark
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.port.maxRetries", "60").getOrCreate()
print(spark.version)

3.3.0


### Read data from test_customer_churn.csv

In [2]:
data = spark.read.csv('test_customer_churn.csv',inferSchema=True,header=True)

### Show the dataset schema

In [3]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



### Overview of the loaded data

In [4]:
data.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|                null|                null|0.16666666666666666|
| stddev|         null|6.127560416916251|2408.644531858096|0.4999208935073339|1.274449013194616|1.764835592035

In [5]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

### Data processing - get only fields needed for calculations

The factors that will determine the outcome are:
- age of the customer
- the total amount paid for our service
- the assigned advisor; how long they have been our client
- the number of websites where they use our service

In [6]:
from pyspark.ml.feature import VectorAssembler

In [7]:
assembler = VectorAssembler(inputCols=[
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'],outputCol='features')

In [8]:
output = assembler.transform(data)

In [9]:
final_data = output.select('features','churn')

### Splitting the dataset into train (70%) and test (30%) datasets

In [10]:
train_churn,test_churn = final_data.randomSplit([0.7,0.3])

### Model fit using logistic regression algorithm

In [11]:
from pyspark.ml.classification import LogisticRegression

In [12]:
lr_churn = LogisticRegression(labelCol='churn')

In [13]:
fitted_churn_model = lr_churn.fit(train_churn)

In [14]:
training_sum = fitted_churn_model.summary

In [15]:
training_sum.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                625|                625|
|   mean|             0.1664|              0.128|
| stddev|0.37273761995984966|0.33435740128621594|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



### Calculation of results on a test set - we predict whether the customer will churn or not

In [16]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [17]:
pred_and_labels = fitted_churn_model.evaluate(test_churn)

In [18]:
pred_and_labels.predictions.show(100)

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[22.0,11254.38,1....|    0|[4.47648927766323...|[0.98875462515543...|       0.0|
|[27.0,8628.8,1.0,...|    0|[5.40274206821269...|[0.99551598383050...|       0.0|
|[28.0,11245.38,0....|    0|[3.74208778845215...|[0.97684433359088...|       0.0|
|[29.0,10203.18,1....|    0|[3.61018961253980...|[0.97366554265775...|       0.0|
|[29.0,11274.46,1....|    0|[4.49634473643090...|[0.98897326735655...|       0.0|
|[29.0,13255.05,1....|    0|[4.09856610865091...|[0.98367448974746...|       0.0|
|[30.0,6744.87,0.0...|    0|[3.63268262463781...|[0.97423618120374...|       0.0|
|[30.0,7960.64,1.0...|    1|[3.19230113426170...|[0.96054352514310...|       0.0|
|[31.0,8688.21,0.0...|    0|[7.09973445468506...|[0.99917535654297...|       0.0|
|[31.0,10058.87,

What does the above table show?
Based on the fragment of calculated data, if rawPrediction is less than probability then prediction is 1 (customer churned).
Prediction is not always true, i.e. churn (customer abandoned) is different from prediction. The algorithm predicts that fewer customers have left than represented by the actual data.

In [19]:
pred_and_labels.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                275|                275|
|   mean|0.16727272727272727|0.14545454545454545|
| stddev| 0.3738996242282269|0.35320130414186124|
|    min|                  0|                0.0|
|    max|                  1|                1.0|
+-------+-------------------+-------------------+



Based on the statistics of the prediction results, it appears (looking at the average) that they are quite different from the true answer, this can be seen, for example, by the average between prediction and churn (the average of prediction is smaller than churn).

It may follow that in many cases, the algorithm believes that a given customer has not left, when in fact it has.

### Using the AUC we evaluate the performance of the algorithm

**AUC** is used primarily in binary classifications. It shows the pattern of matches between each generated (predicted) variable and its predicted variable for the predicted qualitative variables (flag, nominal or ordinal variable). In the displayed table, rows are defined by actual values and columns by predicted values, and the number of records corresponds to the number of records in which this pattern is found in each cell. This is a useful feature for identifying semantic errors in prediction.

**Higher score is better.**

In [20]:
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                           labelCol='churn')

In [21]:
auc = churn_eval.evaluate(pred_and_labels.predictions)

In [22]:
auc

0.7911999240554396

Based on the classification levels of the result, we can conclude that the result received more than 83% falls within the "sufficient (fair)" range.

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935260/#S9

# Execution of the same algorithm on other input data..
Now, the algorithm must predict on its own whether a given customer will abandon our services or not.

Data file: **new_customers.csv**

In [23]:
final_lr_model = lr_churn.fit(final_data)

In [24]:
new_customers = spark.read.csv('new_customers.csv',inferSchema=True,header=True)

In [25]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: integer (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



In [26]:
test_new_customers = assembler.transform(new_customers)

In [27]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: integer (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [28]:
final_results = final_lr_model.transform(test_new_customers)

In [29]:
final_results.select('Company','prediction').orderBy("prediction", ascending=False).show()

+-------------+----------+
|      Company|prediction|
+-------------+----------+
|     Orviston|       1.0|
|     Westboro|       1.0|
|        Onton|       1.0|
|   Tecolotito|       1.0|
|      Hickory|       1.0|
|      Zortman|       1.0|
|       Harmon|       1.0|
|       Chapin|       1.0|
|       Colton|       1.0|
|        Nicut|       1.0|
|   Brambleton|       1.0|
|   Stagecoach|       1.0|
|        Yonah|       1.0|
|     Monument|       1.0|
|     Woodruff|       1.0|
|  Interlochen|       1.0|
|     Cazadero|       1.0|
|      Hilltop|       1.0|
| Bartonsville|       1.0|
|      Advance|       1.0|
+-------------+----------+
only showing top 20 rows



From the above, it follows that we should assign advisors to companies: Orviston, Westboro, Onton, ....

In [30]:
final_results.select("prediction").where("prediction=0").describe().show()

+-------+----------+
|summary|prediction|
+-------+----------+
|  count|      8686|
|   mean|       0.0|
| stddev|       0.0|
|    min|       0.0|
|    max|       0.0|
+-------+----------+



In [31]:
final_results.select("prediction").where("prediction=1").describe().show()

+-------+----------+
|summary|prediction|
+-------+----------+
|  count|      1314|
|   mean|       1.0|
| stddev|       0.0|
|    min|       1.0|
|    max|       1.0|
+-------+----------+



From the tables above, the algorithm predicts that advisors should be assigned to 1314 customers.

# Close Spark session

In [32]:
spark.stop()