# Logistic Regression with Apache Spark
### gyleodhis@outlook.com
### [@gyleodhis](https://www.twitter.com/gyleodhis)
### ![@gyleodhis](./data/gyle.jpg)
#### Licence:
You can use this code for anything you may wish only leave this page:
#### AS IS; HOW IS, WHERE IS

In [1]:
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Logistic Regression').getOrCreate()

### One major use of logistic regression is to predict whether a user will purchase a product or not.
We can look at is as the chances of occurrence of a desired event or interested outcomes upon all possible outcomes

In [2]:
# Here we will use dome dummy web data.
web = spark.read.csv("./data/Log_Reg_dataset.csv", inferSchema=True, header=True)
web.columns # Let us chech the names of the columns in the dataset.

['Country', 'Age', 'Repeat_Visitor', 'Platform', 'Web_pages_viewed', 'Status']

In [21]:
print((web.count(),len(web.columns)))

(20000, 9)


### Column Datatype checking and casting
we now validate the datatypes of the input values to check if we need to change/ cast any columns datatypes.

In [5]:
web.printSchema()

root
 |-- Country: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Repeat_Visitor: integer (nullable = true)
 |-- Platform: string (nullable = true)
 |-- Web_pages_viewed: integer (nullable = true)
 |-- Status: integer (nullable = true)



In [7]:
web.show()

+---------+---+--------------+--------+----------------+------+
|  Country|Age|Repeat_Visitor|Platform|Web_pages_viewed|Status|
+---------+---+--------------+--------+----------------+------+
|    India| 41|             1|   Yahoo|              21|     1|
|   Brazil| 28|             1|   Yahoo|               5|     0|
|   Brazil| 40|             0|  Google|               3|     0|
|Indonesia| 31|             1|    Bing|              15|     1|
| Malaysia| 32|             0|  Google|              15|     1|
|   Brazil| 32|             0|  Google|               3|     0|
|   Brazil| 32|             0|  Google|               6|     0|
|Indonesia| 27|             0|  Google|               9|     0|
|Indonesia| 32|             0|   Yahoo|               2|     0|
|Indonesia| 31|             1|    Bing|              16|     1|
| Malaysia| 27|             1|  Google|              21|     1|
|Indonesia| 29|             1|   Yahoo|               9|     1|
|Indonesia| 33|             1|   Yahoo| 

In [8]:
web.describe().show() # A brief statistical description of the columns.

+-------+--------+-----------------+-----------------+--------+-----------------+------------------+
|summary| Country|              Age|   Repeat_Visitor|Platform| Web_pages_viewed|            Status|
+-------+--------+-----------------+-----------------+--------+-----------------+------------------+
|  count|   20000|            20000|            20000|   20000|            20000|             20000|
|   mean|    null|         28.53955|           0.5029|    null|           9.5533|               0.5|
| stddev|    null|7.888912950773227|0.500004090187782|    null|6.073903499824976|0.5000125004687693|
|    min|  Brazil|               17|                0|    Bing|                1|                 0|
|    max|Malaysia|              111|                1|   Yahoo|               29|                 1|
+-------+--------+-----------------+-----------------+--------+-----------------+------------------+



We can observe that the average age of visitors is close to 28 years, and they view around 9 web pages during the website visit.

In [9]:
web.groupBy('Age').count().orderBy('count', ascending=True).show() # categories visitors based on age in ascending order

+---+-----+
|Age|count|
+---+-----+
| 63|    1|
| 62|    1|
| 65|    1|
|111|    1|
| 59|    3|
| 56|    4|
| 60|    4|
| 58|    5|
| 61|    5|
| 57|    9|
| 53|   14|
| 55|   14|
| 54|   18|
| 52|   32|
| 51|   39|
| 50|   48|
| 49|   51|
| 48|   98|
| 47|  100|
| 46|  125|
+---+-----+
only showing top 20 rows



## Which country has maximum number of visitors

In [10]:
web.groupBy("Country").count().show()

+---------+-----+
|  Country|count|
+---------+-----+
| Malaysia| 1218|
|    India| 4018|
|Indonesia|12178|
|   Brazil| 2586|
+---------+-----+



#### We can see that most of our visitors come from Indonesia.

### Which search engine bings in more visitors.

In [11]:
web.groupBy("Platform").count().orderBy("count", ascending=True).show() # Clearly Yahoo leads the way.

+--------+-----+
|Platform|count|
+--------+-----+
|    Bing| 4360|
|  Google| 5781|
|   Yahoo| 9859|
+--------+-----+



In [12]:
web.groupBy("Status").mean().show()

+------+--------+-------------------+---------------------+-----------+
|Status|avg(Age)|avg(Repeat_Visitor)|avg(Web_pages_viewed)|avg(Status)|
+------+--------+-------------------+---------------------+-----------+
|     1| 26.5435|             0.7019|              14.5617|        1.0|
|     0| 30.5356|             0.3039|               4.5449|        0.0|
+------+--------+-------------------+---------------------+-----------+



#### From the above results it is very clear that the visitors who ended up with a purchase viewd many pages.

### Now let us see how many visitors bought products and how many turned away.

In [13]:
web.groupBy("Status").count().show()

+------+-----+
|Status|count|
+------+-----+
|     1|10000|
|     0|10000|
+------+-----+



In [14]:
# Average mean statistics per country.
web.groupBy("Country").mean().show()

+---------+------------------+-------------------+---------------------+--------------------+
|  Country|          avg(Age)|avg(Repeat_Visitor)|avg(Web_pages_viewed)|         avg(Status)|
+---------+------------------+-------------------+---------------------+--------------------+
| Malaysia|27.792282430213465| 0.5730706075533661|   11.192118226600986|  0.6568144499178982|
|    India|27.976854156296664| 0.5433051269288203|   10.727227476356397|  0.6212045793927327|
|Indonesia| 28.43159796354081| 0.5207751683363442|    9.985711939563148|  0.5422893742814913|
|   Brazil|30.274168600154677|  0.322892498066512|    4.921113689095128|0.038669760247486466|
+---------+------------------+-------------------+---------------------+--------------------+



##### The above table returns some very interesting statistics. Malaysia leads in repeat visitors same us customers who ended up buying...; While with Brazil is an all time low.
##### From the above results it is also clear that the higher the age the lesser the likelyhood of purchase.

## Feature Engineering
Since we are dealing with two categorical columns, we will have to convert the country and search engine columns into numerical form. Our machine learning model cannot understand categorical values.

In [15]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler

The first step is to label the column using StringIndexer into numerical form. It allocates unique values to each of the categories of the column. So, in the below example, all of the three values of search engine (Yahoo, Google, Bing) are assigned values (0.0,1.0,2.0). This is visible in the column named search_engine_num.

In [16]:
country_indexer = StringIndexer(inputCol="Country", outputCol = "Country Number").fit(web)
web = country_indexer.transform(web)
web.show(3)

+-------+---+--------------+--------+----------------+------+--------------+
|Country|Age|Repeat_Visitor|Platform|Web_pages_viewed|Status|Country Number|
+-------+---+--------------+--------+----------------+------+--------------+
|  India| 41|             1|   Yahoo|              21|     1|           1.0|
| Brazil| 28|             1|   Yahoo|               5|     0|           2.0|
| Brazil| 40|             0|  Google|               3|     0|           2.0|
+-------+---+--------------+--------+----------------+------+--------------+
only showing top 3 rows



In [17]:
search_engine_indexer = StringIndexer(inputCol="Platform", outputCol="Platform Number").fit(web)
web = search_engine_indexer.transform(web)
web.show(3)

+-------+---+--------------+--------+----------------+------+--------------+---------------+
|Country|Age|Repeat_Visitor|Platform|Web_pages_viewed|Status|Country Number|Platform Number|
+-------+---+--------------+--------+----------------+------+--------------+---------------+
|  India| 41|             1|   Yahoo|              21|     1|           1.0|            0.0|
| Brazil| 28|             1|   Yahoo|               5|     0|           2.0|            0.0|
| Brazil| 40|             0|  Google|               3|     0|           2.0|            1.0|
+-------+---+--------------+--------+----------------+------+--------------+---------------+
only showing top 3 rows



The next step is to represent each of these values into the form of a one hot encoded vector. However, this vector is a little different in terms of representation as it captures the values and position of the values in the vector

In [18]:
from pyspark.ml.feature import OneHotEncoder
platform_encoder = OneHotEncoder(inputCol="Platform Number", outputCol="Platform Vector")
web = platform_encoder.transform(web)
web.show(5)

+---------+---+--------------+--------+----------------+------+--------------+---------------+---------------+
|  Country|Age|Repeat_Visitor|Platform|Web_pages_viewed|Status|Country Number|Platform Number|Platform Vector|
+---------+---+--------------+--------+----------------+------+--------------+---------------+---------------+
|    India| 41|             1|   Yahoo|              21|     1|           1.0|            0.0|  (2,[0],[1.0])|
|   Brazil| 28|             1|   Yahoo|               5|     0|           2.0|            0.0|  (2,[0],[1.0])|
|   Brazil| 40|             0|  Google|               3|     0|           2.0|            1.0|  (2,[1],[1.0])|
|Indonesia| 31|             1|    Bing|              15|     1|           0.0|            2.0|      (2,[],[])|
| Malaysia| 32|             0|  Google|              15|     1|           3.0|            1.0|  (2,[1],[1.0])|
+---------+---+--------------+--------+----------------+------+--------------+---------------+---------------+
o

In [19]:
web.groupBy("Platform Vector").count().orderBy("count", ascending=True).show()

+---------------+-----+
|Platform Vector|count|
+---------------+-----+
|      (2,[],[])| 4360|
|  (2,[1],[1.0])| 5781|
|  (2,[0],[1.0])| 9859|
+---------------+-----+



### Meaning of the Vectors in the Platform Vector
(2,[0],[1.0]) represents a vector of length 2 , with 1 value :
    Size of Vector – 2
    Value contained in vector – 1.0
    Position of 1.0 value in vector – 0th place

### Let’s repeat the same procedure for the other categorical column (Country).

In [24]:
#country_indexer = StringIndexer(inputCol="Country",
#outputCol="Country Number").fit(web)
#web = country_indexer.transform(web)
web.groupBy('Country').count().orderBy('count',ascending=True).show(5,False)

+---------+-----+
|Country  |count|
+---------+-----+
|Malaysia |1218 |
|Brazil   |2586 |
|India    |4018 |
|Indonesia|12178|
+---------+-----+



In [26]:
web.groupBy('Country Number').count().orderBy('count', ascending=False).show()

+--------------+-----+
|Country Number|count|
+--------------+-----+
|           0.0|12178|
|           1.0| 4018|
|           2.0| 2586|
|           3.0| 1218|
+--------------+-----+



In [30]:
#country_encoder = OneHotEncoder(inputCol="Country Number", outputCol="Country Vector")
#web = country_encoder.transform(web)
web.select(['Country','Country Number','Country Vector']).show()

+---------+--------------+--------------+
|  Country|Country Number|Country Vector|
+---------+--------------+--------------+
|    India|           1.0| (3,[1],[1.0])|
|   Brazil|           2.0| (3,[2],[1.0])|
|   Brazil|           2.0| (3,[2],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
| Malaysia|           3.0|     (3,[],[])|
|   Brazil|           2.0| (3,[2],[1.0])|
|   Brazil|           2.0| (3,[2],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
| Malaysia|           3.0|     (3,[],[])|
|Indonesia|           0.0| (3,[0],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
|    India|           1.0| (3,[1],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
|Indonesia|           0.0| (3,[0],[1.0])|
| Malaysia|           3.0|     (3,[],[])|
|Indonesia|           0.0| (3,[0],[1.0])|
+---------+--------------+--------

In [33]:
web.groupBy('Country Vector').count().orderBy('count',ascending=False).show(5)

+--------------+-----+
|Country Vector|count|
+--------------+-----+
| (3,[0],[1.0])|12178|
| (3,[1],[1.0])| 4018|
| (3,[2],[1.0])| 2586|
|     (3,[],[])| 1218|
+--------------+-----+



#### Now that we have converted both the categorical columns into numerical forms, we need to assemble all of the input columns into a single vector that would act as the input feature for the model.
So, we select the input columns that we need to use to create the single feature vector and name the output vector as features

In [38]:
web_assembler = VectorAssembler(inputCols=['Platform Vector','Country Vector','Age',
                                           'Repeat_Visitor','Web_pages_viewed'],
                                outputCol="features")
web = web_assembler.transform(web)
web.printSchema()

root
 |-- Country: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Repeat_Visitor: integer (nullable = true)
 |-- Platform: string (nullable = true)
 |-- Web_pages_viewed: integer (nullable = true)
 |-- Status: integer (nullable = true)
 |-- Country Number: double (nullable = true)
 |-- Platform Number: double (nullable = true)
 |-- Platform Vector: vector (nullable = true)
 |-- Country Vector: vector (nullable = true)
 |-- features: vector (nullable = true)



In [40]:
web.select("features","Status").show(5, False)

+-----------------------------------+------+
|features                           |Status|
+-----------------------------------+------+
|[1.0,0.0,0.0,1.0,0.0,41.0,1.0,21.0]|1     |
|[1.0,0.0,0.0,0.0,1.0,28.0,1.0,5.0] |0     |
|(8,[1,4,5,7],[1.0,1.0,40.0,3.0])   |0     |
|(8,[2,5,6,7],[1.0,31.0,1.0,15.0])  |1     |
|(8,[1,5,7],[1.0,32.0,15.0])        |1     |
+-----------------------------------+------+
only showing top 5 rows



## Building the Regression Model
Let us select only features column as input and the Status column as output for training the logistic regression model.

In [42]:
model = web.select(['features','Status'])

### Spliting the Dataset
We have to split the dataset into a training and test dataset in order to train and evaluate the performance of of our model. Here we will use the 75%/25% ratio. That means that we will use 75% of the data for training.

In [56]:
training,testing =model.randomSplit([0.75,0.25])
print("Number of records in Training set:", training.count())
print("Number of records in Testing set:", testing.count())

Number of records in Training set: 14991
Number of records in Testing set: 5009


 ### Building The Logistic Regression Model
 We will still use features as input and Status as the output column.

In [57]:
from pyspark.ml.classification import LogisticRegression
log_reg = LogisticRegression(labelCol = "Status").fit(training)


### Training
Here we will use the evalute function of Spark since it executes all the steps in an optimized way.
The prediction column signifies the class label that the model has predicted for the given row and probability column contains two probabilities (probability for negative class at 0th index and probability for positive class at 1st index).

In [59]:
train_results = log_reg.evaluate(training).predictions
train_results.filter(train_results['Status']==1).filter(train_results['prediction']==1).select(['Status','prediction','probability']).show(10,False)

+------+----------+----------------------------------------+
|Status|prediction|probability                             |
+------+----------+----------------------------------------+
|1     |1.0       |[0.29277010838362977,0.7072298916163702]|
|1     |1.0       |[0.29277010838362977,0.7072298916163702]|
|1     |1.0       |[0.29277010838362977,0.7072298916163702]|
|1     |1.0       |[0.16376903333782275,0.8362309666621772]|
|1     |1.0       |[0.16376903333782275,0.8362309666621772]|
|1     |1.0       |[0.16376903333782275,0.8362309666621772]|
|1     |1.0       |[0.16376903333782275,0.8362309666621772]|
|1     |1.0       |[0.08479376587117754,0.9152062341288225]|
|1     |1.0       |[0.08479376587117754,0.9152062341288225]|
|1     |1.0       |[0.08479376587117754,0.9152062341288225]|
+------+----------+----------------------------------------+
only showing top 10 rows



### Linear Regression Model on Test Data
Its perhaps time to check the performance of this model on unseen test data. We again make use of the evaluate function to make predictions on the test. We assign the predictions DataFrame to results and results DataFrame
now contains five columns.

In [60]:
results = log_reg.evaluate(testing).predictions
results.printSchema()

root
 |-- features: vector (nullable = true)
 |-- Status: integer (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [61]:
# Filtering out the status and the prediction columns.
results.select(["Status","prediction"]).show(10)

+------+----------+
|Status|prediction|
+------+----------+
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
+------+----------+
only showing top 10 rows



### Confusion Matrix
Since this is a classification problem, we will use a confusion matrix to gauge the performance of the model.
We will manually create the variables for true positives, true negatives, false positives, and false negatives to understand them better rather than using the direct inbuilt function.

In [66]:
tp = results[(results.Status == 1) & (results.prediction == 1)].count() # True positive
tn = results[(results.Status == 0) & (results.prediction == 0)].count() # True Negative
fp = results[(results.Status == 0) & (results.prediction == 1)].count() # False Positive
fn = results[(results.Status == 1) & (results.prediction == 0)].count() # False Negative

### Model Accuracy Evaluation
accuracy is the most basic metric for evaluating any classifier; however, this is not the right indicator of
the performance of the model due to dependency on the target class balance. We will still use it though.

In [68]:
accuracy=float((tp+tn) /(results.count()))
print("The accuracy of the model is:", accuracy)

The accuracy of the model is: 0.9431024156518267


### WOOOW !!! Our model has achieved 94% accuracy. Incredible. 

### Recal Rate:
This shows how much of the positive class cases we are able to predict correctly out of the total positive class observations.

In [72]:
recall = float(tp)/(tp + fn)
print("Our model has a recall rate of",recall * 100,"percent")

Our model has a recall rate of 94.0935192780968 percent


### Precision Rate
Here we talks about the number of true positives predicted correctly out of all the predicted positives observations.

In [73]:
precision = float(tp) / (tp +fp)
print("Our precisition rate is:", precision * 100, "percent")

Our precisition rate is: 94.2094455852156 percent


Our recall rate and precision rate are also in the same range, which is due to the fact that our target class was well balanced.

#### In this code I went through what I consider the building blocks of a logistics regression model.
#### I really hope it helps.
#### Bye.