# PySpark ML - Classification

Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.

## Preparing the environment

### Importing libraries

In [1]:
from environment import SEED
from pyspark.sql.types import (StructType, StructField,
                               DoubleType, IntegerType, StringType)
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import (StringIndexer, VectorAssembler, Tokenizer,
                                StopWordsRemover, HashingTF, IDF)
from pyspark.ml.classification import DecisionTreeClassifier, LogisticRegression
from pyspark.ml.evaluation import (MulticlassClassificationEvaluator, 
                                   BinaryClassificationEvaluator)

### Connect to Spark

In [2]:
spark = (SparkSession.builder
                     .master('local[*]') \
                     .appName('spark_application') \
                     .config("spark.sql.repl.eagerEval.enabled", True)  # eval DataFrame in notebooks
                     .getOrCreate())

sc = spark.sparkContext
print(f'Spark version: {spark.version}')

Spark version: 3.5.1


### Loading data

In [3]:
schema_flights = StructType([
    StructField("mon", IntegerType()),
    StructField("dom", IntegerType()),
    StructField("dow", IntegerType()),
    StructField("carrier", StringType()),
    StructField("flight", IntegerType()),
    StructField("org", StringType()),
    StructField("mile", IntegerType()),
    StructField("depart", DoubleType()),
    StructField("duration", IntegerType()),
    StructField("delay", IntegerType())
])

flights_data = spark.read.csv('data-sources/flights.csv', header=True, schema=schema_flights, nullValue='NA')
flights_data.createOrReplaceTempView("flights")
flights_data.printSchema()
flights_data.limit(2)

root
 |-- mon: integer (nullable = true)
 |-- dom: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- org: string (nullable = true)
 |-- mile: integer (nullable = true)
 |-- depart: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- delay: integer (nullable = true)



mon,dom,dow,carrier,flight,org,mile,depart,duration,delay
11,20,6,US,19,JFK,2153,9.48,351,
0,22,2,UA,1107,ORD,316,16.33,82,30.0


In [4]:
schema_sms = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

sms_data = spark.read.csv("data-sources/sms.csv", sep=';', header=False, schema=schema_sms)
sms_data.createOrReplaceTempView("sms")
sms_data.printSchema()
sms_data.limit(2)

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



id,text,label
1,"Sorry, I'll call ...",0
2,Dont worry. I gue...,0


In [5]:
schema_cars = StructType([
    StructField("maker", StringType()),
    StructField("model", StringType()),
    StructField("origin", StringType()),
    StructField("type", StringType()),
    StructField("cyl", IntegerType()),
    StructField("size", DoubleType()),
    StructField("weight", IntegerType()),
    StructField("length", DoubleType()),
    StructField("rpm", IntegerType()),
    StructField("consumption", DoubleType())
])

cars_data = spark.read.csv('data-sources/cars.csv', header=True, schema=schema_cars, nullValue='NA')
cars_data.createOrReplaceTempView("cars")
cars_data.printSchema()
cars_data.limit(2)

root
 |-- maker: string (nullable = true)
 |-- model: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- type: string (nullable = true)
 |-- cyl: integer (nullable = true)
 |-- size: double (nullable = true)
 |-- weight: integer (nullable = true)
 |-- length: double (nullable = true)
 |-- rpm: integer (nullable = true)
 |-- consumption: double (nullable = true)



maker,model,origin,type,cyl,size,weight,length,rpm,consumption
Mazda,RX-7,non-USA,Sporty,,1.3,2895,169.0,6500,4.0
Geo,Metro,non-USA,Small,3.0,1.0,1695,151.0,5700,2.0


In [6]:
schema_books = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType())
])

books_data = spark.read.csv("data-sources/books.csv", sep=';', header=True, schema=schema_books)
books_data.createOrReplaceTempView("books")
books_data.printSchema()
books_data.limit(2)

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)



id,text
0,"Forever, or a Lon..."
1,Winnie-the-Pooh


### Tables catalogue

In [7]:
spark.catalog.listTables()

[Table(name='books', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='cars', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='flights', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True),
 Table(name='sms', catalog=None, namespace=[], description=None, tableType='TEMPORARY', isTemporary=True)]

# Cars Dataset

## Data Preparation

### Dropping columns

In [8]:
# Either drop the columns you don't want...
df_cars = cars_data.drop('maker', 'model')

# ... or select the columns you want to retain.
df_cars = cars_data.select('origin', 'type', 'cyl', 'size', 'weight', 'length', 'rpm', 'consumption')
df_cars.limit(2)

origin,type,cyl,size,weight,length,rpm,consumption
non-USA,Sporty,,1.3,2895,169.0,6500,4.0
non-USA,Small,3.0,1.0,1695,151.0,5700,2.0


### Filtering out missing data

In [9]:
# How many missing values?
Dict_Null = {col: df_cars.filter(df_cars[f"`{col}`"].isNull()).count() for col in df_cars.columns}
Dict_Null = {k: v for k, v in Dict_Null.items() if v != 0}
print(f'''
Missing values in cyl: {df_cars.filter('cyl IS NULL').count()}
Total nulls found in cars data: {Dict_Null}
''')

# Drop records with missing values in the cylinders column.
df_cars = df_cars.filter('cyl IS NOT NULL')

# Drop records with missing values in any column.
df_cars = df_cars.dropna()
df_cars.limit(2)


Missing values in cyl: 1
Total nulls found in cars data: {'cyl': 1}



origin,type,cyl,size,weight,length,rpm,consumption
non-USA,Small,3,1.0,1695,151.0,5700,2.0
non-USA,Small,4,1.5,2350,173.0,5900,2.17


### Mutating columns

In [10]:
# Create a new 'mass' column
df_cars = df_cars.withColumn('mass', F.round(df_cars.weight / 2.205, 0))

# Convert length to metres
df_cars = df_cars.withColumn('length', F.round(df_cars.length * 0.0254, 3))
df_cars.show(2)

+-------+-----+---+----+------+------+----+-----------+------+
| origin| type|cyl|size|weight|length| rpm|consumption|  mass|
+-------+-----+---+----+------+------+----+-----------+------+
|non-USA|Small|  3| 1.0|  1695| 3.835|5700|        2.0| 769.0|
|non-USA|Small|  4| 1.5|  2350| 4.394|5900|       2.17|1066.0|
+-------+-----+---+----+------+------+----+-----------+------+
only showing top 2 rows



### Indexing categorical data

In [11]:
# Assign index values to strings
indexer_cars = StringIndexer(inputCol='type', outputCol='type_idx')
indexer_cars = indexer_cars.fit(df_cars)

# Create column with index values
df_cars = indexer_cars.transform(df_cars)
df_cars.show(2)
df_cars.select('type', 'type_idx').distinct().orderBy('type_idx').show()

+-------+-----+---+----+------+------+----+-----------+------+--------+
| origin| type|cyl|size|weight|length| rpm|consumption|  mass|type_idx|
+-------+-----+---+----+------+------+----+-----------+------+--------+
|non-USA|Small|  3| 1.0|  1695| 3.835|5700|        2.0| 769.0|     1.0|
|non-USA|Small|  4| 1.5|  2350| 4.394|5900|       2.17|1066.0|     1.0|
+-------+-----+---+----+------+------+----+-----------+------+--------+
only showing top 2 rows

+-------+--------+
|   type|type_idx|
+-------+--------+
|Midsize|     0.0|
|  Small|     1.0|
|Compact|     2.0|
| Sporty|     3.0|
|  Large|     4.0|
|    Van|     5.0|
+-------+--------+



In [12]:
df_cars = StringIndexer(inputCol="origin", outputCol="label").fit(df_cars).transform(df_cars)
df_cars.show(2)
df_cars.select('origin', 'label').distinct().orderBy('label').show()

+-------+-----+---+----+------+------+----+-----------+------+--------+-----+
| origin| type|cyl|size|weight|length| rpm|consumption|  mass|type_idx|label|
+-------+-----+---+----+------+------+----+-----------+------+--------+-----+
|non-USA|Small|  3| 1.0|  1695| 3.835|5700|        2.0| 769.0|     1.0|  1.0|
|non-USA|Small|  4| 1.5|  2350| 4.394|5900|       2.17|1066.0|     1.0|  1.0|
+-------+-----+---+----+------+------+----+-----------+------+--------+-----+
only showing top 2 rows

+-------+-----+
| origin|label|
+-------+-----+
|    USA|  0.0|
|non-USA|  1.0|
+-------+-----+



### Assembling columns

In [13]:
assembler_cars = VectorAssembler(inputCols=['cyl', 'size'], outputCol='features')
df_cars = assembler_cars.transform(df_cars)
df_cars.select('cyl', 'size', 'features').show(5)

+---+----+---------+
|cyl|size| features|
+---+----+---------+
|  3| 1.0|[3.0,1.0]|
|  4| 1.5|[4.0,1.5]|
|  3| 1.3|[3.0,1.3]|
|  4| 1.6|[4.0,1.6]|
|  4| 1.9|[4.0,1.9]|
+---+----+---------+
only showing top 5 rows



## Decision Tree

### Split train/test

In [14]:
# Specify a seed for reproducibility
df_cars_train, df_cars_test = df_cars.randomSplit([0.8, 0.2], seed=SEED)

# Two DataFrames: cars_train and cars_test .
[df_cars_train.count(), df_cars_test.count()]

[75, 17]

### Build a Decision Tree model

In [15]:
# Create a Decision Tree classifier.
tree_model_cars = DecisionTreeClassifier(featuresCol='features', labelCol='label',
                                         predictionCol='pred', seed=SEED)

# Learn from the training data.
tree_model_cars = tree_model_cars.fit(df_cars_train)
tree_model_cars

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_80f16da06485, depth=5, numNodes=21, numClasses=2, numFeatures=2

### Predicting

In [16]:
df_cars_pred_tree = tree_model_cars.transform(df_cars_test)
df_cars_pred_tree.select('label', 'pred', 'rawPrediction', 'probability').show(12)

+-----+----+-------------+--------------------+
|label|pred|rawPrediction|         probability|
+-----+----+-------------+--------------------+
|  0.0| 0.0|   [10.0,8.0]|[0.55555555555555...|
|  0.0| 0.0|    [2.0,0.0]|           [1.0,0.0]|
|  0.0| 0.0|   [10.0,0.0]|           [1.0,0.0]|
|  0.0| 0.0|   [10.0,0.0]|           [1.0,0.0]|
|  0.0| 0.0|   [10.0,8.0]|[0.55555555555555...|
|  0.0| 0.0|   [10.0,0.0]|           [1.0,0.0]|
|  0.0| 1.0|   [4.0,12.0]|         [0.25,0.75]|
|  0.0| 1.0|   [4.0,12.0]|         [0.25,0.75]|
|  0.0| 0.0|   [10.0,0.0]|           [1.0,0.0]|
|  0.0| 0.0|   [10.0,0.0]|           [1.0,0.0]|
|  0.0| 0.0|   [10.0,0.0]|           [1.0,0.0]|
|  1.0| 0.0|   [10.0,8.0]|[0.55555555555555...|
+-----+----+-------------+--------------------+
only showing top 12 rows



### Confusion matrix

In [17]:
# Labels for confussion matrix
df_cars_cm_tree = df_cars_pred_tree.groupBy("label", "pred").count().toPandas().sort_values(["pred", "label"])
df_cars_cm_tree.index = ['True negative (TN)', 'False negative (FN)',
                         'False positive (FP)', 'True positive (TP)']
TN, FN, FP, TP = df_cars_cm_tree['count'].to_list()
accuracy_cars = (TN + TP) / (TN + TP + FN + FP)
print(f'Accuracy: {accuracy_cars}')
df_cars_cm_tree

Accuracy: 0.7058823529411765


Unnamed: 0,label,pred,count
True negative (TN),0.0,0.0,9
False negative (FN),1.0,0.0,3
False positive (FP),0.0,1.0,2
True positive (TP),1.0,1.0,3


## Logistic Regression

### Build a Logistic Regression model

In [18]:
# Create a Logistic Regression classifier.
lr_model_cars = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='pred')

# Learn from the training data.
lr_model_cars = lr_model_cars.fit(df_cars_train)
lr_model_cars

LogisticRegressionModel: uid=LogisticRegression_e54b689aee4c, numClasses=2, numFeatures=2

### Predictions

In [19]:
df_cars_pred_lr = lr_model_cars.transform(df_cars_test)
df_cars_pred_lr.select('label', 'pred', 'rawPrediction', 'probability').show(12, truncate=False)

+-----+----+------------------------------------------+-----------------------------------------+
|label|pred|rawPrediction                             |probability                              |
+-----+----+------------------------------------------+-----------------------------------------+
|0.0  |1.0 |[-0.10779485172257308,0.10779485172257308]|[0.4730773515143992,0.5269226484856008]  |
|0.0  |0.0 |[1.3298097249608083,-1.3298097249608083]  |[0.7908091593227016,0.20919084067729843] |
|0.0  |0.0 |[0.4608954276344823,-0.4608954276344823]  |[0.6132265749554279,0.38677342504457213] |
|0.0  |0.0 |[4.773709157684628,-4.773709157684628]    |[0.9916218038829928,0.008378196117007186]|
|0.0  |1.0 |[-0.10779485172257308,0.10779485172257308]|[0.4730773515143992,0.5269226484856008]  |
|0.0  |0.0 |[0.6405959997199053,-0.6405959997199053]  |[0.6548881747283181,0.34511182527168194] |
|0.0  |1.0 |[-1.3656988563205323,1.3656988563205323]  |[0.2033156498119474,0.7966843501880526]  |
|0.0  |1.0 |[-1.1859

### Precision and recall

In [20]:
# Labels for confussion matrix
df_cars_cm_lr = df_cars_pred_lr.groupBy("label", "pred").count().toPandas().sort_values(["pred", "label"])
df_cars_cm_lr.index = ['True negative (TN)', 'False negative (FN)',
                       'False positive (FP)', 'True positive (TP)']

TN, FN, FP, TP = df_cars_cm_lr['count'].to_list()
accuracy_cars = (TN + TP) / (TN + TP + FN + FP)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
print(f'''
Accuracy : {accuracy_cars}
Precision: {precision}
Recall   : {recall}
''')
df_cars_cm_lr


Accuracy : 0.7058823529411765
Precision: 0.5555555555555556
Recall   : 0.8333333333333334



Unnamed: 0,label,pred,count
True negative (TN),0.0,0.0,7
False negative (FN),1.0,0.0,1
False positive (FP),0.0,1.0,4
True positive (TP),1.0,1.0,5


### Weighted metrics

In [21]:
evaluator_cars_lr = MulticlassClassificationEvaluator(labelCol='label', 
                                                      predictionCol='pred')
weightedPrecision = evaluator_cars_lr.evaluate(df_cars_pred_lr, 
                                               {evaluator_cars_lr.metricName: 'weightedPrecision'})
weightedRecall = evaluator_cars_lr.evaluate(df_cars_pred_lr, 
                                            {evaluator_cars_lr.metricName: 'weightedRecall'})
accuracy = evaluator_cars_lr.evaluate(df_cars_pred_lr, {evaluator_cars_lr.metricName: 'accuracy'})
f1 = evaluator_cars_lr.evaluate(df_cars_pred_lr, {evaluator_cars_lr.metricName: 'f1'})

print(f'''
weightedPrecision: {weightedPrecision}
   weightedRecall: {weightedRecall}
         accuracy: {accuracy}
               f1: {f1}
''')


weightedPrecision: 0.7622549019607843
   weightedRecall: 0.7058823529411764
         accuracy: 0.7058823529411765
               f1: 0.7120743034055728



# Flights Dataset

## Ex. 1 - Removing columns and rows

You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.

In this exercise you need to trim those data down by:
- removing an uninformative column and
- removing rows which do not have information about whether or not a flight was delayed.

**Instructions:**

1. Remove the `flight` column.
2. Find out how many records have missing values in the delay column.
3. Remove records with missing values in the delay column.
4. Remove records with missing values in any column and get the number of remaining rows.

In [22]:
# Loading the data
df_flights = flights_data.select('*')

# Remove the 'flight' column
df_flights = df_flights.drop('flight')

# Number of records with missing 'delay' values
missing_delay = df_flights.filter('delay IS NULL').count()
print('Missing delay row count:', missing_delay)

# Remove records with missing 'delay' values
df_flights = df_flights.filter('delay IS NOT NULL')

# Remove records with missing values in any column and get the number of remaining rows
df_flights = df_flights.dropna()
print(df_flights.count())

Missing delay row count: 2978
47022


## Ex. 2 - Column manipulation
The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.

The next step of preparing the flight data has two parts:
- convert the units of distance, replacing the mile column with a kmcolumn; and
- create a Boolean column indicating whether or not a flight was delayed.

**Instructions:**

1. Import a function which will allow you to `round` a number to a specific number of decimal places.
2. Derive a new `km` column from the `mile` column, rounding to zero decimal places. One mile is `1.60934` km.
3. Remove the `mile` column.
4. Create a `label` column with a value of `1` indicating the `delay` was `15` minutes or more and `0` otherwise. Think carefully about the logical condition.

In [23]:
# Convert 'mile' to 'km' and drop 'mile' column (1 mile is equivalent to 1.60934 km)
df_flights = df_flights.withColumn('km', F.round(df_flights['mile'] * 1.60934, 0)) \
                       .drop('mile')

# Create 'label' column indicating whether flight delayed (1) or not (0)
df_flights = df_flights.withColumn('label', (df_flights['delay']>=15).cast('integer'))

# Check first five records
df_flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|
+---+---+---+-------+---+------+--------+-----+------+-----+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|
+---+---+---+-------+---+------+--------+-----+------+-----+
only showing top 5 rows



## Ex. 3 - Categorical columns

In the flights data there are two columns, `carrier` and `org`, which hold categorical data. You need to transform those columns into indexed numerical values.

**Instructions:**

1. Import the appropriate class and create an indexer object to transform the carrier column from a string to an numeric index.
2. Prepare the indexer object on the `flight` data.
3. Use the prepared indexer to create the numeric index column.
4. Repeat the process for the `org` column.

In [24]:
# Create an indexer
indexer_flights = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_flights = indexer_flights.fit(df_flights)

# Indexer creates a new column with numeric index values
df_flights = indexer_flights.transform(df_flights)

# Repeat the process for the other categorical feature
df_flights = StringIndexer(inputCol='org', outputCol='org_idx').fit(df_flights).transform(df_flights)
df_flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
only showing top 5 rows



## Ex. 4 - Assembling columns

The final stage of data preparation is to consolidate all of the predictor columns into a single column.

We are going to use the following predictor columns:
- mon, dom and dow
- carrier_idx (indexed value from carrier)
- org_idx (indexed value from org)
- km
- depart
- duration

**Instructions:**

1. Import the class which will assemble the predictors.
2. Create an assembler object that will allow you to merge the predictors columns into a single column.
3. Use the assembler to generate a new consolidated column.

In [25]:
# Create an assembler object
feature_cols_flights = ['mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 
                        'km', 'depart', 'duration']
assembler_flights = VectorAssembler(inputCols=feature_cols_flights, outputCol='features')

# Consolidate predictor columns
df_flights = assembler_flights.transform(df_flights)

# Check the resulting column
df_flights.select('features', 'delay', 'label').show(5, truncate=False)

+-----------------------------------------+-----+-----+
|features                                 |delay|label|
+-----------------------------------------+-----+-----+
|[0.0,22.0,2.0,0.0,0.0,509.0,16.33,82.0]  |30   |1    |
|[2.0,20.0,4.0,0.0,1.0,542.0,6.17,82.0]   |-8   |0    |
|[9.0,13.0,1.0,1.0,0.0,1989.0,10.33,195.0]|-5   |0    |
|[5.0,2.0,1.0,0.0,1.0,885.0,7.98,102.0]   |2    |0    |
|[7.0,2.0,6.0,1.0,0.0,1180.0,10.83,135.0] |54   |1    |
+-----------------------------------------+-----+-----+
only showing top 5 rows



## Ex. 5 - Train/test split

To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!

You will split the data into two components:
- training data (used to train the model) and
- testing data (used to test the model).

**Instructions:**

1. Randomly split the flights data into two sets with `80:20` proportions. For repeatability set a random number seed of `SEED` for the split.
2.  Check that the training data has roughly `80%` of the records from the original data.

In [26]:
# Split into training and testing sets in a 80:20 ratio
df_flights_train, df_flights_test = df_flights.randomSplit([0.8, 0.2], seed=SEED)

# Check that training set has around 80% of records
df_flights_training_ratio = df_flights_train.count() / df_flights.count()
print('Proportion of the training set:', df_flights_training_ratio)
df_flights_train.show(3)

Proportion of the training set: 0.7996469737569648
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+--------------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|            features|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+--------------------+
|  0|  1|  2|     AA|JFK|  6.58|     230|   50|2570.0|    1|        1.0|    2.0|[0.0,1.0,2.0,1.0,...|
|  0|  1|  2|     AA|JFK|   7.0|     385|  -16|4162.0|    0|        1.0|    2.0|[0.0,1.0,2.0,1.0,...|
|  0|  1|  2|     AA|JFK|  17.0|     379|  -10|3983.0|    0|        1.0|    2.0|[0.0,1.0,2.0,1.0,...|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+--------------------+
only showing top 3 rows



## Ex. 6 - Build a Decision Tree

Now that you've split the flights data into training and testing sets, you can use the training set to fit a Decision Tree model.

**Instructions:**

1. Import the class for creating a Decision Tree classifier.
2. Create a classifier object and fit it to the training data.
3. Make predictions for the testing data and take a look at the predictions.

In [27]:
# Create a classifier object and fit to the training data
tree_model_flights = DecisionTreeClassifier(featuresCol='features', 
                                            labelCol='label',
                                            predictionCol='pred', 
                                            seed=SEED)
tree_model_flights = tree_model_flights.fit(df_flights_train)

# Create predictions for the testing data and take a look at the predictions
df_flights_pred_tree = tree_model_flights.transform(df_flights_test)
df_flights_pred_tree.select('label', 'pred', 'probability').show(5, False)

+-----+----+---------------------------------------+
|label|pred|probability                            |
+-----+----+---------------------------------------+
|0    |1.0 |[0.3568441766080433,0.6431558233919568]|
|0    |1.0 |[0.3568441766080433,0.6431558233919568]|
|1    |1.0 |[0.3568441766080433,0.6431558233919568]|
|1    |0.0 |[0.5779060181368508,0.4220939818631492]|
|1    |1.0 |[0.3568441766080433,0.6431558233919568]|
+-----+----+---------------------------------------+
only showing top 5 rows



## Ex. 7 - Evaluate the Decision Tree

You can assess the quality of your model by evaluating how well it performs on the testing data. Because the model was not trained on these data, this represents an objective assessment of the model.

A confusion matrix gives a useful breakdown of predictions versus known values. It has four cells which represent the counts of:
- True Negatives (TN) — model predicts negative outcome & known outcome is negative
- True Positives (TP) — model predicts positive outcome & known outcome is positive
- False Negatives (FN) — model predicts negative outcome but known outcome is positive
- False Positives (FP) — model predicts positive outcome but known outcome is negative.

**Instructions:**

1. Create a confusion matrix by counting the combinations of label and prediction. Display the result.
2. Count the number of True Negatives, True Positives, False Negatives and False Positives.
3. Calculate the accuracy.

In [28]:
# Create a confusion matrix
df_flights_pred_tree.groupBy('label', 'pred').count().show()

# Calculate the elements of the confusion matrix
TN = df_flights_pred_tree.filter('pred = 0 AND label = 0').count()
TP = df_flights_pred_tree.filter('pred = 1 AND label = 1').count()
FN = df_flights_pred_tree.filter('pred = 0 AND label = 1').count()
FP = df_flights_pred_tree.filter('pred = 1 AND label = 0').count()

# Accuracy measures the proportion of correct predictions
accuracy_flights = (TN + TP) / (TN + TP + FN + FP)
print('Accuracy:', accuracy_flights)

+-----+----+-----+
|label|pred|count|
+-----+----+-----+
|    1| 0.0| 1374|
|    0| 0.0| 2540|
|    1| 1.0| 3512|
|    0| 1.0| 1995|
+-----+----+-----+

Accuracy: 0.6423946502494428


## Ex. 8 - Build a Logistic Regression model

You've already built a Decision Tree model using the flights data. Now you're going to create a Logistic Regression model on the same data. The objective is to predict whether a flight is likely to be delayed by at least 15 minutes (label 1) or not (label 0).

**Instructions:**

1. Import the class for creating a Logistic Regression classifier.
2. Create a classifier object and train it on the training data.
3. Make predictions for the testing data and create a confusion matrix.

In [29]:
# Create a classifier object and train on training data
lr_model_flights = LogisticRegression(featuresCol='features',
                                      labelCol='label',
                                      predictionCol='pred').fit(df_flights_train)

# Create predictions for the testing data and show confusion matrix
df_flights_pred_lr = lr_model_flights.transform(df_flights_test)

# Labels for confussion matrix
df_flights_cm_lr = df_flights_pred_lr.groupBy("label", "pred").count().toPandas().sort_values(["pred", "label"])
df_flights_cm_lr.index = ['True negative (TN)', 'False negative (FN)',
                          'False positive (FP)', 'True positive (TP)']

TN, FN, FP, TP = df_flights_cm_lr['count'].to_list()
accuracy_cars = (TN + TP) / (TN + TP + FN + FP)
print(f'Accuracy: {accuracy_cars}')
df_flights_cm_lr

Accuracy: 0.6147967307079928


Unnamed: 0,label,pred,count
True negative (TN),0,0.0,2584
False negative (FN),1,0.0,1678
False positive (FP),0,1.0,1951
True positive (TP),1,1.0,3208


## Ex. 9 - Evaluate the Logistic Regression model

Accuracy is generally not a very reliable metric because it can be biased by the most common target class.

There are two other useful metrics:
- precision and
- recall.

Check the slides for this lesson to get the relevant expressions.

**Precision** is the proportion of positive predictions which are correct. For all flights which are predicted to be delayed, what proportion is actually delayed?

**Recall** is the proportion of positives outcomes which are correctly predicted. For all delayed flights, what proportion is correctly predicted by the model?

The precision and recall are generally formulated in terms of the positive target class. But it's also possible to calculate weighted versions of these metrics which look at both target classes.

**Instructions:**

1. Find the precision and recall.
2. Create a multi-class evaluator and evaluate weighted precision.
3. Create a binary evaluator and evaluate AUC using the `"areaUnderROC"` metric.

In [30]:
# Calculate precision and recall
precision = TP / (TP + FP)
recall = TP / (TP + FN)

print(f'''
precision = {precision:.2f}
recall    = {recall:.2f}
''')

# Find weighted precision
multi_evaluator = MulticlassClassificationEvaluator(labelCol='label', predictionCol='pred')
weighted_precision = multi_evaluator.evaluate(df_flights_pred_lr, 
                                              {multi_evaluator.metricName: "weightedPrecision"})

# Find AUC
binary_evaluator = BinaryClassificationEvaluator(labelCol='label')
auc = binary_evaluator.evaluate(df_flights_pred_lr, {binary_evaluator.metricName: "areaUnderROC"})

print(f'''
weightedPrecision: {weighted_precision}
              AUC: {auc}
''')


precision = 0.62
recall    = 0.66


weightedPrecision: 0.6143464789852423
              AUC: 0.6537184972838267



# Book Dataset

## Turning Text into Tables

### Removing punctuation

In [31]:
# Regular expression (REGEX) to match commas and hyphens
df_books = books_data.select('*')
REGEX = '[,\\-]'
df_books = df_books.withColumn('raw_without_punc', F.regexp_replace(df_books.text, REGEX, ' '))
df_books = df_books.withColumn('text_without_punc', 
                               F.regexp_replace(df_books.raw_without_punc, ' +', ' '))
df_books.show(2, truncate=100)

+---+-----------------------------+-----------------------------+---------------------------+
| id|                         text|             raw_without_punc|          text_without_punc|
+---+-----------------------------+-----------------------------+---------------------------+
|  0|Forever, or a Long, Long Time|Forever  or a Long  Long Time|Forever or a Long Long Time|
|  1|              Winnie-the-Pooh|              Winnie the Pooh|            Winnie the Pooh|
+---+-----------------------------+-----------------------------+---------------------------+
only showing top 2 rows



### Text to tokens

In [32]:
df_books = df_books.drop('tokens')
df_books = Tokenizer(inputCol="text_without_punc", outputCol="tokens").transform(df_books)
df_books.show(2, truncate=25)

+---+-------------------------+-------------------------+-------------------------+-------------------------+
| id|                     text|         raw_without_punc|        text_without_punc|                   tokens|
+---+-------------------------+-------------------------+-------------------------+-------------------------+
|  0|Forever, or a Long, Lo...|Forever  or a Long  Lo...|Forever or a Long Long...|[forever, or, a, long,...|
|  1|          Winnie-the-Pooh|          Winnie the Pooh|          Winnie the Pooh|      [winnie, the, pooh]|
+---+-------------------------+-------------------------+-------------------------+-------------------------+
only showing top 2 rows



### What are stop words?

In [33]:
stopwords = StopWordsRemover()
stops = stopwords.getStopWords()
print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

### Removing stop words

In [34]:
# Specify the input and output column names
df_books = df_books.drop('words')
stopwords = stopwords.setInputCol('tokens').setOutputCol('words')
df_books = stopwords.transform(df_books)
df_books.show(2, truncate=20)

+---+--------------------+--------------------+--------------------+--------------------+--------------------+
| id|                text|    raw_without_punc|   text_without_punc|              tokens|               words|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+
|  0|Forever, or a Lon...|Forever  or a Lon...|Forever or a Long...|[forever, or, a, ...|[forever, long, l...|
|  1|     Winnie-the-Pooh|     Winnie the Pooh|     Winnie the Pooh| [winnie, the, pooh]|      [winnie, pooh]|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



### Feature hashing

In [35]:
df_books = df_books.drop('hash')
hasher = HashingTF(inputCol="words", outputCol="hash", numFeatures=32)
df_books = hasher.transform(df_books)
df_books.show(2, truncate=17)

+---+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
| id|             text| raw_without_punc|text_without_punc|           tokens|            words|             hash|
+---+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
|  0|Forever, or a ...|Forever  or a ...|Forever or a L...|[forever, or, ...|[forever, long...|(32,[8,13],[2....|
|  1|  Winnie-the-Pooh|  Winnie the Pooh|  Winnie the Pooh|[winnie, the, ...|   [winnie, pooh]|(32,[24,31],[1...|
+---+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
only showing top 2 rows



### Dealing with common words

In [36]:
df_books = df_books.drop('features')
df_books = IDF(inputCol="hash", outputCol="features").fit(df_books).transform(df_books)
df_books.show(2, truncate=12)

+---+------------+----------------+-----------------+------------+------------+------------+------------+
| id|        text|raw_without_punc|text_without_punc|      tokens|       words|        hash|    features|
+---+------------+----------------+-----------------+------------+------------+------------+------------+
|  0|Forever, ...|    Forever  ...|     Forever o...|[forever,...|[forever,...|(32,[8,13...|(32,[8,13...|
|  1|Winnie-th...|    Winnie th...|     Winnie th...|[winnie, ...|[winnie, ...|(32,[24,3...|(32,[24,3...|
+---+------------+----------------+-----------------+------------+------------+------------+------------+
only showing top 2 rows



# SMS Dataset

## Ex. 10 - Punctuation, numbers and tokens

At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.

But first you'll need to prepare the SMS messages as follows:
- remove punctuation and numbers
- tokenize (split into individual words)
- remove stop words
- apply the hashing trick
- convert to TF-IDF representation.

In this exercise you'll remove punctuation and numbers, then tokenize the messages.

**Instructions:**

1. Import the function to replace regular expressions and the feature to tokenize. (`regexp_replace`, `Tokenizer`: Already done!)
2. Replace all punctuation characters from the text column with a space. Do the same for all numbers in the text column.
3. Split the text column into tokens. Name the output column words.

In [37]:
# Loading data
df_sms = sms_data.select('*')

# Remove punctuation (REGEX provided) and numbers
df_sms = df_sms.withColumn('fixed_text', F.regexp_replace(df_sms.text, '[_():;,.!?\\-]', ' '))
df_sms = df_sms.withColumn('fixed_text', F.regexp_replace(df_sms.fixed_text, '\d', ' '))

# Merge multiple spaces
df_sms = df_sms.withColumn('fixed_text', F.regexp_replace(df_sms.fixed_text, ' +', ' '))

# Split the fixed_text into words
df_sms = Tokenizer(inputCol='fixed_text', outputCol='words').transform(df_sms)

df_sms.show(5, truncate=30)

+---+------------------------------+-----+------------------------------+------------------------------+
| id|                          text|label|                    fixed_text|                         words|
+---+------------------------------+-----+------------------------------+------------------------------+
|  1|Sorry, I'll call later in m...|    0|Sorry I'll call later in me...|[sorry, i'll, call, later, ...|
|  2|Dont worry. I guess he's busy.|    0| Dont worry I guess he's busy |[dont, worry, i, guess, he'...|
|  3|Call FREEPHONE 0800 542 057...|    1|           Call FREEPHONE now |        [call, freephone, now]|
|  4|Win a 1000 cash prize or a ...|    1|Win a cash prize or a prize...|[win, a, cash, prize, or, a...|
|  5|Go until jurong point, craz...|    0|Go until jurong point crazy...|[go, until, jurong, point, ...|
+---+------------------------------+-----+------------------------------+------------------------------+
only showing top 5 rows



## Ex. 11 - Stop words and hashing

The next steps will be to remove stop words and then apply the hashing trick, converting the results into a TF-IDF.

A quick reminder about these concepts:
- The hashing trick provides a fast and space-efficient way to map a very large (possibly infinite) set of items (in this case, all words contained in the SMS messages) onto a smaller, finite number of values.
- The TF-IDF matrix reflects how important a word is to each document. It takes into account both the frequency of the word within each document but also the frequency of the word across all of the documents in the collection.
- The tokenized SMS data are stored in sms in a column named words. You've cleaned up the handling of spaces in the data so that the tokenized text is neater.

**Instructions:**

1. Import the `StopWordsRemover`, `HashingTF` and `IDF` classes. Already done!.
2. Create a `StopWordsRemover` object (input column `words`, output column `terms`).
3. Create a `HashingTF` object (input results from previous step, output column `hash`).
4. Create an `IDF` object (input results from previous step, output column `features`).

In [38]:
df_sms = df_sms.drop(*['terms', 'hash', 'features'])

# Remove stop words.
df_sms = StopWordsRemover(inputCol='words', outputCol='terms').transform(df_sms)

# Apply the hashing trick
df_sms = HashingTF(inputCol="terms", outputCol="hash", numFeatures=1024).transform(df_sms)

# Convert hashed symbols to TF-IDF
df_sms = IDF(inputCol="hash", outputCol="features").fit(df_sms).transform(df_sms)
      
df_sms.show(4, truncate=10)

+---+----------+-----+----------+----------+----------+----------+----------+
| id|      text|label|fixed_text|     words|     terms|      hash|  features|
+---+----------+-----+----------+----------+----------+----------+----------+
|  1|Sorry, ...|    0|Sorry I...|[sorry,...|[sorry,...|(1024,[...|(1024,[...|
|  2|Dont wo...|    0|Dont wo...|[dont, ...|[dont, ...|(1024,[...|(1024,[...|
|  3|Call FR...|    1|Call FR...|[call, ...|[call, ...|(1024,[...|(1024,[...|
|  4|Win a 1...|    1|Win a c...|[win, a...|[win, c...|(1024,[...|(1024,[...|
+---+----------+-----+----------+----------+----------+----------+----------+
only showing top 4 rows



## Ex. 12 - Training a spam classifier

The `SMS` data have now been prepared for building a classifier. Specifically, this is what you have done:
- removed numbers and punctuation
- split the messages into words (or "tokens")
- removed stop words
- applied the hashing trick and
- converted to a TF-IDF representation.

Next you'll need to split the TF-IDF data into training and testing sets. Then you'll use the training data to fit a Logistic Regression model and finally evaluate the performance of that model on the testing data.

**Instructions:**

1. Split the data into training and testing sets in a `4:1` ratio. Set the random number seed to `SEED` to ensure repeatability.
2. Create a `LogisticRegression` object and fit it to the training data.
3. Generate predictions on the testing data.
4. Use the predictions to form a confusion matrix.

In [39]:
# Split the data into training and testing sets
df_sms_train, df_sms_test = df_sms.randomSplit([0.80, 0.20], seed=SEED)

# Fit a Logistic Regression model to the training data
sms_model_lr = LogisticRegression(regParam=0.2).fit(df_sms_train)

# Make predictions on the testing data
df_sms_cm_lr = sms_model_lr.transform(df_sms_test)

# Create a confusion matrix, comparing predictions to known labels
df_sms_cm_lr = df_sms_cm_lr.groupBy("label", "prediction").count().toPandas().sort_values(["prediction", "label"])
df_sms_cm_lr.index = ['True negative (TN)', 'False negative (FN)',
                      'False positive (FP)', 'True positive (TP)']
TN, FN, FP, TP = df_sms_cm_lr['count'].to_list()
accuracy_sms = (TN + TP) / (TN + TP + FN + FP)
print(f'Accuracy: {accuracy_sms}')
df_sms_cm_lr

Accuracy: 0.9561157796451915


Unnamed: 0,label,prediction,count
True negative (TN),0,0.0,928
False negative (FN),1,0.0,46
False positive (FP),0,1.0,1
True positive (TP),1,1.0,96


# Close session

In [40]:
spark.stop()