# Make your life easier by using unittest.**mock**

### Also, a small introduction to mocking, APIs and cats 🐈

## What is a mock?
### Is it the same as a *stub*? Or as a *dummy*? Or as a *fake*?

![SO](http://localhost:5500/lib/img3.png)

#### A mock is an *interaction-based* object whose purpose is to *override* other objects and return user-defined values. 

Its main implementation in Python is via the builtin package `unittest.mock`. [Official documentation](https://docs.python.org/3/library/unittest.mock.html) says that
>`unittest.mock` is a library for testing in Python. It allows you to replace parts of your system under test with mock objects and make assertions about how they have been used.

In [2]:
# Hidden imports go here
import sys
import ipytest
import requests
import pyspark
import unittest
import time
from unittest import mock

Let's start with the *very* basics:

In [4]:
def a_random_function(arg1: str, arg2: str) -> str:
    return arg2 + arg1

a_random_function("AMRO", "ABN")

'ABNAMRO'

Now let's now make things slightly less linear:

In [5]:
def another_function(arg1: int, arg2: int, arg3: str, arg4: str) -> str:
    return a_random_function(arg3, arg4) + str((arg1 + arg2)//2)

another_function(2, 4, "ʘ=)∫", "(=ʘᆽ")

'(=ʘᆽʘ=)∫3'

### How is this related to unit testing?    
Let's take the definition of *unit testing*. According to Wikipedia (the main source of information for our century), we can define unit testing as:
>"A software testing method by which individual units of source code — sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures — are tested to determine whether they are fit for use."

The dilemma is:
> " How can we test something over which we have no control? "

In our specific case, let's take `a_random_function(arg1, arg2)` as given.    
Let's say we import it from some other package, that doesn't belong to us.    

We run into this issue many times: the most common use case is when we interact with the filesystem.

Let's redefine our `another_function()` as follows:

In [6]:
def another_function(arg1: str, arg2: str) -> str:
    return a_random_function(sys.argv[0], sys.argv[1]) + " - " + str((arg1 + arg2)//2)

another_function(10, 200)

'-f/home/gian/.pyenv/versions/3.7.0/envs/venv/lib/python3.7/site-packages/ipykernel_launcher.py - 105'

As you can see, I have no possible control over `sys.argv`: they get defined at runtime, once I run my application.    
How can we account for them, or get *some* degree of control?

Fortunately, `unittest.mock` allows us to perform this task in a relatively relaxed way.

In [7]:
@mock.patch("sys.argv", ["One", "Two"])
def another_function(arg1: int, arg2: int) -> str:
    return a_random_function(sys.argv[0], sys.argv[1]) + " - " + str((arg1 + arg2)//2)

another_function(10, 200)

'TwoOne - 105'

In [8]:
print(sys.argv)

['/home/gian/.pyenv/versions/3.7.0/envs/venv/lib/python3.7/site-packages/ipykernel_launcher.py', '-f', '/home/gian/.local/share/jupyter/runtime/kernel-7ba6d5dc-340f-49dd-91f2-09f973813763.json']


![cat](http://localhost:5500/lib/img1.jpg)

In [3]:
# Test-related imports go here

ipytest.autoconfig()
test_args = ["--showlocals", 
            "-x", 
            "--cov-report", 
            "term-missing",
            "--cov",
            "neon.functions"]

## Let's get to our cats 🐈
Since I really love cats, I'd like to know more about them.    
Fortunately, someone created the **completely free** [Cat facts API](https://catfact.ninja/) which can return nice (and interesting) facts about our feline friends.

In [9]:
requests.get(url="https://catfact.ninja/fact").json()["fact"]

'The first commercially cloned pet was a cat named "Little Nicky." He cost his owner $50,000, making him one of the most expensive cats ever.'

For the purpose of this presentation, I made a **whole application** to collect facts about cats. The application does two main things:
- Collects a certain number of facts by querying the API
- Saves everything as a nice table into my local HIVE metastore

I called the application **Neon** (just because that's the first name I got from a random generator).    
The application code and its tests are freely available on my Github profile (link at the end of the presentation).

### Let's start making use of it then!

In [10]:
from neon.functions import *

![outline](http://localhost:5500/lib/img4.png)

In [11]:
interesting_facts = process_data(usernumber=3)

2021-08-25 10:39:09,966 - DEBUG - Now arranging API call to https://catfact.ninja/fact
2021-08-25 10:39:11,972 - DEBUG - Time elapsed: 2.0 seconds
2021-08-25 10:39:11,974 - DEBUG - Starting call now
2021-08-25 10:39:11,979 - DEBUG - Starting new HTTPS connection (1): catfact.ninja:443
2021-08-25 10:39:12,529 - DEBUG - https://catfact.ninja:443 "GET /fact HTTP/1.1" 200 None
2021-08-25 10:39:12,532 - DEBUG - Load 1 of 3 is done
2021-08-25 10:39:12,533 - DEBUG - Now arranging API call to https://catfact.ninja/fact
2021-08-25 10:39:15,539 - DEBUG - Time elapsed: 3.0 seconds
2021-08-25 10:39:15,540 - DEBUG - Starting call now
2021-08-25 10:39:15,543 - DEBUG - Starting new HTTPS connection (1): catfact.ninja:443
2021-08-25 10:39:16,074 - DEBUG - https://catfact.ninja:443 "GET /fact HTTP/1.1" 200 None
2021-08-25 10:39:16,077 - DEBUG - Load 2 of 3 is done
2021-08-25 10:39:16,078 - DEBUG - Now arranging API call to https://catfact.ninja/fact
2021-08-25 10:39:17,082 - DEBUG - Time elapsed: 1.0 s

Each API call generates a JSONs.    
`process_data()` conveniently packs them into a list, over which we can easily iterate: 

In [12]:
for entry in interesting_facts:
    print(entry["fact"])

A domestic cat can run at speeds of 30 mph.
The first official cat show in the UK was organised at Crystal Palace in 1871.
The smallest pedigreed cat is a Singapura, which can weigh just 4 lbs (1.8 kg), or about five large cans of cat food. The largest pedigreed cats are Maine Coon cats, which can weigh 25 lbs (11.3 kg), or nearly twice as much as an average cat weighs.


Let's now store these important facts inside my table, so then they don't get lost:

In [13]:
spark = establish_spark()
group_and_save(spark, facts=interesting_facts)
df = spark.read.table("default.random_cats_facts")

2021-08-25 10:40:05,946 - DEBUG - Now establishing the Spark Session
21/08/25 10:40:08 WARN Utils: Your hostname, XPS-13 resolves to a loopback address: 127.0.1.1; using 172.26.53.226 instead (on interface eth0)
21/08/25 10:40:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/08/25 10:40:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-08-25 10:40:15,071 - DEBUG - Spark session is established
2021-08-25 10:40:15,073 - DEBUG - Grouping cat facts list, 3 elements
2021-08-25 10:40:15,074 - DEBUG - Spark context retrieved
2021-08-25 10:40:21,014 - DEBUG - Dataframe created                             
21/08/25 10:40:23 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout d

In [15]:
df.show(100,0)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------+
|fact                                                                                                                                                                                                                                                     |length|load_dts                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+--------------------------+
|The smallest pedigreed cat is a Singapura, which can weigh just 4 lbs (1.8 kg), or about five large cans of cat food. The largest pedigreed c

As you can see, the whole procedure is **very slow**.

This is due to a random `waiting` variable that I added in order to simulate a slow connection, a very inefficient backend on the server side, or anything that can get in between.

Since the `waiting` variable polls a random amount of seconds between 1 and 10, we're talking here about an *average* waiting time of *5 seconds per request*.
In case you don't believe me, check the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) for statistical reference 🤓

### What can I do to make the whole thing more efficient, while testing for the main functionality?

There are two main issues over here:
1) **I don't wanna wait 5 seconds for each call to the API**, but I'd like to be sure that the call works nonetheless (i.e. no test data, I don't care about the data: I want my functionality to work)    
2) **I don't wanna save the data I retrieve every single time**, since I have no control over its saving process. I just know that, as soon as I run the application, the data will be saved somewhere. That's not good for testing purposes!

Let's use our mocks:

In [16]:
class TestAPI(unittest.TestCase):
    @mock.patch("neon.functions.make_request", return_value=None)
    def test_retrieve_data_without_waiting(self, patched_request):
        t0 = time.perf_counter()
        actual: dict = retrieve_data(waiting=5)
        actual_keys: list = [key for key in actual]
        expected_keys: list = ["fact", "length"]
        t1 = time.perf_counter() - t0
        self.assertEqual(actual_keys, expected_keys)
        self.assertLess(t1, 10)
    
    @mock.patch("neon.functions.retrieve_data", return_value={"fact" : "This is a random fact", "length" : "-99"})
    def test_process_data_with_custom_load(self, patched_retrieve):
        t0 = time.perf_counter()
        actual: list = process_data(usernumber=5, waiting=5)
        expected: list = [{"fact" : "This is a random fact", "length" : "-99"}]*5
        t1 = time.perf_counter() - t0
        self.assertEqual(actual, expected)
        self.assertLess(t1, 10)

In [17]:
first_test_suite = TestAPI()
ipytest.run(*test_args)

2021-08-25 10:46:39,400 - DEBUG - Load 1 of 5 is done
2021-08-25 10:46:39,403 - DEBUG - Load 2 of 5 is done
2021-08-25 10:46:39,404 - DEBUG - Load 3 of 5 is done
2021-08-25 10:46:39,406 - DEBUG - Load 4 of 5 is done
2021-08-25 10:46:39,407 - DEBUG - Load 5 of 5 is done


[32m.[0m

2021-08-25 10:46:39,458 - DEBUG - Now arranging API call to https://catfact.ninja/fact
2021-08-25 10:46:39,461 - DEBUG - Starting call now
2021-08-25 10:46:39,475 - DEBUG - Starting new HTTPS connection (1): catfact.ninja:443
2021-08-25 10:46:40,063 - DEBUG - https://catfact.ninja:443 "GET /fact HTTP/1.1" 200 None


[32m.[0m[32m                                                                                           [100%][0m

----------- coverage: platform linux, python 3.7.0-final-0 -----------
Name                                                             Stmts   Miss  Cover   Missing
----------------------------------------------------------------------------------------------
/home/gian/my-repos/pytest-mock-presentation/neon/functions.py      53     41    23%   4-37, 52, 69-110
----------------------------------------------------------------------------------------------
TOTAL                                                               53     41    23%

[32m[32m[1m2 passed[0m[32m in 0.76s[0m[0m


And now to the Spark part:

In [18]:
class TestSparkFunctions(unittest.TestCase):
    @classmethod
    def setUpClass(cls) -> None:
        cls.mocked_data: List[dict] = [{"fact": "This is a random fact", "length": "-99"}]

    @mock.patch("pyspark.sql.readwriter.DataFrameWriter.saveAsTable")
    def test_group_and_save_with_patching(self, patched_writer):
        patched_writer.new = True
        spark: SparkSession = establish_spark()
        data: List[dict] = self.mocked_data
        group_and_save(spark, data)
        patched_writer.assert_called()

    @mock.patch("neon.functions.make_request", return_value=None)
    def test_group_and_save_with_api_load(self, patched_request):
        mocked_save = mock.create_autospec(group_and_save)
        spark: SparkSession = establish_spark()
        data: list[dict] = process_data(usernumber=5, waiting=2)
        expected: bool = mocked_save(spark, data)
        mocked_save.assert_called()
        self.assertTrue(expected)

In [19]:
second_test_suite = TestSparkFunctions()
ipytest.run(*test_args)

2021-08-25 10:52:50,787 - DEBUG - Load 1 of 5 is done
2021-08-25 10:52:50,789 - DEBUG - Load 2 of 5 is done
2021-08-25 10:52:50,791 - DEBUG - Load 3 of 5 is done
2021-08-25 10:52:50,793 - DEBUG - Load 4 of 5 is done
2021-08-25 10:52:50,795 - DEBUG - Load 5 of 5 is done


[32m.[0m

2021-08-25 10:52:50,807 - DEBUG - Now arranging API call to https://catfact.ninja/fact
2021-08-25 10:52:50,809 - DEBUG - Starting call now
2021-08-25 10:52:50,821 - DEBUG - Starting new HTTPS connection (1): catfact.ninja:443
2021-08-25 10:52:51,416 - DEBUG - https://catfact.ninja:443 "GET /fact HTTP/1.1" 200 None


[32m.[0m

2021-08-25 10:52:51,435 - DEBUG - Now establishing the Spark Session
2021-08-25 10:52:51,448 - DEBUG - Spark session is established
2021-08-25 10:52:51,450 - DEBUG - Now arranging API call to https://catfact.ninja/fact
2021-08-25 10:52:51,452 - DEBUG - Starting call now
2021-08-25 10:52:51,461 - DEBUG - Starting new HTTPS connection (1): catfact.ninja:443
2021-08-25 10:52:51,986 - DEBUG - https://catfact.ninja:443 "GET /fact HTTP/1.1" 200 None
2021-08-25 10:52:51,991 - DEBUG - Load 1 of 5 is done
2021-08-25 10:52:51,993 - DEBUG - Now arranging API call to https://catfact.ninja/fact
2021-08-25 10:52:51,996 - DEBUG - Starting call now
2021-08-25 10:52:52,004 - DEBUG - Starting new HTTPS connection (1): catfact.ninja:443
2021-08-25 10:52:52,535 - DEBUG - https://catfact.ninja:443 "GET /fact HTTP/1.1" 200 None
2021-08-25 10:52:52,540 - DEBUG - Load 2 of 5 is done
2021-08-25 10:52:52,542 - DEBUG - Now arranging API call to https://catfact.ninja/fact
2021-08-25 10:52:52,544 - DEBUG - Startin

[32m.[0m

2021-08-25 10:52:54,221 - DEBUG - Now establishing the Spark Session
2021-08-25 10:52:54,232 - DEBUG - Spark session is established
2021-08-25 10:52:54,234 - DEBUG - Grouping cat facts list, 1 elements
2021-08-25 10:52:54,236 - DEBUG - Spark context retrieved
2021-08-25 10:52:54,556 - DEBUG - Dataframe created
2021-08-25 10:52:54,619 - DEBUG - Table has been saved


[32m.[0m[32m                                                                                         [100%][0m

----------- coverage: platform linux, python 3.7.0-final-0 -----------
Name                                                             Stmts   Miss  Cover   Missing
----------------------------------------------------------------------------------------------
/home/gian/my-repos/pytest-mock-presentation/neon/functions.py      53     25    53%   4-37, 52, 69, 88, 109-110
----------------------------------------------------------------------------------------------
TOTAL                                                               53     25    53%

[32m[32m[1m4 passed[0m[32m in 3.90s[0m[0m


### Problem solved!

![cat](http://localhost:5500/lib/img2.jpg)

### Useful links:
[Catfacts (API reference)](https://catfact.ninja/)    
[This application (Git repo)](https://github.com/jean-n92/pytest-mock-presentation)    
[Where to patch (Documentation)](https://docs.python.org/3/library/unittest.mock.html#where-to-patch)    
[Quick start with mock (Documentation)](https://docs.python.org/3/library/unittest.mock.html#quick-guide)    
[Difference between mock and stub (StackOverflow)](https://stackoverflow.com/questions/3459287/whats-the-difference-between-a-mock-stub)