<a href="https://colab.research.google.com/github/kraslav4ik/Recruitment-data-analysis-task/blob/main/Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Krasnoperov Viacheslav

BitPeak recruitment task #1

Based on the pandas and faker libraries implement a function named generate_ssns , which returns an object of the Series type with the number of records specified by the function input parameter and values ​​representing the random numbers generated PESEL.  

Implement the generate_unique_ssns function (in any way), which returns an object of the Series type with the number of records specified by the input parameter of the function and values ​​representing random and unique (only within the returned collection) PESEL numbers appropriate for people of gender (female / male) and born in the range of dates (from-to) also specified by the input parameters of this function.  

Then implement the calls to the generate_ssns and generate_unique_ssns functions for 1,000, 10,000, and 100,000 records, indicating the selected gender and the range of birth dates from 1990-01-01 to 1990-01-19. Make a measurement and display the duration of their execution (separately for each call of each of these two functions).  

Implement a function called validate_ssn , which takes the PESEL number as input along with the expected gender (female / male / any) and date of birth (specific / specific or any), and returns information on the correctness of the PESEL number on the output. Inside the function, include the logic verifying the syntactic correctness of the PESEL number, taking into account the information about the expected gender and date of birth. Then test the validate_ssn function with sample data.



Start with installing libraries using pip

In [1]:
!pip install -q faker
 

[?25l[K     |▏                               | 10 kB 25.4 MB/s eta 0:00:01[K     |▍                               | 20 kB 31.5 MB/s eta 0:00:01[K     |▋                               | 30 kB 13.4 MB/s eta 0:00:01[K     |▉                               | 40 kB 8.8 MB/s eta 0:00:01[K     |█                               | 51 kB 8.1 MB/s eta 0:00:01[K     |█▎                              | 61 kB 9.3 MB/s eta 0:00:01[K     |█▌                              | 71 kB 9.3 MB/s eta 0:00:01[K     |█▋                              | 81 kB 9.7 MB/s eta 0:00:01[K     |█▉                              | 92 kB 10.7 MB/s eta 0:00:01[K     |██                              | 102 kB 10.6 MB/s eta 0:00:01[K     |██▎                             | 112 kB 10.6 MB/s eta 0:00:01[K     |██▌                             | 122 kB 10.6 MB/s eta 0:00:01[K     |██▊                             | 133 kB 10.6 MB/s eta 0:00:01[K     |███                             | 143 kB 10.6 MB/s eta 0:00:01

Import all modules. I will call provider class pl_PL "PESELProvider"

In [2]:
import random
import re
import pandas as pd
import time

from datetime import date, timedelta
from faker import Faker
from faker.providers.ssn.pl_PL import Provider as PESELProvider

That's generator for the second function to generate unique ssn. In task there were no info about that, so, I had two ways to implement this: generate random ssn's and than check if they fits my criteria, or from the beginning, make faker to generate values, satisfying criteria, which is much much much faster. I implement both, decided to change default provider's generator and create my own, but add a checking of each PESEL as well
so, there is generator, "date_time" method of which will return random date from range of dates, which we will want

In [3]:
class CustomBirthDateGenerator:

    def __init__(self, date_from, date_to):
        self.date_from = date_from
        self.date_to = date_to
        self.period = self.date_to - self.date_from
        self.random = random

    def date_time(self):
        return self.date_from + timedelta(days=random.randint(0, self.period.days))

Create "Faker"

In [4]:
fake = Faker()

Simple function using iteration to generate ssn's

In [5]:
def generate_ssns(rec_num: int) -> pd.Series:
    fake.add_provider(PESELProvider)
    ssns = (fake.ssn() for _ in range(rec_num))
    return pd.Series(ssns)

Function to generate unique ssns. As I mentioned, I implemented it using my own generator, but still check if they satisfy. For making unique ssns I used set for linear-time checking if PESEL is already exists. For parse PESEL I wrote separate function "split_pesel". And added check for correctly written gender(male or female)

In [6]:
def generate_unique_ssns(rec_num: int, date_from: date, date_to: date, sex: str) -> pd.Series:
    correct_sex_str = re.match(r'^(fe)?male$', sex)
    if not correct_sex_str:
        raise ValueError(sex)
    custom_provider = PESELProvider(CustomBirthDateGenerator(date_from, date_to))
    fake.add_provider(custom_provider)
    ssns = set()
    while len(ssns) < rec_num:
        pesel = fake.ssn()
        if pesel in ssns:
            continue
        cur_date, pesel_sex, _ = split_pesel(pesel)

        if date_from <= cur_date <= date_to and pesel_sex == sex:
            ssns.add(pesel)

    return pd.Series(list(ssns))

Next function is for checking if there valid PESEL or not. I checked it next way:

1.  Is there an 11 digits number
2.  If have validate gender, check if it's the same in PESEL
3.  If birth date is given, check if it's the same
4.  Check for last digit(parity number) - wrote separate function

Also, split pesel using function "split_pesel(its implementation below))". And added check for correctly written gender(male, female, any)



In [7]:
def validate_ssn(number: str, sex: str = 'any', birth_date: date = None) -> bool:
    correct_pesel = re.match(r'^\d{11}$', number)
    if not correct_pesel:
        return False
    birth_date_p, sex_p, parity_num = split_pesel(number)
    if sex:
        correct_sex_str = re.match(r'^(fe)?male|any$', sex)
        if not correct_sex_str:
            raise ValueError(sex)
        if sex != sex_p:
            return False

    if birth_date and birth_date != birth_date_p:
        return False

    if not is_correct_last_digit(number, parity_num):
        return False
    return True

In "split_pesel" function, I got the birth date, gender and parity num just using string slices. Since century is stored in month number and counted by formula, there is a dependance: for XX century, month is the same, for XXI century month in PESEL = real month + 20. There are the only centuries, we are interested in

In [8]:
def split_pesel(pesel) -> tuple:
    birth_day = int(pesel[4:6])
    pesel_year = pesel[0:2]
    pesel_month = int(pesel[2:4])
    gender = 'male' if int(pesel[9]) % 2 else 'female'
    parity_num = pesel[-1]
    if pesel_month > 12:
        birth_year = int('20' + pesel_year)
        birth_month = pesel_month - 20
    else:
        birth_year = int('19' + pesel_year)
        birth_month = pesel_month
    return date(day=birth_day, year=birth_year, month=birth_month), gender, parity_num

Last PESEL's digit is counted by formula: A×1 + B×3 + C×7 + D×9 + E×1 + F×3 + G×7 + H×9 + I×1 + J×3. Did this check inside next function

In [9]:
def is_correct_last_digit(pesel: str, parity_num: int) -> bool:
    multipliers = (1, 3, 7, 9, 1, 3, 7, 9, 1)
    s = 0
    for number, multiplier in zip(pesel, multipliers):
        s += int(number) * multiplier
    expected_last_digit = s % 10
    if int(parity_num) != expected_last_digit:
        return False
    return True

In [11]:
test_pesel_samples = [{"num": "90011741966", "date": None, "sex": None},  # True
                      {"num": "90011741963", "date": None, "sex": None},  # False
                      {"num": "90011741966", "date": date(year=1990, month=1, day=17), "sex": 'female'},  # True
                      {"num": "900117466", "date": None, "sex": None},  # False
                      {"num": "90011741966", "date": date(year=1990, month=1, day=17), "sex": 'male'}  # False
                      ]
for sample in test_pesel_samples:
    answer = validate_ssn(number=sample["num"],
                                    birth_date=sample.get("date", None), sex=sample.get("sex", None))
    result["validate_ssn"].append(answer)

In [12]:
print(result)

{'generate_snss': {'1000 records': '0.0733 secs', '10000 records': '0.4814 secs'}, 'generate_unique_snss': {'1000 records': '0.1036 secs', '10000 records': '0.6895 secs'}, 'validate_ssn': [True, False, True, False, False]}
