# Software Engineering BKMs in Data Science

Data science is a field that requires a diverse set of skills, including statistical analysis, machine learning, and software engineering. 

As the field continues to evolve and grow, it is becoming increasingly important for data scientists to not only have a deep understanding of the statistical and mathematical concepts, but also to be skilled in software engineering practices. Best Known Methods (BKM) in Software Engineering for Data Scientists are critical for building scalable and maintainable data science projects. 

Ref:  
https://www.freecodecamp.org/news/clean-coding-for-beginners/

# Clean Code



__What is it?__

Clean code is code that is easy to understand and easy to change.

__How do you know if a code is clean?__
![clean code](https://commadot.com/wp-content/uploads/2009/02/wtf.png)

>"Any fool can write code that a computer can understand. Good programmers write code that humans can understand."                                       – Martin Fowler



# Naming

![img](https://i.redd.it/pn6292mmqsy31.jpg)
The process of creating high-quality software requires attention to detail in various aspects, and one of these critical aspects is naming. Naming might seem trivial, but it has a significant impact on the readability, maintainability, and overall quality of your code. 

> "There are only two hard things in Computer Science: cache invalidation and naming things."                                                                                                                – Phil Karlton


__Why Naming Matters:__

*Readability*: Code is read more often than it is written. Good naming practices make it easier for others (and yourself) to understand the purpose of variables, functions, and classes, improving the overall readability of your code.

*Maintainability*: When working on large projects or collaborating with a team, clear and descriptive naming makes it easier to maintain the code, as it helps to understand the functionality of different components quickly.

*Debugging*: Meaningful names help identify potential issues during the debugging process, as they provide context and make it easier to pinpoint the source of problems.

__How to Create Meaningful Names:__  

Do not use comments to explain why a variable is used. If a name requires a comment, then you should take your time to rename that variable instead of writing a comment.

> "A name should tell you why it exists, what it does, and how it is used. If a name requires a comment, then the name does not reveal its intent."                 – Clean Code

In [None]:
#Bad

d = 0 # elapsed time in days

It is a common misconception that you should hide your mess with comments. Do not use letters like x, y, a, or b as variable names unless there is a good reason (loop variables are an exception to this).

In [None]:
# Good
elapsed_time_in_days = 0
days_since_creation = 0
days_since_modification = 0

These names are so much better. They tell you what is being measured and the unit of that measurement.

__Drop the Noise Words__

 Noise words are words that add no value to the meaning of the name and can make it harder to read and understand the code. Examples of noise words include:
 - "the", 
 - "info" 
 - "data" 
 - "variable" 
 - "object" 
 - "manager" 

If your class is named `ProductInfo`, you can remove the `Info` and make it `Product`. You can use `SPC` instead of `SPCData`.

__Pronunceable Words__

Using pronounceable names in your code is part of the clean code practice. But wait, why do we need pronounceable names?  

Well, think about it: if you can't pronounce a name, how are you going to talk about it with your team?

In [None]:
from datetime import datetime

now = datetime.now()

#Bad
yyymmdd_str = now.strftime("%Y/%m/%d")

#Good
current_date = now.strftime("%Y/%m/%d")

__Use Searchable Names__

Name your constant and try to avoid abbreviations and single-letter names. 

In [None]:
# Bad
accuracy = 0

if accuracy < 0.8:
    # do something with your model
    pass

In [None]:
# Good
ACCURACY_THRESHOLD = 0.8
accuracy = 0

if accuracy < ACCURACY_THRESHOLD:
    # do something with your model
    pass

This is much better because `ACCURACY_THRESHOLD` can be used in many places in code. If we need to change it to 0.9 in the future, we can just change the constant.

The bad example creates question marks in the reader's mind, like what is the importance of `0.8`?

__Summary__  

| Type                | Convention                                 | Example          |
|---------------------|--------------------------------------------|------------------|
| Function and variable| Lowercase and underscore separated         | func_name, var_name|
| Constant            | Uppercase                                  | PI, TAU          |
| Class               | CapWords                                   | MyClass, TypeVar |
| Filename - class    | CapWords                                   | MyClass.py, TypeVar.py |
| Filename - others   | Lowercase, can be underscore separated but discouraged | module, module_pack |


*** AND SERIOUSLY, DO NOT TO LEAVE WHITESPACES WHEN NAMING YOUR `.PY` FILE` ***

# Functions

__Keep them Small__

Create functions that are focused, simple, and easy to understand. A function that is too long and complex can be difficult to read and comprehend, and may also be harder to modify and maintain in the future.

To keep functions small, it's recommended to follow the Single Responsibility Principle (SRP), which states that a function should do one thing and do it well. This means that a function should have a clear and specific purpose, and should not be responsible for multiple tasks or responsibilities.

__Make Sure They Just Do One Thing__

>Functions should do one thing. They should do it well. They should do it only. – Clean Code

Lets look at a code that process data that send an email once it is processed.
```Python

#Bad
def process_data(data):
    for item in data:
        # Do some processing
        # ...
        # Send an email to the user
        send_email(item['email'], 'Data processed successfully')
        # ...
        # Save the data to the database
        save_to_database(item)


```

In this example, the `process_data()` function is responsible for three different tasks: processing the data, sending an email to the user, and saving the data to the database. This violates the Single Responsibility Principle and makes the code harder to read, understand, and maintain. If there is an issue with one of the tasks, it can be difficult to locate and fix it without affecting the other tasks.  

Lets looks at how we can make it better.

```Python
#Good

def process_data(data):
    processed_data = []
    for item in data:
        processed_item = process_item(item)
        processed_data.append(processed_item)
    return processed_data

def process_item(item):
    # Do some processing
    # ...
    return processed_item

def send_email(email, message):
    # Send an email to the user
    # ...

def save_to_database(item):
    # Save the data to the database
    # ...
```

In this example, the process_data() function is only responsible for processing the data, and delegates the tasks of sending an email and saving the data to separate functions. Each function is focused on doing one thing and doing it well, which makes the code easier to read, understand, and maintain. If there is an issue with one of the tasks, it can be located and fixed without affecting the other tasks.

__Encapsulate Conditionals in Functions__

Encapsulating conditionals in functions is considered an important practice in clean code development. The idea behind this practice is to create functions that are focused and have a clear and specific purpose, rather than having conditionals scattered throughout the codebase.

By encapsulating conditionals in functions, you can improve the readability and maintainability of your code, as well as make it easier to test and debug. Here's an example to illustrate this:

Bad Example:

```Python
# Scattered conditional statements
if temperature > 30 and humidity > 60 and time_of_day == 'afternoon':
    fan.turn_on()
    air_conditioner.turn_on()
elif temperature < 10 and time_of_day == 'morning':
    heater.turn_on()
elif temperature > 20 and time_of_day == 'evening':
    fan.turn_on()
    air_conditioner.turn_on()
else:
    fan.turn_off()
    air_conditioner.turn_off()
    heater.turn_off()
```
In this example, the conditional statements are scattered throughout the code, which can make it difficult to read and maintain. As the number of conditions grows, it can become increasingly challenging to locate and modify the relevant sections of code.

```Python

# Encapsulated conditionals in functions
def is_hot_afternoon(temperature, humidity, time_of_day):
    return temperature > 30 and humidity > 60 and time_of_day == 'afternoon'

def is_cold_morning(temperature, time_of_day):
    return temperature < 10 and time_of_day == 'morning'

def is_warm_evening(temperature, time_of_day):
    return temperature > 20 and time_of_day == 'evening'

def handle_hot_afternoon():
    fan.turn_on()
    air_conditioner.turn_on()

def handle_cold_morning():
    heater.turn_on()

def handle_warm_evening():
    fan.turn_on()
    air_conditioner.turn_on()

def handle_normal_conditions():
    fan.turn_off()
    air_conditioner.turn_off()
    heater.turn_off()

if is_hot_afternoon(temperature, humidity, time_of_day):
    handle_hot_afternoon()
elif is_cold_morning(temperature, time_of_day):
    handle_cold_morning()
elif is_warm_evening(temperature, time_of_day):
    handle_warm_evening()
else:
    handle_normal_conditions()
```

__Do Not Have Side Effects__  

Side effects occur when a function modifies something outside of its own scope, such as changing the value of a global variable or modifying an object that was passed as an argument.

Functions with side effects can be harder to test and debug, as their behavior can depend on external factors that may be difficult to control. Additionally, side effects can make it harder to reason about the behavior of the code and can lead to unexpected bugs and errors.

A good practice is to make functions "pure", meaning they only rely on their input parameters and don't modify anything outside of their own scope. This makes the functions more predictable and easier to test and debug.

In [None]:
# Bad
total = 0

def add_to_total(amount):
    global total
    total += amount


The `add_to_total()` function modifies the value of the global variable total, which is outside of its own scope. This can make it difficult to track changes to the total variable and can lead to unexpected behavior if other parts of the code depend on its value.

In [None]:
# Good
def calculate_total(amounts):
    total = 0
    for amount in amounts:
        total += amount
    return total

In this example, `the calculate_total()` function takes in a list of amounts as an input parameter and returns the total. It doesn't modify anything outside of its own scope, making it easier to test and debug. Additionally, by returning the total instead of modifying a global variable, this function can be used in a wider variety of contexts.

__Are you DRY?__  


DRY stands for "Don't Repeat Yourself", which is a principle of clean code development that encourages developers to avoid duplicating code as much as possible. Code repetition can be a major problem in software development, as it can make code harder to maintain, more error-prone, and more difficult to update. When you encounter repeated code segments, it's important to take steps to refactor the code and reduce the amount of duplication as much as possible.

One effective way to refactor repeated code segments is to use your IDE's refactoring features to extract the duplicated code into a separate method or function. This can help to reduce code duplication and make it easier to update the code in the future.

# Comments
You may use comments to explain the purpose, inputs, and outputs of functions. But you need to avoid over-commenting and stating the obvious.

![img](https://res.cloudinary.com/practicaldev/image/fetch/s--gPLdAhbi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fal0udi31n6r3pwg8d73.png)

Oh, one more thing to add, __DO NOT LEAVE CODE IN COMMENTS!__ This one is serious because others who see the code will be afraid to delete it because they do not know if it is there for a reason. That commented out code will stay there for a long time. Then when variable names or method names change, it gets irrelevant but still nobody deletes it.

# PEP8  

PEP8 is a set of guidelines that dictate how Python code should be formatted and written. The goal of these guidelines is to make code more readable and consistent, so that it's easier to maintain and understand.

Some of the key principles of PEP8 include using four spaces for indentation (not tabs), limiting line length to 79 characters, using descriptive names for variables and functions, and using whitespace effectively to improve readability.

By following PEP8, you'll not only improve the readability and maintainability of your own code, but you'll also make it easier for others to understand and work with your code. Additionally, adhering to PEP8 makes it easier to collaborate with other developers on large projects, since everyone will be following the same set of guidelines.

So if you're looking to improve your Python coding skills, take some time to familiarize yourself with PEP8. With a little practice, you'll be writing beautiful, readable Python code in no time!

Finally, here's a [PEP8](https://www.youtube.com/watch?v=hgI0p1zf31k) song for you!