# Programming and data manipulation

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/phonchi/nsysu-math105A/blob/master/static_files/presentations/02_Data_manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/phonchi/nsysu-math105A/blob/master/static_files/presentations/02_Data_manipulation.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

## Introduction

One of the essential features of computer programming languages such as `Python` is that they shield users from the tedious details of working with the 
lowest levels of the machine. Having just completed much of topics on the lowest levels of data manipulation in computer, it is instructive to 
review some of the major details that `Python` scripts shield the programmer 
from needing to worry about.

As we will explore in greater detail in Chapter 9, high-level programming language statements are mapped down to low-level machine instructions in order to be executed. A single `Python` statement might map to a single machine instruction, or to many tens or even hundreds of machine instructions, depending on the complexity of the statement and the efficiency of the machine language. Different implementations of the `Python` language interpreter, in concert with other elements of the computer's operating system software, take care of this mapping process for each particular computer processor. As a result, the `Python` programmer does not need to know whether she is executing her `Python` script on a RISC processor or a CISC processor.

We can recognize many `Python` operations that correspond closely to 
the basic machine instructions for modern computers or our hypotehetical computer. Addition of `Python` integers and floating point numbers clearly resembles the `ADDI` and `ADDF` op-codes of our simple machine. 
Assigning values to variables surely involves the `LOAD`, `STORE`, and `MOVE` op-codes in some arrangement. `Python` shields us from worrying about which 
processor registers are in use, but leverages the op-codes of the machine to 
carry out our instructions. **We cannot see the instruction register, program 
counter, or memory cell addresses, but the `Python` script executes sequentially, one statement after the other, in the same way as the simple machine language programs.**

### Boolean expressions

A boolean expression is an expression that is either true or false. The following examples use the operator `==`, which compares two operands and produces `True` if they are equal and `False` otherwise:

In [5]:
5 == 5, 5 == 6

(True, False, True)

`True` and `False` are special values that belong to the class bool; they are not strings:

In [2]:
type(True), type(False)

(bool, bool)

The `==` operator is one of the comparison operators; the others are:


|            | Meaning                         |
|------------|---------------------------------|
| x != y     | x is not equal to y             |
| x > y      | x is greater than y             |
| x < y      | x is less than y                |
| x >= y     | x is greater than or equal to y |
| x <= y     | x is less than or equal to y    |
| x is y     | x is the same as y              |
| x is not y | x is not the same as y          |


> The `Python` symbols are different from the mathematical symbols for the same operations. A common error is to use a single equal sign `=` instead of a double equal sign `==`. Remember that `=` is an assignment operator and `==` is a comparison operator. There is no such thing as `=<` or `=>`.

### Logic and Shift Operations

Logic and shift operations can be executed on any kind of numerical data, but 
because they often deal with individual bits of data, it is easiest to illustrate these operations with binary values. Just as `Python` uses the `‘0x’` prefix to specify values in hexadecimal, the `‘0b’` prefix can be used to specify values in binary.

In [None]:
x  = 0b00110011
mask = 0b00001111

Note that this is effectively no different from assigning `x` the value 51 (which is `110011` in binary) or `0x33` (which is 51 expressed in hexadecimal), or from assigning mask the value 15 (which is `1111` in binary) or `0x0F` (15 in hexadecimal). The representation we use to express the integer value in the `Python` assignment statement does not change how it is represented in the computer, only how human readers understand it.


Built-in Python operators exist for each of the bitwise logical operators 
described in Chapter 4.

In [6]:
print(0b00000101 ^ 0b00000100)     # Prints 5 XOR 4, which is 1
print(0b00000101 | 0b00000100)     # Prints 5 OR 4, which is 5
print(0b00000101 & 0b00000100)     # Prints 5 AND 4, which is 4
print(~0b00000101)            # Prints NOT 5, which is -6

1
5
4
-6


In [7]:
print(0b00000101 or 0b00000100)     # Prints 5 OR 4, which is 5
print(0b00000101 and 0b00000100)     # Prints 5 AND 4, which is 4

5
4


Strictly speaking, the operands of the logical operators should be boolean expressions, but `Python` is not very strict. **Any nonzero number is interpreted as “true.”**

In [10]:
17 and True

True

For all of these examples, `Python` will print the result in its default output 
representation, which is base-10. If the user would also like the output to be 
displayed in binary notation, a built-in function exists to convert any integer 
value into the string of zero and one characters for the corresponding binary 
representation.

In [20]:
print(0b10011010 & 0b11001001)   # Prints "0b10001000"
print(0b10011010 | 0b11001001)   # Prints "0b11011011"
print(0b10011010 ^ 0b11001001)   # Prints "0b1010011"
print(~0b00000101)

136
219
83
-6


In [11]:
print(bin(0b10011010 & 0b11001001))   # Prints "0b10001000"
print(bin(0b10011010 | 0b11001001))   # Prints "0b11011011"
print(bin(0b10011010 ^ 0b11001001))   # Prints "0b1010011"
print(bin(~0b00000101)) #https://realpython.com/python-bitwise-operators/

0b10001000
0b11011011
0b1010011
-0b110


Because newer versions of `Python` can use an arbitrary number of digits for 
representing numbers, leading zeros are not printed. Thus, the third line above 
prints only seven digits, rather than eight. Integers in `Python` are not fixed size. There's no int16, int32, uint8 and such. `Python` will just add bits as it needs to. That means a negative number cannot be represented by its complement, `0b11111011` is not `-5` as for `int8`, but `251`. Since binaries are ***potentially infinite***, there's no fixed position to place a sign bit. Thus, `Python` has to add the explicit unary `-`.

If you want to get the binary representation for negative, fixed size integers, you can use the function below:

In [12]:
# https://stackoverflow.com/questions/39959491/how-to-convert-signed-string-to-its-binary-equivalent-in-python
def bin_int(number, size=8):
    max_val = int('0b' + ('1'* (size - 1)), 2)  # e.g. 0b01111111
    if number >=0:
        return bin(number)
    sign = int('0b1' + ('0' * size), 2)  # e.g. 0b10000000
    return bin(number + sign)

In [15]:
print(bin_int(~0b00000101))

0b11111010


We will go back to this function latter on.

`Python's` built-in operators for performing logical shift operations consist 
of dual greater-than and less-than symbols, visually suggesting the direction 
of shift. The operand on the right of the operator indicates the number of bit 
positions to shift.

In [None]:
print(0b00111100 >> 2)   # Prints  "15",  which is 0b00001111
print(0b00111100 << 2)   # Prints "240",  which is 0b11110000

15
240


Remember that in addition to shifting bit masks left or right, bit shift operators are also an efficient way to multiply (left shift) or divide (right shift) by powers of 2.

For arithmetic shift refer to https://realpython.com/python-bitwise-operators/

## Control Structures

### Conditional execution

The control op-codes like `JUMP` in machine language instructions affords us a mechanism for jumping from one part of a program to another. In higher-level languages like `Python`, this enables what are called **control structures**, syntax patterns that allow us to express algorithms more succinctly. One example of this is the **if-statement**, which allows a segment of code to be conditionally skipped if a Boolean value in the script is not true.

<center><img src="https://www.py4e.com/images/if.svg"></center>
<div align="center"> source: https://www.py4e.com/html3/03-conditional </div>

In [None]:
x = 150
if x > 0:
  print('x is positive')

x is positive


The boolean expression after the `if` statement is called the condition. We end the `if` statement with a colon character (`:`) and the line(s) after the `if` statement are indented. If the logical condition is true, then the indented statement gets executed. If the logical condition is false, the indented statement is skipped.  The statement consists of a header line that ends with the colon character (:) followed by an indented block. Statements like this are called ***compound statements*** because they stretch across more than one line.

A second form of the `if` statement is alternative execution, in which there are two possibilities and the condition determines which one gets executed. The syntax looks like this:

In [None]:
if x%2 == 0:
    print('x is even')
else:
    print('x is odd')

x is even


<center><img src="https://www.py4e.com/images/if-else.svg"></center>
<div align="center"> source: https://www.py4e.com/html3/03-conditional </div>

Since the condition must either be true or false, exactly one of the alternatives will be executed. The alternatives are called ***branches***, because they are branches in the flow of execution.

If the remainder when `x` is divided by 2 is 0, then we know that `x` is even, and the program displays a message to that effect. If the condition is false, the second set of statements is executed.

Sometimes there are more than two possibilities and we need more than two branches. One way to express a computation like that is a chained conditional:

In [None]:
y = 5
if x < y:
    print('x is less than y')
elif x > y:
    print('x is greater than y')
else:
    print('x and y are equal')

x is greater than y


<center><img src="https://www.py4e.com/images/elif.svg"></center>
<div align="center"> source: https://www.py4e.com/html3/03-conditional </div>

### Loops and Iterations

Another control structure is the looping construct `while`, which allows 
a segment of code to be executed multiple times, often subject to some 
condition.

In [None]:
n = 0
while (n < 10):
  print(n)
  n = n + 1

0
1
2
3
4
5
6
7
8
9


More formally, here is the flow of execution for a `while` statement:

1. Evaluate the condition, yielding `True` or `False`.

2. If the condition is false, exit the `while` statement and continue execution at the next statement.

3. If the condition is true, execute the body and then go back to step 1.

This type of flow is called a ***loop*** because the third step loops back around to the top. We call each time we execute the body of the loop an iteration. The body of the loop should change the value of one or more variables so that eventually the condition becomes false and the loop terminates. We call the variable that changes each time the loop executes and controls when the loop finishes the iteration variable. If there is no iteration variable, the loop will repeat forever, resulting in an ***infinite loop***.

Sometimes we want to loop through a set of things such as a list of words, the lines in a file, or a list of numbers. When we have a list of things to loop through, we can construct a definite loop using a `for` statement. We call the `while` statement an indefinite loop because it simply loops until some condition becomes `False`, whereas the `for` loop is looping through a known set of items so it runs through as many iterations as there are items in the set.

In [None]:
friends = ['Joseph', 'Glenn', 'Sally']
for friend in friends:
    print('Happy New Year:', friend)
print('Done!')

Happy New Year: Joseph
Happy New Year: Glenn
Happy New Year: Sally
Done!


We will spend more time examining these and other control structures in 
Chapter 9 and beyond. **For now, we focus on a mechanism that allows us to 
jump to another part of the program, carry out a desired task, and then return 
to the program point from which we came.**

## Functions

We have already seen several built-in `Python` operations that do not follow the same syntactic form as the arithmetic and logic operators. The `print()`, `str()` and `bin()` operations are invoked using given names instead of symbols, and also involve parentheses wrapped around their operands. Both of these are examples of a `Python` language feature called functions. This abstraction allows us to reuse the expression many times without having 
to duplicate it. Programming language allow us to use a name for a series of operations that should be performed on the given parameter or parameters. Due to the way that this language feature is mapped to lower-level machine languages, the appearance of a function in an expression or statement is known as a ***function call***, or sometimes ***calling*** a function. 

From now on, we will follow the convention of including the parentheses 
when talking about `Python` functions, such as `print()`, so as to clearly denote them as distinct from variables or other items.

Functions come in many varieties beyond what we have already seen. 
Some functions take more than one argument, such as the `max()` function:

In [None]:
x = 1034
y = 1056
z = 2078
biggest = max(x, y, z)
print(biggest)   

2078


Multiple arguments are separated by commas within the parentheses. Some 
functions ***return*** a value, which is to say that the function call itself can appear as part of a more complex expression, or as the right-hand side of an assignment statement. These are sometimes called ***fruitful functions***. This is the case for both `max()` (as above) and `bin()`. Other functions do not return a value, and usually are used as standalone statements, as is the case for `print()`. Functions that do not return a value are sometimes called ***void functions***, or ***procedures***, although `Python` makes no distinction in its syntax rules. It makes no sense to assign the result of a void function to a variable, as in

In [None]:
x = print('hello world!')     # x is assigned None
type(x)

hello world!


NoneType

Each of the functions we have seen so far is one of the few dozen built-in 
functions that `Python` knows about, but there are extensive libraries of additional functions that a more advanced script can refer to. The `Python` library modules contain many useful functions that may not normally be required, but can be called upon when needed.

In [22]:
# Calculates the hypotenuse of a right triangle
import math
sideA = 3.0
sideB = 4.0

# Calculate third side via Pythagorean Theorem
hypotenuse = math.sqrt(sideA**2 + sideB**2)
print(hypotenuse)

5.0


In this example, the `import` statement forewarns the `Python` interpreter that he script refers to the library called `“math”`, which happens to be one of the 
standard set of library modules that `Python` comes equipped with. The `sqrt()` 
function defined within the `math` library module provides the square root of 
the argument, which in this case was the expression of `sideA` squared plus 
`sideB` squared. Note that the library function call includes both the module 
name (`“math”`) and the function name (`“sqrt”`), joined by a period.

So far, we have only been using the functions that come with `Python`, but it is also possible to **add new functions**. A function definition specifies the name of a new function and the sequence of statements that execute when the function is called. Once we define a function, we can reuse the function over and over throughout our program.

In [None]:
def print_lyrics():
    print("I'm a lumberjack, and I'm okay.")
    print('I sleep all night and I work all day.')

`def` is a keyword that indicates that this is a function definition. The name of the function is `print_lyrics`. The rules for function names are the same as for variable names: letters, numbers and some punctuation marks are legal, but the first character can’t be a number. You can’t use a keyword as the name of a function, and you should avoid having a variable and a function with the same name.

The empty parentheses after the name indicate that this function doesn’t take any arguments. **The first line of the function definition is called the header; the rest is called the body. The header has to end with a colon and the body has to be indented. By convention, the indentation is always four spaces. The body can contain any number of statements.**

Some of the built-in functions we have seen require arguments. Inside the function, the arguments are assigned to variables called parameters. Here is an example of a user-defined function that takes an argument:

In [23]:
def print_twice(bruce):
    print(bruce)
    print(bruce)

This function assigns the argument to a parameter named `bruce`. When the function is called, it prints the value of the parameter (whatever it is) twice.

In [24]:
print_twice('Spam')

Spam
Spam


## Input and Output

`Python's` built-in operators allow numeric values to be manipulated and combined in a variety of familiar ways.

The previous example snippets and scripts have used the built-in `Python` 
`print()` function to output results. Many programming languages provide 
similar mechanisms for achieving input and output, providing programmers 
with a convenient abstraction to move data in or out of the computer processor. In fact, these I/O built-ins communicate with the hardware controllers 
and peripheral devices discussed in Chapter 5.

None of our example scripts thus far have required any input from the 
user. Simple user input can be accomplished with the built-in `Python` `input()` function.

In [25]:
echo = input('Please enter a string to echo: ')
print(echo * 3)

Please enter a string to echo: 5
555


The `input()` function takes as an optional argument a prompt string to present to the user when waiting for input. When run, this script will pause after displaying “Please enter a string to echo:”, and wait for the user to type something. When the user hits the enter key, the script assigns the string 
of characters typed (**not including the enter key**,) to the variable `echo`. The second line of the script then outputs the string repeated three times. (Recall that the `‘*’` operator replicates string operands.)

Armed with the ability to acquire input, let's rewrite our hypotenuse script to prompt a user for the side lengths rather than hardcode the values into assignment statements.

In [26]:
# Calculates the hypotenuse of a right triangle
import math
# Inputting the side lengths, first try
sideA = input('Length of side A? ')
sideB = input('Length of side B? ')
# Calculate third side via Pythagorean Theorem
hypotenuse = math.sqrt(sideA**2 + sideB**2)
print(hypotenuse)

Length of side A? 3
Length of side B? 4


TypeError: ignored

At this point, the `Python` interpreter aborts the script! This type of error can be easy to create in a dynamically-typed language like 
`Python`. Our hypotenuse calculation, which worked in the earlier version of 
the script, now causes an error when the values have been read as input from 
the user instead. The problem is indeed a `'TypeError'`, stemming from the 
fact that `Python` no longer knows how to take the square of variable sideA, 
because sideA is now a **character string** in this version of the script, rather than an integer as before. The problem comes from earlier in the script, when the values of sideA and sideB are returned from `input()`. This, too, is common when encountering errors with programming languages. **The `Python` interpreter attempts to provide the line of script responsible for the problem, but the real culprit is actually earlier in the script.**

In the string-echoing snippet above, it was clear that the value assigned 
to echo should be the string of characters typed in by the user. The `input()` 
function behaves in the same way in the hypotenuse program, even though 
the programmer's intent is now to enter integer values. The representation of 
the ASCII- or UTF-8-encoded string `“4”` differs from the two's complement 
representation of the integer 4, and the Python script must explicitly make 
the conversion from one representation to the other before proceeding to calculations with integers.

Fortunately, remember that another built-in function provides the capability. The `int()` function attempts to convert its argument into an integer representation. If it cannot, an appropriate error message is produced.

In [27]:
# Calculates the hypotenuse of a right triangle
import math
# Inputting the side lengths, with integer conversion
sideA = int(input('Length of side A? '))
sideB = int(input('Length of side B? '))
# Calculate third side via Pythagorean Theorem
hypotenuse = math.sqrt(sideA**2 + sideB**2)
print(hypotenuse)

Length of side A? 3
Length of side B? 4
5.0


The revised script operates as intended, and can be used for many right triangles without having to edit the script, as in the pre-input version.
As a final note, the `int()` function performs its conversion by carefully 
examining the string argument and interpreting it as a number. If the input 
string is a number, but not an integer, as for example, `“3.14”`, the `int()` function discards the fractional portion, and returns only the integer value. 

## Bit manipulation and the convertor

In [28]:
"""
This function shifts 1 over by i bits, creating a value being like 0001000. By
performing an OR with num, only value at bit i will change.
"""
def set_bit(num, i):
    return num | (1 << i)

"""
This method operates in almost the reverse of set_bit
"""
def clear_bit(num, i):
    mask = ~(1 << i)
    return num & mask

In [29]:
print(bin(set_bit(0b0000, 3)))
print(bin(clear_bit(0b1111, 0)))

0b1000
0b1110


In [30]:
def bin_to_decimal(bin_string):
    """
    Convert a binary value to its decimal equivalent
    >>> bin_to_decimal("101")
    5
    >>> bin_to_decimal(" 1010   ")
    10
    >>> bin_to_decimal("-11101")
    -29
    >>> bin_to_decimal("0")
    0
    """
    bin_string = str(bin_string).strip()
    is_negative = bin_string[0] == "-"
    if is_negative:
      bin_string = bin_string[1:]
    decimal_number = 0
    for char in bin_string:
        decimal_number = 2 * decimal_number + int(char)
    return -decimal_number if is_negative else decimal_number

In [31]:
bin_to_decimal("101"), bin_to_decimal(" 1010   "), bin_to_decimal("-11101"), bin_to_decimal("0")

(5, 10, -29, 0)

In [34]:
def decimal_to_binary(num):

    """
    Convert an Integer Decimal Number to a Binary Number as str.
    >>> decimal_to_binary(0)
    '0b0'
    >>> decimal_to_binary(2)
    '0b10'
    >>> decimal_to_binary(7)
    '0b111'
    >>> decimal_to_binary(35)
    '0b100011'
    >>> # negatives work too
    >>> decimal_to_binary(-2)
    '-0b10'
    """
    if num == 0:
        return "0b0"
    negative = False

    if num < 0:
        negative = True
        num = -num

    binary = []
    while num > 0:
        binary.insert(0, num % 2)
        num >>= 1

    if negative:
        return "-0b" + "".join(str(e) for e in binary)

    return "0b" + "".join(str(e) for e in binary)

In [35]:
decimal_to_binary(0), decimal_to_binary(2), decimal_to_binary(7), decimal_to_binary(35), decimal_to_binary(-2)

('0b0', '0b10', '0b111', '0b100011', '-0b10')

## The little computer

Check out https://computerscience.chemeketa.edu/cs160Reader/ProgrammingLanguages/LittleComputer1.html, https://eseo-tech.github.io/emulsiV/ or https://www.101computing.net/lmc-simulator/

### Exercise 1: 
Try to extend the decimal to binary operator, so that it can accept real number that contains fractional part.

### Exercise 2:

Try to understand the internal of `bin_to_decimal`, `decimal_to_binary()` and `bin_int()`