# Session 1 - Introduction to Python and Core Libraries <img src="./Resources/sla.png" width="100" align ="right"/>

## Python

We briefly introduce the essentials of programming in Python. This session will not cover all details in Python. Pls consider the following resources if you'd like to learn more. 

* https://www.codecademy.com/learn/python
* http://docs.python-guide.org/en/latest/intro/learning/
* https://www.codementor.io/learn-python-online
* http://mbakker7.github.io/exploratory_computing_with_python/

### Data Types

The basic data types in Python are integers, floats, strings and booleans.
<p>Integers: <tt>2, 25, -13, 1000000</tt>
<p>Floats: <tt>1.2, 0.0001, -3.4, 1.5e-10</tt>
<p>Strings: <tt>"hello", "this is a string", "iːvən juːnɪkəʊd", 'single quotes are also OK', "can't"</tt>
<p>Booleans: <tt>True</tt>, <tt>False</tt>

### Print statement
You can display values with the <tt>print()<tt> statement. Separate values with commas.
<p>The <tt>type()</tt> function will report the type of the expression.

In [56]:
print(2021,"hello","everyone",True)
print(type(2),type(1.2),type("hello"),type(True))

2021 hello everyone True
<class 'int'> <class 'float'> <class 'str'> <class 'bool'>


### Expressions
The basic types can be combined into expressions with the following operators:
<table>
  <tr><th>Operator</th><th>Example</th><th>Operation</th></tr>
  <tr><td>+</td><td><tt>3+4</tt></td><td>Addition</td></tr>
  <tr><td>-</td><td><tt>-5</tt></td><td>Negation</td></tr>
  <tr><td>-</td><td><tt>5-3</tt></td><td>Subtraction</td></tr>
  <tr><td>*</td><td><tt>3*4</tt></td><td>Multiplication</td></tr>
  <tr><td>/</td><td><tt>12/4</tt></td><td>Division (float result)</td></tr>
  <tr><td>//</td><td><tt>12//4</tt></td><td>Division (int result)</td></tr>
  <tr><td>%</td><td><tt>10%4</tt></td><td>Modulus (int result)</td></tr>
  <tr><td>**</td><td><tt>2**8</tt></td><td>Exponentiation (float result)</td></tr>  
  <tr><td>+</td><td><tt>"Hello "+"world!"</tt></td><td>String concatenation (string result)</td></tr>
</table>
<p>The following operators create a boolean result
<table>
  <tr><th>Operator</th><th>Example</th><th>Operation</th></tr>
  <tr><td>&lt;</td><td><tt>3<4</tt></td><td>Less than</td></tr>
  <tr><td>&lt;=</td><td><tt>3<=4</tt></td><td>Less than or equal</td></tr>
  <tr><td>==</td><td><tt>3==4</tt></td><td>Equal</td></tr>
  <tr><td>!=</td><td><tt>3!=4</tt></td><td>Not equal</td></tr>
  <tr><td>&gt;=</td><td><tt>3>=4</tt></td><td>Greater than or equal</td></tr>
  <tr><td>&gt;</td><td><tt>3>4</tt></td><td>Greater than</td></tr>
  <tr><td>and</td><td><tt>0<3 and 3&lt;5</tt></td><td>Logical and</td></tr>
  <tr><td>or</td><td><tt>0&lt;3 or 3&lt;5</tt></td><td>Logical or</td></tr>
  <tr><td>not</td><td><tt>not True</tt></td><td>Logical not</td></tr>
</table>
 <p>Parentheses may be used to control order of evaluation in complex expressions:
 <p>Compare: <tt>2+3*4</tt> vs <tt>(2+3)*4</tt>
   

In [1]:
print(3+4,-5,5-3,3*4,12/4,12//4,10%3,2**8,"Hello "+"QGIS!")
print(3<4,3<=4,3==4,3!=4,3>=4,3>4,0<3 and 3<5,0<3 or 3<5,not True)

7 -5 2 12 3.0 3 1 256 Hello QGIS!
True True False True False False True True False


### Lists and Dicts

Python has three very useful data structures built into the language:

* lists: []
* dictionaries (hash tables): {}

In [9]:
mylist=[100,"QGIS",True,3.14159]
print(mylist)
# element=mylist[1]
# print(element)
# mylist[1]="Fred"
# print(mylist)
# mylist.append("new value")
# print(mylist)
# del mylist[2]
# print(mylist)
# lst=['a','b','c','d','e'];
# print(lst[1:3])

[100, 'QGIS', True, 3.14159]


In [4]:
mydict={'firstname':"Mike",'employed':True,'height':1.8 }
print(mydict)
# element=mydict["employed"]
# print(element)
# mydict['status']="academic"
# print(mydict)
# del mydict["height"]
# print(mydict)

{'firstname': 'Mike', 'employed': True, 'height': 1.8}


### Defining Functions

In [18]:
def plus(a, b):
    return a + b

In [19]:
plus(3, 4)

7

In [20]:
# Default arguments
def plus(a=1, b=2):
    return a+6

In [21]:
plus()

7

In [26]:
# arguement can be a list
def some_function(start=[]):
    start.append(1)
    return start

In [27]:
result = some_function()

In [28]:
result.append(2)
result

[1, 2]

In [51]:
# argument can be a function
def some_function2(fun):
    fun()
    print("Hello world")
    
def fun():
    print("This is a function")

some_function2(fun)

This is a function
Hello world


In [49]:
# an simple example of decorator
def some_function3(fun):
    def wrappedFunc():
        print("Hello world")
        fun()
    return wrappedFunc

@some_function3
def fun():
    print("This is a function")
    
fun()

Hello world
This is a function


### Conditional statements
<p>The basic structure of a conditional statement is:
  <ul>if &lt;boolean expression>:
    <ul>&lt;statements to be executed only if expression is true></ul>
  </ul>
<p>Note the use of the colon ':' symbol to indicate the start of a controlled block. 

In [52]:
val=int(input("Type a number: "))
if val > 100:
    print("warning: value is too big")
    val=100
elif val < 0:
    print("warning: value is too small")
    val=0
else:
    print("value is OK")
print("value=",val)

Type a number:  1


value is OK
value= 1


### Scope of variables

In [9]:
y = 0
for x in range(10):
    y = x

In [10]:
x

9

In [11]:
[x for x in range(10, 20)]

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [12]:
y

9

In [13]:
x

9

In [15]:
def scope_of_var():
    x = 20
scope_of_var()

In [16]:
x

9

Python follows the LEGB Rule (after https://www.amazon.com/dp/0596513984/):

* L, Local: Names assigned in any way within a function (def or lambda)), and not declared global in that function.
* E, Enclosing function locals: Name in the local scope of any and all enclosing functions (def or lambda), from inner to outer.
* G, Global (module): Names assigned at the top-level of a module file, or declared global in a def within the file.
* B, Built-in (Python): Names preassigned in the built-in names module : open, range, SyntaxError,...

In [8]:
x = 3
def foo():
    x=4
    def bar():
        print(x)  # Accesses x from foo's scope
    bar()  # Prints 4
    x=5
    bar()  # Prints 5

In [9]:
foo()

4
5


See [scope_resolution_legb_rule.ipynb](scope_resolution_legb_rule.ipynb) for some additional readings on scope.

### Built-in functions
Python has a large number of built-in functions. Here are the most commonly used:
<p><table>
  <tr><th>Function</th><th>Example</th><th>Description</th></tr>
  <tr><td><tt>abs</tt></td><td><tt>y=abs(x)</tt></td><td>Absolute value</td></tr>
  <tr><td><tt>chr</tt></td><td><tt>ch=chr(x)</tt></td><td>Character with specified code</td></tr>
  <tr><td><tt>float</tt></td><td><tt>x=float(str)</tt></td><td>Convert string to floating point number</td></tr>
  <tr><td><tt>input</tt></td><td><tt>y=input('Type your name: ')</tt></td><td>Input string value from user</td></tr>
  <tr><td><tt>int</tt></td><td><tt>n=int(str)</tt></td><td>Convert string to integer number</td></tr>
  <tr><td><tt>len</tt></td><td><tt>n=len(mylist)</tt></td><td>Number of items in list or characters in string</td></tr>
  <tr><td><tt>max</tt></td><td><tt>x=max(mylist)</tt></td><td>Largest value in list</td></tr>
  <tr><td><tt>min</tt></td><td><tt>x=min(mylist)</tt></td><td>Smallest value in list</td></tr>
  <tr><td><tt>ord</tt></td><td><tt>n=ord(ch)</tt></td><td>Character code from string</td></tr>
  <tr><td><tt>print</tt></td><td><tt>print(x,y,str)</tt></td><td>Convert values to string form and print</td></tr>
  <tr><td><tt>range</tt></td><td><tt>r=range(1,10,2)</tt></td><td>Generate number sequence</td></tr>
  <tr><td><tt>str</tt></td><td><tt>str=str(x)</tt></td><td>String representation of value</td></tr>
  <tr><td><tt>type</tt></td><td><tt>print(type(x))</tt></td><td>String representation of the type of an expression</td></tr>
</table>

Here is a [full list of the built-in functions](https://docs.python.org/3/library/functions.html).


### Default arguments

In [45]:
def do_something(a, b, c):
    return (a, b, c)

In [46]:
do_something(1, 2, 3)

(1, 2, 3)

In [47]:
def do_something_else(a=1, b=2, c=3):
    return (a, b, c)

In [48]:
do_something_else()

(1, 2, 3)

In [49]:
def some_function(start=[]):
    start.append(1)
    return start

In [50]:
result = some_function()

In [51]:
result

[1]

In [None]:
result.append(2)

In [None]:
other_result = some_function()

In [None]:
other_result

### List comprehension

"List comprehension" is the idea of writing some code inside of a list that will generate a list.

Consider the following:

In [None]:
[x ** 2 for x in range(10)]

In [None]:
temp_list = []
for x in range(10):
    temp_list.append(x ** 2)
temp_list

But list comprehension is much more concise.

In [92]:
# dict comprehension
dict1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
# Double each value in the dictionary
double_dict1 = {k:v*2 for k,v in dict1.items()}
double_dict1

{'a': 2, 'b': 4, 'c': 6, 'd': 8, 'e': 10}

### Basic Intro to Object-oriented (OO) programming 

In [2]:
class Person():
    def __init__(self):
        print("Initiating a person object")
    
A = Person()
A.hobby="TT"
B = Person()
B.hobby="SW"
C = Person()
C.hobby="CC"

print(A.hobby)
print(B.hobby)
print(C.hobby)

class Employee(Person):
   def __init__(self, department):
       super().__init__()
       self.department = department

e1 = Employee("Department 1")
e2 = Employee("Department 1")
print(e1.department)
print(e2.department)

Initiating a person object
Initiating a person object
Initiating a person object
TT
SW
CC
Initiating a person object
Initiating a person object
Department 1
Department 1


__init__ is a double under (so called "dunder") method. 

1. Basic Customizations
* \_\_new__(self) return a new object (an instance of that class). It is called before __init__ method.  
* \_\_init__(self) is called when the object is initialized. It is the constructor of a class.  
* \_\_del__(self) for del() function. Called when the object is to be destroyed. Can be used to commit unsaved data or close connections.  
* \_\_repr__(self) for repr() function. It returns a string to print the object. Intended for developers to debug. Must be implemented in any class.  
* \_\_str__(self) for str() function. Return a string to print the object. Intended for users to see a pretty and useful output. If not implemented, __repr__ will be used as a fallback.  
* \_\_bytes__(self) for bytes() function. Return a byte object which is the byte string representation of the object.  
* \_\_format__(self) for format() function. Evaluate formatted string literals like % for percentage format and ‘b’ for binary.  
* \_\_lt__(self, anotherObj) for < operator.  
* \_\_le__(self, anotherObj) for <= operator.  
* \_\_eq__(self, anotherObj) for == operator.  
* \_\_ne__(self, anotherObj) for != operator.  
* \_\_gt__(self, anotherObj)for > operator.  
* \_\_ge__(self, anotherObj)for >= operator.  
2. Arithmetic Operators
* \_\_add__(self, anotherObj) for + operator.
* \_\_sub__(self, anotherObj) for – operation on object.
* \_\_mul__(self, anotherObj) for * operation on object.
* \_\_matmul__(self, anotherObj) for @ operator (numpy matrix multiplication).
* \_\_truediv__(self, anotherObj) for simple / division operation on object.
* \_\_floordiv__(self, anotherObj) for // floor division operation on object.
3. Type Conversion
* \_\_abs__(self) make support for abs() function. Return absolute value.
* \_\_int__(self) support for int() function. Returns the integer value of the object.
* \_\_float__(self) for float() function support. Returns float equivalent of the object.
* \_\_complex__(self) for complex() function support. Return complex value representation of the object.
* \_\_round__(self, nDigits) for round() function. Round off float type to 2 digits and return it.
* \_\_trunc__(self) for trunc() function of math module. Returns the real value of the object.
* \_\_ceil__(self) for ceil() function of math module. The ceil function Return ceiling value of the object.
* \_\_floor__(self) for floor() function of math module. Return floor value of the object.
4. Emulating Container Types
* __len__(self) for len() function. Returns the total number in any container.
* __getitem__(self, key) to support indexing. LIke container[index] calls container.__getitem(key)explicitly.
* __setitem__(self, key, value) makes item mutable (items can be changed by index), like container[index] = otherElement.
* __delitem__(self, key) for del() function. Delete the value at the index key.
* __iter__(self) returns an iterator when required that iterates all values in the container.

They give the ability to create classes that behave like native data structures like lists, tuples, dictionary, set etc.
The special methods provide a common API that allows developers to create interactive classes which are very useful.

In [7]:
# declare our own string class
class String:

    # magic method to initiate object
    def __init__(self, string):
        self.string = string

    # print our string object
    # def __repr__(self):
    #     return 'Object: {}'.format(self.string)
    
    # def __add__(self, other):
    #     return self.string + other

# object creation
string1 = String('Hello')

# print object location
print(string1)

# concatenate String object and a string
# print(string1 + ' world')


<__main__.String object at 0x0000019399740710>


### Exercise FizzBuzz

Suppose we have a number n. We have to display a string representation of all numbers from 1 to n, but there are some constraints.

If the number is divisible by 3, write Fizz instead of the number
If the number is divisible by 5, write Buzz instead of the number
If the number is divisible by 3 and 5 both, write FizzBuzz instead of the number
To solve this, we will follow these steps −

For all number from 1 to n,
if a number is divisible by 3 and 5 both, print “FizzBuzz”
otherwise when the number is divisible by 3, print “Fizz”
otherwise when the number is divisible by 5, print “Buzz”
otherwise, write the number as a string

In [5]:
# hint: i% 3== 0 means i the number is divisible by 3
class Solution(object):
    def fizzBuzz(self, n):
        result = []
        # for loop from 1 to n
#         for i in :
        # i% 3== 0 means i the number is divisible by 3
#             if :
#                 result.append()
#             elif :
#                 result.append()
#             elif :
#                 result.append()
#             else:
#                 result.append()
        return result

ob1 = Solution()
print(ob1.fizzBuzz(30))

[]


## Python standard library
<p>Some of the modules found in the standard library:
<p><table><tr><th>Module Name</th><th>Description</th></tr>
  <tr><td>os</td><td>Access operating system</td></tr>
  <tr><td>sys</td><td>Access computer resources</td></tr>
  <tr><td>math</td><td>Access standard mathematical functions</td></tr>
  <tr><td>random</td><td>Access pseudo-random number generator</td></tr>
  <tr><td>datetime</td><td>Functions for manipulating dates and times</td></tr>
  <tr><td>matplotlib</td><td>Simple plotting functions</td></tr>
  <tr><td>pandas</td><td>Data Manipulation and Visualization</td></tr>
  <tr><td>geopandas</td><td>Geospatial Data Manipulation and Visualization</td></tr>
  <tr><td>pysal</td><td>Spatial Analysis Libirary</td></tr>
  <tr><td>numpy</td><td>Numeric computing library for Python</td></tr>
  <tr><td>scipy</td><td>Scientific computing library for Python</td></tr>
 </table>
  
  You can print a list of the functions available in a module with the `dir()` command.

### Importing modules

One of the great advantages of Python is the ease with which you can incorporate code written by others into your programs. These packages of code, called modules, can be readily downloaded from the internet and integrated into your programs.

To access these modules, you need to explicitly import them at the start of your program.

For example, to access the `random` module you would write
```
import random
```
Then you would be able to access functions within the random module by name, for example the function `random.randint(low,high)` returns a pseudo-random integer between low and high.


When you are importing modules you can optionally assign them a short name, this saves typing:
```
import numpy as np
x=np.zeros((2,2))
print(x)
```

You can choose to import only specifically named functions from within a module. This can prevent name clashes between your own functions and the functions in the module. For example:
```
from random import randint
print(randint(1,6))
```

## NumPy

NumPy (Numerical Python) is a C implementation of arrays in Python. It provides an efficient interface to store and operate on dense data buffers. NumPy arrays provide much more efficient storage and data operations than Python list as the arrays grow larger in size. 

The CPython implementation is written in C. When we define an integer in Python, such as x = 100, x is actually a pointer to a compound C structure, which contains several values. 
A single integer in Python contains four pieces:

- ``ob_refcnt``, a reference count that helps Python silently handle memory allocation and deallocation
- ``ob_type``, which encodes the type of the variable
- ``ob_size``, which specifies the size of the following data members
- ``ob_digit``, which contains the actual integer value that we expect the Python variable to represent.

In [15]:
import numpy as np

In [4]:
np?

[1;31mType:[0m        module
[1;31mString form:[0m <module 'numpy' from 'C:\\OSGEO4~1\\apps\\Python37\\lib\\site-packages\\numpy\\__init__.py'>
[1;31mFile:[0m        c:\osgeo4~1\apps\python37\lib\site-packages\numpy\__init__.py
[1;31mDocstring:[0m  
NumPy
=====

Provides
  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

How to use the documentation
----------------------------
Documentation is available in two forms: docstrings provided
with the code, and a loose standing reference guide, available from
`the NumPy homepage <https://www.scipy.org>`_.

We recommend exploring the docstrings using
`IPython <https://ipython.org>`_, an advanced Python shell with
TAB-completion and introspection capabilities.  See below for further
instructions.

The docstring examples assume that `numpy` has been imported as `np`::

  >>> import numpy as np

Code snippets are indicated b

In [6]:
dir(np)

['ALLOW_THREADS',
 'AxisError',
 'BUFSIZE',
 'CLIP',
 'DataSource',
 'ERR_CALL',
 'ERR_DEFAULT',
 'ERR_IGNORE',
 'ERR_LOG',
 'ERR_PRINT',
 'ERR_RAISE',
 'ERR_WARN',
 'FLOATING_POINT_SUPPORT',
 'FPE_DIVIDEBYZERO',
 'FPE_INVALID',
 'FPE_OVERFLOW',
 'FPE_UNDERFLOW',
 'False_',
 'Inf',
 'Infinity',
 'MAXDIMS',
 'MAY_SHARE_BOUNDS',
 'MAY_SHARE_EXACT',
 'MachAr',
 'NAN',
 'NINF',
 'NZERO',
 'NaN',
 'PINF',
 'PZERO',
 'RAISE',
 'SHIFT_DIVIDEBYZERO',
 'SHIFT_INVALID',
 'SHIFT_OVERFLOW',
 'SHIFT_UNDERFLOW',
 'ScalarType',
 'Tester',
 'TooHardError',
 'True_',
 'UFUNC_BUFSIZE_DEFAULT',
 'UFUNC_PYVALS_NAME',
 'WRAP',
 '_NoValue',
 '_UFUNC_API',
 '__NUMPY_SETUP__',
 '__all__',
 '__builtins__',
 '__cached__',
 '__config__',
 '__dir__',
 '__doc__',
 '__file__',
 '__getattr__',
 '__git_revision__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_add_newdoc_ufunc',
 '_distributor_init',
 '_globals',
 '_mat',
 '_pytesttester',
 'abs',
 'absolute',
 'add',
 'add_

There is some overhead in storing an integer in Python as compared to C, as illustrated in the following figure:

### Understand Python Data Types 

![Integer Memory Layout](Resources/Figures/cint_vs_pyint.png)

An integer in C is a label for a position in memory whose bytes encode an integer value. A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes (ob_digit) that contain the integer value. Everything in Python is an object. This structure allows Python to be coded so freely and dynamically. Let's look at Python list. Python's list can contain different data types (heterogeneous lists) due to dynamic typing.

In [59]:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]

[bool, str, float, int]

The cost of this flexibility is that each item in the list must contain its own type info, even if all items are the same type. In this case, much of the information is redundant, and it can be much more efficient to store data in a fixed-type array (NumPy-style).

![Integer Memory Layout](Resources/Figures/array_vs_list.png)

### Understand Basic Numpy Vectorization 

Sometimes when all itmes in a list are numbers, we want to add a number to the list. For example:

In [6]:
list1= [1, 2, 3, 4]
list1

[1, 2, 3, 4]

In [8]:
# error 
# list1 + 10

In [9]:
[i +10 for i in list1]

[11, 12, 13, 14]

In [10]:
list2= [10, 20, 30, 40]
list2

[10, 20, 30, 40]

In [12]:
# list1+list2

Although Python's array object provides array-based data as well, the ndarray object of the NumPy package is much more useful, as NumPy adds efficient operations on that data using vectorization.

In [16]:
a = np.array([10, 20, 90, 50])
b = 50
# the next line cause value error
# foo(a,b)

In [17]:
a

array([10, 20, 90, 50])

In [18]:
a+b

array([ 60,  70, 140, 100])

In [62]:
def foo(a, b):
    if a >= b:
       return a + b
    else:
       return a - b

In [75]:
# Create a vectorized version of foo
vecfoo = np.vectorize(foo)
vecfoo(a,b)

array([-40, -30, 140, 100])

In [22]:
c=[10,20]
d=1
results=[]
for i in c:
    results.append(d+i)
results

[11, 21]

In [77]:
[i+b if i>b else i-b for i in a]

[-40, -30, 140, 0]

### Create Numpy Arrays from scratch

What is b is also an array? How to add a and b? This is related to the next topic of broadcasting. Let's first look at how to create arrays from scratch using methods built into NumPy.

In [33]:
a = np.array([10, 20, 90, 100])
b = np.array([50,60])
# a+b

In [34]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

In [38]:
np.zeros(10, dtype='int16')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int16)

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

In [34]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

In [34]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [36]:
# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [37]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)

array([1., 1., 1.])

In [83]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [84]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [87]:
# create an 2D array
a = np.array([[1,2],[3,4],[5,6]])
print(a.shape)
print(a.ndim)
print(a.size)
a

(3, 2)
2
6


array([[1, 2],
       [3, 4],
       [5, 6]])

### Create Random Numbers 

In [35]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[0.40121118, 0.47760715, 0.78165763],
       [0.50812425, 0.33595897, 0.54053657],
       [0.02746623, 0.44913132, 0.57396857]])

In [41]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[-0.32250767,  0.52947906,  0.93137695],
       [-0.76296453, -0.34475489, -0.28861167],
       [-0.80162761, -1.87238651, -0.43157154]])

In [47]:
# normal distribution
x = np.random.normal(loc=1, scale=2, size=(2, 3))
print(x)

[[-1.97949149  0.07088962 -0.55938813]
 [ 1.22186105  4.23425055 -0.62086966]]


In [42]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[9, 5, 5],
       [9, 6, 4],
       [1, 0, 1]])

In [44]:
x = np.random.choice([3, 5, 7, 9])
print(x)

3


In [45]:
x = np.random.choice([3, 5, 7, 9], size=(3, 5))
print(x)

[[7 5 7 7 5]
 [9 5 7 7 9]
 [5 3 5 5 7]]


In [46]:
# data distribution
x = np.random.choice([3, 5, 7, 9], p=[0.1, 0.3, 0.6, 0.0], size=(100))
print(x)

[7 3 7 7 7 7 3 7 5 7 3 7 7 7 7 7 7 7 7 7 5 3 7 7 7 7 7 7 7 7 5 3 7 7 5 7 7
 7 3 5 3 3 7 7 3 5 7 3 3 7 7 7 7 7 7 7 7 7 3 7 3 5 5 7 7 7 7 7 7 5 7 7 7 7
 7 5 7 5 5 7 7 7 7 7 5 5 3 7 7 7 7 7 7 7 5 7 5 7 7 7]


In [48]:
# binomial distribution
x = np.random.binomial(n=10, p=0.5, size=10)
print(x)

[7 5 3 7 6 4 5 4 4 4]


In [49]:
x = np.random.poisson(lam=2, size=10)
print(x)

[0 2 1 3 1 3 3 1 2 2]


In [50]:
x = np.random.uniform(size=(2, 3))

In [51]:
x = np.random.logistic(loc=1, scale=2, size=(2, 3))

In [52]:
x = np.random.multinomial(n=6, pvals=[1/6, 1/6, 1/6, 1/6, 1/6, 1/6])

In [53]:
x = np.random.exponential(scale=2, size=(2, 3))

In [54]:
x = np.random.chisquare(df=2, size=(2, 3))

### Array Indexing 

In [68]:
x=np.random.normal(0, 10, (4, 4))
x

array([[ -6.4031933 ,  -8.80990286,  -8.52264142,   4.70639473],
       [-21.80222489, -13.17149572,   8.18129848,   5.48014523],
       [ -4.02745804,   8.09985177,   2.58034132,  -5.26890741],
       [ 22.20111718,  -4.708983  ,  20.58172123,  11.43269941]])

In [69]:
x[0,1]

-8.809902859676939

In [70]:
x[-1,1]

-4.708982998326853

### Array Slicing

Just as we can use square brackets to access individual Python array elements, we can also use them to access Numpy subarrays with the slice notation, marked by the colon (:) character.
```
x[start:stop:step]
```
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1. We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

In [74]:
x[:3]  # first three rows

array([[ -6.4031933 ,  -8.80990286,  -8.52264142,   4.70639473],
       [-21.80222489, -13.17149572,   8.18129848,   5.48014523],
       [ -4.02745804,   8.09985177,   2.58034132,  -5.26890741]])

In [76]:
x[:3,2:] # row 1 to 3, column 3 to 4 

array([[-8.52264142,  4.70639473],
       [ 8.18129848,  5.48014523],
       [ 2.58034132, -5.26890741]])

In [77]:
x[:-1,1]

array([ -8.80990286, -13.17149572,   8.09985177])

In [79]:
x[::-1] #reverse row

array([[ 22.20111718,  -4.708983  ,  20.58172123,  11.43269941],
       [ -4.02745804,   8.09985177,   2.58034132,  -5.26890741],
       [-21.80222489, -13.17149572,   8.18129848,   5.48014523],
       [ -6.4031933 ,  -8.80990286,  -8.52264142,   4.70639473]])

In [81]:
x[::2]

array([[-6.4031933 , -8.80990286, -8.52264142,  4.70639473],
       [-4.02745804,  8.09985177,  2.58034132, -5.26890741]])

In [80]:
x[::-2]

array([[ 22.20111718,  -4.708983  ,  20.58172123,  11.43269941],
       [-21.80222489, -13.17149572,   8.18129848,   5.48014523]])

In [88]:
x=np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [91]:
x[3::2]

array([6, 4, 2, 0])

In [92]:
x[3::-2]

array([6, 8])

Note: if step is nagative, start should be larger than stop.

In [98]:
x[1:8:2]

array([8, 6, 4, 2])

In [99]:
x[8:1:-2]

array([1, 3, 5, 7])

### Accessing array rows and columns 

In [100]:
x=np.random.normal(0, 10, (4, 4))
x

array([[  8.34601073,  11.15262237,  11.01328401,  -7.42832058],
       [-30.30107503, -16.8928445 ,  14.64111392, -11.54815956],
       [ -8.06878909,   3.52226986,  13.60124825,  -7.78128454],
       [ 11.78343973,   4.96366682, -23.03305137, -11.82068973]])

In [102]:
x[0] # first row, equals x[0,:]

array([ 8.34601073, 11.15262237, 11.01328401, -7.42832058])

In [104]:
x[:,0] # first column

array([  8.34601073, -30.30107503,  -8.06878909,  11.78343973])

### Subarrays as no-copy views 

One **important–thing** to know about array slices is that they return views rather than copies of the array data (in Python lists, slices will be copies). 

In [107]:
subarray = x[:,0]
subarray

array([  8.34601073, -30.30107503,  -8.06878909,  11.78343973])

In [109]:
subarray[1] = 100
subarray

array([  8.34601073, 100.        ,  -8.06878909,  11.78343973])

In [110]:
x

array([[  8.34601073,  11.15262237,  11.01328401,  -7.42832058],
       [100.        , -16.8928445 ,  14.64111392, -11.54815956],
       [ -8.06878909,   3.52226986,  13.60124825,  -7.78128454],
       [ 11.78343973,   4.96366682, -23.03305137, -11.82068973]])

We can copy the sub array using the copy method.

In [111]:
subarray = x[:,0].copy()
subarray[1] = 1000
x

array([[  8.34601073,  11.15262237,  11.01328401,  -7.42832058],
       [100.        , -16.8928445 ,  14.64111392, -11.54815956],
       [ -8.06878909,   3.52226986,  13.60124825,  -7.78128454],
       [ 11.78343973,   4.96366682, -23.03305137, -11.82068973]])

Change a one-dimensional array into a two-dimensional row or column matrix with the reshape method. The size of the initial array must match the size of the reshaped array. Where possible, the reshape method will use a no-copy view of the initial array,

In [116]:
x =np.arange(1, 10)
x

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [119]:
y=x.reshape((3, 3))
y

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [121]:
y[1,1] =100
y

array([[  1,   2,   3],
       [  4, 100,   6],
       [  7,   8,   9]])

In [122]:
x

array([  1,   2,   3,   4, 100,   6,   7,   8,   9])

In [126]:
y

array([[  1,   2,   3],
       [  4, 100,   6],
       [  7,   8,   9]])

In [127]:
x[:,np.newaxis]

array([[  1],
       [  2],
       [  3],
       [  4],
       [100],
       [  6],
       [  7],
       [  8],
       [  9]])

In [129]:
x[np.newaxis,:] # note that this is a matrix with 1 row and 9 columns. 

array([[  1,   2,   3,   4, 100,   6,   7,   8,   9]])

### Array Concatenation and Splitting

np.concatenate takes a tuple or list of arrays as its first argument.

In [130]:
# combining array
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
arr

array([1, 2, 3, 4, 5, 6])

In [132]:
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])

In [133]:
arr = np.concatenate((arr1, arr2), axis=0)
arr

array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

In [134]:
arr_axis1 = np.concatenate((arr1, arr2), axis=1)
arr_axis1

array([[1, 2, 5, 6],
       [3, 4, 7, 8]])

For working with arrays of mixed dimensions, it can be clearer to use the np.vstack (vertical stack) and np.hstack (horizontal stack) functions:

In [136]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr_h = np.hstack((arr1, arr2))
arr_h

array([1, 2, 3, 4, 5, 6])

In [138]:
arr_v = np.vstack((arr1, arr2))
arr_v 

array([[1, 2, 3],
       [4, 5, 6]])

In [147]:
arr_v.shape

(2, 3)

Similary, np.dstack will stack arrays along the third axis.

In [142]:
arr1

array([1, 2, 3])

In [143]:
arr2

array([4, 5, 6])

In [146]:
arr_d = np.dstack((arr1, arr2))
arr_d 

array([[[1, 4],
        [2, 5],
        [3, 6]]])

In [145]:
arr_d.shape

(1, 3, 2)

### Splitting of Arrays

The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and np.vsplit. For each of these, we can pass a list of indices giving the split points.

array_split allows indices_or_sections to be an integer that does not equally divide the axis. For an array of length l that should be split into n sections, it returns l % n sub-arrays of size l//n + 1 and the rest of size l//n.

In [22]:
# split array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])
newarr1,newarr2 = np.array_split(arr, 2, axis=1)
newarr1,newarr2

(array([[ 1,  2],
        [ 4,  5],
        [ 7,  8],
        [10, 11],
        [13, 14],
        [16, 17]]),
 array([[ 3],
        [ 6],
        [ 9],
        [12],
        [15],
        [18]]))

In [23]:
x1, x2, x3 = np.split(arr, [3, 5])
x1, x2, x3 

(array([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]]),
 array([[10, 11, 12],
        [13, 14, 15]]),
 array([[16, 17, 18]]))

In [162]:
haha = np.array_split(arr, 4) #array_split does not require equal size
haha

[array([[1, 2, 3],
        [4, 5, 6]]),
 array([[ 7,  8,  9],
        [10, 11, 12]]),
 array([[13, 14, 15]]),
 array([[16, 17, 18]])]

In [None]:
# search
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
x

### NumPy Arrays Universal Functions

The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

The bottleneck here is the type-checking and function dispatches that CPython must do at each cycle of the loop. Each time the reciprocal is computed, Python first examines the object's type and does a dynamic lookup of the correct function to use for that type. 

In [24]:
big_array = np.random.randint(1, 100, size=1000000)

In [25]:
%timeit sum(big_array)
%timeit np.sum(big_array)

69.4 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
346 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [172]:
x = np.arange(9).reshape((3, 3))
2 ** x

array([[  1,   2,   4],
       [  8,  16,  32],
       [ 64, 128, 256]], dtype=int32)

In [175]:
x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division
print("-x     = ", -x)
print("x ** 2 = ", x ** 3)
print("x % 2  = ", x % 3)

x     = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]
-x     =  [ 0 -1 -2 -3]
x ** 2 =  [ 0  1  8 27]
x % 2  =  [0 1 2 0]


### Exercise Numpy

1. Use Numpy to create a 5 * 5 matrix
2. Sum the matrix by rows
3. Select the 1st, 2nd, and 5th Column of the matrix
4. Create a 5 * 3 numpy matrix
5. Concatecate the two matrix along columns
6. Calculate the mean of each column

Hint: 
``` 
x = np.arange(9).reshape((3, 3)) # create a 3 * 3 matrix: 
x.sum(axis=0) # calculate sum along rows
```

## Pandas 

Pandas is a package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

Numpy's limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting such as groupings, pivots, etc. Pandas builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

In [33]:
import pandas as pd

### Pandas Series Object

The Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array.

In [18]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [19]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [36]:
data.index # One difference between Index objects and NumPy arrays is that indices are immutable

Index(['a', 'b', 'c', 'd'], dtype='object')

The essential difference between numpy array and pandas series is: 
* the Numpy Array has an implicitly defined integer index used to access the values.
* the Pandas Series has an explicitly defined index associated with the values.

In [21]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [22]:
data.b

0.5

In [23]:
data['b']

0.5

We can think of a Pandas Series a bit like a specialization of a Python dictionary. 

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [6]:
# Python Dictionary

population_dict = {'PHILLIP': 0,
                   'TIONG BAHRU': 12830,
                   'CRAWFORD': 9200,
                   'MOUNT EMILY': 1210,
                   'BOULEVARD': 420}
population = pd.Series(population_dict)
population

PHILLIP            0
TIONG BAHRU    12830
CRAWFORD        9200
MOUNT EMILY     1210
BOULEVARD        420
dtype: int64

Unlike a dictionary, the Series supports array-style operations such as slicing:

In [27]:
population['PHILLIP':'MOUNT EMILY'] # python dictionary doesn't support this. try population_dict['PHILLIP':'MOUNT EMILY']

PHILLIP            0
TIONG BAHRU    12830
MOUNT EMILY     9200
dtype: int64

### Series Indexing and Selection

In [39]:
population

PHILLIP            0
TIONG BAHRU    12830
CRAWFORD        9200
MOUNT EMILY     1210
BOULEVARD        420
dtype: int64

In [40]:
population['PHILLIP']

0

In [41]:
population.PHILLIP

0

In [42]:
'CRAWFORD' in population

True

In [44]:
9200 in population

False

In [None]:
x = [1,2,3]

1 in x

True

In [None]:
y = {'a':1,'b':2}
'a' in y

True

In [63]:
population.keys()

Index(['PHILLIP', 'TIONG BAHRU', 'CRAWFORD', 'MOUNT EMILY', 'BOULEVARD'], dtype='object')

In [61]:
population.values

array([    0, 12830,  9200,  1210,   420], dtype=int64)

In [59]:
population.items()

<zip at 0x18c245f49c0>

In [60]:
list(population.items())

[('PHILLIP', 0),
 ('TIONG BAHRU', 12830),
 ('CRAWFORD', 9200),
 ('MOUNT EMILY', 1210),
 ('BOULEVARD', 420)]

In [66]:
population['TEST']=300

In [67]:
population

PHILLIP            0
TIONG BAHRU    12830
CRAWFORD        9200
MOUNT EMILY     1210
BOULEVARD        420
test             300
TEST             300
dtype: int64

In [68]:
# slicing by explicit index
population['PHILLIP':'CRAWFORD']

PHILLIP            0
TIONG BAHRU    12830
CRAWFORD        9200
dtype: int64

In [69]:
# slicing by implicit integer index
population[0:2]

PHILLIP            0
TIONG BAHRU    12830
dtype: int64

In [71]:
# masking
population[(population > 1000) & (population < 10000)]

CRAWFORD       9200
MOUNT EMILY    1210
dtype: int64

In [72]:
# fancy indexing
population[['PHILLIP', 'CRAWFORD']]

PHILLIP        0
CRAWFORD    9200
dtype: int64

### Pandas DataFrame Object

Pandas DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. 

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.

In [7]:
area_dict = {'PHILLIP': 39437,
                   'TIONG BAHRU': 448127,
                   'CRAWFORD': 850853,
                   'MOUNT EMILY': 193992,
                   'BOULEVARD': 460550}
area = pd.Series(area_dict)
                 
sing = pd.DataFrame({'population': population,
                       'area': area})
sing

Unnamed: 0,population,area
PHILLIP,0,39437
TIONG BAHRU,12830,448127
CRAWFORD,9200,850853
MOUNT EMILY,1210,193992
BOULEVARD,420,460550


In [33]:
sing.index

Index(['PHILLIP', 'TIONG BAHRU', 'CRAWFORD', 'MOUNT EMILY', 'BOULEVARD'], dtype='object')

In [34]:
sing.columns

Index(['population', 'area'], dtype='object')

Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.

In [35]:
sing['area']

PHILLIP         39437
TIONG BAHRU    448127
CRAWFORD       850853
MOUNT EMILY    193992
BOULEVARD      460550
Name: area, dtype: int64

Notice the difference here:
* In a two-dimesnional NumPy array, data[0] will return the first **row**. 
* For a DataFrame, data['col0'] will return the first **column**. 

Because of this, it is probably better to think about DataFrames as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.

### Dataframe Indexing and Selection

 First, we can consider DataFrame as a dictionary of related Series objects. The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [73]:
sing['area']

PHILLIP         39437
TIONG BAHRU    448127
CRAWFORD       850853
MOUNT EMILY    193992
BOULEVARD      460550
Name: area, dtype: int64

In [74]:
sing.area

PHILLIP         39437
TIONG BAHRU    448127
CRAWFORD       850853
MOUNT EMILY    193992
BOULEVARD      460550
Name: area, dtype: int64

In [76]:
sing

Unnamed: 0,population,area
PHILLIP,0,39437
TIONG BAHRU,12830,448127
CRAWFORD,9200,850853
MOUNT EMILY,1210,193992
BOULEVARD,420,460550


In [77]:
sing['density'] = sing['population'] / sing['area']

In [78]:
sing

Unnamed: 0,population,area,density
PHILLIP,0,39437,0.0
TIONG BAHRU,12830,448127,0.02863
CRAWFORD,9200,850853,0.010813
MOUNT EMILY,1210,193992,0.006237
BOULEVARD,420,460550,0.000912


we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:

In [79]:
sing.values

array([[0.00000000e+00, 3.94370000e+04, 0.00000000e+00],
       [1.28300000e+04, 4.48127000e+05, 2.86302767e-02],
       [9.20000000e+03, 8.50853000e+05, 1.08126786e-02],
       [1.21000000e+03, 1.93992000e+05, 6.23737061e-03],
       [4.20000000e+02, 4.60550000e+05, 9.11953100e-04]])

In [85]:
sing.index

Index(['PHILLIP', 'TIONG BAHRU', 'CRAWFORD', 'MOUNT EMILY', 'BOULEVARD'], dtype='object')

In [83]:
sing.keys

<bound method NDFrame.keys of              population    area   density
PHILLIP               0   39437  0.000000
TIONG BAHRU       12830  448127  0.028630
CRAWFORD           9200  850853  0.010813
MOUNT EMILY        1210  193992  0.006237
BOULEVARD           420  460550  0.000912>

### Additional indexing conventions

In [87]:
sing['PHILLIP':'MOUNT EMILY']

Unnamed: 0,population,area,density
PHILLIP,0,39437,0.0
TIONG BAHRU,12830,448127,0.02863
CRAWFORD,9200,850853,0.010813
MOUNT EMILY,1210,193992,0.006237


In [129]:
sing.loc['CRAWFORD':,:'density']

Unnamed: 0,population,area,density
CRAWFORD,9200,850853,0.010813
MOUNT EMILY,1210,193992,0.006237
BOULEVARD,420,460550,0.000912


In [130]:
sing.iloc[1:,:2]

Unnamed: 0,population,area
TIONG BAHRU,12830,448127
CRAWFORD,9200,850853
MOUNT EMILY,1210,193992
BOULEVARD,420,460550


In [10]:
# negative number -1 here means not select the last column
sing.iloc[:,:-1] 

Unnamed: 0,population
PHILLIP,0
TIONG BAHRU,12830
CRAWFORD,9200
MOUNT EMILY,1210
BOULEVARD,420


In [132]:
sing[area>50000]

Unnamed: 0,population,area,density
TIONG BAHRU,12830,448127,0.02863
CRAWFORD,9200,850853,0.010813
MOUNT EMILY,1210,193992,0.006237
BOULEVARD,420,460550,0.000912


In [120]:
sing.loc[area>50000,['population']]

Unnamed: 0,population
TIONG BAHRU,12830
CRAWFORD,9200
MOUNT EMILY,1210
BOULEVARD,420


In [123]:
sing.loc[area>50000,'population']

TIONG BAHRU    12830
CRAWFORD        9200
MOUNT EMILY     1210
BOULEVARD        420
Name: population, dtype: int64

In [122]:
type(sing.loc[area>50000,'population'])

pandas.core.series.Series

In [114]:
sing[area>50000].filter(['density','population'])

Unnamed: 0,density,population
TIONG BAHRU,0.02863,12830
CRAWFORD,0.010813,9200
MOUNT EMILY,0.006237,1210
BOULEVARD,0.000912,420


### Pandas Dataframe Manipulation

In [95]:
df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])

In [96]:
df

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


In [47]:
pd_melt = pd.melt(df, id_vars='a')
pd_melt

Unnamed: 0,a,variable,value
0,4,b,7
1,5,b,8
2,6,b,9
3,4,c,10
4,5,c,11
5,6,c,12


In [51]:
pd.pivot(pd_melt, index ='a',columns='variable',values='value')

variable,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1
4,7,10
5,8,11
6,9,12


In [57]:
df.a.unique()

array([4, 5, 6], dtype=int64)

In [60]:
df

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


In [97]:
# make new columns
df = df.assign(d = df.a +df.b)
df

Unnamed: 0,a,b,c,d
1,4,7,10,11
2,5,8,11,13
3,6,9,12,15


In [98]:
df['e']=df.c + df.d

In [99]:
df

Unnamed: 0,a,b,c,d,e
1,4,7,10,11,21
2,5,8,11,13,24
3,6,9,12,15,27


In [100]:
df2=pd.DataFrame(np.array([[1,2,3,4,5]]),columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,1,2,3,4,5


In [107]:
df3 = pd.concat([df, df2])
df3

Unnamed: 0,a,b,c,d,e
1,4,7,10,11,21
2,5,8,11,13,24
3,6,9,12,15,27
0,1,2,3,4,5


In [108]:
df4=pd.concat([df3, pd.DataFrame({'index':range(4)})],axis=1)
df4 = df4.set_index('index')
df4

Unnamed: 0_level_0,a,b,c,d,e
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,2,3,4,5
1,4,7,10,11,21
2,5,8,11,13,24
3,6,9,12,15,27


In [109]:
df5=pd.concat([df4, pd.DataFrame({'category':list('ab')*2})],axis=1)
df5

Unnamed: 0,a,b,c,d,e,category
0,1,2,3,4,5,a
1,4,7,10,11,21,b
2,5,8,11,13,24,a
3,6,9,12,15,27,b


In [123]:
df5.sort_values('category')

Unnamed: 0,a,b,c,d,e,category
0,1,2,3,4,5,a
2,5,8,11,13,24,a
1,4,7,10,11,21,b
3,6,9,12,15,27,b


In [122]:
df5.groupby('category').agg('mean')

Unnamed: 0_level_0,a,b,c,d,e
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,3.0,5.0,7.0,8.5,14.5
b,5.0,8.0,11.0,13.0,24.0


In [128]:
df5.groupby('category').apply(np.mean)

Unnamed: 0_level_0,a,b,c,d,e
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,3.0,5.0,7.0,8.5,14.5
b,5.0,8.0,11.0,13.0,24.0


![image.png](attachment:4333c508-3809-4ba4-8fa1-0d886e84bd5b.png)

### Exercise Pandas

1. Use Pandas to create a two 5 * 5 dataframes. Dataframe 1 has columns a, b, c, d, e, and Dataframe 2 has columns a, f, g, h, i
2. The two dataframes contain random int from 1 to 100 generated by Numpy.random. 
    >Hint:  
    Create a 3x3 array of random integers in the interval [0, 10)  
    np.random.randint(0, 10, (3, 3))

3. Merge the two dataframes into dataframe 3 on column a.
4. For dataframe 3, create a new column x = b +f
5. Create a new column y based on the criteria that if dataframe3.e > dataframe3.i: y =True Elif y = False.  
6. Group Dataframe 3 by column y, and calculate the mean of columns x, f, h

# References

[Introduction to Python](https://jupyter.brynmawr.edu/services/public/dblank/CS245%20Programming%20Languages/2016-Fall/Labs/Chapter%2002%20-%20Introduction%20to%20Python.ipynb)

[how to use Juypter notebook](https://athena.brynmawr.edu/jupyter/hub/dblank/public/Jupyter%20Notebook%20Users%20Manual.ipynb)

[Tutorial Introduction to Python](https://www.phon.ucl.ac.uk/courses/pals0039/)