## Numpy arrays

## generate integer vectors 
use `range()`, but there can be multiple options.
- 1 parameter: `range(5)` returns from 0 to 5
- 2 parameters: `range(3, 6)` returns from 3 to 5(non inclusive 6). 
- 3 parameters: `range(4, 10, 2)` returns 4, 6, 8 (non inclusive 10). 

In [None]:
import numpy as np

In [13]:
for i in range(5):
    print (i)


0
1
2
3
4


### self-defined f(loat)range

In [16]:
def frange(start, stop, step):
    i = start
    while i < stop:
        yield i
        i += step
        
for i in frange(0.5, 1.0, 0.1):
    print(i)

0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999


In [1]:
np_array1 = np.array([1,2,3])
np_array2 = np.array([4,5,6])
meas = np.array([np_array1, np_array2]) # this gives a list of 2 arrays
print (meas)

[[1 2 3]
 [4 5 6]]


### vector (1D array)
For 1D array, use `len()` or `.size` to get the size (length). 
It is possible to use `.shape` too but I don't know the mechanism yet. 

In [2]:
len(np_array1)

3

In [11]:
np_array1.shape

(3,)

In [12]:
np_array1.size

3

### matrix

use `.shape` or `.size` to get the size. No need to use the bracket afterwards. 

In [4]:
A = np.matrix([[1,2],
               [3,4]])   # use np.matrix()
A
''' 
alternatively, 
B = np.matrix([np_array1, 
              np_array2])

'''

matrix([[1, 2],
        [3, 4]])

In [8]:
A.shape

(2, 2)

In [10]:
A.size

4

### List and index method

In [17]:
rw = [0]
print (rw)

rw.append(3)
print (rw)
rw.index(3)    # note that index(list_content). 

[0]
[0, 3]


1

In [1]:
# 2 lists, use the index to connect 

pop = [30.55, 2.77, 39.21]
countries = ['afghanistan', 'albania', 'algeria']
index_alb = countries.index('albania')  # similar to which() in R 

# print index
index_alb

1

In [2]:
pop[index_alb]

2.77

## dictionary

Key is the country name, value is the population. Keys have to be immutable (unique)

In [3]:
world = {'afghanistan': 30.55, 
         'albania': 2.77, 
         'algeria': 39.21}
world['albania']

2.77

In [4]:
world.keys()   # seems that a 'method' is always using .expression. can't use (keys(world)). 

dict_keys(['afghanistan', 'albania', 'algeria'])

In [5]:
# to add a new pair

world['sealand'] = 0.000027
'sealand' in world

True

In [6]:
world

{'afghanistan': 30.55, 'albania': 2.77, 'algeria': 39.21, 'sealand': 2.7e-05}

In [7]:
# to remove
del(world['sealand'])

In [8]:
world

{'afghanistan': 30.55, 'albania': 2.77, 'algeria': 39.21}

In [None]:
# dictionary of dictionaries

europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }

# the rest operation is the same as adding individual items 

# pandas

## data frame from dictionary

In [2]:
import pandas as pd

In [12]:
# keys: column labels. values: data. 

dict = {
    'country':['brazil', 'russia', 'china'], 
    'capital':['brasilia', 'moscow', 'beijing'],
    'area':[8.516, 17.10, 9.597],
    'population':[200.4, 143.5, 1357]}

brics_manual = pd.DataFrame(dict)
brics_manual

Unnamed: 0,country,capital,area,population
0,brazil,brasilia,8.516,200.4
1,russia,moscow,17.1,143.5
2,china,beijing,9.597,13.57


In [None]:
# set index 

brics_manual.index = ['BR', 'RU', 'CN']

In [3]:
brics = pd.read_csv('brics.csv', index_col = 0) # the first column is treated as row index. (can be arbitrary)
brics


Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


## Basics of manipulating data
Since I've forgotten most of the commands.
### List
`listname[-1]` will give the last item

### data frame

`df.describe` will give the summary. To get one column, use 
`df.colname` returns a dataframe

`df['colname']` returns a Pandas Series

Also can use `df[['colname']]` returns a dataframe, and can return multiple columns.  


In [16]:
brics[['country', 'capital']]

Unnamed: 0,country,capital
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


The row selection is slightly different: `df[0:2]` will select the first 2 rows (UP TO 2) without specifying the columns; and using `[2:]` seems to select everything starting from the 3rd (index 2)

In [17]:
brics[0:2]

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5


In [21]:
brics[2:]

Unnamed: 0,country,capital,area,population
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


## loc and iloc
`loc` is label based: specify row and columns based on row and column labels

`iloc` is integer index based. 

### select based on row

In [22]:
# select as Series 

brics.loc['IN']
brics.iloc[2]

country           India
capital       New Delhi
area              3.286
population         1252
Name: IN, dtype: object

In [23]:
# select as DataFrame

brics.loc[['IN', 'CH']]
brics.iloc[[2, 3]]

Unnamed: 0,country,capital,area,population
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


### select based on column

note that the difference between Series and DF is only important when it's just one column. More than 2 col, all df. 

In [30]:
brics.iloc[:, 1]  # equivalent to brics.loc['country']

BR     Brasilia
RU       Moscow
IN    New Delhi
CH      Beijing
SA     Pretoria
Name: capital, dtype: object

In [32]:
brics.iloc[:, [1,2]] # brics.loc['capital', 'area']

Unnamed: 0,capital,area
BR,Brasilia,8.516
RU,Moscow,17.1
IN,New Delhi,3.286
CH,Beijing,9.597
SA,Pretoria,1.221


### select based on both row and column 

In [27]:
# select 1 value given the single bracket index

brics.iloc[1, 1]

'Moscow'

In [28]:
# also can be used to select more than one row/col

brics.loc[['IN', 'CH'], ['capital']]

Unnamed: 0,capital
IN,New Delhi
CH,Beijing


## filtering (subsetting)

In [4]:
brics

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


In [6]:
brics['area'] > 8

BR     True
RU     True
IN    False
CH     True
SA    False
Name: area, dtype: bool

In [7]:
# combine together

is_huge = brics['area']>8
brics[is_huge]

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
CH,China,Beijing,9.597,1357.0


## drop and dropna


axis 1 is always column. 
```
df.drop(['colname'], axis = 1)
# is equivalent to 

df.drop(columns = ['colname'])

```
dropna see *Preprocessing*. 