# Week 2
- Compose different types of Python sequences and sequence operations.
- Create dictionaries and sets in Python.
- Write list comprehensions that efficient create lists.
- Construct generator expressions and functions.

## List Comprehension

In [1]:
output =[]
for x in range(10):
    output.append(x**2)
    
output

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

The for loop can be replaced by a list comprehension like below

In [2]:
output = [x**2 for x in range(10)]
output

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

### Map

In [3]:
import random
scores = []
for i in range(5):
    scores.append(random.randint(i,10))
    
scores

[2, 10, 5, 10, 9]

The above step is called mapping a funtion unto a sequence. This can be replaced by:

In [4]:
[random.randint(i,10) for i in range(5)]

[3, 8, 4, 7, 5]

### Filter

In [5]:
caps = []

for letter in "Henry Honey":
    if letter.isupper():
        caps.append(letter)
        
caps

['H', 'H']

In [6]:
[letter for letter in "Henry Honey" if letter.isupper()]

['H', 'H']

### Nest

In [7]:
list_of_lists = [['a','b','c'],['d','e','f'],['g','h','i']]
flat = []

for sub_list in list_of_lists:
    for item in sub_list:
        flat.append(item)
        
flat

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

In [8]:
[item for sublist in list_of_lists for item in sub_list]

['g', 'h', 'i', 'g', 'h', 'i', 'g', 'h', 'i']

## Generator Expressions

In [9]:
large_num = 9999999
l_square = [x**2 for x in range(large_num)]

In [10]:
import sys
sys.getsizeof(l_square)

89095160

To replace the above code with generator expression:

In [11]:
g_square = (x**2 for x in range(large_num))
sys.getsizeof(g_square)

112

In [12]:
g_square

<generator object <genexpr> at 0x7fa33f4aa350>

to access values of a generated expression we use the next() function, because the output in the step above doesn't mean anything

In [13]:
next(g_square)

0

In [14]:
next(g_square)

1

To access the values of a generated expression we use a for loop

In [15]:
for x in g_square:
    print(x)
    if x > 12:
        break

4
9
16


### Chaining Generator Expressions

In [16]:
evens = (x for x in range(0,100,2))
div_three = (y for y in evens if y%3 == 0)
# next(div_three)

In [17]:
next(evens)

0

In [18]:
[x for x in div_three]

[6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96]

## Generator Functions

In [19]:
def return_num():
    for x in range(5):
        return x

return_num()

0

in order to make the function above 'return_num' to return a generator object, return should change to yield, like:

In [20]:
def return_num():
    for x in range(5):
        yield x

gen_num = return_num()

In [21]:
next(gen_num)

0

Everytime the next function is called, the generator keeps track of where we are in the for loop, so the yield function is a continuous return statement. Now, let's create a generator that counts for us.

In [22]:
def counter(x):
    while True: # this is an infinite generator
        yield x
        x += 1
         
count = counter(17) 

In [23]:
next(count)

17

Fibonnoci generator:

In [24]:
def fib():
    for cur in (0,1):
        last = cur
        yield cur
    while True:
        yield cur
        last, cur = cur, last + cur

In [25]:
f = fib()

In [26]:
next(f)

0

# Introduction to Pandas DataFrames

## Creating Pandas DF  in Python

In [27]:
import pandas as pd # need to download pandas and wheel (pip3 install wheel)

### From Dictionary

In [28]:
data = { "Name": ['Carl', 'Carol', 'Cas'],
         "Age": [43, 23, 30],
         "Score": [123, 268, 14]
       }

In [29]:
pd.DataFrame(data)

Unnamed: 0,Name,Age,Score
0,Carl,43,123
1,Carol,23,268
2,Cas,30,14


^ This is how a DF is represented in jupyter notebook, as a table

### From List of Lists

In [30]:
data = [['Carl', 'Carol', 'Cas'],
        [43, 23, 30],
        [123, 268, 14]]

In [31]:
df = pd.DataFrame(data)
df

Unnamed: 0,0,1,2
0,Carl,Carol,Cas
1,43,23,30
2,123,268,14


If we want to have named columns

In [32]:
df = pd.DataFrame(data, columns=['Name', 'age', 'score'])
df

Unnamed: 0,Name,age,score
0,Carl,Carol,Cas
1,43,23,30
2,123,268,14


### From File

In [33]:
file_path= './student.csv'

In [34]:
pd.read_csv(file_path)

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female
5,6,Alex John,Four,55,male
6,7,My John Rob,Fifth,78,male
7,8,Asruid,Five,85,male
8,9,Tes Qry,Six,78,male
9,10,Big John,Four,55,female


## Looking at DataFrame Data

In [35]:
file_path='./student.csv'
df = pd.read_csv(file_path)

### Heads/Tails

In [36]:
df.head(3)

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male


In [37]:
df.tail()

Unnamed: 0,id,name,class,mark,gender
30,31,Marry Toeey,Four,88,male
31,32,Binn Rott,Seven,90,female
32,33,Kenn Rein,Six,96,female
33,34,Gain Toe,Seven,69,male
34,35,Rows Noump,Six,88,female


### Descriptive Statistics

In [38]:
df. describe()

Unnamed: 0,id,mark
count,35.0,35.0
mean,18.0,74.657143
std,10.246951,16.401117
min,1.0,18.0
25%,9.5,62.5
50%,18.0,79.0
75%,26.5,88.0
max,35.0,96.0


In [39]:
df.min()

id                1
name      Alex John
class         Eight
mark             18
gender       female
dtype: object

In [40]:
# df.std() this will cause an error because we have to select only valid columns before calling the reduction.

### Sellecting Columns

In [41]:
# if you are not using a large amount of data
df.columns

Index(['id', 'name', 'class', 'mark', 'gender'], dtype='object')

In [42]:
df['name']

0        John Deo
1        Max Ruin
2          Arnold
3      Krish Star
4       John Mike
5       Alex John
6     My John Rob
7          Asruid
8         Tes Qry
9        Big John
10         Ronald
11          Recky
12            Kty
13           Bigy
14       Tade Row
15          Gimmy
16          Tumyu
17          Honny
18          Tinny
19         Jackly
20     Babby John
21         Reggid
22          Herod
23      Tiddy Now
24       Giff Tow
25         Crelea
26       Big Nose
27      Rojj Base
28    Tess Played
29      Reppy Red
30    Marry Toeey
31      Binn Rott
32      Kenn Rein
33       Gain Toe
34     Rows Noump
Name: name, dtype: object

In [43]:
df[['name','class']]

Unnamed: 0,name,class
0,John Deo,Four
1,Max Ruin,Three
2,Arnold,Three
3,Krish Star,Four
4,John Mike,Four
5,Alex John,Four
6,My John Rob,Fifth
7,Asruid,Five
8,Tes Qry,Six
9,Big John,Four


In [44]:
df.id

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
15    16
16    17
17    18
18    19
19    20
20    21
21    22
22    23
23    24
24    25
25    26
26    27
27    28
28    29
29    30
30    31
31    32
32    33
33    34
34    35
Name: id, dtype: int64

In [45]:
df.gender

0     female
1       male
2       male
3     female
4     female
5       male
6       male
7       male
8       male
9     female
10    female
11    female
12    female
13    female
14      male
15      male
16      male
17      male
18      male
19    female
20    female
21    female
22      male
23      male
24      male
25      male
26    female
27    female
28      male
29    female
30      male
31    female
32    female
33      male
34    female
Name: gender, dtype: object

### Selecting Columns and Rows

In [46]:
# For large amount of data
df.iloc[3] #this will return the 3rd row

id                 4
name      Krish Star
class           Four
mark              60
gender        female
Name: 3, dtype: object

In [47]:
df.iloc[3:10] # this will return rows 3 to 9 and all the columns

Unnamed: 0,id,name,class,mark,gender
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female
5,6,Alex John,Four,55,male
6,7,My John Rob,Fifth,78,male
7,8,Asruid,Five,85,male
8,9,Tes Qry,Six,78,male
9,10,Big John,Four,55,female


In [48]:
df.iloc[3:10,3] # this will return rows 3 to 9 and column 3

3    60
4    60
5    55
6    78
7    85
8    78
9    55
Name: mark, dtype: int64

In [49]:
df.iloc[3:10,1:3] # this will return rows 3 to 9 and columns 1 to 3

Unnamed: 0,name,class
3,Krish Star,Four
4,John Mike,Four
5,Alex John,Four
6,My John Rob,Fifth
7,Asruid,Five
8,Tes Qry,Six
9,Big John,Four


df.loc uses name instead of index

In [50]:
df.head()

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female


In [51]:
df.loc[4] # returns row number 5

id                5
name      John Mike
class          Four
mark             60
gender       female
Name: 4, dtype: object

In [52]:
df.loc[2:5] # returns rows 2 to 5

Unnamed: 0,id,name,class,mark,gender
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female
5,6,Alex John,Four,55,male


In [53]:
df.loc[2:5, ['name','class']] # returns rows 2 to 5 and the two column names passed

Unnamed: 0,name,class
2,Arnold,Three
3,Krish Star,Four
4,John Mike,Four
5,Alex John,Four


## Selecting Data in a Pandas DataFrame

In [54]:
# File opened in previous steps
# df.head(2)
df

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female
5,6,Alex John,Four,55,male
6,7,My John Rob,Fifth,78,male
7,8,Asruid,Five,85,male
8,9,Tes Qry,Six,78,male
9,10,Big John,Four,55,female


### Boolean Mask

In [55]:
mask = [False for _ in range(len(df))] # create a list of 'False' with length = to len(dataframe)
mask[3:7] = [True]* 4  # set 'True' to the indices of that you want from the dataframe
mask

[False,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]

In [56]:
df[mask] # pass the list as an argument to the dataframe varaible

Unnamed: 0,id,name,class,mark,gender
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female
5,6,Alex John,Four,55,male
6,7,My John Rob,Fifth,78,male


### Creating masks using comparison operators

In [57]:
mask = df.loc[:,'gender'] == 'male' # get all the rows that have gender set to male // comparison can be done
# between columns as well
df[mask]

Unnamed: 0,id,name,class,mark,gender
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
5,6,Alex John,Four,55,male
6,7,My John Rob,Fifth,78,male
7,8,Asruid,Five,85,male
8,9,Tes Qry,Six,78,male
14,15,Tade Row,Four,88,male
15,16,Gimmy,Four,88,male
16,17,Tumyu,Six,54,male
17,18,Honny,Five,75,male


### Pandas boolean operators

* And &
* Or |
* Not ~

In [58]:
df.describe()

Unnamed: 0,id,mark
count,35.0,35.0
mean,18.0,74.657143
std,10.246951,16.401117
min,1.0,18.0
25%,9.5,62.5
50%,18.0,79.0
75%,26.5,88.0
max,35.0,96.0


Let's say we want all the marks between 60 and 90:

In [59]:
mask = (df.loc[:, 'mark'] > 60) & (df.loc[:, 'mark'] < 90)

In [161]:
# mask

In [61]:
df[mask]

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
6,7,My John Rob,Fifth,78,male
7,8,Asruid,Five,85,male
8,9,Tes Qry,Six,78,male
10,11,Ronald,Six,89,female
12,13,Kty,Seven,88,female
13,14,Bigy,Seven,88,female
14,15,Tade Row,Four,88,male
15,16,Gimmy,Four,88,male


### Creating new column

Let's say I want to add a column called 'mark/average' which divideds the mark by the mean

In [62]:
df.columns

Index(['id', 'name', 'class', 'mark', 'gender'], dtype='object')

In [63]:
df.loc[:, 'mark/average'] = df.loc[:,'mark'] / df['mark'].mean()

In [64]:
df.columns

Index(['id', 'name', 'class', 'mark', 'gender', 'mark/average'], dtype='object')

# Accessing Data in Pandas DataFrame

## Manipulating Pandas DataFrames

In [65]:
data = { 'first': ['Carl', 'Francis', 'Sam'],
         'last': ['Po', 'Nygeun', 'Smith'],
         'age': [43, 23, 30],
         'CH_count': [12,14,39]
       }
clients = pd.DataFrame(data)
clients

Unnamed: 0,first,last,age,CH_count
0,Carl,Po,43,12
1,Francis,Nygeun,23,14
2,Sam,Smith,30,39


### Rename a column

In [66]:
clients.rename(columns={'last':'First Column'}) # the column name changed but the change doesn't affect the original df

Unnamed: 0,first,First Column,age,CH_count
0,Carl,Po,43,12
1,Francis,Nygeun,23,14
2,Sam,Smith,30,39


In [67]:
# clients

### Rename a row

In [68]:
clients.rename(index={0:'a',1:'b',2:'c'}, inplace=True) # setting argument 'inplace' to 'True' will change the original df
clients

Unnamed: 0,first,last,age,CH_count
a,Carl,Po,43,12
b,Francis,Nygeun,23,14
c,Sam,Smith,30,39


### Reset indexes

In [69]:
clients.reset_index(inplace=True)

In [70]:
clients

Unnamed: 0,index,first,last,age,CH_count
0,a,Carl,Po,43,12
1,b,Francis,Nygeun,23,14
2,c,Sam,Smith,30,39


### Drop columns

In [71]:
clients.drop(columns='first') # returns a new df

Unnamed: 0,index,last,age,CH_count
0,a,Po,43,12
1,b,Nygeun,23,14
2,c,Smith,30,39


### Drop rows

In [72]:
clients.drop(index=0) # returns a new df

Unnamed: 0,index,first,last,age,CH_count
1,b,Francis,Nygeun,23,14
2,c,Sam,Smith,30,39


### Set a columns data type

In [73]:
clients.age

0    43
1    23
2    30
Name: age, dtype: int64

In [74]:
clients.age.astype(int)

0    43
1    23
2    30
Name: age, dtype: int64

## Updating Pandas Data

In [75]:
# using variables 'data' and 'clients from above
data
# clients
clients.drop(columns='index', inplace = True)

In [76]:
new_data = {'first':['Sue','Boya'],
            'last': ['Rankler','Maple'],
            'age': [93,12],
            'CH_count': [22,1]}
new_clients = pd.DataFrame(new_data)
new_clients

Unnamed: 0,first,last,age,CH_count
0,Sue,Rankler,93,22
1,Boya,Maple,12,1


### Adding Rows

In [77]:
# clients.append(new_clients) # this method is being depcrecated from pandas, use pandas.concat
clients = pd.concat([clients,new_clients], ignore_index = True) # ignore_index is false by default 

In [78]:
clients

Unnamed: 0,first,last,age,CH_count
0,Carl,Po,43,12
1,Francis,Nygeun,23,14
2,Sam,Smith,30,39
3,Sue,Rankler,93,22
4,Boya,Maple,12,1


### Setting specific value

In [79]:
clients.loc[1, 'first'] = 'Frankie'
clients

Unnamed: 0,first,last,age,CH_count
0,Carl,Po,43,12
1,Frankie,Nygeun,23,14
2,Sam,Smith,30,39
3,Sue,Rankler,93,22
4,Boya,Maple,12,1


In [80]:
# To change a value in multiple locations
clients.loc[0:1, 'CH_count'] = -1 # CH_count will change in index 0 and 1; it is inclusive of both values, unlike lists
clients

Unnamed: 0,first,last,age,CH_count
0,Carl,Po,43,-1
1,Frankie,Nygeun,23,-1
2,Sam,Smith,30,39
3,Sue,Rankler,93,22
4,Boya,Maple,12,1


### Math Operations

In [81]:
clients.CH_count + 1 # this returns a new dataFrame

0     0
1     0
2    40
3    23
4     2
Name: CH_count, dtype: int64

In [82]:
# the equivalent of the inplace operation in Math operations use:
clients.CH_count -= 3
clients

Unnamed: 0,first,last,age,CH_count
0,Carl,Po,43,-4
1,Frankie,Nygeun,23,-4
2,Sam,Smith,30,36
3,Sue,Rankler,93,19
4,Boya,Maple,12,-2


### Replace

In [83]:
clients.replace(10,0) # this returns a new df

Unnamed: 0,first,last,age,CH_count
0,Carl,Po,43,-4
1,Frankie,Nygeun,23,-4
2,Sam,Smith,30,36
3,Sue,Rankler,93,19
4,Boya,Maple,12,-2


In [84]:
clients = clients.replace(-13,99) # this equivalent to inplace = True

In [85]:
clients

Unnamed: 0,first,last,age,CH_count
0,Carl,Po,43,-4
1,Frankie,Nygeun,23,-4
2,Sam,Smith,30,36
3,Sue,Rankler,93,19
4,Boya,Maple,12,-2


## Applying Functions in a Pandas DataFrame

In [86]:
data = {'even': range(20,0,-2),
        'odd':  range(1,21,2)}
df = pd.DataFrame(data)
df

Unnamed: 0,even,odd
0,20,1
1,18,3
2,16,5
3,14,7
4,12,9
5,10,11
6,8,13
7,6,15
8,4,17
9,2,19


In [87]:
# we are going to use a built-in function in python
sum(range(10))

45

### Apply

### Across columns

In [88]:
df.apply(sum) # by default it adds up all the values in the columns

even    110
odd     100
dtype: int64

### Across rows

In [89]:
df.apply(sum, axis = 1) # now it adds up the values in each row

0    21
1    21
2    21
3    21
4    21
5    21
6    21
7    21
8    21
9    21
dtype: int64

### Define Column Function

In [90]:
def hundred_plus(col):
    if sum(col) > 100:
        return "Greater than a 100"
    return "Not greater than a 100"

In [91]:
df.apply(hundred_plus)

even        Greater than a 100
odd     Not greater than a 100
dtype: object

### Apply your own functions

In [92]:
def label(row):
    if row['even'] % 3 == 0:
        return True
    elif row['odd'] % 3 == 0:
        return True
    return False

In [93]:
df.apply(label,axis=1)

0    False
1     True
2    False
3    False
4     True
5    False
6    False
7     True
8    False
9    False
dtype: bool

### Expanding results

In [94]:
def ret_list(row):
    ret_val = [False, False]
    if row['even'] > 6:
        ret_val[0] = True
        
    if row['odd'] > 6:
        ret_val[1] = True
        
    return ret_val

In [95]:
df.apply(ret_list, axis = 1)

0    [True, False]
1    [True, False]
2    [True, False]
3     [True, True]
4     [True, True]
5     [True, True]
6     [True, True]
7    [False, True]
8    [False, True]
9    [False, True]
dtype: object

In [96]:
# to expand the results so that each result has its own column
df.apply(ret_list, axis = 1, result_type='expand')

Unnamed: 0,0,1
0,True,False
1,True,False
2,True,False
3,True,True
4,True,True
5,True,True
6,True,True
7,False,True
8,False,True
9,False,True


### Apply functions to a single column

In [97]:
def div_three(row):
    if row % 3 ==0:
        return 'Divisible by 3'
    return 'Not divisible by 3'

In [98]:
df.even.apply(div_three)

0    Not divisible by 3
1        Divisible by 3
2    Not divisible by 3
3    Not divisible by 3
4        Divisible by 3
5    Not divisible by 3
6    Not divisible by 3
7        Divisible by 3
8    Not divisible by 3
9    Not divisible by 3
Name: even, dtype: object

# Week 3: Alternatives to Pandas DataFrame
- Create Pandas DataFrames in Python.
- Write statements to select columns and rows from a DataFrame.
- Apply comparison and boolean operators as a method of selecting data.
- Explain when it is appropriate to use alternatives to Pandas DataFrames.

- ndarray
    - size set at creation
    - single data type
    - Not limited to two dimensions
    - Can be reshaped

## Creating Numpy Arrays in Python

In [99]:
import numpy as np

### Creating Arrays

In [100]:
data = [1,2,3,4]
first_array = np.array(data)
first_array

array([1, 2, 3, 4])

In [101]:
data = [[1,2,3],
        [4,5,6],
        [7,8,9]]
my_array = np.array(data)
my_array

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Other ways to create arrays

In [102]:
np.ones(12)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [103]:
np.zeros(3)

array([0., 0., 0.])

In [104]:
np.arange(10) # this is simlar to the range method

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [105]:
np.arange(3,13,4)

array([ 3,  7, 11])

### Dimensions and Shape

In [106]:
oned = np.arange(21)
oned

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20])

In [107]:
oned.shape # returns dimension = 21 objects long

(21,)

In [108]:
oned.size

21

In [109]:
oned.ndim

1

In [110]:
list_o_list = [[1,2,3],
               [4,5,6],
               [7,8,9]]
twod = np.array(list_o_list)
twod

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [111]:
twod.ndim # two dimensions

2

In [112]:
twod.size

9

In [113]:
twod.shape

(3, 3)

In [114]:
oned = np.arange(12)
oned

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [115]:
twod = oned.reshape(3,4)
twod

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [116]:
twod.shape

(3, 4)

In [117]:
twod.ndim

2

`For the reshape method to work and not throw an error, the new shape should be able to hold all the values in  
the old array. For example, the twod array has 12 values, the new array should be able to hold 12 values, which means, its shape should be (1,12), (2,6), (3,4), (4,3), (6,2) or (12,1)`

In [118]:
# to create higher dimension arrays
twod.reshape(2,2,3)

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]]])

### Setting Data type

In [119]:
darray = np.arange(100)
darray

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [120]:
darray.dtype

dtype('int64')

In [121]:
darray.nbytes

800

`To change the type:`

In [122]:
darray = np.arange(100, dtype=np.int8)
darray

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
      dtype=int8)

In [123]:
darray.dtype

dtype('int8')

In [124]:
darray.nbytes

100

`This is very important to remember when using huge data sets on a machine with limited memory and CPU`

In [125]:
# Numpy doesn't allow more than 1 type in each array
# darray[12] ='a'

In [126]:
darray[12] =0.4
darray[12] # lost the decimal value

0

### Broadcasting

In [127]:
A1 = np.array([[1,2,3],
              [4,5,6],
              [7,8,9]])

In [128]:
A1.shape

(3, 3)

In [129]:
A1.ndim

2

In [130]:
A2 = np.array([[1,1,1],
              [1,1,1],
              [1,1,1]])

In [131]:
A1 + A2

array([[ 2,  3,  4],
       [ 5,  6,  7],
       [ 8,  9, 10]])

In [132]:
A1 + 3 # adding a single value, returns the shape of the array

array([[ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [133]:
A3 = np.array([1,1,1])

In [134]:
A3.shape

(3,)

In [135]:
A3.ndim

1

In [136]:
A1 + A3 # this is called broadcasting where you add two arrays with two different dimensions, then the dimention of 
        # the resulted array is equal to the maximum dimension of the input arrays

array([[ 2,  3,  4],
       [ 5,  6,  7],
       [ 8,  9, 10]])

In [137]:
(A1 + A3).shape # same shape as A1

(3, 3)

In [138]:
(A1 + A3).ndim # same dim as A1

2

In [139]:
A4 = np.arange(10).reshape(2,1,5)
A4

array([[[0, 1, 2, 3, 4]],

       [[5, 6, 7, 8, 9]]])

In [140]:
A5 = np.arange(14).reshape(2,7,1)

In [141]:
A6 = A4 + A5

In [142]:
A6.shape # the maximum dimension has been taken from either input arrays
            # 2 is equal in both A4 and A5
            # 7 is from A5
            # 5 is from A4

(2, 7, 5)

### Matrix Operations

In [143]:
M1 = np.arange(9).reshape(3,3)
M1

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [144]:
M2 = np.arange(2,11).reshape(3,3)
M2

array([[ 2,  3,  4],
       [ 5,  6,  7],
       [ 8,  9, 10]])

In [145]:
M1.transpose()

array([[0, 3, 6],
       [1, 4, 7],
       [2, 5, 8]])

In [146]:
M1.diagonal()

array([0, 4, 8])

In [147]:
M1 @ M2 # matrix to matrix product 

array([[ 21,  24,  27],
       [ 66,  78,  90],
       [111, 132, 153]])

- When to use **ndarray**:
    - Multi dimensional data
    - single data type
    - complex numerical calculations

- When to use **DataFrame**:
    - Two dimensional data
    - Multiple data types
    - data analysis
    - data visualization

## Spark and PySpark DataFrames in Python

- Panda is designed to work on a single machine
- Performance bound by machines memory
- chunking
    - you load chunks of data, not the whole data as a whole. You handle with one chunk at a time
- upper limit gigabytes
    - how big your data be to use panda? (1 - 5 Gb without using chuncking; if using chuncking, then 100 GB)
    
**Big Data**
- Terebytes of data
    - can't use panda for this
- Solution: **Hadoop and Spark**
    - they use disributed computing
    - use multiple nodes (different VMs, or computers)


- **Spark**
    - distribute dataFrames on JVM
    - written in Scala
    - PySpark Library
    - Data sources inlcude Hadoop HDFS, S3 and streaming
    - uses lazy evaluation
    
    ********
    
- Eager (Pandas)
    - Each operation calculate its result before the next operation is run
    - debuggins is straight forward
- Lazy (Spark)
    - operations are stacked and the results are not calculated after each operation, only after all operations required are run
    - debuggin is challenging

## Creating Dask DataFrames in Python

### Dask

- python native
- distributed operations
- wraps both pandas DataFrame and NumPy arrays
- Arbitrary Python code

PS: basically Dask is equivalant to PySpark but better

In [148]:
import pandas as pd
import random

In [149]:
leng = 1000000 # this is to mimic big data
data = {'a': (random.randint(0,100) for _ in range(leng)),
        'b': (random.randint(2,200) for _ in range(leng))
       }

In [150]:
df = pd.DataFrame(data)
df.head()
# this will take a bit of time

Unnamed: 0,a,b
0,38,46
1,85,15
2,0,164
3,73,45
4,80,78


In [151]:
df.std()

a    29.141312
b    57.511943
dtype: float64

`Now let us create a Dask DF with the same data`
- need to install Dask via `python3 -m pip install dask`

In [152]:
import dask.dataframe as dd

In [153]:
ddf = dd.from_pandas(df, npartitions=3) # nparttions is number of partitions you want to divide your data to

In [154]:
ddf # there is no actual values, because of lazy evaluation

Unnamed: 0_level_0,a,b
npartitions=3,Unnamed: 1_level_1,Unnamed: 2_level_1
0,int64,int64
333334,...,...
666667,...,...
999999,...,...


In [155]:
ddf.std() # no value either

Dask Series Structure:
npartitions=1
a    float64
b        ...
dtype: float64
Dask Name: astype, 8 graph layers

In [156]:
# to force an evaluation, use
ddf.std().compute()

a    29.141312
b    57.511943
dtype: float64

In [157]:
result = ddf.a.sum() - ddf.b.sum()
result # again, not computed unles swe use .compute()

dd.Scalar<sub-f02..., dtype=int64>

In [158]:
result.compute()

-51063488

Dask creats a graph behind the scene

In [159]:
result.dask

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  3  npartitions  3  columns  ['a', 'b']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int64'), 'b': dtype('int64')}",

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,3
npartitions,3
columns,"['a', 'b']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int64'), 'b': dtype('int64')}"

0,1
layer_type  Blockwise  is_materialized  False  number of outputs  3  depends on from_pandas-1eaa284e7487eebadc269b1167c6f22b,

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,3
depends on,from_pandas-1eaa284e7487eebadc269b1167c6f22b

0,1
layer_type  Blockwise  is_materialized  False  number of outputs  3  depends on getitem-be20ba1a2aa26e5ac4869a963ca276f7,

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,3
depends on,getitem-be20ba1a2aa26e5ac4869a963ca276f7

0,1
layer_type  DataFrameTreeReduction  is_materialized  True  number of outputs  1  depends on series-sum-chunk-cb30747309ae95f10eb7e5b658aa519f-74493da623d77ca1d129f5a7f0bac58f,

0,1
layer_type,DataFrameTreeReduction
is_materialized,True
number of outputs,1
depends on,series-sum-chunk-cb30747309ae95f10eb7e5b658aa519f-74493da623d77ca1d129f5a7f0bac58f

0,1
layer_type  Blockwise  is_materialized  False  number of outputs  3  depends on from_pandas-1eaa284e7487eebadc269b1167c6f22b,

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,3
depends on,from_pandas-1eaa284e7487eebadc269b1167c6f22b

0,1
layer_type  Blockwise  is_materialized  False  number of outputs  3  depends on getitem-e049fa3a5604208363a42fedd20f9a10,

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,3
depends on,getitem-e049fa3a5604208363a42fedd20f9a10

0,1
layer_type  DataFrameTreeReduction  is_materialized  True  number of outputs  1  depends on series-sum-chunk-894c29fa95391060b665f8509a04e9c4-667cb3b3dd77cc5301d285312ebc9917,

0,1
layer_type,DataFrameTreeReduction
is_materialized,True
number of outputs,1
depends on,series-sum-chunk-894c29fa95391060b665f8509a04e9c4-667cb3b3dd77cc5301d285312ebc9917

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1  depends on series-sum-agg-894c29fa95391060b665f8509a04e9c4  series-sum-agg-cb30747309ae95f10eb7e5b658aa519f,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1
depends on,series-sum-agg-894c29fa95391060b665f8509a04e9c4
,series-sum-agg-cb30747309ae95f10eb7e5b658aa519f


In [160]:
# dask also supplies the ability to visualize the graph
# result.visualize() # error due to not having graphvix or ipycytoscape packages

#### How to choose a framework

- Pandas or NumPy vs Pyspark or dask
    - depends on the size of data: big data => PySpark or Dask
- Pandas vs Numpy
    - Nature of data: one data type with highly efficient operations on => NumPy
    - tabular data with data types: Pandas (simple analysis, or visualization)
- PySpark vs Dask
    - Nature of enterprise: PySpark is mature framework
    - Dask is python native and allows you take advantage of pandas and numpy

# Week 4
- Utilize Vim and VSCode to write Python code
- Develop your project with Git for version control

### VIM modes
- Normal
- Insert
- Visual
- Command line
***
- Esc key takes you to normal mode
***
- h - move left
- j - move down
- k - move up
- l - move right
- w - move you one word
- b - beginning of the current word
- e - end of the current word
- gg - top of the file
- G - end of the file
***
- y - copy highlighted text
- yw - copy word
- yy - copyline
- p - paste
- x - delete a character
- dd - delete a line
**********
### Insert Mode
- i - before cursor
- I - before line
- a - after cursor
- A - end of line
- o - new line below cursor
- o - new line above cursor
**********
### Visual mode
- v - by character selection
- V - by line selection
- Ctl-v - visual block
**********
## Working with Vim Command Line
### Entering Command line
- : - execute command
- / - search down
- ? - search up
- n - move to next match (when you are searching for something)
- N - move to the previous match (when you are searching for something)
- search and replace:`:s/pass/cat/g`
    - s for search
    - pass is the word we are searching for
    - cat is the word replacing pass
    - g for all
- :! - filter command
**********
### Basic command
- help - documentation
- q - close buffer
- w - save buffer
- e - open file
- new - new buffer
- sav - save buffer as
***
## Vim Configuration
- configure Vim the way you want 
- set it up in .vimrc
- google it