## Numpy Arrays
* multi dimensional arrays or data
* can be 1d, 2d, 3d
* very computationally efficient to perform linear algebra operations over
* elementwise operations
* row and column wise calculations
* https://docs.scipy.org/doc/numpy-1.11.0/numpy-user-1.11.0.pdf

## Arrays vs. Matrix Class
* "Short Answer: Use Arrays."
* They are the standard vector/matrix/tensor type of numpy. Many numpy functions return arrays, not matrices.
* The array class is intended to be a general-purposen-dimensional array for many kinds of numerical computing, whilematrixis intended to facilitate linear algebracomputations specifically.

In [1]:
import numpy as np

#### 1-D arrays
* note his only has one demension
* we could use a method called reshape to change this (we will get into this more later on)

In [5]:
a = np.array([1,2,3,4,5,6])

In [6]:
a.shape

(6,)

In [7]:
b = np.reshape(a, (3,2))
b.shape

(3, 2)

In [8]:
b

array([[1, 2],
       [3, 4],
       [5, 6]])

In [11]:
b[1,0]

3

In [12]:
a[1]

2

#### indexing and slicing

In [13]:
a = np.array([1,2,3,4,5,6])

In [14]:
a[0]

1

In [17]:
a[0:2]

array([1, 2])

In [18]:
a[1:3]

array([2, 3])

In [19]:
a[-1]

5

In [20]:
type(a)

numpy.ndarray

In [15]:
for i in a:
    print(i)

1
2
3
4
5
6


#### looping
* numpy loops through the left most axis

In [16]:
array = np.array([
    [1,2,3],
    [2,3,4]
])

array.shape

(2, 3)

In [17]:
# since we have 2 items in the first axis we will end up with 2 items in our iteration
for i in array:
    print(i)

[1 2 3]
[2 3 4]


In [18]:
array = np.array([
    [1,2,3,4],
    [2,3,4,4],
    [2,3,4,4]
])

array1 = np.array([
    [5,6,7,4],
    [4,6,9,4],
    [2,3,4,4]
])

In [27]:
# depth 2
# rows 3
# columns 4
tensor = np.array([array, array1])
tensor.shape

(2, 3, 4)

* remember numpy will iterate left most index
* in this 3-d cube, it's iterating the dpeth
* so we end up printing our 2 initial matrices

In [28]:
for i in tensor:
    print(i, "\n")

[[1 2 3 4]
 [2 3 4 4]
 [2 3 4 4]] 

[[5 6 7 4]
 [4 6 9 4]
 [2 3 4 4]] 



#### performing matrix operations
* axis 0 is the y axis
* the up and down axis
* the number of rows
* so any operation done on axis 0 is going to be columns
* done up and down
* axis 1 is the length of our array
* any summing or anything is done across our dataframe
* doing some operation to our rows

In [19]:
import numpy as np 
array = np.array([
    [1,2,3,4],
    [2,3,4,4],
    [2,3,4,4]
])

In [20]:
import pandas as pd

In [29]:
df = pd.DataFrame(array)
df.columns = ["A", "B", "C", "D"]
df
#df.sum(axis = 0)

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,3,4,4
2,2,3,4,4


In [28]:
np.sum(array, axis = 0)

array([ 5,  8, 11, 12])

In [17]:
df.sum(axis = 0)

A     5
B     8
C    11
D    12
dtype: int64

In [2]:
array.shape

(3, 4)

In [3]:
array

array([[1, 2, 3, 4],
       [2, 3, 4, 4],
       [2, 3, 4, 4]])

In [25]:
assert len(np.sum(array, axis = 0)) == array.shape[1], "axis don't align"

In [5]:
np.sum(array, axis = 0)

array([ 5,  8, 11, 12])

In [35]:
np.median(array, axis = 0)

array([2., 3., 4., 4.])

In [6]:
np.median(array, axis = 1)

array([2.5, 3.5, 3.5])

* the other option would be to loop which is far more costly of an operation

In [7]:
for i in array:
    print(sum(i))

10
13
13


In [9]:
array

array([[1, 2, 3, 4],
       [2, 3, 4, 4],
       [2, 3, 4, 4]])

In [10]:
array.T

array([[1, 2, 2],
       [2, 3, 3],
       [3, 4, 4],
       [4, 4, 4]])

In [8]:
for i in array.T:
    print(sum(i))

5
8
11
12


In [39]:
array

array([[1, 2, 3, 4],
       [2, 3, 4, 4],
       [2, 3, 4, 4]])

#### tranpose, turn rows into columns

In [40]:
array.T

array([[1, 2, 2],
       [2, 3, 3],
       [3, 4, 4],
       [4, 4, 4]])

<h3 style="color:blue">Make a 2-d array in numpy:</h3>
<p style="color:blue">- find the row wise mean</p>
<p style="color:blue">- find the column wise mean</p>
<p style="color:blue">- do the same for the transpose of the array you've made</p>

#### other useful metrics
* https://docs.scipy.org/doc/numpy/reference/generated/numpy.quantile.html
* https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html

In [46]:
array

array([[1101, 2020, 3212],
       [2212, 3112, 4121]])

In [47]:
np.std(array, axis = 0)

array([555.5, 546. , 454.5])

In [48]:
np.mean(array, axis = 0)

array([1656.5, 2566. , 3666.5])

In [49]:
np.min(array, axis = 0)

array([1101, 2020, 3212])

In [50]:
np.max(array, axis = 0)

array([2212, 3112, 4121])

In [51]:
np.quantile(array,.5, axis = 0)

array([1656.5, 2566. , 3666.5])

In [52]:
np.percentile(array,.5, axis = 0)

array([1106.555, 2025.46 , 3216.545])

#### reshape
* change the shape of your data
* adding indices
* https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html

In [34]:
# this is indexed by one element
array = np.array([1,2,3,4])
array

array([1, 2, 3, 4])

In [55]:
array[0]

1

In [56]:
array = np.reshape(array, (4,1))
array.shape

(4, 1)

In [57]:
array[0][0]

1

In [35]:
# we've now added two dimension using reshape
array1 = np.reshape(array,(2,2))

In [36]:
array1

array([[1, 2],
       [3, 4]])

In [37]:
array = np.array([1,2,3,4,5,6,7,8])

In [38]:
array

array([1, 2, 3, 4, 5, 6, 7, 8])

In [39]:
array1 = np.reshape(array, (2,2,2))

In [40]:
array1

array([[[1, 2],
        [3, 4]],

       [[5, 6],
        [7, 8]]])

In [41]:
array1.shape

(2, 2, 2)

In [42]:
for i in array1:
    print(i, "\n")

[[1 2]
 [3 4]] 

[[5 6]
 [7 8]] 



#### norm of a vector
* https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html
* defaults to the l2 norm

In [43]:
array = np.array([1,2,3])
np.linalg.norm(array)

3.7416573867739413

In [67]:
np.sqrt(14)

3.7416573867739413

#### matrix multiplication/dot product

In [44]:
array = np.array([
    [1,2],
    [2,3]
])

array1 = np.array([
    [5,6],
    [4,6]
])

In [45]:
# shortcut
array@array1

array([[13, 18],
       [22, 30]])

In [143]:
np.matmul(array, array1)

array([[13, 18],
       [22, 30]])

* https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html

In [144]:
np.dot(array, array1)

array([[13, 18],
       [22, 30]])

#### remember the dimension constraints

In [46]:
array = np.array([
    [1,2,3],
    [2,3,3]
])

array1 = np.array([
    [5,6,2],
    [4,6,1]
])

In [47]:
print(array.shape, array1.shape)

(2, 3) (2, 3)


In [48]:
assert array.shape[1] == array1.shape[0], "The dimensions aren't aligned"

array@array1

AssertionError: The dimensions aren't aligned

In [49]:
array = np.array([
    [1,2,3],
    [2,3,3]
])

array1 = np.array([
    [5,6,2,5,3,2],
    [4,6,1,3,4,5],
    [2,3,2,2,1,3]
])

print(array.shape, array1.shape)
assert array.shape[1] == array1.shape[0], "The dimensions aren't aligned"

array@array1

(2, 3) (3, 6)


array([[19, 27, 10, 17, 14, 21],
       [28, 39, 13, 25, 21, 28]])

<h3 style="color:blue">Make udf that:</h3>
<p style="color:blue">- takes 2 2-d arrays as params</p>
<p style="color:blue">- tests to make sure they can be multipled using assert</p>
<p style="color:blue">- return the dot product of the two matrices</p>

#### broadcasting
* expand the dimension to add to a matrix
* https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

In [78]:
array = np.array([
    [1,2,3,1],
    [2,3,3,1]
])

array1 = np.array([1,1,1,1])
array1 = np.reshape(array1, (4,))
array1

array([1, 1, 1, 1])

In [79]:
array + array1

array([[2, 3, 4, 2],
       [3, 4, 4, 2]])

In [58]:
array

array([[1, 2, 3],
       [2, 3, 3]])

In [59]:
array1

array([[1, 1, 1]])

In [60]:
np.add(array, array1)

array([[2, 3, 4],
       [3, 4, 4]])

In [120]:
np.subtract(array, array1)

array([[0, 1, 2],
       [1, 2, 2]])

In [73]:
array + array1

array([[2, 3, 4],
       [3, 4, 4]])

In [74]:
array - array1

array([[0, 1, 2],
       [1, 2, 2]])

#### elementwise squaring
* this is the elementwise operations arrays that make arrays so powerful

In [127]:
array**5

array([[  1,  32, 243],
       [ 32, 243, 243]])

In [128]:
np.power(array, 5)

array([[  1,  32, 243],
       [ 32, 243, 243]])

<h3 style="color:blue">Make udf that:</h3>
<p style="color:blue">- takes 2 2-d arrays as params</p>
<p style="color:blue">- iterates the rows and columns and squares each element</p>
<p style="color:blue">- puts the squared items in a list</p>
<p style="color:blue">- returns the list</p>

In [88]:
array = np.array([
    [1,2,3,1],
    [2,3,3,1]
])

lst = [
    [1,2,3,1],
    [2,3,3,1]
]

pd.DataFrame(lst)

Unnamed: 0,0,1,2,3
0,1,2,3,1
1,2,3,3,1


In [96]:
array.sum()

16

In [97]:
x = np.sum(lst, axis = 1)
x

array([7, 9])

In [98]:
x = x.sum()

In [86]:
array.mean()

2.0

#### argmax and min
* returns the index of the max/min value
* In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.
* https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html

In [None]:
for user in range(100000000):
    
    for cluster in cluster in range(1000):
        
        

In [None]:
user_matrix = []

cluster_center_matrix = []

distance matrix

argsmix()

In [109]:
feature = ["C1", "C2", "C3", "C4", "C5"]
array = np.array([1,2,3,4,15])
min_val = np.argmax(array)
feature[min_val]

'C5'

In [110]:
np.argmin(array)

0

In [115]:
array = np.array([
    [1,6,3,4],
    [3,10,5,2],
    [15,8,9,2]
])

In [118]:
list(np.argmax(array, axis = 0)) # column level

[2, 1, 2, 0]

In [117]:
np.argmax(array, axis = 1) # row level

array([1, 1, 0])

#### let's say these are distances
* each row is an item
* and each column is the distance to a cluster
array = np.array([

In [119]:
array = np.array([
    [1,6,3],
    [3,10,5],
    [15,8,9]
])

In [120]:
# this could then give us the index of the min distance
np.argmin(array, axis = 1)

array([0, 0, 1])

In [124]:
items = ["user_A", "user_B", "user_C"]
clusters = ["A", "B", "C"]
closest_cluster_idx = np.argmin(array, axis = 1)

In [125]:
closest_cluster_idx

array([0, 0, 1])

In [126]:
closest_cluster_labels = [clusters[idx] for idx in closest_cluster_idx]
closest_cluster_labels

['A', 'A', 'B']

In [130]:
assign_df = pd.DataFrame(list(zip(items, closest_cluster_labels)))
assign_df.columns = ["user", "cluster"]
assign_df

Unnamed: 0,user,cluster
0,user_A,A
1,user_B,A
2,user_C,B


In [132]:
array = np.array([
    items,
    closest_cluster_labels
]).T

array

array([['user_A', 'A'],
       ['user_B', 'A'],
       ['user_C', 'B']], dtype='<U6')

In [81]:
array = np.array([
    items,
    [clusters[idx] for idx in np.argmin(array, axis = 1)]
]).T

array

array([['item_A', 'B'],
       ['item_B', 'B'],
       ['item_C', 'B']], dtype='<U6')

In [133]:
np.argmax(np.array([1,1,2,2,5,5,5]))

4

#### argsort

In [139]:
list(df.columns)

['A', 'B', 'C', 'D']

In [143]:
array = np.array([1,3,6,0,51,12])

In [144]:
sorted_args = np.argsort(array)

In [145]:
sorted_args

array([3, 0, 1, 2, 5, 4])

In [136]:
list(df.columns)[idx] for idx in sorted_args

array([3, 0, 1, 2, 5, 4])

In [31]:
[array[idx] for idx in sorted_args]

[0, 1, 3, 6, 12, 51]

In [148]:
array = np.array([
    [1,6,3],
    [3,10,5],
    [15,8,9]
])

In [None]:
items = ["item_A", "item_B", "item_C"]
clusters = ["A", "B", "C"]
closest_cluster = np.argmin(array, axis = 1)

In [38]:
array

array([[ 1,  6,  3],
       [ 3, 10,  5],
       [15,  8,  9]])

In [37]:
np.argsort(array, axis = 1)

array([[0, 2, 1],
       [0, 2, 1],
       [1, 2, 0]])

In [149]:
def myfunc(x):
    items = ["item_A", "item_B", "item_C"]
    return items[x]

vectorized_func = np.vectorize(myfunc)

In [150]:
vectorized_func(array)

IndexError: list index out of range

## Distance Measures
* used in many ML algorithms
* rec systems, kmeans, knn, decision trees
* https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

#### euclidean distance
* magnitude makes a difference

In [153]:
a = np.array([1,2,3])
b = np.array([4,2,500])

In [154]:
np.sqrt(sum((a-b)**2))

497.00905424348156

#### cosine similarity

In [18]:
np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

0.6415330278717848

#### jaccard coefficient

In [155]:
a = ["the", "old", "man", "by", "the", "sea"]
b = ["the", "old", "man", "by", "the", "chair"]

In [156]:
def jaccard_similarity(a, b):
    return len(set(a).intersection(set(b))) / len(set(a).union(set(b)))

In [157]:
jaccard_similarity(a,b)

0.6666666666666666

#### distance matrix
* pdist returns a reduced distance matrix, since they are symetrical
* squareform makes this a dense form

In [160]:
from scipy.spatial.distance import pdist, squareform

In [161]:
array = np.array([
    [1,2,3,4],
    [2,4,2,1],
    [2,3,4,5]
    
])

In [None]:
for i in array:
    for i2 in array:
        distance(i,i2)

In [163]:
squareform(pdist(array.T, "euclidean"))

array([[0.        , 2.44948974, 2.82842712, 4.35889894],
       [2.44948974, 0.        , 2.44948974, 4.12310563],
       [2.82842712, 2.44948974, 0.        , 1.73205081],
       [4.35889894, 4.12310563, 1.73205081, 0.        ]])

### Pandas
* data wrangling
* groupby
* extension of numpys to add a dataframe capability
* similar to R dataframes
* integrates with numpy
* https://pandas.pydata.org/

In [164]:
import pandas as pd

## Series
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
* One-dimensional ndarray with axis labels (including time series)

In [165]:
vals = [1,2,3,4,5]
idxs = ["a","b","c","d","e"]

my_series = pd.Series(vals, idxs)

In [172]:
my_series[["a","b"]]

a    1
b    2
dtype: int64

In [170]:
df["A"]

0    1
1    2
2    2
Name: A, dtype: int64

In [25]:
my_series[["a", "b"]]

a    1
b    2
dtype: int64

### Dataframe
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
* core object in pandas
* 2 dimensional
* index and columns

In [198]:
array1 = np.array([
    [5,6,2,5,3,2],
    [4,6,1,3,4,5],
    [2,3,2,2,1,3]
])

In [199]:
df = pd.DataFrame(array1)

In [200]:
#df.shape

In [201]:
#df.head(1)
#df.tail(1)

In [202]:
df

Unnamed: 0,0,1,2,3,4,5
0,5,6,2,5,3,2
1,4,6,1,3,4,5
2,2,3,2,2,1,3


In [203]:
df.columns = ["a", "b", "c", "d", "e", "f"]

In [204]:
df

Unnamed: 0,a,b,c,d,e,f
0,5,6,2,5,3,2
1,4,6,1,3,4,5
2,2,3,2,2,1,3


In [205]:
df = df.set_index("b")
df

Unnamed: 0_level_0,a,c,d,e,f
b,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,5,2,5,3,2
6,4,1,3,4,5
3,2,2,2,1,3


In [None]:
df
df_total = func(df)
df_total = func1(df_total)
df = func2(df)

In [197]:
df

Unnamed: 0_level_0,a,c,d,e,f
b,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,5,2,5,3,2
6,4,1,3,4,5
3,2,2,2,1,3


In [207]:
array = [
    ["a",2],
    ["a",2],
    ["b",2]
]

df = pd.DataFrame(array)
df.columns = ["aa", "bb"]
df

Unnamed: 0,aa,bb
0,a,2
1,a,2
2,b,2


In [213]:
df.groupby("aa").sum().reset_index().set_index("bb")

Unnamed: 0_level_0,aa
bb,Unnamed: 1_level_1
4,a
2,b


In [187]:
df.index

Int64Index([5, 4, 2], dtype='int64', name='a')

In [188]:
df

Unnamed: 0_level_0,b,c,d,e,f
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,6,2,5,3,2
4,6,1,3,4,5
2,3,2,2,1,3


#### change headers

In [215]:
df

Unnamed: 0,aa,bb
0,a,2
1,a,2
2,b,2


In [239]:
"Flavor / Scent ".rstrip().replace(" ", "")

'Flavor/Scent'

In [230]:
df.columns = ["AfterHour", "b"]

In [233]:
df.a.a.a

SyntaxError: invalid syntax (<ipython-input-233-7dd68b996720>, line 1)

In [227]:
df.a_a_a

0    a
1    a
2    b
Name: a_a_a, dtype: object

In [None]:
"10201020101"

In [None]:
df["a"] = df["a"].apply(int)

In [237]:
assert df.dtypes["b"] == int, "column must be integer"

#### accessing columns
* returns series
* index still attached to the series

In [107]:
df["a"]

0    5
1    4
2    2
Name: a, dtype: int64

In [108]:
type(df["a"])

pandas.core.series.Series

In [109]:
df["a"].tolist()

[5, 4, 2]

In [110]:
df["a"].index

RangeIndex(start=0, stop=3, step=1)

In [111]:
df[["a", "b"]]

Unnamed: 0,a,b
0,5,6
1,4,6
2,2,3


#### index subsetting
* iloc[] is position selecting
* iloc[rows, columns]
* rows/columns can be a list of indicies (integers) or use : to seperate a range
* [2:] means give me 2 and everything after for slicing
* [1] means give me index 1
* [1,2,3] means give me index 1,2,3
* [1],[0,1] means give me row 1 and columns 0 and 1

In [112]:
df.iloc[1]

a    4
b    6
c    1
d    3
e    4
f    5
Name: 1, dtype: int64

In [113]:
# need integer based subsetting
df.iloc[1,[0,1,2]]

a    4
b    6
c    1
Name: 1, dtype: int64

In [114]:
df.iloc[[1,2]]

Unnamed: 0,a,b,c,d,e,f
1,4,6,1,3,4,5
2,2,3,2,2,1,3


In [115]:
df.iloc[[0,2,1]]

Unnamed: 0,a,b,c,d,e,f
0,5,6,2,5,3,2
2,2,3,2,2,1,3
1,4,6,1,3,4,5


In [116]:
df.iloc[0:,0:]

Unnamed: 0,a,b,c,d,e,f
0,5,6,2,5,3,2
1,4,6,1,3,4,5
2,2,3,2,2,1,3


In [117]:
df.iloc[0:,0:1]

Unnamed: 0,a
0,5
1,4
2,2


In [118]:
df.iloc[[1],[0,1]]

Unnamed: 0,a,b
1,4,6


#### loc
* used if we have labels and not integer indec

In [119]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [120]:
df.index = ["aa", "bb", "cc"]

In [121]:
df

Unnamed: 0,a,b,c,d,e,f
aa,5,6,2,5,3,2
bb,4,6,1,3,4,5
cc,2,3,2,2,1,3


In [122]:
df.loc["aa"]

a    5
b    6
c    2
d    5
e    3
f    2
Name: aa, dtype: int64

In [123]:
df.loc[["aa", "bb"]]

Unnamed: 0,a,b,c,d,e,f
aa,5,6,2,5,3,2
bb,4,6,1,3,4,5


In [124]:
df.loc[["aa"], ["a","b"]]

Unnamed: 0,a,b
aa,5,6


<h3 style="color:blue">how might we</h3>
<p style="color:blue">- filter if we have a large number of indices?</p>
<p style="color:blue">- we don't want to type out a list of 1000s of index labels</p>

In [126]:
rows = ["aa", "bb"]
cols = ["a", "c"]

df.loc[rows,cols]

Unnamed: 0,a,c
aa,5,2
bb,4,1


#### subset for columns

In [11]:
df1 = df[["a", "b", "c", "d"]]

In [12]:
df1

Unnamed: 0,a,b,c,d
0,5,6,2,5
1,4,6,1,3
2,2,3,2,2


#### describe

In [13]:
df1.describe()

Unnamed: 0,a,b,c,d
count,3.0,3.0,3.0,3.0
mean,3.666667,5.0,1.666667,3.333333
std,1.527525,1.732051,0.57735,1.527525
min,2.0,3.0,1.0,2.0
25%,3.0,4.5,1.5,2.5
50%,4.0,6.0,2.0,3.0
75%,4.5,6.0,2.0,4.0
max,5.0,6.0,2.0,5.0


#### correlate

In [14]:
df1.corr()

Unnamed: 0,a,b,c,d
a,1.0,0.944911,-0.188982,0.928571
b,0.944911,1.0,-0.5,0.755929
c,-0.188982,-0.5,1.0,0.188982
d,0.928571,0.755929,0.188982,1.0


#### summations

In [17]:
df1.sum(axis = 0)

a    11
b    15
c     5
d    10
dtype: int64

In [18]:
df1.sum(axis = 1)

0    18
1    14
2     9
dtype: int64

#### median

In [19]:
df1.median(axis = 0)

a    4.0
b    6.0
c    2.0
d    3.0
dtype: float64

#### new column

In [22]:
df1["class"] = ["AA", "AA", "BB"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [23]:
df1

Unnamed: 0,a,b,c,d,class
0,5,6,2,5,AA
1,4,6,1,3,AA
2,2,3,2,2,BB


In [24]:
df1.groupby("class").sum()

Unnamed: 0_level_0,a,b,c,d
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,9,12,3,8
BB,2,3,2,2


In [25]:
df1.groupby("class").count()

Unnamed: 0_level_0,a,b,c,d
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,2,2,2,2
BB,1,1,1,1


In [27]:
# count distinct
df1.groupby("class").nunique()

Unnamed: 0_level_0,a,b,c,d,class
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AA,2,1,2,2,1
BB,1,1,1,1,1


#### reset index

In [28]:
df1.groupby("class").count().index

Index(['AA', 'BB'], dtype='object', name='class')

In [29]:
df1.groupby("class").count().reset_index()

Unnamed: 0,class,a,b,c,d
0,AA,2,2,2,2
1,BB,1,1,1,1


In [30]:
df1.groupby("class").count().reset_index().index

RangeIndex(start=0, stop=2, step=1)

#### reading in csv

In [63]:
path = "data/iris.csv"
df = pd.read_csv(path, sep = ",")

In [33]:
df.head(5)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


<h3 style="color:blue">read in the iris csv:</h3>
<p style="color:blue">- find the descriptive statistics</p>
<p style="color:blue">- create a correlation matrix </p>

#### subset on rules

In [36]:
df1 = df[df["sepal.length"] > 6]
df1.shape

(61, 5)

In [37]:
df1 = df[(df["sepal.length"] > 6) & (df["sepal.width"] > 3)]
df1.shape

(23, 5)

In [38]:
df1 = df[df["variety"] == "Setosa"]
df1.shape

(50, 5)

#### isin()

In [39]:
df["variety"].unique()

array(['Setosa', 'Versicolor', 'Virginica'], dtype=object)

In [40]:
lst = ["Setosa", "Versicolor"]

df1 = df[df["variety"].isin(lst)]
df1.shape

(100, 5)

#### like

In [42]:
df1 = df[df["variety"].str.contains("Versi")]
df1.shape

(50, 5)

#### apply

In [43]:
def my_func(x):
    return -x

In [44]:
df.head(1)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa


In [45]:
df["sepal.length"] = df["sepal.length"].apply(my_func)

In [46]:
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,-5.1,3.5,1.4,0.2,Setosa
1,-4.9,3.0,1.4,0.2,Setosa
2,-4.7,3.2,1.3,0.2,Setosa
3,-4.6,3.1,1.5,0.2,Setosa
4,-5.0,3.6,1.4,0.2,Setosa


In [48]:
df.columns

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')

In [49]:
cols = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']

In [50]:
for col in cols:
    df[col] = df[col].apply(my_func)

In [51]:
df.head(5)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,-3.5,-1.4,-0.2,Setosa
1,4.9,-3.0,-1.4,-0.2,Setosa
2,4.7,-3.2,-1.3,-0.2,Setosa
3,4.6,-3.1,-1.5,-0.2,Setosa
4,5.0,-3.6,-1.4,-0.2,Setosa


In [52]:
df["sepal.length"] = df["sepal.length"].apply(lambda x: -x)

In [54]:
df.head(5)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,-5.1,-3.5,-1.4,-0.2,Setosa
1,-4.9,-3.0,-1.4,-0.2,Setosa
2,-4.7,-3.2,-1.3,-0.2,Setosa
3,-4.6,-3.1,-1.5,-0.2,Setosa
4,-5.0,-3.6,-1.4,-0.2,Setosa


#### datatypes

In [56]:
df.dtypes

sepal.length    float64
sepal.width     float64
petal.length    float64
petal.width     float64
variety          object
dtype: object

In [57]:
df["sepal.length"] = df["sepal.length"].apply(str)

In [58]:
df.dtypes

sepal.length     object
sepal.width     float64
petal.length    float64
petal.width     float64
variety          object
dtype: object

#### get dummies
* for string or object variables
* can specify specific columns
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

<h3 style="color:blue">what might we need dummies for?</h3>

In [64]:
df = pd.get_dummies(df)

In [65]:
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety_Setosa,variety_Versicolor,variety_Virginica
0,5.1,3.5,1.4,0.2,1,0,0
1,4.9,3.0,1.4,0.2,1,0,0
2,4.7,3.2,1.3,0.2,1,0,0
3,4.6,3.1,1.5,0.2,1,0,0
4,5.0,3.6,1.4,0.2,1,0,0


In [129]:
col_1 = np.array(["A", "B", "C"])
col_2 = np.array(["AA", "BB", "CC"])

df = pd.DataFrame({
    "col_1":col_1,
    "col_2":col_2
})

df

Unnamed: 0,col_1,col_2
0,A,AA
1,B,BB
2,C,CC


<h3 style="color:blue">perform get dummies on col_1 only?</h3>

#### iterate

In [66]:
for idx, r in df.head(5).iterrows():
    print(idx, r)

0 sepal.length          5.1
sepal.width           3.5
petal.length          1.4
petal.width           0.2
variety_Setosa        1.0
variety_Versicolor    0.0
variety_Virginica     0.0
Name: 0, dtype: float64
1 sepal.length          4.9
sepal.width           3.0
petal.length          1.4
petal.width           0.2
variety_Setosa        1.0
variety_Versicolor    0.0
variety_Virginica     0.0
Name: 1, dtype: float64
2 sepal.length          4.7
sepal.width           3.2
petal.length          1.3
petal.width           0.2
variety_Setosa        1.0
variety_Versicolor    0.0
variety_Virginica     0.0
Name: 2, dtype: float64
3 sepal.length          4.6
sepal.width           3.1
petal.length          1.5
petal.width           0.2
variety_Setosa        1.0
variety_Versicolor    0.0
variety_Virginica     0.0
Name: 3, dtype: float64
4 sepal.length          5.0
sepal.width           3.6
petal.length          1.4
petal.width           0.2
variety_Setosa        1.0
variety_Versicolor    0.0
variety_Vi

In [67]:
for idx, r in df.head(5).iterrows():
    print(r["petal.length"])

1.4
1.4
1.3
1.5
1.4


#### tolist()

In [69]:
df["petal.width"].tolist()

[0.2,
 0.2,
 0.2,
 0.2,
 0.2,
 0.4,
 0.3,
 0.2,
 0.2,
 0.1,
 0.2,
 0.2,
 0.1,
 0.1,
 0.2,
 0.4,
 0.4,
 0.3,
 0.3,
 0.3,
 0.2,
 0.4,
 0.2,
 0.5,
 0.2,
 0.2,
 0.4,
 0.2,
 0.2,
 0.2,
 0.2,
 0.4,
 0.1,
 0.2,
 0.2,
 0.2,
 0.2,
 0.1,
 0.2,
 0.2,
 0.3,
 0.3,
 0.2,
 0.6,
 0.4,
 0.3,
 0.2,
 0.2,
 0.2,
 0.2,
 1.4,
 1.5,
 1.5,
 1.3,
 1.5,
 1.3,
 1.6,
 1.0,
 1.3,
 1.4,
 1.0,
 1.5,
 1.0,
 1.4,
 1.3,
 1.4,
 1.5,
 1.0,
 1.5,
 1.1,
 1.8,
 1.3,
 1.5,
 1.2,
 1.3,
 1.4,
 1.4,
 1.7,
 1.5,
 1.0,
 1.1,
 1.0,
 1.2,
 1.6,
 1.5,
 1.6,
 1.5,
 1.3,
 1.3,
 1.3,
 1.2,
 1.4,
 1.2,
 1.0,
 1.3,
 1.2,
 1.3,
 1.3,
 1.1,
 1.3,
 2.5,
 1.9,
 2.1,
 1.8,
 2.2,
 2.1,
 1.7,
 1.8,
 1.8,
 2.5,
 2.0,
 1.9,
 2.1,
 2.0,
 2.4,
 2.3,
 1.8,
 2.2,
 2.3,
 1.5,
 2.3,
 2.0,
 2.0,
 1.8,
 2.1,
 1.8,
 1.8,
 1.8,
 2.1,
 1.6,
 1.9,
 2.0,
 2.2,
 1.5,
 1.4,
 2.3,
 2.4,
 1.8,
 1.8,
 2.1,
 2.4,
 2.3,
 1.9,
 2.3,
 2.5,
 2.3,
 1.9,
 2.0,
 2.3,
 1.8]

<h3 style="color:blue">Using iris, group by variety and find the count of the other columns?</h3>
<h3 style="color:blue">Using iris, group by variety and find the sum of the other columns?</h3>

In [132]:
col_1 = np.array(["A", "B", "C", "D", "E"])
col_2 = np.array(["A", "B", "C"])

df = pd.DataFrame({
    "col_1":col_1,
    "col_1_ind":1
})

df1 = pd.DataFrame({
    "col_1":col_2,
    "col_2_ind":1
})
df

Unnamed: 0,col_1,col_1_ind
0,A,1
1,B,1
2,C,1
3,D,1
4,E,1


In [133]:
df1

Unnamed: 0,col_1,col_2_ind
0,A,1
1,B,1
2,C,1


#### concat
* https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [135]:
pd.concat([df, df1], axis = 0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,col_1,col_1_ind,col_2_ind
0,A,1.0,
1,B,1.0,
2,C,1.0,
3,D,1.0,
4,E,1.0,
0,A,,1.0
1,B,,1.0
2,C,,1.0


#### merge
* https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.merge.html
* assumes inner join

In [142]:
df.merge(df1, how = "inner", left_on = "col_1", right_on = "col_1")

Unnamed: 0,col_1,col_1_ind,col_2_ind
0,A,1,1
1,B,1,1
2,C,1,1


In [141]:
df.merge(df1, how = "left", left_on = "col_1", right_on = "col_1")

Unnamed: 0,col_1,col_1_ind,col_2_ind
0,A,1,1.0
1,B,1,1.0
2,C,1,1.0
3,D,1,
4,E,1,


## Data Imputation
* https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate
* missing values
* When summing data, NA (missing) values will be treated as zero.
* If the data are all NA, the result will be 0.
* Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use skipna=False.


In [2]:
import pandas as pd
data = [1,2,3,4,None,5,2,1]
df = pd.DataFrame(data)

In [3]:
df

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
3,4.0
4,
5,5.0
6,2.0
7,1.0


In [4]:
df.isna()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,True
5,False
6,False
7,False


In [6]:
df.notna()

Unnamed: 0,0
0,True
1,True
2,True
3,True
4,False
5,True
6,True
7,True


In [7]:
df.fillna(0)

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
3,4.0
4,0.0
5,5.0
6,2.0
7,1.0


In [8]:
df.fillna(100)

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
3,4.0
4,100.0
5,5.0
6,2.0
7,1.0


In [9]:
df.dropna(axis=0)

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
3,4.0
5,5.0
6,2.0
7,1.0


In [13]:
df.dropna(axis=1)

0
1
2
3
4
5
6
7


In [14]:
df.fillna(df.mean())

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
3,4.0
4,2.571429
5,5.0
6,2.0
7,1.0


In [15]:
df.fillna(df.median())

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
3,4.0
4,2.0
5,5.0
6,2.0
7,1.0


In [20]:
df[0].fillna(df[0].mean()).reset_index()

Unnamed: 0,index,0
0,0,1.0
1,1,2.0
2,2,3.0
3,3,4.0
4,4,2.571429
5,5,5.0
6,6,2.0
7,7,1.0


## Json data

In [19]:
path = "/Users/conagrabrands/Documents/PythonForAnalytics/Lectures/data/ca.json"

In [7]:
with open(path, "r") as file:
    line = file.readlines()
    for l in line[0:50]:
        print(l)

[

    {

        "city": "Toronto", 

        "admin": "Ontario", 

        "country": "Canada", 

        "population_proper": "3934421", 

        "iso2": "CA", 

        "capital": "admin", 

        "lat": "43.666667", 

        "lng": "-79.416667", 

        "population": "5213000"

    }, 

    {

        "city": "Montr\u00e9al", 

        "admin": "Qu\u00e9bec", 

        "country": "Canada", 

        "population_proper": "2356556", 

        "iso2": "CA", 

        "capital": "", 

        "lat": "45.5", 

        "lng": "-73.583333", 

        "population": "3678000"

    }, 

    {

        "city": "Vancouver", 

        "admin": "British Columbia", 

        "country": "Canada", 

        "population_proper": "603502", 

        "iso2": "CA", 

        "capital": "", 

        "lat": "49.25", 

        "lng": "-123.133333", 

        "population": "2313328"

    }, 

    {

        "city": "Ottawa", 

        "admin": "Ontario", 

        "country": "Canada", 

        "po

In [8]:
import pandas as pd

columns = ["age", "job", "city"]
data = [
    [31, "data scientist", "chicago"],
    [28, "data scientist", "new york"],
    [28,None,None]
]

df = pd.DataFrame(data, columns = columns)

df

In [10]:
df = pd.read_json(path)

In [None]:
df

* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

* Nested JSON files can be time consuming and difficult process to flatten and load into Pandas.
* Let’s unpack the works column into a standalone dataframe. We’ll also grab the flat columns.
* nycphil = json_normalize(d['programs'])

In [18]:
data = [
        {'id': 1, 'name': "Cole Volk",'fitness': 
             {'height': 130, 'weight': 60}},
    
        {'name': "Mose Reg",'fitness': 
             {'height': 130, 'weight': 60}},
    
        {'id': 2, 'name': 'Faye Raker','fitness': 
             {'height': 130, 'weight': 60}}
]

In [22]:
from pandas.io.json import json_normalize

In [23]:
json_normalize(data)

Unnamed: 0,id,name,fitness.height,fitness.weight
0,1.0,Cole Volk,130,60
1,,Mose Reg,130,60
2,2.0,Faye Raker,130,60


In [24]:
json_normalize(data, max_level = 0)

Unnamed: 0,id,name,fitness
0,1.0,Cole Volk,"{'height': 130, 'weight': 60}"
1,,Mose Reg,"{'height': 130, 'weight': 60}"
2,2.0,Faye Raker,"{'height': 130, 'weight': 60}"


In [28]:
data = [
        {'state': 'Florida','shortname': 'FL',
            'info': {'governor': 'Rick Scott', "gender":"m"},
            'counties': [{'name': 'Dade', 'population': 12345},
                         {'name': 'Broward', 'population': 40000},
                         {'name': 'Palm Beach', 'population': 60000}]},
    
         {'state': 'Ohio', 'shortname': 'OH',
          'info': {'governor': 'John Kasich', "gender":"m"},
          'counties': [{'name': 'Summit', 'population': 1234},
                       {'name': 'Cuyahoga', 'population': 1337}]}
]

In [29]:
json_normalize(data, max_level = 0)

Unnamed: 0,state,shortname,info,counties
0,Florida,FL,"{'governor': 'Rick Scott', 'gender': 'm'}","[{'name': 'Dade', 'population': 12345}, {'name..."
1,Ohio,OH,"{'governor': 'John Kasich', 'gender': 'm'}","[{'name': 'Summit', 'population': 1234}, {'nam..."


In [30]:
json_normalize(data, max_level = 1)

Unnamed: 0,state,shortname,counties,info.governor,info.gender
0,Florida,FL,"[{'name': 'Dade', 'population': 12345}, {'name...",Rick Scott,m
1,Ohio,OH,"[{'name': 'Summit', 'population': 1234}, {'nam...",John Kasich,m


In [33]:
json_normalize(data)

Unnamed: 0,state,shortname,counties,info.governor,info.gender
0,Florida,FL,"[{'name': 'Dade', 'population': 12345}, {'name...",Rick Scott,m
1,Ohio,OH,"[{'name': 'Summit', 'population': 1234}, {'nam...",John Kasich,m


In [53]:
# make each row a county, then start parsing data as such
json_normalize(data, "counties", [["info","governor"], ["info", "gender"]])

Unnamed: 0,name,population,info.governor,info.gender
0,Dade,12345,Rick Scott,m
1,Broward,40000,Rick Scott,m
2,Palm Beach,60000,Rick Scott,m
3,Summit,1234,John Kasich,m
4,Cuyahoga,1337,John Kasich,m


## Chunks
* Can use chunks to process pieces of a dataframe at a time if it won't fit into memory

In [1]:
import pandas as pd

In [2]:
chunk_size = 100

In [3]:
path = "/Users/conagrabrands/Documents/PythonForAnalytics/Lectures/data/iris.csv"

In [11]:
for c in pd.read_csv(path, chunksize = chunk_size):
    print(c.shape)

(100, 5)
(50, 5)


In [15]:
for c in pd.read_csv(path, chunksize = chunk_size):
    df = c.groupby("variety").sum()
    print(df.shape)

(2, 4)
(1, 4)
