# Why python for data analysis, machine learning?
There are lots of reasons that we want to use python for doing data science. It is certainly one of the younger programming languages used in the data science ecosystem (compared to say R and SAS) but it is used just as frequently for analysis as SAS and R. Having a good foundation in python and R, (and SAS or SPSS) should be a *must* for **every data scientist** and machine learning enthusiast. 

In this course, python allows for an open source method of performing machine learning that runs from just about any machine. So let's start with looking at Numpy and Pandas pachages for analyzing data. 

With that in mind, let's go over the following:
- Numpy matrices
- Simple operations on arrays and matrices
- Indexing with numpy
- Pandas for tabular data
- Representing categorical data (discussion point)

In [35]:
import sys
import numpy as np

print(sys.version)
print(np.__version__)

3.8.5 (default, Sep  4 2020, 02:22:02) 
[Clang 10.0.0 ]
1.19.2


In [36]:
x = np.random.rand(5,3)
# rand() is uniform(0,1)
# np array size is (5,3)
# x

In [37]:
x.shape 
# shape is a public property 
# because no underscore is needed to access it

(5, 3)

In [38]:
x.dtype

dtype('float64')

In [39]:
y = np.random.rand(3,4)
# z = x*y 
# * is elementwise multiplication

In [40]:
# we can designate what matrix multiplication is directly using objects
z = np.dot(x,y)
# dot is matrix multiplication
# z

In [41]:
# or we can use the overloaded matrix multiplication operator
z = x @ y
# @ is also matrix multiplication
# z

# Indexing

In [42]:
x1 = np.array([[1,2,3],
               [4,5,6],
               [7,8,9]])
# x1 is an np array. It is a list of list.

In [43]:
# for row in range(x1.shape[0]):
#     print(x1[row,1])
# it is slow to run loops

In [44]:
print(x1[:,1])
print(x1[:,1]>3)
# slicing
print(x1[ x1[:,1]>3 ])

[2 5 8]
[False  True  True]
[[4 5 6]
 [7 8 9]]


In [45]:
x2 = np.array(range(10))
print(x2)
x2.shape # 1D array

[0 1 2 3 4 5 6 7 8 9]


(10,)

In [46]:
idx = x2 > 5
# print(idx)
print(x2[idx])

[6 7 8 9]


In [47]:
x2[x2>5]

array([6, 7, 8, 9])

# Named columns
So what if we have a matrix of data where each row is some observation of features and the feature values are represented in each column?

In [48]:
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],
                 [50,2200,4],
                 [48,2300,3],
                 [34,0,   2],
                 [30,100, 5]])
# data

In [49]:
data2 = data[data[:,1]>1500]
# data2

In [50]:
# pandas to the rescue
import pandas as pd

df = pd.DataFrame(data,columns = col_names)
df

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3
3,34,0,2
4,30,100,5


In [51]:
# can always access the backend numpy with .values
print(type(df.to_numpy()))
df.to_numpy()

<class 'numpy.ndarray'>


array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3],
       [  34,    0,    2],
       [  30,  100,    5]])

In [52]:
df[df.time > 1500]

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3


In [53]:
# lets get a description of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   temperature  5 non-null      int64
 1   time         5 non-null      int64
 2   day          5 non-null      int64
dtypes: int64(3)
memory usage: 248.0 bytes


In [54]:
df.day[df.day == 1] = 'Mon'
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,4
2,48,2300,3
3,34,0,2
4,30,100,5


In [55]:
# there is almost always a more efficient built in pandas function
df.day.replace(to_replace = range(7),
               value = ['Su','Mon','Tues','Wed','Th','Fri','Sat'],
               inplace = True )
# inplace = True
# inplace = True works on the original dataframe
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Th
2,48,2300,Wed
3,34,0,Tues
4,30,100,Fri


In [56]:
# notice how the type of the column has changed to an object "categorical"
# df.info()

In [57]:
df2 = df.copy()
# df2 is a deep copy of df
# any change made to df2 will not affect df 

In [58]:
# one hot encoding example
pd.get_dummies(df.day)

Unnamed: 0,Fri,Mon,Th,Tues,Wed
0,0,1,0,0,0
1,0,0,1,0,0
2,0,0,0,0,1
3,0,0,0,1,0
4,1,0,0,0,0


# Some Pandas Syntax

In [73]:
# print(df.day)
# print(df['day'])
# df[['day','temperature']]

In [60]:
print(df.day[2])
print(df.day[2:]) # not fast

Wed
2     Wed
3    Tues
4     Fri
Name: day, dtype: object


In [61]:
df.iloc[3:] # fast
# index location from 3 and beyond

Unnamed: 0,temperature,time,day
3,34,0,Tues
4,30,100,Fri


In [62]:
df.iloc[3:][['day','temperature']]

Unnamed: 0,day,temperature
3,Tues,34
4,Fri,30


In [63]:
df[['day','temperature']].iloc[3:]

Unnamed: 0,day,temperature
3,Tues,34
4,Fri,30


In [74]:
# df.mean()
# mean for each numeric column in df

In [75]:
# df.std()

In [77]:
# df.mean()/df.std()

temperature    3.321375
time           1.135349
dtype: float64

In [67]:
df.time.unique()

array([2100, 2200, 2300,    0,  100])

# Pandas Block Manager
Let's take a look at some important points from the following post:
 - https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

This is the pandas BlockManager, which tries to group internal structures together to make things fast:
<img src="https://uwekorn.com/images/pd-df-perception.002.png" width=200 height=200 />

In [79]:
print(df._data.nblocks) # private property
df._data

2


BlockManager
Items: Index(['temperature', 'time', 'day'], dtype='object')
Axis 1: RangeIndex(start=0, stop=5, step=1)
IntBlock: slice(0, 2, 1), 2 x 5, dtype: int64
ObjectBlock: slice(2, 3, 1), 1 x 5, dtype: object

## Advantages and disadvantages:
This can speed up operations because it inhenertly can apply operations along columns in a single pass over the data (like sums, etc.) and therefore is using c++ for much of the heavy lifting.

But, **it might be bad** when you are adding columns to the data because it can trigger consolidation of columns, which means copying over data in numpy to creata new matrix. The slow down also doesn't show up until a needed column is accessed (lazy data copying). Let's do an example from:  https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

**Block consolidation is triggered after 100 blocks of data are reached.**

In [82]:
df_example = pd.DataFrame({
    'int64': np.arange(1024 * 1024, dtype=np.int64),
    'float64': np.arange(1024 * 1024, dtype=np.float64),
})
# df_example

In [85]:
%%time 
# %%time is Jupyter magics, it is specific to Jupyter notebooks.
# %%time calculate time needed to run the entire block
for i in range(97):
    df_example[f'new_{i}'] = df_example['int64'].values
    
print(df_example._data.nblocks)
# df_example

99
CPU times: user 223 ms, sys: 3.21 ms, total: 226 ms
Wall time: 225 ms


In [91]:
%time df_example['dummy_name1'] = df_example['int64'].values # copy over some new columns
print('Number of blocks in data:',df_example._data.nblocks)

# print(df_example)
# %time calculate time needed to run the specific line of code

%time df_example['dummy_name2'] = df_example['int64'].values # copy over some new columns
print('Number of blocks in data:',df_example._data.nblocks)


CPU times: user 3.1 ms, sys: 1.19 ms, total: 4.29 ms
Wall time: 2.61 ms
Number of blocks in data: 4
CPU times: user 2.77 ms, sys: 117 µs, total: 2.89 ms
Wall time: 2.33 ms
Number of blocks in data: 4
