# Why python for data analysis, machine learning?
There are lots of reasons that we want to use python for doing data science. It is certainly one of the younger programming languages used in the data science ecosystem (compared to say R and SAS) but it is used just as frequently for analysis as SAS and R. Having a good foundation in python, R, and SAS should be a *must* for **every data scientist** and machine learning enthusiast. 

In this course, python allows for an open source method of performing machine learning that runs from just about any machine. So let's start with looking at Numpy and Pandas pachages for analyzing data. 

With that in mind, let's go over the following:
- Numpy matrices
- Simple operations on arrays and matrices
- Indexing with numpy
- Pandas for tabular data
- Representing categorical data (discussion point)

In [1]:
import sys # system specific parameters and functions 
import numpy as np

print(sys.version)
print(np.__version__)

3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
1.15.0


In [4]:
# Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
x = np.random.rand(5,3)
x

array([[0.11656552, 0.44206924, 0.52958223],
       [0.85324776, 0.8649684 , 0.68833811],
       [0.690906  , 0.66720319, 0.14971898],
       [0.39244887, 0.30822911, 0.89256576],
       [0.18161866, 0.60604304, 0.53780784]])

In [5]:
x.shape

(5, 3)

In [6]:
x.dtype

dtype('float64')

In [9]:
y = np.random.rand(3,4)
print(y)
z = x*y
z

[[0.02303992 0.49794438 0.53980624 0.83080689]
 [0.45063912 0.58522209 0.13870141 0.68008081]
 [0.57892999 0.32330561 0.87990077 0.96465386]]


ValueError: operands could not be broadcast together with shapes (5,3) (3,4) 

In [6]:
# we can designate what matrix multiplication is directly using objects
z = np.dot(x,y)
z

array([[0.49075975, 0.63223609, 0.82279248, 1.10694356],
       [0.26218682, 0.3802152 , 0.52492076, 0.57943639],
       [0.33431131, 0.25819678, 0.66437645, 0.81252861],
       [0.3611374 , 0.48572955, 0.60110577, 0.80780374],
       [0.42703138, 0.35626193, 0.81520351, 1.02852813]])

In [10]:
# or we can use the overloaded matrix multiplication operator
z = x @ y
print(z.shape)
z

(5, 4)


array([[0.50849039, 0.48796874, 0.59021824, 0.9083498 ],
       [0.80794693, 1.15361212, 1.18623004, 1.96114055],
       [0.40326309, 0.78289979, 0.59723524, 1.17218855],
       [0.66467518, 0.66437172, 1.03996746, 1.39668694],
       [0.58864427, 0.61898206, 0.65531544, 1.08184669]])

# Indexing

In [11]:
x1 = np.array([[1,2,3],
               [4,5,6],
               [7,8,9]])
x1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [12]:
x1.shape

(3, 3)

In [9]:
for row in range(x1.shape[0]):
    print(x1[row,1])

2
5
8


In [18]:
print(x1[0,:])
print(x1[:,2])

[1 2 3]
[3 6 9]


In [12]:
x1[:,1]

array([2, 5, 8])

In [13]:
x1[1,1]

5

In [13]:
x1[:,1]>3

array([False,  True,  True])

In [14]:
# slicing
x1[ x1[:,1]>3 ]

array([[4, 5, 6],
       [7, 8, 9]])

In [15]:
x2 = np.array(range(10))
x2

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]:
x2.shape

(10,)

In [17]:
idx = x2>5
idx

array([False, False, False, False, False, False,  True,  True,  True,
        True])

In [18]:
x2[idx]

array([6, 7, 8, 9])

In [19]:
x2[x2>5]

array([6, 7, 8, 9])

# Named columns
So what if we have a matrix of data where each row is some observation of features and the feature values are represented in each column?

In [29]:
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],
                 [50,2200,4],
                 [48,2300,3],
                 [48,2000,3],
                 [34,0,   2],
                 [30,100, 5]])
data

array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3],
       [  48, 2000,    3],
       [  34,    0,    2],
       [  30,  100,    5]])

In [20]:
data2 = data[data[:,1]>1500]
data2

array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3]])

In [30]:
# pandas to the rescue
import pandas as pd

df = pd.DataFrame(data,columns=col_names)
df

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3
3,48,2000,3
4,34,0,2
5,30,100,5


In [22]:
df[df.time>1500]

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3


In [23]:
# lets get a description of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int32
time           5 non-null int32
day            5 non-null int32
dtypes: int32(3)
memory usage: 140.0 bytes


In [32]:
df.day[df.day==1] = 'Mon'
df.day[df.day==3 and df.time==2000] = 'Tue' # this doesn't work
df

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [33]:
# this doesn't work either
df = df.day.apply(lambda x:'Tue' if x==2)
print(df)

SyntaxError: invalid syntax (<ipython-input-33-1c81e666d28f>, line 1)

In [26]:
# there is almost always a more efficient built in pandas function
df.day.replace(to_replace=range(7),
               value=['Su','Mon','Tues','Wed','Th','Fri','Sat'],
               inplace=True)
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Th
2,48,2300,Wed
3,34,0,Tues
4,30,100,Fri


In [27]:
# notice how the type of the column has changed to an object "categorical"
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int64
time           5 non-null int64
day            5 non-null object
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes


In [28]:
# one hot encoding example
pd.get_dummies(df.day)

Unnamed: 0,Fri,Mon,Th,Tues,Wed
0,0,1,0,0,0
1,0,0,1,0,0
2,0,0,0,0,1
3,0,0,0,1,0
4,1,0,0,0,0
