# Why python for data analysis?
There are lots of reasons that we want to use python for doing data science. It is certainly one of the younger programming languages used in the data science ecosystem (compared to say R and SAS) but it is used just as frequently for analysis as SAS and R. Having a good foundation in python, R, and SAS should be a *must* for **every data scientist**. 

In this course, python allows for an open source method of performing machine learning that runs from just about any machine. So let's start with looking at Numpy and Pandas packages for analyzing data. 

With that in mind, let's go over the following:
- Numpy matrices
- Simple operations on arrays and matrices
- Indexing with numpy
- Pandas for tabular data
- Representing categorical data (discussion point)

In [1]:
import numpy as np

x = np.random.rand(5,3)
x

array([[0.41692445, 0.26797325, 0.76390328],
       [0.62033705, 0.39400496, 0.75837898],
       [0.79553523, 0.53961973, 0.35636875],
       [0.1530034 , 0.62451399, 0.39696761],
       [0.47501647, 0.54509539, 0.4492403 ]])

In [2]:
x.shape

(5, 3)

In [3]:
x.dtype

dtype('float64')

In [4]:
y = np.random.rand(3,4)
z = x*y
z

ValueError: operands could not be broadcast together with shapes (5,3) (3,4) 

In [5]:
z = x @ y

z

array([[1.12584244, 0.7018893 , 0.65830929, 0.29644248],
       [1.32167909, 0.87565909, 0.97452776, 0.32785765],
       [1.12419143, 0.8855276 , 1.28208108, 0.24136954],
       [0.58547789, 0.7762429 , 0.7478308 , 0.21071869],
       [0.91876454, 0.82980165, 0.98019355, 0.24299231]])

In [6]:
x = np.mat(x)
y = np.mat(y)
z = x*y
z

matrix([[1.12584244, 0.7018893 , 0.65830929, 0.29644248],
        [1.32167909, 0.87565909, 0.97452776, 0.32785765],
        [1.12419143, 0.8855276 , 1.28208108, 0.24136954],
        [0.58547789, 0.7762429 , 0.7478308 , 0.21071869],
        [0.91876454, 0.82980165, 0.98019355, 0.24299231]])

# Indexing

In [7]:
x1 = np.array([[1,2,3],
               [4,5,6],
               [7,8,9]])
x1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [8]:
for row in range(x1.shape[0]):
    print (x1[row,1])

2
5
8


In [9]:
x1[:,1]

array([2, 5, 8])

In [10]:
x1[:,1]>3

array([False,  True,  True])

In [11]:
x1[ x1[:,1]>3 ]

array([[4, 5, 6],
       [7, 8, 9]])

In [12]:
x2 = np.array(range(10))
x2

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]:
x2.shape

(10,)

In [14]:
idx = x2>5
idx

array([False, False, False, False, False, False,  True,  True,  True,
        True])

In [15]:
x2[idx]

array([6, 7, 8, 9])

In [16]:
x2[x2>5]

array([6, 7, 8, 9])

# Named columns
So what if we have a matrix of data where each row is some observation of features and the feature values are represented in each column?

In [17]:
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],
                 [50,2200,1],
                 [48,2300,1],
                 [34,0,   2],
                 [30,100, 2]])
data

array([[  64, 2100,    1],
       [  50, 2200,    1],
       [  48, 2300,    1],
       [  34,    0,    2],
       [  30,  100,    2]])

In [18]:
data2 = data[data[:,1]>1500]
data2

array([[  64, 2100,    1],
       [  50, 2200,    1],
       [  48, 2300,    1]])

In [19]:
# pandas to the rescue
import pandas as pd

df = pd.DataFrame(data,columns=col_names)
df

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,1
2,48,2300,1
3,34,0,2
4,30,100,2


In [20]:
df[df.time>1500]

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,1
2,48,2300,1


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int32
time           5 non-null int32
day            5 non-null int32
dtypes: int32(3)
memory usage: 188.0 bytes


In [22]:
df.day[df.day==1] = 'Mon'

In [23]:
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Mon
2,48,2300,Mon
3,34,0,2
4,30,100,2


In [24]:
df.day.replace(to_replace=range(7),
               value=['Su','Mon','Tues','Wed','Th','Fri','Sat'],
               inplace=True)
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Mon
2,48,2300,Mon
3,34,0,Tues
4,30,100,Tues


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int32
time           5 non-null int32
day            5 non-null object
dtypes: int32(2), object(1)
memory usage: 208.0+ bytes
