# GA Data Science 19 (DAT19) - Class 5
## Developing Mastery of Pandas, Numpy & Bokeh
####  

Justin Breucop (with parts from Craig Sakuma)

## Lab goals

- NumPy: Entering the Matrix
- Pandas: DataFrames as Bamboo
- Bokeh: Picture-Perfect Visuals

##NumPy
As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as a big piece of data cleansing is converting data into various formats and certain algorithms require data to be in a specific shape.

NumPy is a package designed to be used in scientific computing, and specifically around building N-dimensional array objects.

###Creating an array

In [78]:
import numpy as np
a = np.arange(25).reshape(5,5)
# arange(n) is a function that creates a 1 row array of integers of length n 
# reshape(M,N) is a method converts a list to a matrix of size MxN
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

We can convert from lists to arrays. Note however unlike lists, elements of an array all have to be of the same datatype.

In [79]:
alist = [[ 0,  1,  2,  3,  4],[ 5,  6,  7,  8,  9],[10, 11, 12, 13, 14],[15, 16, 17, 18, 19],[20, 21, 22, 23, 24]]
type(alist)

list

In [80]:
np.array(alist)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [81]:
biga = a*10
biga

array([[  0,  10,  20,  30,  40],
       [ 50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140],
       [150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240]])

In [82]:
print biga.mean()
print biga.mean(0) #Average per column
print biga.mean(1) #average per row
type(biga.mean(1))

120.0
[ 100.  110.  120.  130.  140.]
[  20.   70.  120.  170.  220.]


numpy.ndarray

In [83]:
bigm = np.matrix(biga-20)
bigm

matrix([[-20, -10,   0,  10,  20],
        [ 30,  40,  50,  60,  70],
        [ 80,  90, 100, 110, 120],
        [130, 140, 150, 160, 170],
        [180, 190, 200, 210, 220]])

In [84]:
bigm * biga # equal to np.matrix(bigm) * np.matrix(biga)

matrix([[  5000,   5000,   5000,   5000,   5000],
        [ 30000,  32500,  35000,  37500,  40000],
        [ 55000,  60000,  65000,  70000,  75000],
        [ 80000,  87500,  95000, 102500, 110000],
        [105000, 115000, 125000, 135000, 145000]])

In [85]:
np.linalg.inv(biga-20)

array([[ -2.81474977e+13,  -1.52777778e-03,   5.62949953e+13,
         -2.22222222e-02,  -2.81474977e+13],
       [  3.51843721e+13,   2.25000000e-02,  -5.27765581e+13,
         -3.51843721e+13,   5.27765581e+13],
       [ -4.22212465e+13,   9.38249922e+13,  -7.97512434e+13,
          4.69124961e+13,  -1.87649984e+13],
       [  9.14793674e+13,  -1.87649984e+14,   9.26521798e+13,
          1.17281240e+13,  -8.20968682e+12],
       [ -5.62949953e+13,   9.38249922e+13,  -1.64193736e+13,
         -2.34562481e+13,   2.34562481e+12]])

In [86]:
print bigm,'\n', biga
bigm * biga

[[-20 -10   0  10  20]
 [ 30  40  50  60  70]
 [ 80  90 100 110 120]
 [130 140 150 160 170]
 [180 190 200 210 220]] 
[[  0  10  20  30  40]
 [ 50  60  70  80  90]
 [100 110 120 130 140]
 [150 160 170 180 190]
 [200 210 220 230 240]]


matrix([[  5000,   5000,   5000,   5000,   5000],
        [ 30000,  32500,  35000,  37500,  40000],
        [ 55000,  60000,  65000,  70000,  75000],
        [ 80000,  87500,  95000, 102500, 110000],
        [105000, 115000, 125000, 135000, 145000]])

####Slices

In [87]:
bigm = np.array(bigm)
bigm[0]

array([-20, -10,   0,  10,  20])

In [88]:
#Same thing, but demonstrating the full slice with a colon
bigm[0,:]
#biga

array([-20, -10,   0,  10,  20])

In [89]:
print biga
biga[:,3]

[[  0  10  20  30  40]
 [ 50  60  70  80  90]
 [100 110 120 130 140]
 [150 160 170 180 190]
 [200 210 220 230 240]]


array([ 30,  80, 130, 180, 230])

Slice rules work for even more complex dimensional data

In [90]:
compa = np.arange(30).reshape(5,3,2)
compa

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]],

       [[24, 25],
        [26, 27],
        [28, 29]]])

In [91]:
# lets describe it
print compa.shape
print compa.ndim
print compa.dtype

(5, 3, 2)
3
int64


In [92]:
compa[3,:,1]

array([19, 21, 23])

In [93]:
compa[0,0,0]

0

In [94]:
compa[0,0,0] = 5.9
compa[0,0,0]

5

Numpy tries to resolve conflicting datatypes, sometimes to our dismay

In [95]:
compa = compa.astype(float)
compa[0,0,0] = 5.75
compa[0,0,0]
type(compa[1,1,1])

numpy.float64

####Random Numbers
Random numbers are very helpful and are necessary at times for testing data pipelines and running statistical analyses. Functions for creating random values are under numpy.random.

In [96]:
#Create a randomized array
# pick up 5*5 random numbers
rm = np.random.rand(5,5)
rm

array([[ 0.5019966 ,  0.87621992,  0.23848262,  0.34411253,  0.78879982],
       [ 0.84446473,  0.25190578,  0.65036279,  0.68366312,  0.01631988],
       [ 0.68544353,  0.72873324,  0.97244003,  0.79364348,  0.82339282],
       [ 0.78487125,  0.80761541,  0.83841237,  0.91179345,  0.5542711 ],
       [ 0.33679144,  0.63146089,  0.50794777,  0.21234856,  0.4907863 ]])

In [97]:
np.random.rand() #use shift + tab to see description of common

0.29838015655082606

In [98]:
rm.shape

(5, 5)

In [99]:
np.random.normal(0,10,50) #(mean, standard division, array)

array([  9.24643574e+00,   4.53825570e+00,  -1.30828191e+01,
         1.00648289e+01,  -2.96656744e+00,  -8.24066218e+00,
        -8.81077803e-03,  -5.38082255e+00,   1.99358029e+01,
         5.80938378e-01,  -4.82399980e+00,   1.04718343e+00,
         8.47530562e+00,  -1.41423332e+01,   1.22658120e+01,
        -7.39970798e+00,   1.37517572e+01,   1.11366156e+01,
        -4.59752452e+00,   9.27993228e-01,  -3.27593393e+00,
        -7.05306529e+00,   4.12913246e+00,  -1.77021843e+00,
         4.41765908e+00,   4.82497912e+00,  -1.62898519e+01,
         5.31527810e+00,   1.51337248e+01,   1.05101632e+01,
         2.93828361e+00,  -1.99961588e+01,   6.38057560e+00,
         8.53318065e+00,  -3.48128435e+00,   6.50290695e+00,
         3.22701343e+00,   1.73905649e-01,   6.37859091e+00,
         5.26397113e+00,  -3.65508179e+00,  -8.43393365e+00,
        -1.04249133e+01,   8.77205191e+00,  -9.41336672e+00,
        -6.13914279e+00,   1.24614829e+01,   1.02431728e+01,
        -4.64205712e+00,

In [100]:
print rm.mean()
print rm.mean(0) #Average per column
print rm.mean(1) #average per row

0.611051177128
[ 0.63071351  0.65918705  0.64152912  0.58911223  0.53471398]
[ 0.5499223   0.48934326  0.80073062  0.77939272  0.43586699]


In [101]:
# for a different Normal Distribution, use np.random.normal
rm = np.random.normal(5,9,(30,30))
rm

array([[  8.55806543e+00,  -1.17277945e+01,   7.93710176e+00,
         -6.64615650e+00,   7.36872730e+00,  -1.34412556e+00,
          3.70018924e-01,   1.51814674e+01,   1.09687394e+01,
          1.28711838e+01,   9.49737630e+00,   8.50252887e+00,
          6.20618011e+00,   8.29288984e+00,  -5.18277336e+00,
          1.76350654e+01,   1.24331080e+01,   2.17974351e+00,
          1.39527471e+01,   6.31094415e+00,   5.23709004e+00,
          1.49392361e+00,   9.47554019e+00,   7.40666146e+00,
          1.86753820e+01,   1.52338329e+01,   3.65715380e+00,
         -2.25718538e-01,   1.33883505e+01,   2.96634149e+00],
       [  7.52768215e+00,   2.16481115e+01,   4.32229097e+00,
          3.90769375e+00,   1.08377311e+00,  -4.83209642e+00,
          6.42481303e+00,   1.96171075e+01,   1.13648270e+01,
         -4.17599903e-01,   1.67270081e+01,   2.36736895e+00,
          1.93377506e+01,   1.96094704e+00,   6.44960418e+00,
          4.85541224e+00,   2.42113032e+00,   1.17360266e+01,
       

In [102]:
print rm.mean(), "which is hopefully close to the input mean"
print rm.var(), "which variance = stdev squared"
print np.median(rm)

5.61397096709 which is hopefully close to the input mean
81.9574192376 which variance = stdev squared
5.2385588772


Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

###Exercise 1
1) Create a 4x5 array of integers numbering 0 to 19.

In [103]:
np.arange(20).reshape(4,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

2) Create a 50x500 array with a mean of 20 and variance of 100. Save it to a variable called  `biggie`

In [104]:
biggie = np.random.normal(20,10,(50,500))
print biggie.shape
print biggie.mean()
print biggie.var()

(50, 500)
19.9778435859
100.214040388


3) Change the mean of the array to a value within 1 of 0 and the variance within 1 of 25. Think about what the mean and the variance represent and try using various mathematical operations.

In [105]:
morph = (biggie - 20)/2
print morph.mean()
print morph.var()

-0.0110782070491
25.053510097


# Pandas: DataFrames as Bamboo
You've already been exposed to dataframes in the previous labs so lets get into dataframes and how we can work with them.

In [106]:
import pandas as pd

data = pd.read_csv("../data/titanic.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [107]:
data.describe() 

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [108]:
data['Fare'] = 0
data
#example, not make sense

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,0,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,0,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,0,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,0,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,0,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,0,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,0,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,0,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,0,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,0,,C


In [109]:
data[data.Age>65] #math statement to set up a requirement

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,0,,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,0,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,0,,Q
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,0,,C
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,0,A23,S
672,673,0,2,"Mitchell, Mr. Henry Michael",male,70.0,0,0,C.A. 24580,0,,S
745,746,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,0,B22,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,0,,S


In [110]:
data[data.Age<65] #case sensitive, has to spell and print right

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,0,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,0,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,0,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,0,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,0,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,0,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,0,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,0,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,0,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4,1,1,PP 9549,0,G6,S


In [111]:
data[(data.Age==11)&(data.SibSp==5)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
59,60,0,3,"Goodwin, Master. William Frederick",male,11,5,2,CA 2144,0,,S


In [112]:
data[['Name','Age']][data.Age>65] #same thing as data[data.Age>65]['Name'] 
#double [] creat a list
#equal to colums=['Name', 'Age']
#data[data.age>65][Column]

Unnamed: 0,Name,Age
33,"Wheadon, Mr. Edward H",66.0
96,"Goldschmidt, Mr. George B",71.0
116,"Connors, Mr. Patrick",70.5
493,"Artagaveytia, Mr. Ramon",71.0
630,"Barkworth, Mr. Algernon Henry Wilson",80.0
672,"Mitchell, Mr. Henry Michael",70.0
745,"Crosby, Capt. Edward Gifford",70.0
851,"Svensson, Mr. Johan",74.0


In [113]:
data[(data.Age==11)|(data.SibSp==5)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
59,60,0,3,"Goodwin, Master. William Frederick",male,11,5,2,CA 2144,0,,S
71,72,0,3,"Goodwin, Miss. Lillian Amy",female,16,5,2,CA 2144,0,,S
386,387,0,3,"Goodwin, Master. Sidney Leonard",male,1,5,2,CA 2144,0,,S
480,481,0,3,"Goodwin, Master. Harold Victor",male,9,5,2,CA 2144,0,,S
542,543,0,3,"Andersson, Miss. Sigrid Elisabeth",female,11,4,2,347082,0,,S
683,684,0,3,"Goodwin, Mr. Charles Edward",male,14,5,2,CA 2144,0,,S
731,732,0,3,"Hassan, Mr. Houssein G N",male,11,0,0,2699,0,,C
802,803,1,1,"Carter, Master. William Thornton II",male,11,1,2,113760,0,B96 B98,S


In [114]:
data.values #modified version of array
data.columns

Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object')

###Cleaning Data

In [115]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null int64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(1), int64(6), object(5)
memory usage: 90.5+ KB


####Working with nulls
Exclude data

In [116]:
# data[data.Age.isnull()]
data[data.Age.notnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,0,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,0,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,0,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,0,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,0,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,0,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,0,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,0,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,0,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4,1,1,PP 9549,0,G6,S


In [117]:
# You can also just replace the nulls
data.Age[data.Age.isnull()].fillna(0)

5      0
17     0
19     0
26     0
28     0
29     0
31     0
32     0
36     0
42     0
45     0
46     0
47     0
48     0
55     0
64     0
65     0
76     0
77     0
82     0
87     0
95     0
101    0
107    0
109    0
121    0
126    0
128    0
140    0
154    0
      ..
718    0
727    0
732    0
738    0
739    0
740    0
760    0
766    0
768    0
773    0
776    0
778    0
783    0
790    0
792    0
793    0
815    0
825    0
826    0
828    0
832    0
837    0
839    0
846    0
849    0
859    0
863    0
868    0
878    0
888    0
Name: Age, dtype: float64

In [118]:
#Replace with the mean to preserve statistical values
avg_age = data.Age[data.Age.notnull()].mean()
print avg_age
data.age.fillna(avg_age)

29.6991176471


0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
5      29.699118
6      54.000000
7       2.000000
8      27.000000
9      14.000000
10      4.000000
11     58.000000
12     20.000000
13     39.000000
14     14.000000
15     55.000000
16      2.000000
17     29.699118
18     31.000000
19     29.699118
20     35.000000
21     34.000000
22     15.000000
23     28.000000
24      8.000000
25     38.000000
26     29.699118
27     19.000000
28     29.699118
29     29.699118
         ...    
861    21.000000
862    48.000000
863    29.699118
864    24.000000
865    42.000000
866    27.000000
867    31.000000
868    29.699118
869     4.000000
870    26.000000
871    47.000000
872    33.000000
873    47.000000
874    28.000000
875    15.000000
876    20.000000
877    19.000000
878    29.699118
879    56.000000
880    25.000000
881    33.000000
882    22.000000
883    28.000000
884    25.000000
885    39.000000
886    27.000000
887    19.000000
888    29.6991

####Replace with random normal distribution

In [119]:
# Get values of mean and standard deviation
data.Age[data.Age.notnull()].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [120]:
# Replace null values with 
data.Age.fillna(np.random.normal(29.7,14.5),inplace=True)

In [121]:
data.Age.fillna(np.random.normal(29.7,14.5)).describe()

count    891.000000
mean      27.773730
std       13.565520
min        0.420000
25%       20.006914
50%       24.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

###Convert categorical data to numerical

In [122]:
data.Sex=='female'

0      False
1       True
2       True
3       True
4      False
5      False
6      False
7      False
8       True
9       True
10      True
11      True
12     False
13     False
14      True
15      True
16     False
17     False
18      True
19      True
20     False
21     False
22      True
23     False
24      True
25      True
26     False
27     False
28      True
29     False
       ...  
861    False
862     True
863     True
864    False
865     True
866     True
867    False
868    False
869    False
870    False
871     True
872    False
873    False
874     True
875     True
876    False
877    False
878    False
879     True
880     True
881    False
882     True
883    False
884    False
885     True
886    False
887     True
888     True
889    False
890    False
Name: Sex, dtype: bool

In [123]:
data.rename(columns={'Sex':'Is Female'},inplace=True)
data['Is Female']=data['Is Female']=='female'
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Is Female,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",False,22,1,0,A/5 21171,0,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",True,38,1,0,PC 17599,0,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",True,26,0,0,STON/O2. 3101282,0,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",True,35,1,0,113803,0,C123,S
4,5,0,3,"Allen, Mr. William Henry",False,35,0,0,373450,0,,S


In [124]:
# get unique values of Embarked
data.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [125]:
# replace values with numbers
data.Embarked.replace(['S', 'C', 'Q'],[1,2,3],inplace=True)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Is Female,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",False,22,1,0,A/5 21171,0,,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",True,38,1,0,PC 17599,0,C85,2
2,3,1,3,"Heikkinen, Miss. Laina",True,26,0,0,STON/O2. 3101282,0,,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",True,35,1,0,113803,0,C123,1
4,5,0,3,"Allen, Mr. William Henry",False,35,0,0,373450,0,,1


###Selecting with .loc, .iloc, & .ix

Selecting data in pandas can be tricky. The main takeaway is that .loc looks for index labels, .iloc looks for the integer index position, and .ix can be a mix. 

In [126]:
df = pd.DataFrame(np.random.randn(6,4),index=list('abcdef'),columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
a,0.236301,-1.231242,2.141954,-0.43649
b,-0.015346,-0.94966,1.112765,1.058777
c,0.607472,-0.937408,-0.66574,0.675953
d,0.076937,-2.068285,-0.730544,-1.305667
e,-1.621248,0.189123,-0.509118,0.153752
f,1.765439,1.752432,-0.951042,-0.05909


In [127]:
df.loc['f']

A    1.765439
B    1.752432
C   -0.951042
D   -0.059090
Name: f, dtype: float64

In [128]:
df.iloc[len(df.index)-1]

A    1.765439
B    1.752432
C   -0.951042
D   -0.059090
Name: f, dtype: float64

In [129]:
df.A.ix['f'] == df.A.ix[-1]

True

In [130]:
cc = list('cookies')
cc[-4]

'k'

###Group by

In [131]:
# Find average age of passengers that survived vs. died
data.groupby(['Pclass','Survived'])['Age'].mean()
#grouping entire dataset into 2 values
#for sort:
#titanic_status= ...
#Titanic_status.sort()
#titanic _status

Pclass  Survived
1       0           38.957633
        1           33.786888
2       0           32.567509
        1           25.630548
3       0           24.759960
        1           20.463488
Name: Age, dtype: float64

In [132]:
# Count number of female passengers
data.groupby('Is Female')['PassengerId'].count()

Is Female
False    577
True     314
Name: PassengerId, dtype: int64

In [133]:
data.groupby(['Survived','Pclass'])['PassengerId'].count()

Survived  Pclass
0         1          80
          2          97
          3         372
1         1         136
          2          87
          3         119
Name: PassengerId, dtype: int64

###Apply

In [134]:
# Convert ticket prices to USD
data.Fare.apply(lambda x: x*1.6)
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Is Female,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",False,22.000000,1,0,A/5 21171,0,,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",True,38.000000,1,0,PC 17599,0,C85,2
2,3,1,3,"Heikkinen, Miss. Laina",True,26.000000,0,0,STON/O2. 3101282,0,,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",True,35.000000,1,0,113803,0,C123,1
4,5,0,3,"Allen, Mr. William Henry",False,35.000000,0,0,373450,0,,1
5,6,0,3,"Moran, Mr. James",False,20.006914,0,0,330877,0,,3
6,7,0,1,"McCarthy, Mr. Timothy J",False,54.000000,0,0,17463,0,E46,1
7,8,0,3,"Palsson, Master. Gosta Leonard",False,2.000000,3,1,349909,0,,1
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",True,27.000000,0,2,347742,0,,1
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",True,14.000000,1,0,237736,0,,2


In [135]:
data.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Wil

In [136]:
data.Name.apply(lambda x: x.split(",")[0])

0               Braund
1              Cumings
2            Heikkinen
3             Futrelle
4                Allen
5                Moran
6             McCarthy
7              Palsson
8              Johnson
9               Nasser
10           Sandstrom
11             Bonnell
12         Saundercock
13           Andersson
14             Vestrom
15             Hewlett
16                Rice
17            Williams
18       Vander Planke
19          Masselmani
20              Fynney
21             Beesley
22             McGowan
23              Sloper
24             Palsson
25             Asplund
26                Emir
27             Fortune
28             O'Dwyer
29            Todoroff
            ...       
861              Giles
862              Swift
863               Sage
864               Gill
865            Bystrom
866       Duran y More
867           Roebling
868      van Melkebeke
869            Johnson
870             Balkic
871           Beckwith
872           Carlsson
873    Vand

###Concatenate

In [137]:
data_first_half = data.iloc[0:10,:]
data_first_half.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 12 columns):
PassengerId    10 non-null int64
Survived       10 non-null int64
Pclass         10 non-null int64
Name           10 non-null object
Is Female      10 non-null bool
Age            10 non-null float64
SibSp          10 non-null int64
Parch          10 non-null int64
Ticket         10 non-null object
Fare           10 non-null int64
Cabin          3 non-null object
Embarked       10 non-null float64
dtypes: bool(1), float64(2), int64(6), object(3)
memory usage: 970.0+ bytes


In [147]:
data_second_half = data.iloc[10:,:]

remake_data = pd.concat([data_first_half,data_second_half])
remake_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null object
Name           891 non-null object
Is Female      891 non-null bool
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null int64
Cabin          204 non-null object
Embarked       889 non-null float64
dtypes: bool(1), float64(2), int64(5), object(4)
memory usage: 84.4+ KB


###EXERCISE 2
1) Replace Pclass numbers with 'First Class', 'Second Class', 'Third Class'

In [139]:
data.Pclass.replace([1,2,3],['First Class', 'Second Class', 'Third Class'],inplace=True)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Is Female,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,Third Class,"Braund, Mr. Owen Harris",False,22,1,0,A/5 21171,0,,1
1,2,1,First Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",True,38,1,0,PC 17599,0,C85,2
2,3,1,Third Class,"Heikkinen, Miss. Laina",True,26,0,0,STON/O2. 3101282,0,,1
3,4,1,First Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",True,35,1,0,113803,0,C123,1
4,5,0,Third Class,"Allen, Mr. William Henry",False,35,0,0,373450,0,,1


2) What was the average ticket price for survivors vs. dead passengers?

In [140]:
data.groupby(['Survived'])['Fare'].mean()

Survived
0    0
1    0
Name: Fare, dtype: int64

###Bonus!!!
Round all ages to the nearest year using `apply`

##Bokeh: Picture Perfect Visuals

To install Bokeh, go to a terminal and type:

`conda install bokeh` 

Bokeh is built by the same people that created Anaconda (Continuum Analytics) and is designed out of the box for web display, making it nice for creating presentation ready, interactive visuals quickly. Labs in this course will be shown in Bokeh. Checkout http://bokeh.pydata.org/en/latest/docs/quickstart.html#concepts to see some of the range of capabilities.

In [148]:
from bokeh.plotting import figure, output_notebook,show,vplot
output_notebook()

In [152]:
import pandas.io.data
import datetime
fb = pd.io.data.get_data_yahoo('FB', 
                                 start=datetime.datetime(2015, 4, 1), 
                                 end=datetime.datetime(2015, 4, 28))


In [155]:
y.mean()

83.27526336842107

In [156]:
x*y.mean()/x.mean()

Date
2015-04-01    82.332731
2015-04-02    82.913040
2015-04-06    82.261465
2015-04-07    83.707147
2015-04-08    83.320269
2015-04-09    83.187920
2015-04-10    83.391543
2015-04-13    83.401718
2015-04-14    83.931127
2015-04-15    83.758047
2015-04-16    83.635882
2015-04-17    81.823687
2015-04-20    82.709418
2015-04-21    85.051022
2015-04-22    85.163013
2015-04-23    83.900586
2015-04-24    82.953764
2015-04-27    83.106471
2015-04-28    81.681155
Name: Low, dtype: float64

In [154]:
# prepare some data
x = fb.Low
y = fb.High

# create a new plot with a title and axis labels
p = figure(title="Stock High vs. Low", x_axis_label='Low', y_axis_label='High')

# These are glyphs
p.circle(x, y,size=90,alpha=0.5,)
p.line(x,x*y.mean()/x.mean())

# show the results
show(p)

In [157]:
fb.Low

Date
2015-04-01    80.870003
2015-04-02    81.440002
2015-04-06    80.800003
2015-04-07    82.220001
2015-04-08    81.839996
2015-04-09    81.709999
2015-04-10    81.910004
2015-04-13    81.919998
2015-04-14    82.440002
2015-04-15    82.269997
2015-04-16    82.150002
2015-04-17    80.370003
2015-04-20    81.239998
2015-04-21    83.540001
2015-04-22    83.650002
2015-04-23    82.410004
2015-04-24    81.480003
2015-04-27    81.629997
2015-04-28    80.230003
Name: Low, dtype: float64

At its core, Bokeh is built up with Plots and Glyphs. Plots are created with the figure keyword and then glyphs are visuals that are added to the visualization. The visuals are scalable, interactive and savable. You can even create vectorized colors.

In [158]:
# prepare some data
N = 4000
x = np.random.random(size=N) * 100
y = np.random.random(size=N) * 100
radii = np.random.random(size=N) * 1.5
colors = ["#%02x%02x%02x" % (r, g, 150) for r, g in zip(np.floor(50+2*x), np.floor(30+2*y))]

TOOLS="resize,crosshair,pan,wheel_zoom,box_zoom,reset,box_select,lasso_select"

# create a new plot with the tools above, and explicit ranges
p = figure(tools=TOOLS, x_range=(0,100), y_range=(0,100))

# add a circle renderer with vecorized colors and sizes
p.circle(x,y, radius=radii, fill_color=colors, fill_alpha=0.6, line_color=None)

# show the results
show(p)

In [None]:
p1 = figure(title="Titanic Ages Dead",x_axis_label = 'Age',y_axis_label = 'Count')
#construct the histogram
hist, edges = np.histogram(data.Age[data.Survived==0].values, density=True, bins=50)
#Construct your x axis
x = np.linspace(data.Age.min(),data.Age.max(),100)
#add the bars, scaling the value to the full count of people
p1.quad(top=hist*len(data.Age), bottom=0, left=edges[:-1], right=edges[1:],line_color='black')

p2 = figure(title="Titanic Ages Survived",x_axis_label = 'Age',y_axis_label = 'Count')

hist, edges = np.histogram(data.Age[data.Survived==1].values, density=True, bins=50)
x = np.linspace(data.Age.min(),data.Age.max(),100)
p2.quad(top=hist*len(data.Age), bottom=0, left=edges[:-1], right=edges[1:],line_color='black')

dummy_line = range(0,len(x)=1)
p2.line(x, dummy_line)

show(vplot(p1,p2))

In [None]:
%matplotlib inline
data.Age.hist()