# Loading Libraries

Once you have installed a package (easiest way is to use the command pip3 install package_name in the Console), you can load the library in your workspace as follows. Often, people use abbreviations for package names so that they can reference them more easily when calling functions. Note: if you installed Python using Anaconda, then you probably already have these packages installed.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import statsmodels.formula.api as smf
#this last line will cause graphics to be printed in-line
%matplotlib inline

# NumPy
Numpy is a library in Python that allows manipulation of vectors and arrays in a way that is more similar to R (so you can do element-wise operations, matrix math, etc.)

## NumPy Arrays

NumPy arrays are N-dimensional arrays that contain data. A one dimensional NumPy array is very similar to a Python list.

In [7]:
#create a numpy array
print(np.array([1,2,3]))

[1 2 3]


In [8]:
heights = [1.71, 1.62, 1.74]
weights = [65.1, 59.4, 63.9]
np_heights = np.array(heights)
np_weights = np.array(weights)
# note that we can do element-wise operations like R vectors
# meanwhile, bmis = heights / weights ** 2 would throw an error
bmis = np_weights / np_heights ** 2 #all good

#print rounded bmis
print(bmis.round(2))

[22.26 22.63 21.11]


NumPy also allows you to subset using conditionals, like in R

In [10]:
numbers = [1,2,3,4,5]
np_numbers = np.array(numbers)
print(np_numbers > 3)
print(np_numbers[np_numbers > 3])

[False False False  True  True]
[4 5]


Here is an example of a 2-D NumPy array

In [12]:
np_2d = np.array([[1,2,3,4,5],
                  [10,9,8,7,6]])
print(np_2d)

[[ 1  2  3  4  5]
 [10  9  8  7  6]]
(2, 5)


In [15]:
#note: shape is an attribute, not a method
#gives number of rows, then columns
print(np_2d.shape)

nrow, ncol = np_2d.shape
print(nrow)
print(ncol)

(2, 5)
2
5


### Subsetting a (2D) NumPy Array

In [16]:
print(np_2d[0]) #first row
print(np_2d[0][1]) #first row, second column
print(np_2d[0,1]) #same thing

#other examples:
print(np_2d[:,1]) #second column
print(np_2d[:,1:3]) #all rows, columns 2-3
print(np_2d[0,:]) #row 1, all columns

[1 2 3 4 5]
2
2
[2 9]
[[2 3]
 [9 8]]
[1 2 3 4 5]


### Basic Summary Stats from NumPy Arrays

In [17]:
#mean and median
print(np.mean(np_2d[0,:])) #mean of first row
print(np.median(np_2d[:,1])) #median of second column

3.0
5.5


# Pandas
Pandas is another Python library that was built to work on-top of NumPy, which provides a dataframe object, like in R

### Pandas DataFrame

You can create a data frame in Pandas using a dictionary object as follows:  

In [2]:
#First, load pandas as "pd"
import pandas as pd

In [72]:
borough = ['The Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island']
population =  [1432132, 2582830, 1628701, 2278906, 476179]
sq_miles = [42, 71, 23, 109, 58]
dict = { 'borough':borough, 'pop':population, 'sq_miles':sq_miles }
nyc = pd.DataFrame(dict)
print(nyc)

         borough      pop  sq_miles
0      The Bronx  1432132        42
1       Brooklyn  2582830        71
2      Manhattan  1628701        23
3         Queens  2278906       109
4  Staten Island   476179        58


In [73]:
#two different ways to access the borough column:
print(nyc['borough'])
print(nyc.borough)

0        The Bronx
1         Brooklyn
2        Manhattan
3           Queens
4    Staten Island
Name: borough, dtype: object
0        The Bronx
1         Brooklyn
2        Manhattan
3           Queens
4    Staten Island
Name: borough, dtype: object


In [74]:
#create a new column: density, which is pop/sq_mi
nyc['density'] = round(nyc['pop']/nyc['sq_miles'])

In [75]:
#set borough column to be the row name
nyc = nyc.set_index('borough')
print(nyc)

                   pop  sq_miles  density
borough                                  
The Bronx      1432132        42  34098.0
Brooklyn       2582830        71  36378.0
Manhattan      1628701        23  70813.0
Queens         2278906       109  20907.0
Staten Island   476179        58   8210.0


We can now access specific rows and columns using either their names and the .loc( ) method, or their index and the .iloc( ) method as follows: 

In [76]:
#pop column, all rows
nyc.loc[:,'pop']

borough
The Bronx        1432132
Brooklyn         2582830
Manhattan        1628701
Queens           2278906
Staten Island     476179
Name: pop, dtype: int64

In [77]:
#pop and sq_miles column, all rows
nyc.loc[:,('pop','sq_miles')]

Unnamed: 0_level_0,pop,sq_miles
borough,Unnamed: 1_level_1,Unnamed: 2_level_1
The Bronx,1432132,42
Brooklyn,2582830,71
Manhattan,1628701,23
Queens,2278906,109
Staten Island,476179,58


In [78]:
#pop, sq_miles columns, Brooklyn and Manhattan rows
nyc.loc[('Brooklyn', 'Manhattan'),('pop','sq_miles')]

Unnamed: 0_level_0,pop,sq_miles
borough,Unnamed: 1_level_1,Unnamed: 2_level_1
Brooklyn,2582830,71
Manhattan,1628701,23


In [79]:
# queens and staten island rows, all columns
nyc.loc[('Queens', 'Staten Island'),:]

Unnamed: 0_level_0,pop,sq_miles,density
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Queens,2278906,109,20907.0
Staten Island,476179,58,8210.0


In [83]:
#rows from Bronx through Manhattan
nyc.loc['The Bronx':'Manhattan']

Unnamed: 0_level_0,pop,sq_miles,density
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1432132,42,34098.0
Brooklyn,2582830,71,36378.0
Manhattan,1628701,23,70813.0


In [84]:
#first three rows, using index
nyc.iloc[0:3,]

Unnamed: 0_level_0,pop,sq_miles,density
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1432132,42,34098.0
Brooklyn,2582830,71,36378.0
Manhattan,1628701,23,70813.0


In [88]:
#first three rows, first two columns
nyc.iloc[0:3,1:3]

Unnamed: 0_level_0,sq_miles,density
borough,Unnamed: 1_level_1,Unnamed: 2_level_1
The Bronx,42,34098.0
Brooklyn,71,36378.0
Manhattan,23,70813.0


In [93]:
#first, second, 4th rows, first and 3rd columns
nyc.iloc[[0,1,3],[0,2]]

Unnamed: 0_level_0,pop,density
borough,Unnamed: 1_level_1,Unnamed: 2_level_1
The Bronx,1432132,34098.0
Brooklyn,2582830,36378.0
Queens,2278906,20907.0


In [106]:
#we can also index rows using conditional statements as follows:
nyc[(nyc['density']>30000) & (nyc['sq_miles']<50)]

Unnamed: 0_level_0,pop,sq_miles,density
borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Bronx,1432132,42,34098.0
Manhattan,1628701,23,70813.0
