## Data Analysis in Python
Statistics Club
October 27, 2019

Python is more of a general purpose programming language than R or Matlab. It has gradually become more popular for data analysis and scientific computing, but additional modules are needed. Some of the more popular modules are:

1. NumPy - N-dimensional array
2. SciPy - Scientific computing (linear algebra, numerical integration, optimization, etc)
3. Matplotlib - 2D Plotting (similar to Matlab)
4. Pandas - Data analysis (provides a data frame structure similar to R)'
5. Seaborn - Data visualization (for more attractive and informative graphics)
6. scikit-learn - Machine learning (simple and efficient machine learning tools)

NumPy and Pandas are used in this presentation. 
We will discuss Matplotlib, Seaborn, and scikit-learn in the subsequent presentation.

## Object Oriented Programming
- All Python data types are objects (lists, strings, dicts, etc.)
- Objects combine value(s) and methods (functions) 
- Advanced users can create their own classes of objects 
- Topics not covered today but are important for OOP: encapsulation, polymorphism 

In [1]:
l=[1,2,3]
l.clear()
l

[]

## Defining New Classes

In [2]:
class Dog:

    # Class Attribute
    species = 'mammal'

    # Initializer / Instance Attributes
    def __init__(self, name, age):
        self.name = name
        self.age = age
        
    # Method
    def printAge(self):
        print("The age of ", self.name, " is ", self.age)

## OOP Example

In [3]:
a = Dog("Apollo", 11)
a.printAge()

The age of  Apollo  is  11


In [4]:
type(a)

__main__.Dog

In [5]:
b = Dog("Apollo", 11)
a == b # cannot simply compare objects with == (need to define __eq__ method)

False

### OOP Example 2

We would like to model a bank account with support for 'deposit' and 'withdraw' operations

In [6]:
class BankAccount:
    def __init__(self):
        self.balance = 0

    def withdraw(self, amount):
        self.balance -= amount
        return self.balance

    def deposit(self, amount):
        self.balance += amount
        return self.balance

In [7]:
a = BankAccount()
b = BankAccount()

In [8]:
a.deposit(100)
a.balance

100

In [9]:
a.withdraw(10)
a.balance

90

Now we would like to model an account with a fixed minimum balance

In [10]:
class MinimumBalanceAccount(BankAccount):
    def __init__(self, minimum_balance):
        BankAccount.__init__(self)
        self.minimum_balance = minimum_balance

    def withdraw(self, amount):
        if self.balance - amount < self.minimum_balance:
            print('Sorry, minimum balance must be maintained.')
        else:
            BankAccount.withdraw(self, amount)

In [11]:
myAccount = MinimumBalanceAccount(500)
myAccount.deposit(1000)
myAccount.balance

1000

In [12]:
myAccount.withdraw(800)

Sorry, minimum balance must be maintained.


## Numpy
- Numpy is a fundamental package for scientific computing with Python.

In [13]:
import numpy as np # importing numpy module

## Array 
- A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers.
- The number of dimensions is the rank of the array.
- The shape of an array is a tuple of integers giving the size of the array along each dimension.

## Creating Arrays
There are many ways to initialize array. Several ways that are highlighted below involve initializing from lists and using various numpy functions.

In [14]:
# Create 2x3 array initialized to all zeroes
a = np.zeros((2,3))
print(a)
print(a.shape) # prints dimensions of array

[[0. 0. 0.]
 [0. 0. 0.]]
(2, 3)


In [15]:
# Create array initialized by list of lists
b = np.array([[0,1,2],[3,4,5]])
b

array([[0, 1, 2],
       [3, 4, 5]])

In [16]:
# Create 2x2 array filled with random values
c = np.random.random((2,2)) 
print(c)

[[0.73727209 0.69028787]
 [0.32437459 0.7728979 ]]


## Array Slicing
- Similar to Python lists, numpy arrays can be sliced.
- Since arrays may be multidimensional, you must specify a slice for each dimension of the array.

In [17]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [18]:
# Use slicing to select the subarray consisting of the first 2 rows
b1 = a[:2,:]
print(b1)

[[1 2 3 4]
 [5 6 7 8]]


In [19]:
# Use slicing to select the subarray consisting of columns 2 and 3
b2 = a[:, 1:3]
print(b2)

[[ 2  3]
 [ 6  7]
 [10 11]]


## Exercise: Use slicing to select the subarray consisting of the first 2 rows and the last 2 columns of 'a' 

In [20]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [21]:
b3 = a[:2, 2:]
print(b3)

[[3 4]
 [7 8]]


## Modifying Arrays

In [22]:
a = np.ones((2,3)) 
a[0,1] = 5                 # Change an element of the array
a

array([[1., 5., 1.],
       [1., 1., 1.]])

In [23]:
a[0,:] = np.array([10,11,12], dtype=np.float64) # Modify first row of the array
a

array([[10., 11., 12.],
       [ 1.,  1.,  1.]])

A slice of an array is a view into the same data, so modifying it will modify the original array.

In [24]:
b4 = a[1,] # use slicing to select second row of the array 
b4[0] = 12
print(a)

[[10. 11. 12.]
 [12.  1.  1.]]


## Boolean Indexing
- Boolean array indexing can be used to select the elements of an array that satisfy some condition.

In [25]:
a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)  # Find the elements of a that are bigger than 2;
                    # this returns a numpy array of Booleans of the same
                    # dimensions as a, where each element of bool_idx  
                    # corresponds to whether that element of a is > 2.

print(bool_idx)

[[False False]
 [ True  True]
 [ True  True]]


In [26]:
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values of bool_idx
print(a[bool_idx])

[3 4 5 6]


In [27]:
# We can do all of the above in a single concise statement:
print(a[a > 2])

[3 4 5 6]


## Array Math - Elementwise Operations

In [28]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

In [29]:
# Elementwise sum
print(x + y) # equivalently can run 'print (np.add(x,y))'

[[ 6  8]
 [10 12]]


In [30]:
# Elementwise difference
print(x - y) # equivalently can run 'print (np.subtract(x,y))'

[[-4 -4]
 [-4 -4]]


In [31]:
# Elementwise product
print(x * y) # equivalently can run 'print (np.multiply(x,y))'

[[ 5 12]
 [21 32]]


In [32]:
# Elementwise difference
print(x / y) # equivalently can run 'print (np.divide(x,y))'

[[0.2        0.33333333]
 [0.42857143 0.5       ]]


## Array Math - Matrix Operations

In [33]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
v = np.array([9,10])
w = np.array([11, 12])

In [34]:
# Inner product of vectors
print(np.dot(v,w)) # equivalently can run 'print(v.dot(w))'

219


In [35]:
# Matrix / vector product
print(np.dot(x, v)) # equivalently can run 'print(x.dot(v))'

[29 67]


In [36]:
# Matrix / matrix product
print(np.dot(x, y)) # equivalently can run 'print(x.dot(y))'

[[19 22]
 [43 50]]


## Array Math - Other Operations

In [37]:
## Perform sum on arrays
x = np.array([[1,2],[3,4]])
print(np.sum(x))          # Compute sum of all elements

10


In [38]:
print(np.sum(x, axis=0))  # Compute sum of each column

[4 6]


In [39]:
print(np.sum(x, axis=1))  # Compute sum of each row

[3 7]


In [40]:
## Matrix tranpose
print(x.T)

[[1 3]
 [2 4]]


## Pandas 
In Python we use DataFrames from Pandas for data analysis. 
This object is very similar to dataframes and tibbles in R. Note that every column in DataFrame is a Series object.  

In [41]:
import pandas as pd

## Creating DataFrame from Numpy Array

In [42]:
planets_array = np.array([[0.330, 4879, 3.7, 88.0],
            [4.87, 12104, 8.9, 224.7],
            [5.97, 12756, 9.8, 365.2 ],
            [0.642, 6792, 3.7, 687.0],
            [1898, 142984, 23.1, 4331],
            [568, 120536, 9.0, 10747],
            [86.8, 51118, 8.7, 30589],
            [102, 49528, 11.0, 59800]])
planets_array

array([[3.30000e-01, 4.87900e+03, 3.70000e+00, 8.80000e+01],
       [4.87000e+00, 1.21040e+04, 8.90000e+00, 2.24700e+02],
       [5.97000e+00, 1.27560e+04, 9.80000e+00, 3.65200e+02],
       [6.42000e-01, 6.79200e+03, 3.70000e+00, 6.87000e+02],
       [1.89800e+03, 1.42984e+05, 2.31000e+01, 4.33100e+03],
       [5.68000e+02, 1.20536e+05, 9.00000e+00, 1.07470e+04],
       [8.68000e+01, 5.11180e+04, 8.70000e+00, 3.05890e+04],
       [1.02000e+02, 4.95280e+04, 1.10000e+01, 5.98000e+04]])

In [43]:
# Data is shared between array and dataframe
planets = pd.DataFrame(planets_array, 
                  columns=['mass', 'diameter', 'gravity', 'period'],
                  index=['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn','Uranus','Neptune'])
planets

Unnamed: 0,mass,diameter,gravity,period
Mercury,0.33,4879.0,3.7,88.0
Venus,4.87,12104.0,8.9,224.7
Earth,5.97,12756.0,9.8,365.2
Mars,0.642,6792.0,3.7,687.0
Jupiter,1898.0,142984.0,23.1,4331.0
Saturn,568.0,120536.0,9.0,10747.0
Uranus,86.8,51118.0,8.7,30589.0
Neptune,102.0,49528.0,11.0,59800.0


## Data Exploration

In [44]:
planets.shape # dimensions of DataFrame

(8, 4)

In [45]:
planets.dtypes # data types of each column 

mass        float64
diameter    float64
gravity     float64
period      float64
dtype: object

In [46]:
planets.head(3) # View the first 3 entries

Unnamed: 0,mass,diameter,gravity,period
Mercury,0.33,4879.0,3.7,88.0
Venus,4.87,12104.0,8.9,224.7
Earth,5.97,12756.0,9.8,365.2


In [47]:
planets.describe() # extract summary statistics of each column 

Unnamed: 0,mass,diameter,gravity,period
count,8.0,8.0,8.0,8.0
mean,333.3265,50087.125,9.7375,13353.9875
std,660.538057,53916.366175,6.040089,21447.657907
min,0.33,4879.0,3.7,88.0
25%,3.813,10776.0,7.45,330.075
50%,46.385,31142.0,8.95,2509.0
75%,218.5,68472.5,10.1,15707.5
max,1898.0,142984.0,23.1,59800.0


## Accessing Data in DataFrame

In [48]:
# Extract value from particular location (row 'Saturn', column 'gravity') 
planets.iloc[5,2]

9.0

In [49]:
# Extract value from particular location (row 'Saturn', column 'gravity')
planets.loc['Saturn','gravity']

9.0

In [50]:
# Get columns 1 thru 3
planets.iloc[:,0:3]

Unnamed: 0,mass,diameter,gravity
Mercury,0.33,4879.0,3.7
Venus,4.87,12104.0,8.9
Earth,5.97,12756.0,9.8
Mars,0.642,6792.0,3.7
Jupiter,1898.0,142984.0,23.1
Saturn,568.0,120536.0,9.0
Uranus,86.8,51118.0,8.7
Neptune,102.0,49528.0,11.0


## Filtering and Sorting Data

Example: Sort planets by gravity in decreasing order

In [51]:
planets.sort_values(by = "gravity", ascending = False)

Unnamed: 0,mass,diameter,gravity,period
Jupiter,1898.0,142984.0,23.1,4331.0
Neptune,102.0,49528.0,11.0,59800.0
Earth,5.97,12756.0,9.8,365.2
Saturn,568.0,120536.0,9.0,10747.0
Venus,4.87,12104.0,8.9,224.7
Uranus,86.8,51118.0,8.7,30589.0
Mercury,0.33,4879.0,3.7,88.0
Mars,0.642,6792.0,3.7,687.0


Example: How many plants have period longer than Earth period?

In [52]:
planets[planets.period > planets.period['Earth']]

Unnamed: 0,mass,diameter,gravity,period
Mars,0.642,6792.0,3.7,687.0
Jupiter,1898.0,142984.0,23.1,4331.0
Saturn,568.0,120536.0,9.0,10747.0
Uranus,86.8,51118.0,8.7,30589.0
Neptune,102.0,49528.0,11.0,59800.0


## Modifying Data in DataFrame

In [53]:
m = np.random.rand(2,3)
df = pd.DataFrame(m, columns=['a', 'b', 'c'], index=['A','B'])
df

Unnamed: 0,a,b,c
A,0.099937,0.487708,0.534249
B,0.040199,0.358316,0.283793


In [54]:
# Modify element to particular value
df.iloc[1,1] = 100.0
df

Unnamed: 0,a,b,c
A,0.099937,0.487708,0.534249
B,0.040199,100.0,0.283793


In [55]:
# Set column to particular value
df.iloc[:,2] = 200.0
df

Unnamed: 0,a,b,c
A,0.099937,0.487708,200.0
B,0.040199,100.0,200.0


Data is shared between DataFrame and the associated Numpy array.
Thus, modifying data in Numpy array will modify corresponding DataFrame.

In [56]:
m[0,0] = 10.0
df

Unnamed: 0,a,b,c
A,10.0,0.487708,200.0
B,0.040199,100.0,200.0


## Grouping
Equivalent to 'group_by' with 'dplyr package in R

In [57]:
drinks = pd.read_csv('drinks.csv', na_filter=False)
drinks.head(5)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [58]:
drinks.shape

(193, 6)

Example: Which continent drinks more beer on average?

In [59]:
drinks.groupby('continent').beer_servings.mean()

continent
AF     61.471698
AS     37.045455
EU    193.777778
NA    145.434783
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

## Adding Columns in DataFrame - 'assign'
Equivalent to 'mutate' with 'dplyr' package in R

In [60]:
df # Remind ourselves what df looks like

Unnamed: 0,a,b,c
A,10.0,0.487708,200.0
B,0.040199,100.0,200.0


In [61]:
df.assign(d=df['c']*df['b']) # Add new column 'd'

Unnamed: 0,a,b,c,d
A,10.0,0.487708,200.0,97.541625
B,0.040199,100.0,200.0,20000.0


## Adding Rows in DataFrame - 'append'

In [62]:
df1 = pd.DataFrame([[1,2], [3,4]], columns = list('AB'))
df2 = pd.DataFrame([[5,6], [7,8]], columns = list('AB'))
df3 = df1.append(df2)
df3

Unnamed: 0,A,B
0,1,2
1,3,4
0,5,6
1,7,8


## Creating DataFrame from Dictionaries

In [63]:
moons = pd.DataFrame({'diameter':[4821, 5262, 3122, 3643],
                  'mass':[107.6, 148.2, 48.0, 89.3]},
                   index=['Callisto','Ganymede','Europa','Io'])
moons

Unnamed: 0,diameter,mass
Callisto,4821,107.6
Ganymede,5262,148.2
Europa,3122,48.0
Io,3643,89.3


## Summary
Some topics to review/learn:
    1. OOP: Inheritance, Encapsulation, Polymorphism
    2. Numpy: Array Math, Slicing & Indexing Arrays, Array Manipulation
    3. Pandas: Data Exploration, Filtering & Sorting, Grouping, Apply, Merge

For more tutorials and exercises of Numpy and Pandas check out:
https://www.tutorialspoint.com/numpy/ and https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

Also, check out an alumn George You's blogpost: http://georgeyou.net/2019/07/09/python-map-for-data-scientists/