# **PLSC - Political Science Poli.Analysis & Data w/ Python**

## **Summer 2023**
### **Module 3**

## Data Analysis Basics

**Agenda for this week:**

We will learn essential libraries for data analysis:
  - `numpy`
  - `scipy`
  - `pandas`
  - `statsmodel`

# **NumPy**


- `NumPy` stands for *Numerical Python*
- a linear algebra library for Python
- used for numerical and scientific computing
- support multi-dimensional arrays and matrices
- collection of high-level mathematical functions
- almost all libraries rely on numpy



We call/import NumPy with this statement:

` import numpy as np`

You could import NumPy as `import numpy`, but that is not conventional.
We use alias name `np` for NumPy.

If we use `import numpy`, then, as we will see, we would have to write `numpy` whenever we need to use a function from the numpy package/library. To make our life a bit easier we `import numpy` as short `np`


In [1]:
#import numpy library

import numpy as np

Numpy has many built-in functions and capabilities. (Remember last week's built-in functions)

We won't cover whole library, that is not possible and irrational.

You can read documentations for packages for spesific function when needed. Here is the offical website for [NumPy](https://numpy.org/doc/stable/index.html).


Instead we will focus on some of the most important aspects of Numpy:
- vectors, arrays,matrices, and number generation.

## Numpy Arrays

NumPy arrays are the main way we will use Numpy.

NumPy provides a data structure called ndarray (n-dimensional array).

Numpy arrays essentially come in two flavors: vectors and matrices.

Vectors are strictly 1-d arrays and matrices are 2-d (but you should note a matrix can still have only one row or one column).

Arrays can store various types of data, such as integers, floats, or complex numbers.

NumPy offers powerful tools for indexing, slicing, and selecting elements from arrays, making it easy to work with specific subsets of data.

Let's begin our introduction by exploring how to create NumPy arrays.

In [2]:
# Creating NumPy Arrays
# Remember Python lists --> [ ]

my_list = [1,2,3]
my_list

[1, 2, 3]

In [3]:
#we can make the list as numpy array
# essentially we create one dimentional data structure or vector.
# think of array as a variable that stores ages of students.

my_array=np.array(my_list)
my_array

array([1, 2, 3])

Notice `np.array`

so `np` stands for numpy. We are using a function\method from `numpy` library called `.array()`.

Each time, we we use a method from any library in Python, we would first need to write the name/alias of the library such as `np` then call for its method such as `.array()`

In [4]:
#how to read documentation for the function?

#first way: using "?" at the end

np.array?


In [5]:
# or using help(np.array())

help(np.array)

Help on built-in function array in module numpy:

array(...)
    array(object, dtype=None, *, copy=True, order='K', subok=False, ndmin=0,
          like=None)
    
    Create an array.
    
    Parameters
    ----------
    object : array_like
        An array, any object exposing the array interface, an object whose
        __array__ method returns an array, or any (nested) sequence.
        If object is a scalar, a 0-dimensional array containing object is
        returned.
    dtype : data-type, optional
        The desired data-type for the array.  If not given, then the type will
        be determined as the minimum type required to hold the objects in the
        sequence.
    copy : bool, optional
        If true (default), then the object is copied.  Otherwise, a copy will
        only be made if __array__ returns a copy, if obj is a nested sequence,
        or if a copy is needed to satisfy any of the other requirements
        (`dtype`, `order`, etc.).
    order : {'K', 'A', '

In [6]:
# matrix

#remember matrix from math.
# two-dimensional array
#two-dimensional array has rows and columns

#here we have list within list nested.
my_nested_list = [[1,2,3],[4,5,6],[7,8,9]]
my_nested_list


[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [7]:
# we can create matrix form using np.array
# think of this as we have three columns and three rows
# notice there is no column or row names
# we would need to know columnn names/variables for data analysis
# pandas will be built on top of numpy to provide that

my_matrix =np.array(my_nested_list)
my_matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

## Some methods in `numpy` to create arrays.

  - `.arange()`
  - `.zeros()`
  - `.ones()`
  - `.linspace()`
  - `.eye()`
  - `.random()`
    - `.random.rand()`
    - `.random.randn()`
    - `.random.randint()`

In [8]:
# arange
# similar to range()
#Return evenly spaced values within a given interval.

np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [9]:
np.arange(0,11,2)

array([ 0,  2,  4,  6,  8, 10])

In [10]:
np.arange?

In [11]:
# zeros and ones

# Generate arrays of zeros or ones

np.zeros(3)

array([0., 0., 0.])

In [12]:
np.zeros((5,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [13]:
np.ones(3)

array([1., 1., 1.])

In [14]:
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [15]:
# linspace
#Return evenly spaced numbers over a specified interval.


np.linspace(0,10,3)



array([ 0.,  5., 10.])

In [16]:
np.linspace(0,10,50)

array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])

In [17]:
#eye
#Creates an identity matrix

np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

## `Random `

Numpy also has lots of ways to create random number arrays:

### `rand`
Create an array of the given shape and populate it with
random samples from a uniform distribution
over ``[0, 1)``.

In [18]:
#rand
# uniform distribution:

np.random.rand(2)

array([0.27447838, 0.02631316])

In [19]:
np.random.rand(5,5)

array([[0.27728794, 0.88987657, 0.98135144, 0.30583135, 0.77088601],
       [0.18356157, 0.43874293, 0.37257179, 0.49939422, 0.56507977],
       [0.15224957, 0.22421743, 0.19949405, 0.21800001, 0.31017525],
       [0.49467036, 0.41090752, 0.59603672, 0.31946508, 0.77102019],
       [0.82314943, 0.72830805, 0.83197673, 0.90282859, 0.16641578]])

In [20]:
### randn

#Return a sample (or samples) from the "standard normal" distribution.

np.random.randn(2)

array([-0.66579374, -1.02906829])

In [21]:
np.random.randn(5,5)

array([[ 0.72105511, -1.06398901,  1.17893733, -0.91072249,  0.72001668],
       [-0.29678605, -1.24798476, -0.32850382, -0.52380575,  0.92291977],
       [ 1.11668004,  1.77774809,  1.65448717, -0.93724843, -0.46829488],
       [ 1.88048625,  0.30064123, -0.97640997, -0.58109289,  0.01825987],
       [ 0.09172822,  0.6910925 ,  2.02937707, -1.03908317, -0.39386021]])

In [22]:
# randint
#Return random integers from `low` (inclusive) to `high` (exclusive).

np.random.randint(1,100)

17

In [23]:
var1 = np.arange(25)
var1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

In [24]:
var2 = np.random.randint(0,50,10)
var2

array([23, 10, 12, 22,  8, 41, 35, 30, 20, 38])

In [25]:
## Reshape
#Returns an array containing the same data with a new shape.
# five row five column

var1.reshape(5,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [26]:
# max, min, argmax, argmin

# useful methods for finding max or min values.
# find their index locations using argmin or argmax

var2.max()

41

In [27]:
# index poisiton of the max value
var2.argmax()

5

In [28]:
var2.min()

8

In [29]:
var2.argmin()

4

## Shape

Shape is an **attribute** that arrays have (not a method):

In [30]:
var1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

In [31]:
# Vector: one dimentional
# no paranthesis for attributes

var1.shape

(25,)

In [32]:
# Notice the two sets of brackets
# two dimentional

var1.reshape(1,25)

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24]])

In [33]:
# 1 row 25 column
# think of this as some features of one single person
var1.reshape(1,25).shape

(1, 25)

In [34]:
# now it is different.
# 25 row and 1 column
# 25 people and one variable.
var1.reshape(25,1)

array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15],
       [16],
       [17],
       [18],
       [19],
       [20],
       [21],
       [22],
       [23],
       [24]])

In [35]:
# Create a 3D array
# notice [[[  ]]]
arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
shape_3d = np.shape(arr_3d)
print(shape_3d)

(2, 2, 2)


In [36]:
# dtype
# You can also grab the data type of the object in the array:
# it refers to a 64-bit signed integer that can hold large whole numbers.
#It means that each element in the array is a whole number (integer)

var1.dtype

dtype('int64')

## NumPy Indexing and Selection

How to select elements or groups of elements from an array.

The simplest way to pick one or some elements of an array looks very similar to python lists

In [37]:
#Creating sample array
var3 = np.arange(0,11)

var3

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [38]:
#Get a value at an index
var3[8]

8

In [39]:
#Get values in a range
# exclusive

var3[1:5]

array([1, 2, 3, 4])

In [40]:
#Setting a value with index range (Broadcasting)
var3[0:5]=100

var3

array([100, 100, 100, 100, 100,   5,   6,   7,   8,   9,  10])

## Indexing a 2D array (matrices)

The general format is `var4[row][col]` or `var4[row,col]`.

I usually the comma notation.

In [41]:
var4 = np.array(([5,10,15],[20,25,30],[35,40,45]))

var4

array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])

In [42]:
#Indexing row
var4[1]

array([20, 25, 30])

In [43]:
# Format is var4[row][col]

var4[1][0]

20

In [44]:

#var4[row,col]

var4[1,0]

20

In [45]:
var4

array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])

In [46]:
# 2D array slicing

#Shape (2,2) from top right corner
#var4[rowslice, columnlice]

var4[:2,1:]

array([[10, 15],
       [25, 30]])

In [47]:
#Shape bottom row
var4[2]

array([35, 40, 45])

## More Indexing Help
Indexing a 2d matrix can be a bit confusing at first, especially when you start to add in step size.

Try google image searching NumPy indexing to fins useful images, like this one:

<img src= 'https://scipy-lectures.org/_images/numpy_indexing.png' width=500/>

## Selection from arrays

more on these when we cover pandas



In [48]:
var1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

In [49]:
var1>4

array([False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [50]:
condition=var1>4

In [51]:
var1[condition]

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24])

In [52]:
var1[var1>4]

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24])

## NumPy Operations

### Arithmetic

You can easily perform array with array arithmetic, or scalar with array arithmetic. Let's see some examples:



In [53]:
arr = np.arange(0,10)

arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [54]:
arr + arr

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [55]:
arr * arr

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [56]:
arr - arr

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [57]:
arr**3

array([  0,   1,   8,  27,  64, 125, 216, 343, 512, 729])

## mathematical operations

Numpy comes with many [universal array functions](http://docs.scipy.org/doc/numpy/reference/ufuncs.html)

In [58]:
#Taking Square Roots
np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

In [59]:
#Calcualting exponential (e^)
np.exp(arr)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

In [60]:
np.sin(arr)

array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ,
       -0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849])

In [61]:
np.log(arr)

  np.log(arr)


array([      -inf, 0.        , 0.69314718, 1.09861229, 1.38629436,
       1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])

# SciPy



Scientific computing library for Python

It is built on top of NumPy
Provides additional functionality for scientific and technical computing.

Scipy is widely used in various scientific disciplines, including mathematics, physics, engineering, biology, and more.

For more on [scipy](https://scipy.org/).


The `scipy.stats.norm` submodule provides various methods for working with the normal (Gaussian) distribution.

I mostly use scipy for Probability Density Function (PDF) and Cumulative Distribution Function (CDF).


<img src= 'https://www.researchgate.net/publication/8588176/figure/fig1/AS:281581576048646@1444145687424/The-probability-distribution-function-PDF-and-cumulative-distribution-function-CDF-of.png' width=500/>

<img src= 'https://rovdownloads.com/blog/wp-content/uploads/2014/07/fig-3.png' width=500/>




In [62]:

import scipy.stats as sps

x = 1.5 #standard normal distribution

pdf_value = sps.norm.pdf(x)
print("PDF at x =", x, ":", pdf_value)



PDF at x = 1.5 : 0.12951759566589174


In [63]:
x = 1.5 #standard normal distribution

cdf_value = sps.norm.cdf(x)

print("CDF at x =", x, ":", cdf_value)

CDF at x = 1.5 : 0.9331927987311419
