# Week 1 Lesson 1 Python Review

Short review of core concepts in Python exemplified by objects in the Numpy library.

- Recall basic vocabulary
- Practice markdown syntax

## Libraries and packages

**Library:** collection of code that we can use to perform a specific task in our programs. It can be one or multiple files.

**NumPy:**

- Core library for numerical computing
- Many libraries use NumPy arrays as building blocks
- Computations on NumPy objects are optimized for speed an memory usage

Let's import NumPy with its **standard abbreviation** `np`:

In [None]:
import numpy as np  

## Variables

**Variable:** a name we assign to a particular object in python

Example:

In [2]:
# Assign a small array to a variable a
a = np.array([[1, 1, 2], [3, 5, 8]])

To view a variables value from jupyer nb:

In [3]:
# Run cell with variable name to show value
a

array([[1, 1, 2],
       [3, 5, 8]])

In [4]:
# Use `print` function to print value
print(a)

[[1 1 2]
 [3 5 8]]


## Convention: Use `snake_case` for naming variables

This is the convention we will use in the course. Why? `my-variable` or `MyVariable` or `myVariable`,
PEP 8 - Style guide for python code recommends snake case

**Variable names should be both descriptive and concise**

## Objects

**Object:** (informal) a bundle of *properties* and *actions* about something specific 

Example:

Object: data frame
Properties: # of rows, names of columns, date created
Actions: selecting a specific row, adding a new column

A variable is the name we give a specific object and the same object can be referenced by difference variables

In practice, we can use variable and object interchangeably 

## Types

Every object in Python has a **type**, the type tells us what kind of object we have. AKA the class. type and class refer to the same thing


In [22]:
print(a)

[[1 1 2]
 [3 5 8]]


In [24]:
# See the type/class of a variable/object by using the type funciton
type(a)

numpy.ndarray

The `numpy.ndarray` is the core object/data type of the NumPy package

In [26]:
print(a[0,0]) # first row, first column of a

type(a[0,0])

1


numpy.int64

`numpy.int64` is not the standard python integer type `int`

`numpy.int64` special data type in numpy telling us that 1 is an integer stored as a 64-bit number. Has to do with taking up memory space. Numpy is great for optimizing space. 

check-in: access the value 5 in array a

In [30]:
# indexing is different in python!
print(a[1,1])

5


## Functions

`print` was our first **function**

Functions take in a set of **arguments**, separated by commas, and use those arguments to create an **output**

We'll use argument and parameter interchangeably, but they have slightly different meanings. 

We can ask for info about what a function does by executing `?` followed by the function name

In [32]:
?print

[0;31mDocstring:[0m
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
[0;31mType:[0m      builtin_function_or_method


What we obtain is called a **docstring**, a special type of comment used to document how a function (or class or module) works. 

Notive that there are different types of arguments

There are two types:

- **non-optional arguments:** you need to specify for the function to work
- **optional arguments:** pre-filled with default value but you can override them

Example:

`end` is a parameter in `print` with the default value a new line
We can pass the value `:)` to this parameter instead

In [35]:
print('Change the parameter', end=' :)')

Change the parameter :)

## Attributes and methods

objects in python have attributes and methods

- **attribute:** a property of the object, some piece of info about it
- **method:** a procedure associated with the object, an action where the main ingredient is the object itself

## check-in

make a diagram like the cat for a class called fish

Example:

Numpy arrays have many methods and attributes

In [37]:
a

array([[1, 1, 2],
       [3, 5, 8]])

In [39]:
# T is an example of an attribute, it returns the transpose of an array
print(a.T)

[[1 3]
 [1 5]
 [2 8]]


In [40]:
type(a.T)

numpy.ndarray

In [43]:
# shape is another attribute
print(a.shape)
type(a.shape) # attributes for the same object can have many different types

(2, 3)


tuple

In [45]:
# ndim: number of array dimensions
print('dim', a.ndim, '| type', type(a.ndim))

dim 2 | type <class 'int'>


Attributes can have many different types

Some examples of methods:

In [46]:
# min method returns minimum value in an array along axis
print(a)
a.min(axis=0)

[[1 1 2]
 [3 5 8]]


array([1, 1, 2])

In [47]:
# run min method without axis
a.min()

1

In [49]:
# method tolist turns array into a list
a.tolist()

[[1, 1, 2], [3, 5, 8]]

In [50]:
type(a.tolist)

builtin_function_or_method

## Exercise

1. Read the print function help. What type of argument is sep? default or not? why?
2. Create two new variables, one with int value 77 and other string 99
3. Use your variables to print 77%99%77 by changing the value of one of the default agruments in print. 

In [52]:
help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



In [None]:
b = 77
c = '99'

In [54]:
type(b)

int

In [55]:
type(c)

str

In [57]:
print(b, c, b, sep="%")

77%99%77


# Pandas Series and Data Frames

In [5]:
import pandas as pd
import numpy as np

The first core object of pandas is the series. A series is a one-dimensional array of indexed data.

A pandas.Series having an index is the main difference between a pandas.Series and a NumPy array. Let’s see the difference:

In [6]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[-0.23731022  0.83334894  1.11844478  0.60532948] 

<class 'pandas.core.series.Series'>
0   -0.237310
1    0.833349
2    1.118445
3    0.605329
dtype: float64


## Creating a pandas.Series

The basic method to create a pandas.Series is to call


`s = pd.Series(data, index=index)`

The data parameter can be:

a list or NumPy array,
a Python dictionary, or
a single number, boolean (True/False), or string.

In [7]:
# A series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

In [8]:
# A series from a list of strings with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

In [9]:
# Panda series from a dictionary
# Construct dictionary
d = {'key_0':2, 'key_1':'3', 'key_2':5}

# Initialize series using a dictionary
pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

In [10]:
# Panda series from a single value
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

## Simple operations


In [11]:
# Define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s /10, '\n')

# Take the exponential of each element in series
print(np.exp(s), '\n')

# Original series is unchanged
print(s)


Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


We can also produce new pandas.Series with True/False values indicating whether the elements in a series satisfy a condition or not:

In [12]:
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

## Identify missing values

In [13]:
# Series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

In [14]:
# Check if series has NAs
s.hasnans

True

In [15]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

# Data Frames

## Creating a pandas.DataFrame
There are many ways of creating a pandas.DataFrame. We present one simple one in this section.

We already mentioned each column of a pandas.DataFrame is a pandas.Series. In fact, the pandas.DataFrame is a dictionary of pandas.Series, with each column name being the key and the column values being the key’s value. Thus, we can create a pandas.DataFrame in this way:

In [16]:
# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [17]:
# Change index
df.index = ['a','b','c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


## Check in exercise

The integer number -999 is often used to represent missing values. Create a pandas.Series named s with four integer values, two of which are -999. The index of this series should be the the letters A through D.

In the pandas.Series documentation, look for the method mask(). Use this method to update the series s so that the -999 values are replaced by NA values. HINT: check the first example in the method’s documentation.

In [18]:
# s = pd.Series([24, -999, 3, -999])
# s = pd.DataFrame(s)
# s.index = ['a', 'b', 'c', 'd']

In [19]:
s = {'value' : pd.Series([24, -999, 3, -999])}
s = pd.DataFrame(s)
s.index = ['A', 'B', 'C', 'D']
s

Unnamed: 0,value
A,24
B,-999
C,3
D,-999


In [62]:
t = pd.Series([24, -999, 5, -999], index=['A', 'B', 'C', 'D'])
t

A     24
B   -999
C      5
D   -999
dtype: int64

In [64]:
t = t.mask(t == -999)
t

A    24.0
B     NaN
C     5.0
D     NaN
dtype: float64

In [20]:
s.mask(s == -999, "NA")

Unnamed: 0,value
A,24.0
B,
C,3.0
D,


In [58]:
# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [59]:
# Change index
df.index = ['a','b','c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


In [65]:
df = df.rename(columns={"col_name_1": "C1", "col_name_2": "C2"})
df

Unnamed: 0,C1,C2
a,0,3.1
b,1,3.2
c,2,3.3


In [66]:
# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [69]:
df.columns = ['C1', 'C2'] # directly updating the df, no need to reassign
df

Unnamed: 0,C1,C2
0,0,3.1
1,1,3.2
2,2,3.3
