# LAB 1 - INTRO TO PYTHON

This lab is comprised of three parts:

- 1) Introduction to Jupyter Notebooks

- 2) Python Language and NumPy Library

- 3) Data Manipulation with Pandas Library


## 1) INTRODUCTION TO JUPYTER NOTEBOOKS

Open the file "Lab1.ipynb"

`"File -> Open..."` 

Navigate to where you saved the files you downloaded for this class and double click on the "Lab1.ipynb" file

### Running notebook cells

The notebook is divided into cells. Each cell can contain texts, codes or html scripts. Running a non-code cell simply advances to the next cell. Make sure to type commands always in a Code cell. You can verify this by checking the scroll-down window in the above toolbar menu.

To run a code cell use `shift + enter` or `ctrl + enter`. 

Try the following commends:

1) shift + enter run cell, select below

2) ctrl + enter run cell

3) option + enter run cell, insert below

In [2]:
8*6

48

In [2]:
2**16

65536

Incomplete command, Jupyter will display a `SyntaxError`

In [3]:
2^

SyntaxError: invalid syntax (<ipython-input-3-0deb74d06d3d>, line 1)

**EXERCISE:**

In [3]:
# Compute 284455 divided by 3.67778
284455/3.67778

77344.21308506763

Note: In Python, any text following a hash sign in a code cell is a comment

### Interrupting the kernel

For debugging, often we would like to interupt the current running process. This can be done by pressing the stop button. 

When a processing is running, the circle on the right upper corner is filled. When idle, the circle is empty.

Interrupting sometimes does not work. You can reset the state by restarting the kernel. This is done by clicking Kernel/Restart or the Refresh button in the toolbar above.

### Undoing

To undo changes in each cell, hit `Command-z` for Mac and `Ctrl-z` for Windows.
To undo `Delete Cell`, select `Edit->Undo Delete Cell`.

### Saving the notebook

To save your notebook, either select `"File->Save and Checkpoint"` or hit `Command-s` for Mac and `Ctrl-s` for Windows

### Other Notebook tips
- To add a new cell, either select `"Insert->Insert New Cell Below"` or click the white plus button
- You can change the cell mode from code to text in the pulldown menu by selecting `Markdown`.
- `Help->Keyboard Shortcuts` has a list of keyboard shortcuts

## 2) PYTHON LANGUAGE AND NUMPY LIBRARY

##  Data Types

### Floats and Integers

In [5]:
x = 4
print(x, type(x))

4 <class 'int'>


In [6]:
x = 1 / 4
print(x, type(x))

0.25 <class 'float'>


### Strings

Double quotes and single quotes are the same thing. Both represent strings. `'+'` concatenates strings

In [7]:
"IEOR " + '242'

'IEOR 242'

### Lists

A list is a mutable collection of data, which means that we can change it after it is created. A list can be created using square brackets []


Important functions: 
- `'+'` appends lists. 
- `len(x)` returns the length of a list.

In [8]:
x = ["IEOR"] + [2, 4, 2]
print(x)

['IEOR', 2, 4, 2]


In [9]:
print(len(x))

4


### Tuples

A tuple is an immutable collection of data. They can be created using round brackets (). 
They are usually used as inputs and outputs to functions.

In [10]:
t = ("I", "E", "O", "R") + (2, 4, 2)
print(t)

('I', 'E', 'O', 'R', 2, 4, 2)


In [11]:
# cannot do assignment to a tuple after creation - it's immutable
t[4] = 3 # will cause error

# Note: errors in notebook appear inline

TypeError: 'tuple' object does not support item assignment

## Functions and Variables

A function can take in several arguments or inputs, and returns an output value.
Python has some built-in functions:

In [12]:
abs(-65)

65

In [13]:
max([2, 4, 2])

4

In [14]:
# Get help on any function:
max?

Basic variable naming rules: 
- Don't use spaces (underscores or capital letters instead)
- Don't start names with a number
- Variable names are case sensitive - capital and lowercase letters are different

**EXERCISE:**

In [15]:
# Create a variable called "SecondsDay" that is equal to the number of 
# seconds in a day, and output its value.
SecondsDay = 24*60*60
SecondsDay

86400

### User-defined Functions
We can define functions ourselves, by using def and passing the expected inputs, as well as stating the returned output. In this example we create a function that takes two numbers x and y, and returns the sum.

In [16]:
def my_function(x, y):
    
    result = x+y
    
    return result

In [17]:
my_function(5, 3)

8

## Linear Algebra with Numpy

The numpy array, aka an "ndarray", is like a list with multidimensional support and more functions.
https://numpy.org/doc/stable/reference/routines.linalg.html

Important NumPy Array functions:

- `.shape` returns the dimensions of the array.

- `.ndim` returns the number of dimensions. 

- `.size` returns the number of entries in the array.

- `len()` returns the first dimension.


To use functions in NumPy, we have to import NumPy to our workspace. This is done by the command `import numpy`. By convention, we rename `numpy` as `np` for convenience.

### Arrays

NumPy arrays are made up of two parts:
* **data buffer**: block of raw elements (numbers)
* **view**: how NumPy interprets the data buffer

In [4]:
import numpy as np

a = np.array([1,2,3]) # NumPy array indexed by single element from 0 to 2
a

array([1, 2, 3])

In [5]:
a.shape

(3,)

In [6]:
a[1]

2

In [7]:
a[0, 1] 
# Will result in an error because we have NOT 
# reshaped the 'view' of the data in matrix form

IndexError: too many indices for array

In [8]:
a = a.reshape(1,3) # Now we reshaped to a 1x3 matrix 
a

array([[1, 2, 3]])

In [23]:
a.shape

(1, 3)

In [9]:
a[0, 1] # Now we can index by [i, j] from the newly shaped matrix

2

In [10]:
# Multiplication of a constant and a vector
a = np.array([1,2,3])
2*a

array([2, 4, 6])

In [12]:
# Element-wise multiplication
b = np.array([3,3,3])
np.multiply(a,b)

array([3, 6, 9])

Note: The two examples we did above touch the essense of two important concepts in NumPy array -- vectorization and broadcasting.

It is a good habit to employ vectorization and broadcasting whenever possible when dealing with linear algebra in NumPy. It will avoid unnecessary loops and significantly improve the efficiency of your code. 

Read more of the documentation at https://numpy.org/doc/stable/user/basics.broadcasting.html

In [14]:
# Inner product
inner_product = np.dot(a,b)
print(inner_product)

18


### Slicing

NumPy uses pass-by-reference semantics so it creates views into the existing array, without implicit copying. This is particularly helpful with very large arrays because copying can be slow.

In [15]:
x = np.array([1, 2, 3, 4])
x

array([1, 2, 3, 4])

In [16]:
y = x[0:3]

In [17]:
y

array([1, 2, 3])

In [18]:
y[0] = 1000
y

array([1000,    2,    3])

In [19]:
x # Note that changing NumPy array y changes NumPy array x

array([1000,    2,    3,    4])

Notes: since slicing does not copy the array, any change made to `y` would also change `x`. To create an object `y` that does not bind with the original object `x`, one need to make a copy of `x`. 

To achieve this, one should use `.copy()` from the `copy` library. (Documentation: https://docs.python.org/3/library/copy.html)

### Matrices

In [21]:
# Create a matrix
A = np.array([[1, 2, 8],
             [3, 2, 9]])
print(A)

[[1 2 8]
 [3 2 9]]


In [22]:
A.shape

(2, 3)

In [24]:
# Matrix multiplication
B = np.array([[1, 2],
              [3, 8],
              [2, 9]])

# There are two ways to perform matrix multiplication:
print(np.matmul(A,B))

# Alternatively:
print(A@B)

[[ 23  90]
 [ 27 103]]
[[ 23  90]
 [ 27 103]]


In [26]:
# Transpose a matrix
A.T

array([[1, 3],
       [2, 2],
       [8, 9]])

In [27]:
# Compute the inverse
C = np.array([[1, 2],
             [3, 2]])
D = np.linalg.inv(C)

C@D # note: the off-diagonal entries are essentially zero, so the output is an identity matrix.
# Remember A*A^-1 = I (for square matrices)

array([[1.00000000e+00, 5.55111512e-17],
       [0.00000000e+00, 1.00000000e+00]])

In [29]:
# Reshape 1-d NumPy Array to 2-d matrix
X = np.array([1,2,3])
print(X)
print(X.shape)

[1 2 3]
(3,)


In [30]:
Y = np.reshape(X,(-1,1))
print(Y)
print(Y.shape)
# This technique is useful when you want to convert a 1-d vector to a 2-d array (a matrix with only 1 column).  

[[1]
 [2]
 [3]]
(3, 1)


# 3) DATA MANIPULATION WITH PANDAS LIBRARY

`pandas` is designed to make it easier to work with structured data. Most of the analyses you might perform will likely involve using tabular data, e.g., from .csv files or relational databases (e.g., SQL). The `DataFrame` object in `pandas` is "a two-dimensional tabular, column-oriented data structure with both row and column labels."

If you're curious:

>The `pandas` name itself is derived from *panel data*, an econometrics term for multidimensional structured data sets, and *Python data analysis* itself. After getting introduced, you can consult the full [`pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/).

### Setting the working directory

Before loading the data, let's begin by setting the right working directory. In order to change the working directory, we use the `os` library

In [31]:
import os
os.getcwd()

'C:\\Users\\Hyungki Im\\OneDrive\\UCB Files\\2022 Fall\\IEOR 142\\Lab\\Lab1'

Use the following line to change the working directory to the path where your python files and data files are saved in

In [32]:
# new_path = 
# os.chdir(new_path)

### Loading CSV files

Now we can use the `pandas` library to load the data. Import `pandas` using the conventional abbreviation and call the `read_csv` method on your file's path name

In [34]:
import pandas as pd

# WHO = pd.read_csv("WHO.csv")
WHO = pd.read_csv("WHO.csv", encoding = "ISO-8859-1") ##check this encoding, also encoding = 'unicode_escape'

### The Dataframe

In [35]:
WHO

Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
0,Afghanistan,Eastern Mediterranean,29825,47.42,3.82,5.40,60,98.5,54.26,,1140.0,,
1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
2,Algeria,Africa,38482,27.42,7.17,2.83,73,20.0,98.99,,8310.0,98.2,96.4
3,Andorra,Europe,78,15.20,22.86,,82,3.2,75.49,,,78.4,79.4
4,Angola,Africa,20821,47.58,3.84,6.10,51,163.5,48.38,70.1,5230.0,93.1,78.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,Venezuela (Bolivarian Republic of),Americas,29955,28.84,9.17,2.44,75,15.3,97.78,,12430.0,94.7,95.1
190,Viet Nam,Western Pacific,90796,22.87,9.32,1.79,75,23.0,143.39,93.2,3250.0,,
191,Yemen,Eastern Mediterranean,23852,40.72,4.54,4.35,64,60.0,47.05,63.9,2170.0,85.5,70.5
192,Zambia,Africa,14075,46.73,3.95,5.77,55,88.5,60.59,71.2,1490.0,91.4,93.9


In [36]:
# Structure of the data
WHO.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 0 to 193
Data columns (total 13 columns):
Country                          194 non-null object
Region                           194 non-null object
Population                       194 non-null int64
Under15                          194 non-null float64
Over60                           194 non-null float64
FertilityRate                    183 non-null float64
LifeExpectancy                   194 non-null int64
ChildMortality                   194 non-null float64
CellularSubscribers              184 non-null float64
LiteracyRate                     103 non-null float64
GNI                              162 non-null float64
PrimarySchoolEnrollmentMale      101 non-null float64
PrimarySchoolEnrollmentFemale    101 non-null float64
dtypes: float64(9), int64(2), object(2)
memory usage: 19.8+ KB


In [45]:
# Recent statistics from the World Health Organization (WHO)
# The variables are: 
# 1) the name of the country
# 2) the region the country is in
# 3) the population in thousandsa
# 4) the percentage of the population under 15 
# 5) the percentage of the population  over 60
# 6) the fertility rate (average number of children per woman)
# 7) the Life Expectancy in years
# 8) the Child Mortality rate (the number of children who die by age 5 per 1000 births)
# 9) the number of cellular subscribers per 100 population
# 10) the literacy rate among adults aged >= 15
# 11) the gross national income per capita
# 12) the percentage of male children enrolled in primary school
# 13) the percentage of female children enrolled in primary school

In [37]:
# Statistical summary of the data:
WHO.describe()

Unnamed: 0,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
count,194.0,194.0,194.0,183.0,194.0,194.0,184.0,103.0,162.0,101.0,101.0
mean,36359.97,28.732423,11.16366,2.940656,70.010309,36.148969,93.641522,83.71068,13320.925926,90.850495,89.632673
std,137903.1,10.534573,7.149331,1.480984,9.259075,37.992935,41.400447,17.530645,15192.98865,11.017147,12.817614
min,1.0,13.12,0.81,1.26,47.0,2.2,2.57,31.1,340.0,37.2,32.5
25%,1695.75,18.7175,5.2,1.835,64.0,8.425,63.5675,71.6,2335.0,87.7,87.3
50%,7790.0,28.65,8.53,2.4,72.5,18.6,97.745,91.8,7870.0,94.7,95.1
75%,24535.25,37.7525,16.6875,3.905,76.0,55.975,120.805,97.85,17557.5,98.1,97.9
max,1390000.0,49.99,31.92,7.58,83.0,181.6,196.41,99.8,86440.0,100.0,100.0


In [39]:
# Display a few data points at the "head" (start) of the dataset, i.e. the first few records
# By default, the first 5 records are shown; this can be overwritten by specificing the number of records in the paranthesis.
WHO.head() # try WHO.head(2)

Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
0,Afghanistan,Eastern Mediterranean,29825,47.42,3.82,5.4,60,98.5,54.26,,1140.0,,
1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
2,Algeria,Africa,38482,27.42,7.17,2.83,73,20.0,98.99,,8310.0,98.2,96.4
3,Andorra,Europe,78,15.2,22.86,,82,3.2,75.49,,,78.4,79.4
4,Angola,Africa,20821,47.58,3.84,6.1,51,163.5,48.38,70.1,5230.0,93.1,78.2
5,Antigua and Barbuda,Americas,89,25.96,12.35,2.12,75,9.9,196.41,99.0,17900.0,91.1,84.5
6,Argentina,Americas,41087,24.42,14.97,2.2,76,14.2,134.92,97.8,17130.0,,
7,Armenia,Europe,2969,20.34,14.06,1.74,71,16.4,103.57,99.6,6100.0,,
8,Australia,Western Pacific,23050,18.95,19.46,1.89,82,4.9,108.34,,38110.0,96.9,97.5
9,Austria,Europe,8464,14.51,23.52,1.44,81,4.0,154.78,,42050.0,,


In [40]:
# Display the last few records.
WHO.tail()

Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
189,Venezuela (Bolivarian Republic of),Americas,29955,28.84,9.17,2.44,75,15.3,97.78,,12430.0,94.7,95.1
190,Viet Nam,Western Pacific,90796,22.87,9.32,1.79,75,23.0,143.39,93.2,3250.0,,
191,Yemen,Eastern Mediterranean,23852,40.72,4.54,4.35,64,60.0,47.05,63.9,2170.0,85.5,70.5
192,Zambia,Africa,14075,46.73,3.95,5.77,55,88.5,60.59,71.2,1490.0,91.4,93.9
193,Zimbabwe,Africa,13724,40.24,5.68,3.64,54,89.8,72.13,92.2,,,


### Subsets of data

In [45]:
# find the subset with only the countries in Europe
WHO_Europe = WHO[WHO['Region'] == 'Europe']
WHO_Europe.head()

Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
3,Andorra,Europe,78,15.2,22.86,,82,3.2,75.49,,,78.4,79.4
7,Armenia,Europe,2969,20.34,14.06,1.74,71,16.4,103.57,99.6,6100.0,,
9,Austria,Europe,8464,14.51,23.52,1.44,81,4.0,154.78,,42050.0,,
10,Azerbaijan,Europe,9309,22.25,8.24,1.96,71,35.2,108.75,,8960.0,85.3,84.1


In [43]:
#Let us go element by element
WHO['Region']

0      Eastern Mediterranean
1                     Europe
2                     Africa
3                     Europe
4                     Africa
               ...          
189                 Americas
190          Western Pacific
191    Eastern Mediterranean
192                   Africa
193                   Africa
Name: Region, Length: 194, dtype: object

In [46]:
WHO['Region'] == 'Europe'

0      False
1       True
2      False
3       True
4      False
       ...  
189    False
190    False
191    False
192    False
193    False
Name: Region, Length: 194, dtype: bool

In [47]:
WHO_Europe.count()

Country                          53
Region                           53
Population                       53
Under15                          53
Over60                           53
FertilityRate                    50
LifeExpectancy                   53
ChildMortality                   53
CellularSubscribers              51
LiteracyRate                     26
GNI                              48
PrimarySchoolEnrollmentMale      38
PrimarySchoolEnrollmentFemale    38
dtype: int64

In [48]:
# Use compound boolean operators in the conditional expression
# Use & for and; | for or. 
WHO_AsiaEurope = WHO[(WHO['Region'] == 'Europe') | (WHO['Region'] == 'South-East Asia') | (WHO['Region'] == "Eastern Mediterranean")] 
WHO_AsiaEurope.head()

Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
0,Afghanistan,Eastern Mediterranean,29825,47.42,3.82,5.4,60,98.5,54.26,,1140.0,,
1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
3,Andorra,Europe,78,15.2,22.86,,82,3.2,75.49,,,78.4,79.4
7,Armenia,Europe,2969,20.34,14.06,1.74,71,16.4,103.57,99.6,6100.0,,
9,Austria,Europe,8464,14.51,23.52,1.44,81,4.0,154.78,,42050.0,,


### Saving dataframe to CSV file

In [49]:
WHO_AsiaEurope.to_csv("WHO_AsiaEurope.csv") 

**EXERCISE:**

In [50]:
# How many countries have population greater than 50 million? 
len(WHO[WHO['Population']>50000])

25

### More Data Analysis

To access a variable in a data frame, you always have to link it to the data frame and call it using square brackets and pass it's name as a string.

In [51]:
# This will give you an error!
LifeExpectancy

NameError: name 'LifeExpectancy' is not defined

In [52]:
# Now, run this.
WHO['LifeExpectancy']

0      60
1      74
2      73
3      82
4      51
       ..
189    75
190    75
191    64
192    55
193    54
Name: LifeExpectancy, Length: 194, dtype: int64

### Statistics

In [53]:
# Statistics of a variable
print((WHO['LifeExpectancy'].mean()))
print((WHO['LifeExpectancy'].max()))
print((WHO['LifeExpectancy'].min()))

70.01030927835052
83
47


In [54]:
# Standard deviation
WHO['LifeExpectancy'].std
WHO['LifeExpectancy'].describe()

count    194.000000
mean      70.010309
std        9.259075
min       47.000000
25%       64.000000
50%       72.500000
75%       76.000000
max       83.000000
Name: LifeExpectancy, dtype: float64

In [55]:
WHO['GNI'].describe()
# what's different here?

count      162.000000
mean     13320.925926
std      15192.988650
min        340.000000
25%       2335.000000
50%       7870.000000
75%      17557.500000
max      86440.000000
Name: GNI, dtype: float64

In [56]:
# Identify countries corresponding to max and min
idx_min = WHO['LifeExpectancy'].argmin()
print(WHO['Country'][idx_min])

idx_max = WHO['LifeExpectancy'].argmax()
print(WHO['Country'][idx_max])

Sierra Leone
Japan


The current behaviour of 'Series.argmin' is deprecated, use 'idxmin'
instead.
The behavior of 'argmin' will be corrected to return the positional
minimum in the future. For now, use 'series.values.argmin' or
'np.argmin(np.array(values))' to get the position of the minimum
row.
  
The current behaviour of 'Series.argmax' is deprecated, use 'idxmax'
instead.
The behavior of 'argmax' will be corrected to return the positional
maximum in the future. For now, use 'series.values.argmax' or
'np.argmax(np.array(values))' to get the position of the maximum
row.
  """


**EXERCISE:**

In [57]:
# What is the largest population value among all countries?
print((WHO['Population'].max()))
# Which country has the largest population?
idx_max = WHO['Population'].argmax()
print(WHO['Country'][idx_max])

1390000
China


The current behaviour of 'Series.argmax' is deprecated, use 'idxmax'
instead.
The behavior of 'argmax' will be corrected to return the positional
maximum in the future. For now, use 'series.values.argmax' or
'np.argmax(np.array(values))' to get the position of the maximum
row.
  after removing the cwd from sys.path.


### Dealing with missing data

In [59]:
# Dealing with NA
# Try:
WHO['LiteracyRate'].head()

0     NaN
1     NaN
2     NaN
3     NaN
4    70.1
Name: LiteracyRate, dtype: float64

In [60]:
WHO.dropna(subset=['LiteracyRate'], inplace=True)
WHO['LiteracyRate'].head()
# Note: setting "inplace = True" will modify the original dataframe. 
# Alternatively, setting "inplace = False" will generate a new dataframe.

4     70.1
5     99.0
6     97.8
7     99.6
12    91.9
Name: LiteracyRate, dtype: float64

# References
- [1] Special thanks to the [EECS127 Fall 2019](https://inst.eecs.berkeley.edu/~ee127/fa19/) for providing a great starting point for Intro to Jupyter
- [2] D-lab intro to pandas
- [3] The official Python 3 language documentation. [Link](https://docs.python.org/3/).
- [4] The official numpy and scipy documentation. [Link](https://docs.scipy.org/doc/).


